Web Crawler (Python)

Overview

This was a group assignment to build a focused web crawler that operates against the UCI ICS static corpus. The crawler manages a persistent frontier, fetches pages from the local corpus, extracts absolute links, validates and filters potential traps, and records crawl analytics including subdomain coverage, outlink maxima, the longest page by word count, and the top frequent words. The system is organized around four modules: frontier.py, corpus.py, crawler.py, and main.py.

Key Features

Persistent frontier with deduplication Queue, set, and a fetched counter are checkpointed to disk so runs can be paused and resumed. If no state exists, the crawler seeds from http://www.ics.uci.edu/.
Static corpus access URLs are mapped to hashed filenames, then parsed from CBOR to recover body, headers, status, and size. This enables repeatable experiments off-network.
Absolute link extraction Links are parsed from HTML and normalized to absolute form using lxml.html’s make_links_absolute and iterlinks, which avoids brittle string concatenation.
Trap-aware URL validation Heuristics reject already processed URLs, missing hosts, excessive length, deep paths, repetitive segments, parameter explosions, session IDs, and non-HTML assets. The validator also screens known dynamic patterns such as wp-json and reply-thread parameters.
Crawl analytics While crawling, the system tracks per-subdomain URL counts, the page with the most links, the longest page by words, frequently used terms (stop-words removed), downloaded URLs, and identified traps, then writes a human-readable analytics.txt.

Architecture

Frontier: double-ended queue plus membership set, with save_frontier and load_frontier for crash-safe progress.
Corpus: hashed URL mapping (sha224) and CBOR decoding to return a uniform response dictionary of content, content_type, http_code, and redirection fields.
Crawler: main loop fetches a URL, extracts links, validates candidates, adds valid ones back to the frontier, and updates analytics; stop-words come from NLTK and HTML text is extracted with BeautifulSoup.
Entry point: main.py wires modules together and registers a shutdown hook to persist state.

Trap Avoidance Heuristics

The validator applies layered checks to reduce duplicates and crawler traps:

protocol check for http and https, hostname required, fragment removal and URL normalization
hard caps on URL length and query parameter count; long query strings are rejected
repetitive directory patterns and excessive depth detection
session-ID filters, reply-to comment parameters, and JSON API endpoints for CMS platforms
a broad extension blacklist for non-HTML assets and archives

All logic is in crawler.py:is_valid together with helpers for normalization and repetition checks.

Analytics Collected

Per-subdomain URL counts and the page with the most links
Longest page by word count (HTML markup excluded)
Top 50 words across the corpus, excluding English stop-words
Downloaded URLs and a list of identified traps Results are written to analytics.txt at the end of a crawl.

How to Run

Follow the course brief: install Python 3 and required libraries, download the ICS static corpus, then run the crawler with the corpus directory path. For example:

python3 main.py /path/to/spacetime_crawler_data

The frontier will autosave on shutdown. Delete the frontier_state directory to reset progress.

My Role

I collaborated with my teammate on the crawler loop, absolute-URL extraction using lxml, and the trap-aware validator. I also contributed to the analytics subsystem: subdomain aggregation, longest-page detection, and frequency analysis with a custom tokenizer and NLTK stop-words.