clean.go |
Cleans/normalizes URLs and resolves relative links against a base URL. For example, if we are currently on https://github.com/ and we have the href /nishoof, cleaning it would give https://github.com/nishoof. |
crawl.go |
Crawls the provided href. First downloads the page. Then extracts the words and hrefs. Cleans the hrefs. Then repeats the crawling process using the new hrefs. Uses a queue and tracks visited URLs to avoid cycles. Returns a map from URLs to their extracted words. If an Index was provided, then crawl will also build the index by calling the index's increment method. |
download.go |
Downloads the contents of a web page using HTTP and returns a readable stream for further processing. |
extract.go |
Extracts relevant unique words and hrefs, skipping unwanted elements like <style> and <script>. |
index_in_memory.go |
An in-memory inverted index (implementing the Index interface) that maps words to another map mapping the documents (URLs) the words appear in to the word's frequency in that document. |
index_interface.go |
Defines the interface for an index to search for documents using keywords. Provides methods including GetFrequency() which gets the frequency of a given word in a given document. The increment method should be called for every occurrence of every word. |
index_sqlite.go |
An SQLite-based inverted index (implementing the Index interface). Uses an SQLite database to store the index persistently on disk. Uses 3 tables: documents, words, and frequencies. |
robots.go |
Used by crawl to parse the robots.txt file of a website to make sure we're following its rules (including crawl delays and disallowed paths). |
search.go |
Searches the index for documents matching the provided words. Calculates a TF-IDF score for each document with at least one occurrence of the search word and returns a list of results (containing document URL, score, and num of occurrences). |
stop.go |
Checks if a word is a stop word. Stop words are common words that we should filter out. |
tfidf.go |
Implements the TF-IDF ranking to determine how relevant a document is to a search word. |