Skip to content

nishoof/search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

search-engine-nishoof

File Description
clean.go Cleans/normalizes URLs and resolves relative links against a base URL. For example, if we are currently on https://github.com/ and we have the href /nishoof, cleaning it would give https://github.com/nishoof.
crawl.go Crawls the provided href. First downloads the page. Then extracts the words and hrefs. Cleans the hrefs. Then repeats the crawling process using the new hrefs. Uses a queue and tracks visited URLs to avoid cycles. Returns a map from URLs to their extracted words. If an Index was provided, then crawl will also build the index by calling the index's increment method.
download.go Downloads the contents of a web page using HTTP and returns a readable stream for further processing.
extract.go Extracts relevant unique words and hrefs, skipping unwanted elements like <style> and <script>.
index_in_memory.go An in-memory inverted index (implementing the Index interface) that maps words to another map mapping the documents (URLs) the words appear in to the word's frequency in that document.
index_interface.go Defines the interface for an index to search for documents using keywords. Provides methods including GetFrequency() which gets the frequency of a given word in a given document. The increment method should be called for every occurrence of every word.
index_sqlite.go An SQLite-based inverted index (implementing the Index interface). Uses an SQLite database to store the index persistently on disk. Uses 3 tables: documents, words, and frequencies.
robots.go Used by crawl to parse the robots.txt file of a website to make sure we're following its rules (including crawl delays and disallowed paths).
search.go Searches the index for documents matching the provided words. Calculates a TF-IDF score for each document with at least one occurrence of the search word and returns a list of results (containing document URL, score, and num of occurrences).
stop.go Checks if a word is a stop word. Stop words are common words that we should filter out.
tfidf.go Implements the TF-IDF ranking to determine how relevant a document is to a search word.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors