search-engine-nishoof

File	Description
`clean.go`	Cleans/normalizes URLs and resolves relative links against a base URL. For example, if we are currently on `https://github.com/` and we have the href `/nishoof`, cleaning it would give `https://github.com/nishoof`.
`crawl.go`	Crawls the provided href. First downloads the page. Then extracts the words and hrefs. Cleans the hrefs. Then repeats the crawling process using the new hrefs. Uses a queue and tracks visited URLs to avoid cycles. Returns a map from URLs to their extracted words. If an Index was provided, then crawl will also build the index by calling the index's increment method.
`download.go`	Downloads the contents of a web page using HTTP and returns a readable stream for further processing.
`extract.go`	Extracts relevant unique words and hrefs, skipping unwanted elements like `<style>` and `<script>`.
`index_in_memory.go`	An in-memory inverted index (implementing the Index interface) that maps words to another map mapping the documents (URLs) the words appear in to the word's frequency in that document.
`index_interface.go`	Defines the interface for an index to search for documents using keywords. Provides methods including GetFrequency() which gets the frequency of a given word in a given document. The increment method should be called for every occurrence of every word.
`index_sqlite.go`	An SQLite-based inverted index (implementing the Index interface). Uses an SQLite database to store the index persistently on disk. Uses 3 tables: `documents`, `words`, and `frequencies`.
`robots.go`	Used by crawl to parse the `robots.txt` file of a website to make sure we're following its rules (including crawl delays and disallowed paths).
`search.go`	Searches the index for documents matching the provided words. Calculates a TF-IDF score for each document with at least one occurrence of the search word and returns a list of results (containing document URL, score, and num of occurrences).
`stop.go`	Checks if a word is a stop word. Stop words are common words that we should filter out.
`tfidf.go`	Implements the TF-IDF ranking to determine how relevant a document is to a search word.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
templates		templates
testdata		testdata
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
clean.go		clean.go
clean_test.go		clean_test.go
crawl.go		crawl.go
crawl_test.go		crawl_test.go
download.go		download.go
download_test.go		download_test.go
extract.go		extract.go
extract_test.go		extract_test.go
go.mod		go.mod
go.sum		go.sum
index_in_memory.go		index_in_memory.go
index_interface.go		index_interface.go
index_sqlite.go		index_sqlite.go
main.go		main.go
robots.go		robots.go
robots_test.go		robots_test.go
search.go		search.go
search_test.go		search_test.go
setup_test.go		setup_test.go
stop.go		stop.go
stop_test.go		stop_test.go
testUtils.go		testUtils.go
tfidf.go		tfidf.go
tfidf_test.go		tfidf_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-engine-nishoof

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

search-engine-nishoof

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages