Indexer

A program to index all pages of a website.

Usage

To index a webpage, simply run the program and enter your favorite URL. For example:

Enter a URL to scrape: https://misnymerch.com/

This will scrape https://misnymerch.com.

Implementation

The website is traversed using the BFS algorithm, meaning:

Start at the top page (that is, the URL provided) and at it to a queue
Search the page for links and take note of them
After all the links on the page have been found, add them to the queue in order from first found to last found.
Remove the first item of the queue (that is, the page we just checked), and repeat step 1 for the new first item in the queue.

The above procedure will run until the queue is empty.

Once the website has been traversed, the data will be saved to a MongoDB database. This program assumes a database called Indexer has already been configured at address mongodb://localhost:27017/.

The data will be saved to a collection with a title of the provided start URL. The collection will contain objects representative of the pages traversed. In addition to links, the pages' headings and title are saved. A reference count is also provided, indicating how many times the page was linked to during the traversal.

Dependencies

This program has two dependencies: JSoup and the MongoDB Java Driver.

JSoup

Version: 1.20.1

JSoup is a wonderful library that assists in parsing HTML content. Its primary use case in this program is to ease the process of iterating through HTML elements and to pick out links and headers. It also provides an implementation for connecting to a URL and downloading its contents with a supplied user agent.

MongoDB Driver

Version: 5.5.1

The MongoDB Java Driver provides a means to interact with a MongoDB NoSQL database from Java code. Note that MongoDB offers two variants of the driver: sync and async. This program uses the former as it's relatively simple in scope; it only performs one crawl at a time.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
src/main/java/com/pjoschmann/indexer		src/main/java/com/pjoschmann/indexer
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indexer

Usage

Implementation

Dependencies

JSoup

MongoDB Driver

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Indexer

Usage

Implementation

Dependencies

JSoup

MongoDB Driver

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages