Skip to content

tos-kamiya/keyphrase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Keyphrase

keyphrase is a command-line tool that automatically detects key phrases and important sentences in PDF or Markdown files using an LLM (Large Language Model) and annotates them with color highlights. It is designed for academic papers, technical documents, and any text where understanding the main points at a glance is helpful.

Example Outputs

Features

  • Supports both PDF and Markdown (.md) files

  • AI-based detection and color-coding of key concepts:

    • Approach/methodology (blue): The main novelty or core contribution of the paper
    • Experimental results (green): Key observations and experimental outcomes
    • Threats to validity (pink): Weaknesses or potential problems with the approach
  • Generates a new, annotated file with color-coded highlights

  • Flexible output filename options, with overwrite protection

  • All LLM inference is done locally via Ollama

  • Customizable highlight colors for each category via command-line options

Installation

1. Install via pipx (recommended)

pipx install git+https://github.com/tos-kamiya/keyphrase.git

If you don't have pipx:

python -m pip install --user pipx
python -m pipx ensurepath

2. Install and set up Ollama

Keyphrase uses Ollama for local LLM inference. Follow the instructions for your platform on the official Ollama site.

3. Download the gpt-oss model for Ollama

Install the required model in your local Ollama server:

ollama pull gpt-oss:20b

Usage

Basic usage

For PDF:

keyphrase input.pdf
  • Annotates input.pdf, outputs as out.pdf (if not present).

For Markdown:

keyphrase input.md
  • Annotates input.md, outputs as out.md using HTML <span> tags for highlights.

Output options

  • -o OUTPUT, --output OUTPUT: Specify output file name. Use -o - to write output to standard output (Markdown only).
  • -O, --output-auto: Output to INPUT-annotated.pdf or INPUT-annotated.md.
  • By default, output will be out.pdf or out.md. If the file exists, an error is raised unless --overwrite is specified.
  • --overwrite: Overwrite output file if it already exists

Color options

You can fully customize and preview highlight colors for each category using the options below.

Customizing highlight colors

  • Use --color-map to specify colors for each category.
  • Format: name:#rgba or name:#rrrggbbaa (e.g., approach:#8edefbb0)
  • Available category names: approach, experiment, threat
  • To disable a specific marker, specify name:0 (e.g., threat:0)
  • This option can be used multiple times.

Example:

# Change 'approach' to yellow, 'experiment' to teal, and disable 'threat'
keyphrase input.pdf --color-map approach:#ffcc00ff --color-map experiment:#44cc99ff --color-map threat:0

Checking your current color settings (legend output)

You can check the currently active highlight colors as a legend in your terminal. This is especially useful when adjusting colors with --color-map.

keyphrase --color-legend text   # Show legend as plain text
keyphrase --color-legend ansi   # Show legend with 24-bit color blocks (background + black text)
keyphrase --color-legend html   # Show legend as a compact HTML table snippet

You can combine this with --color-map to preview your custom color settings:

keyphrase --color-legend ansi --color-map approach:#ffcc00ff --color-map experiment:#44cc99ff
  • ANSI output uses a background color block and black text for visibility (works best in 24-bit color terminals).
  • HTML output can be copy-pasted into documentation.

Skim mode (experimental)

  • --skim: Enable skim mode, a simplified highlighting mode intended for survey papers (i.e., papers not following the typical problem → approach → experiment structure). Instead of categorizing sentences by type, this mode highlights only important sentences using a single highlight color.

Logging and verbosity options

  • -q, --quiet: Suppress all progress output and messages.

  • --debug: Enable debug output (show prompts/responses) and progress bar.

  • --verbose: Show progress bar (default behavior if no --quiet).

Other options

  • -m MODEL, --model MODEL: Specify the Ollama model to use (default: gpt-oss:20b)
  • --max-sentence-length N: Maximum sentence length for analysis (default: 80)
  • --buffer-size N: Buffer size for batch LLM queries (in characters, default: 2000). Sentences are processed in batches for efficiency.

More usage examples

keyphrase paper.pdf -O
# -> Annotates 'paper.pdf', outputs as 'paper-annotated.pdf'

keyphrase notes.md -o highlights.md --buffer-size 5000 --max-sentence-length 100 --verbose
# -> Annotates 'notes.md', outputs to 'highlights.md', using a larger buffer, longer sentences, and showing progress.

Requirements

  • Python 3.10 or newer
  • Ollama running locally
  • gpt-oss:20b model installed in Ollama (ollama pull gpt-oss:20b)

License

MIT

Notes

  • No data is sent to any third-party APIs: all processing is local via Ollama.
  • For best results on scientific papers, use high-quality, clean PDF or Markdown sources.
  • Markdown output uses HTML <span style="background-color:...">...</span> for color highlights.

About

Automatically detects key phrases and important sentences in PDF or Markdown files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages