reddit-eoe

Open-ended analysis of self-reported symptoms in r/EosinophilicE posts.

Eosinophilic Esophagitis (EoE) is an allergic inflammatory condition of the esophagus. This project takes a JSONL scrape of ~10,000 posts from the r/EosinophilicE subreddit and produces an aggregate picture of what symptoms users associate with their EoE — in their own words, not against a predefined symptom list.

Headline result

Symptom	% of symptomatic posts
dysphagia	35.4%
food impaction	29.8%
throat tightness	14.7%
chest pain	13.0%
acid reflux	11.7%
vomiting	11.4%
stomach pain	10.8%
choking	8.6%
difficulty eating	7.9%
nausea	7.4%

Across 4,361 symptomatic posts by 2,811 distinct authors. Per-symptom CSV at .data/symptom_counts.csv.

How it works

Three-stage pipeline using the Anthropic API. Each stage writes resumable artifacts to .data/ and pauses for human review before the next.

Stage 1 — Per-post symptom extraction

src/eoe/extract_submit.py and src/eoe/extract_collect.py drive the Anthropic Message Batches API over Sonnet 4.6. For each post, Claude extracts EoE symptoms in the author's own words plus a grounding quote, with explicit prompt rules to skip negated mentions, hypothetical/educational framing, medication side effects, and comorbidity-list bleed-through. Allowing self-reported symptoms attributed to specific named family members (e.g. a parent describing a child's symptoms) is explicit. See src/eoe/prompts.py for the system prompt.

Output: .data/symptoms_raw.jsonl — one row per post with {post_id, author, created_utc, permalink, title, symptoms: [{phrase, quote}, ...]}.

Stage 2 — Cluster raw phrases into canonical groups

src/eoe/cluster.py does this in two passes:

Pass 1 sends phrases with count ≥ 2 (~1,000 phrases) to Claude in a single call to establish ~60 canonical groups (e.g. dysphagia, food impaction, throat tightness).
--fixup patches Pass 1 output: targeted LLM mapping for any high-count phrases the model missed, plus deterministic dedup of phrases the model placed in two groups (winner = higher-total group).
Pass 2 (--pass2) chunk-maps the long-tail singletons (~6,000 phrases) onto the established canonicals. Phrases the model declines stay in unmapped as the genuine long tail.

Output: mappings/symptom_mapping.json — hand-editable, sorted by group total, every phrase carries its raw count. Tracked in git so reviewers can see and propose changes to the canonical groupings.

A --validate flag reports unmapped phrases and any phrase that ended up in two groups.

Stage 3 — Aggregate and visualize

src/eoe/analyze.py applies the mapping to per-post symptom lists (deduped within a post so dysphagia mentioned 4 times in one post counts once), then writes:

symptom_counts.csv — per-canonical post count, post share, author count, author share, plus example phrases. Both per-post and dedupe-by-author counts side by side, so prolific posters can be spotted.
images/top_symptoms.png — top-30 horizontal bar chart (embedded above).
images/wordcloud.png — word cloud sized by post count.
images/cooccurrence.png — top-20 co-occurrence heatmap, conditional probability P(column | row), diagonal masked.
.data/symptom_quotes.md (locally generated, not tracked) — 5 randomly sampled verbatim quotes per canonical symptom, with permalink and post id.

Percentages use the symptomatic-post denominator (the 4,361 posts that report at least one symptom), not the full 8,262 analyzed posts. The remaining ~3,900 are correctly empty: treatment-only posts, dietary advice, hypothetical questions, etc.

Run it yourself

uv sync
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env

# Place the scraped JSONL at .data/r_EosinophilicE_posts.jsonl

# Stage 1 — extract
uv run python -m eoe.extract_submit --limit 20 --no-batch     # dry-run on 20 posts
uv run python -m eoe.extract_submit                            # submit full corpus batch
uv run python -m eoe.extract_collect --status                  # check progress
uv run python -m eoe.extract_collect                           # poll → download when done

# Stage 2 — cluster
uv run python -m eoe.cluster                                   # Pass 1: establish canonicals
uv run python -m eoe.cluster --fixup                           # patch unmapped + dedup
uv run python -m eoe.cluster --pass2                           # absorb the long tail
uv run python -m eoe.cluster --validate                        # sanity-check the mapping

# Stage 3 — aggregate + plot
uv run python -m eoe.analyze

Stage 1 on Sonnet 4.6 via the Batches API is roughly $5–7 for ~8,000 posts (50% off list pricing). Stages 2 and 3 are cents.

Notes on methodology

Open-ended extraction, not a checklist. The extraction prompt does not list canonical symptoms — Claude returns whatever the author describes. Canonicals emerge from the clustering stage.
Post-level dedup, author-level dedup reported side-by-side. A post mentioning food impaction four times counts once; an author posting about food impaction ten times counts once toward the author column. Both views appear in the CSV.
Authors marked [deleted] or None are excluded from the author counts rather than collapsed into one fake "anonymous" author.
Singletons get genuine LLM judgment, not just deterministic mapping. Pass 2 is an LLM call, so a singleton like food getting stuck deep in my throat ends up in food impaction rather than its own group. Phrases that are genuinely different (allergic reactions, GI bleeding signs, etc.) are left in unmapped — ~143 phrases / 1% of mentions.
No time-trend chart. Posts span ~2014–2026 but treatment availability changed dramatically over that window (Dupixent approved 2022), so a longitudinal view would conflate corpus drift with anything substantive.

Layout

src/eoe/
  prompts.py            # extraction + clustering + fixup prompts
  json_parsing.py       # tolerant JSON extraction (handles fences/multi-block)
  extract_submit.py     # Stage 1a: build + submit Message Batch
  extract_collect.py    # Stage 1b: poll + download → symptoms_raw.jsonl
  cluster.py            # Stage 2: cluster phrases (Pass 1, --fixup, --pass2, --validate)
  analyze.py            # Stage 3: counts + charts + quotes
.data/                  # gitignored; pipeline inputs/outputs
mappings/               # tracked; hand-reviewable phrase→canonical mapping
images/                 # tracked; workflow diagram + chart snapshots

Built with uv and the Anthropic Python SDK.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.data		.data
images		images
mappings		mappings
src/eoe		src/eoe
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reddit-eoe

Headline result

How it works

Stage 1 — Per-post symptom extraction

Stage 2 — Cluster raw phrases into canonical groups

Stage 3 — Aggregate and visualize

Run it yourself

Notes on methodology

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

reddit-eoe

Headline result

How it works

Stage 1 — Per-post symptom extraction

Stage 2 — Cluster raw phrases into canonical groups

Stage 3 — Aggregate and visualize

Run it yourself

Notes on methodology

Layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages