Research papers, benchmark code, presentations, and planning resources from the Anote AI Research Fellowship.
The fellowship goal is to produce publishable research across NLP, RAG, agentic AI, and annotation efficiency, with each paper producing a reusable open-source artifact.
| Area | What is here |
|---|---|
| Active research | 7 Summer 2026 paper tracks with LaTeX starters, research questions, venues, and tracking issues |
| Backlog | 18 additional paper ideas for future fellowship cohorts |
| Benchmark code | Experiments for RAG, text classification, question answering, and object detection |
| Research assets | Prior papers, presentations, and video talks for onboarding and background reading |
| Program tracking | A spreadsheet with paper ideas, deadlines, venues, owners, and progress |
| If you are... | Start with... |
|---|---|
| A fellowship intern | Pick your track in Active Papers, read the linked issues, then follow the Intern Workflow |
| A researcher reviewing scope | Skim the active and backlog paper tables to understand the research roadmap |
| Looking for reusable code | Browse researchcode/ by benchmark area |
| Looking for paper drafts | Open the relevant main.tex under researchpapers/ |
| Looking for background material | Review researchpresentations/ and the Video Talks |
See anote_fellowship_tracker.xlsx for the full tracker with 25 paper ideas, deadlines, venues, and progress.
7 primary paper tracks for the Summer 2026 intern cohort. Each has 4 GitHub issues (idea, code, results, paper repo).
| Track | Paper | Research Question | Venue | Issues |
|---|---|---|---|---|
| T1a | AgenticEval | Does BFCL rank predict enterprise trustworthiness? | AAAI 2027 | #1–4 |
| T1b | EnterpriseSynth | Can we generate agentic SFT data from API schemas without live execution? | AAAI 2027 | #5–8 |
| T2a | AnnotateBench | How much labeled data do annotation strategies need across NLP tasks? | AAAI 2027 | #9–12 |
| T2b | AnnotateROI | How should enterprises measure annotation ROI? | AAAI 2027 | #13–16 |
| T3 | Human-AI Teaming | How does human-AI collaboration protocol affect downstream model behavior? | AAAI 2027 | #17–20 |
| T4 | RAG Failure Prop. | How do retrieval errors propagate through agentic RAG pipelines? | AAAI 2027 | #21–24 |
| T5 | RetrievalBench | Which retrieval combination generalizes across domains? | SIGIR 2027 | #25–28 |
18 additional research ideas for future intern cohorts. Each has a GitHub issue to track.
| Track | Paper | Research Question |
|---|---|---|
| T6 | FineTuneBench | When does fine-tuning outperform RAG for domain QA? |
| T7 | MultiHopRAG | Can agentic decomposition improve multi-hop retrieval? |
| T8 | TableRAG | How can RAG be specialized for tabular financial data? |
| T9 | EmbedBench | How do embedding models compare across enterprise domains? |
| T10 | LLMClassifyBench | LLMs vs. fine-tuned models for classification under domain shift? |
| T11 | PrivacyRAG | RAG without exposing sensitive entities to the LLM? |
| T12 | AgentMemory | Which memory architecture best supports long-horizon agents? |
| T13 | SLMFineTune | Which PEFT method works best for enterprise domain adaptation? |
| T14 | NERBench | How much annotation for expert NER in clinical/legal/finance? |
| T15 | ChunkingTheory | Can we auto-select chunking strategy from document features? |
| T16 | AgenticOrchestration | Which multi-agent pattern works best for enterprise tasks? |
| T17 | SyntheticEval | Can synthetic evaluation datasets reliably rank LLMs? |
| T18 | HallucinationRAG | Which mitigation strategies reduce RAG hallucination most? |
| T19 | MultiDocRAG | How can RAG synthesize answers across multiple documents? |
| T20 | PromptStability | How sensitive are LLM classifiers to prompt paraphrase? |
| T21 | ActiveRAG | Active learning for RAG knowledge base curation? |
| T22 | LLMDataAug | When does LLM data augmentation help vs. hurt? |
| T23 | OntologyRAG | Can domain ontologies improve retrieval for medicine and law? |
Research/
├── anote_fellowship_tracker.xlsx # Master tracker (25 paper ideas)
├── main.tex # Reference LaTeX template (RAG paper)
├── researchpapers/
│ ├── T1a-AgenticEval/main.tex # LaTeX starter for each active paper
│ ├── T1b-EnterpriseSynth/main.tex
│ ├── T2a-AnnotateBench/main.tex
│ ├── T2b-AnnotateROI/main.tex
│ ├── T3-HumanAITeaming/main.tex
│ ├── T4-RAGFailureProp/main.tex
│ ├── T5-RetrievalBench/main.tex
│ ├── classification.pdf
│ ├── questionanswering.pdf
│ └── retrieval.pdf # arXiv:2404.07221
├── researchcode/
│ ├── Benchmarking-RAG/
│ ├── Benchmarking-Text-Classification/
│ ├── Benchmarking-Question-Answering/
│ └── Benchmarking-ObjectDetection/
└── researchpresentations/
├── RAG.pdf
├── TextClassification.pdf
├── AI_Talk.pdf
└── HumanCenteredAI.pdf
- Read the GitHub issue for your track (start with the idea improvement issue)
- Design — write a Research Design Doc and link it in the tracker spreadsheet
- Code — build experiments, link code repo in tracker
- Results — run experiments, produce tables/figures
- Paper — fill in
researchpapers/<track>/main.texwith your results - Repo — create a standalone paper repo (reference structure)
- Update — update
anote_fellowship_tracker.xlsxwith all URLs and status
| Topic | Video |
|---|---|
| Fine Tuning LLMs | YouTube |
| Benchmarking Text Classification | YouTube |
| Benchmarking Q&A Models | YouTube |
| Human Centered AI | YouTube |
| Improving Retrieval for Q&A | YouTube |
| Few Shot Learning Ted Talk | YouTube |