truthfulqa

Here are 13 public repositories matching this topic...

vignesh2027 / LLM-Evaluation-Framework

Production-grade LLM Evaluation & Benchmarking Framework — GPT-4, Claude, Gemini, Mistral. Accuracy, latency, cost, hallucination, reasoning metrics.

Updated May 20, 2026
Python

fathom-lab / styxx

Star

Cognitive observability for LLM agents. Cognometric instruments + self-healing reflex (F10) + MCP server. Pure-Python, MIT, no LLM required. 9-for-9 on K=1 phase transition. Every Mind Leaves Vitals (DOI 10.5281/zenodo.19777921).

python mit-license ai-safety nli rag guardrails llm llm-safety hallucination-detection truthfulqa halueval cognometry halubench

Updated Jun 2, 2026
Python

upunaprosk / quantized-lm-confidence

Star

Code for NAACL paper When Quantization Affects Confidence of Large Language Models?

nlp compression quantization efficient-model large-language-models llm gptq truthfulqa

Updated Dec 30, 2024
Jupyter Notebook

NahuelGiudizi / llm-evaluation

Star

Enterprise-grade LLM evaluation framework | Multi-model benchmarking, honest dashboards, system profiling | Academic metrics: MMLU, TruthfulQA, HellaSwag | Zero fake data | PyPI: llm-benchmark-toolkit | Blog: https://dev.to/nahuelgiudizi/building-an-honest-llm-evaluation-framework-from-fake-metrics-to-real-benchmarks-2b90

visualization python benchmarking machine-learning performance-testing academic-metrics mmlu ollama llm-evaluation truthfulqa hellaswag

Updated Dec 5, 2025
Python

tamimmirza / llm-lie-detector

Star

A hallucination detection pipeline for Large Language Models (LLMs).

python pytorch lora rag huggingface wandb weights-and-biases large-language-models llm truthfulqa halueval llama3-2

Updated May 31, 2026
Jupyter Notebook

aaitorm / truthfulqa-llm-evaluation

Star

Evaluation of Llama-3.1-8B Base vs Instruct on TruthfulQA using few-shot prompting and automatic judge models

multilingual nlp evaluation transformers llama llm prompting truthfulqa

Updated Mar 18, 2026
Python

LadyPary / llm-conversational-judgment

Star

Official code for "From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems" (IWSDS 2026)

nlp conversational-ai llm sycophancy truthfulqa llm-as-a-judge

Updated Jan 8, 2026
Python

SamueleSamonini / sycophancy-causal-effect

Star

Does instruction tuning make language models more sycophantic? A paired causal study across Qwen, Llama, and Gemma on TruthfulQA, showing the effect is family-dependent in both magnitude and direction. 7,200 evaluations, 12 ATE estimates with paired t-tests and bootstrap CIs.

nlp evaluation llama causal-inference gemma large-language-models llm instruction-tuning qwen sycophancy truthfulqa

Updated May 29, 2026
Jupyter Notebook

chi-binh-ta / ptgat-transformer-diagnostics

Star

PT-GAT Transformer Diagnostics: task-relative hallucination diagnosis with adequacy triggers, evidence conditioning, and anti-collapse baselines.

nlp machine-learning fever transformer diagnostics hallucination rag factuality truthfulqa

Updated May 6, 2026
Python

sujitha-madda / multilingual-llm-hallucination-evaluation

Star

Multilingual hallucination evaluation framework for Large Language Models across Indian languages using TruthfulQA, NLLB-200, and mechanistic interpretability.

multilingual nlp transformers hallucination huggingface llm mechanistic-interpretability truthfulqa indian-lanugages

Updated May 19, 2026
Jupyter Notebook

alexneilgreen / UCF-ComputerUnderstandingOfNaturalLanguage-LLMReliabilityEval

Star

CAP6640-Spring2026: Benchmarks GPT-3.5, GPT-4, Claude Haiku, and Gemini on GSM8k and TruthfulQA, measuring accuracy, self-consistency, and confidence calibration.

python benchmark natural-language-processing gemini openai self-consistency ucf confidence-calibration anthropic gsm8k llm-evaluation truthfulqa cap6640

Updated May 1, 2026
Python

ravikirankrishnaprasad / multi-agent-hallucination-detection-and-correction

Star

Multi-agent framework for hallucination detection and correction in LLM outputs using retrieval-grounded verification. MSc AI/ML dissertation (LJMU).

nlp machine-learning ai-research ljmu rag llm generative-ai retrieval-augmented-generation hallucination-detection truthfulqa multi-agent-ai medhallu

Updated May 29, 2026
Python

Shuichi346 / llm-benchmark-script

Star

A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).

python macos benchmark quantization model-evaluation apple-silicon llm gsm8k local-llm mmlu ollama lmstudio truthfulqa deepeval

Updated Mar 14, 2026
Python

Improve this page

Add a description, image, and links to the truthfulqa topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the truthfulqa topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

truthfulqa

Here are 13 public repositories matching this topic...

vignesh2027 / LLM-Evaluation-Framework

fathom-lab / styxx

upunaprosk / quantized-lm-confidence

NahuelGiudizi / llm-evaluation

tamimmirza / llm-lie-detector

aaitorm / truthfulqa-llm-evaluation

LadyPary / llm-conversational-judgment

SamueleSamonini / sycophancy-causal-effect

chi-binh-ta / ptgat-transformer-diagnostics

sujitha-madda / multilingual-llm-hallucination-evaluation

alexneilgreen / UCF-ComputerUnderstandingOfNaturalLanguage-LLMReliabilityEval

ravikirankrishnaprasad / multi-agent-hallucination-detection-and-correction

Shuichi346 / llm-benchmark-script

Improve this page

Add this topic to your repo