Production-grade LLM Evaluation & Benchmarking Framework — GPT-4, Claude, Gemini, Mistral. Accuracy, latency, cost, hallucination, reasoning metrics.
-
Updated
May 20, 2026 - Python
Production-grade LLM Evaluation & Benchmarking Framework — GPT-4, Claude, Gemini, Mistral. Accuracy, latency, cost, hallucination, reasoning metrics.
Cognitive observability for LLM agents. Cognometric instruments + self-healing reflex (F10) + MCP server. Pure-Python, MIT, no LLM required. 9-for-9 on K=1 phase transition. Every Mind Leaves Vitals (DOI 10.5281/zenodo.19777921).
Code for NAACL paper When Quantization Affects Confidence of Large Language Models?
Enterprise-grade LLM evaluation framework | Multi-model benchmarking, honest dashboards, system profiling | Academic metrics: MMLU, TruthfulQA, HellaSwag | Zero fake data | PyPI: llm-benchmark-toolkit | Blog: https://dev.to/nahuelgiudizi/building-an-honest-llm-evaluation-framework-from-fake-metrics-to-real-benchmarks-2b90
A hallucination detection pipeline for Large Language Models (LLMs).
Evaluation of Llama-3.1-8B Base vs Instruct on TruthfulQA using few-shot prompting and automatic judge models
Official code for "From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems" (IWSDS 2026)
Does instruction tuning make language models more sycophantic? A paired causal study across Qwen, Llama, and Gemma on TruthfulQA, showing the effect is family-dependent in both magnitude and direction. 7,200 evaluations, 12 ATE estimates with paired t-tests and bootstrap CIs.
PT-GAT Transformer Diagnostics: task-relative hallucination diagnosis with adequacy triggers, evidence conditioning, and anti-collapse baselines.
Multilingual hallucination evaluation framework for Large Language Models across Indian languages using TruthfulQA, NLLB-200, and mechanistic interpretability.
CAP6640-Spring2026: Benchmarks GPT-3.5, GPT-4, Claude Haiku, and Gemini on GSM8k and TruthfulQA, measuring accuracy, self-consistency, and confidence calibration.
Multi-agent framework for hallucination detection and correction in LLM outputs using retrieval-grounded verification. MSc AI/ML dissertation (LJMU).
A tool to evaluate and compare local LLMs running on Ollama or LM Studio under identical conditions using deepeval's public benchmarks (MMLU, TruthfulQA, GSM8K).
Add a description, image, and links to the truthfulqa topic page so that developers can more easily learn about it.
To associate your repository with the truthfulqa topic, visit your repo's landing page and select "manage topics."