Truthfulness Evaluator
Multi-model truthfulness evaluation with filesystem-aware evidence gathering. Automates fact-checking for technical documentation.
Overview
Truthfulness Evaluator extracts verifiable claims from your Markdown files, gathers evidence from web searches and your actual codebase, then uses multiple AI models to verify each claim through weighted consensus.
Stop shipping documentation that drifts from reality. Run this tool in CI to catch outdated READMEs, verify API references match actual code, and ensure your technical claims are backed by evidence.
How It Works
-
Claim Extraction - LLM parses the document and extracts verifiable factual statements as structured Pydantic models, skipping opinions and predictions.
-
Evidence Gathering - For each claim, the system searches multiple sources in parallel: web search via DuckDuckGo for external facts, and a filesystem React agent for code-specific claims (reads files, parses AST, follows imports).
-
Multi-Model Verification - Each claim is sent to multiple AI models independently. Models analyze evidence and return structured verdicts (SUPPORTS, REFUTES, or NOT_ENOUGH_INFO) with confidence scores.
-
Consensus and Grading - Model votes are aggregated through weighted consensus. Final report includes letter grade (A+ to F), evidence citations, and detailed explanations.
Key Features
- Multi-Model Consensus - GPT-4o, Claude, and other models vote independently on verdicts, reducing hallucination risk through ensemble verification
- Filesystem Evidence - React agent browses your codebase, reads source files, and follows imports to verify code-specific claims
- Web Search Integration - DuckDuckGo search for external fact verification with URL fetching and content analysis
- Pluggable Workflows - Composable extractors, gatherers, verifiers, and formatters with built-in presets (external, full, quick, internal)
- Structured Outputs - Pydantic models throughout, no brittle JSON parsing, full type safety
- Rich Reports - JSON, Markdown, and HTML output formats with evidence citations and confidence scores
- LangGraph 1.0+ - Durable execution with checkpointing, streaming, and human-in-the-loop support
Use Cases
- Documentation Review - Catch outdated claims in READMEs before release
- Technical Writing - Verify API claims against actual code signatures
- Content Validation - Fact-check blog posts and tutorials before publishing
- CI/CD Integration - Fail builds when documentation drifts from code
- Code Review - Validate docstrings and comments match implementation
Quick Start
pip install truthfulness-evaluator
export OPENAI_API_KEY="sk-..."
# Verify a document
truth-eval README.md
# Generate a Markdown report
truth-eval README.md -o report.md
# Check docs against your codebase
truth-eval README.md --root-path . --mode bothExample Output
Truthfulness Evaluation Report
Grade: A | Overall Confidence: 87.3%
Extracted 5 claims from README.md
SUPPORTED (95% confidence)
Claim: "Requires Python 3.11 or higher"
Votes: gpt-4o: SUPPORTS, gpt-4o-mini: SUPPORTS
Evidence: pyproject.toml (requires-python = ">=3.11")
SUPPORTED (92% confidence)
Claim: "Built on LangGraph 1.0+ and LangChain 1.0+"
Votes: gpt-4o: SUPPORTS, gpt-4o-mini: SUPPORTS
Evidence: pyproject.toml (langgraph = "^1.0.0")
Full documentation: https://sosoka-labs.github.io/truthfulness-evaluator/
