Truthfulness Evaluator

Multi-model truthfulness evaluation with filesystem-aware evidence gathering. Automates fact-checking for technical documentation.

Overview

Truthfulness Evaluator extracts verifiable claims from your Markdown files, gathers evidence from web searches and your actual codebase, then uses multiple AI models to verify each claim through weighted consensus.

Stop shipping documentation that drifts from reality. Run this tool in CI to catch outdated READMEs, verify API references match actual code, and ensure your technical claims are backed by evidence.

How It Works

Claim Extraction - LLM parses the document and extracts verifiable factual statements as structured Pydantic models, skipping opinions and predictions.
Evidence Gathering - For each claim, the system searches multiple sources in parallel: web search via DuckDuckGo for external facts, and a filesystem React agent for code-specific claims (reads files, parses AST, follows imports).
Multi-Model Verification - Each claim is sent to multiple AI models independently. Models analyze evidence and return structured verdicts (SUPPORTS, REFUTES, or NOT_ENOUGH_INFO) with confidence scores.
Consensus and Grading - Model votes are aggregated through weighted consensus. Final report includes letter grade (A+ to F), evidence citations, and detailed explanations.

Key Features

Multi-Model Consensus - GPT-4o, Claude, and other models vote independently on verdicts, reducing hallucination risk through ensemble verification
Filesystem Evidence - React agent browses your codebase, reads source files, and follows imports to verify code-specific claims
Web Search Integration - DuckDuckGo search for external fact verification with URL fetching and content analysis
Pluggable Workflows - Composable extractors, gatherers, verifiers, and formatters with built-in presets (external, full, quick, internal)
Structured Outputs - Pydantic models throughout, no brittle JSON parsing, full type safety
Rich Reports - JSON, Markdown, and HTML output formats with evidence citations and confidence scores
LangGraph 1.0+ - Durable execution with checkpointing, streaming, and human-in-the-loop support

Use Cases

Documentation Review - Catch outdated claims in READMEs before release
Technical Writing - Verify API claims against actual code signatures
Content Validation - Fact-check blog posts and tutorials before publishing
CI/CD Integration - Fail builds when documentation drifts from code
Code Review - Validate docstrings and comments match implementation

Quick Start

pip install truthfulness-evaluator
export OPENAI_API_KEY="sk-..."
 
# Verify a document
truth-eval README.md
 
# Generate a Markdown report
truth-eval README.md -o report.md
 
# Check docs against your codebase
truth-eval README.md --root-path . --mode both

Example Output

Truthfulness Evaluation Report
Grade: A | Overall Confidence: 87.3%

Extracted 5 claims from README.md

SUPPORTED (95% confidence)
   Claim: "Requires Python 3.11 or higher"
   Votes: gpt-4o: SUPPORTS, gpt-4o-mini: SUPPORTS
   Evidence: pyproject.toml (requires-python = ">=3.11")

SUPPORTED (92% confidence)
   Claim: "Built on LangGraph 1.0+ and LangChain 1.0+"
   Votes: gpt-4o: SUPPORTS, gpt-4o-mini: SUPPORTS
   Evidence: pyproject.toml (langgraph = "^1.0.0")

Full documentation: https://sosoka-labs.github.io/truthfulness-evaluator/