Data Source + Security Layer
Data Sources (Live Read Access)
The system connects to multiple data sources through secure, read-only access. Product analytics via MCP for behavioral data. Airtable via REST API for business context (revenue, programs, experiments).
Events (raw + custom)
Event properties
User properties
Business context (Airtable)
Metrics + Segments
Memory files
Every entity pre-cataloged into a machine-readable schema file
Security Firewall (3 Layers per Source)
Read-only tools whitelisted 18
Write tools permanently blocked 5
Injection detection layers 3
Airtable approved bases 3
No agent can create, modify, or delete anything in any data source. All queries are validated against allow-lists before execution. Prompt injection patterns are detected and rejected. Airtable access is locked to 3 approved base IDs with GET-only methods.
amplitude_firewall.py airtable_firewall.py schema_validator.py
orchestrator.py ~1,500 lines Coordinates all 6 agents + 4 feedback loops Injects memory, context, and prevention rules into every prompt
Agent Pipeline (Sequential with Feedback Loops)
Loop 1 target
Agent 1 (runs twice)
Product Analyst
"The Detective" -- Phase 1: Decision-first hypotheses + query data. Phase 2: Journey mapping + opportunity sizing.
Phase 1: Receives the business question. Forms decision-driven hypotheses (each tied to a specific decision leadership needs to make). Formulates structured queries. Pulls live data via MCP. Extracts every quantitative claim as a machine-readable ClaimSpec (numerator, denominator, computation type).

Phase 2 (after Analytics Engineer certification): Takes the certified data back and maps user journeys for key cohorts (converters vs non-converters). Identifies the divergence point where user paths split. Segments deeply (channel, platform, geography, user tenure). Quantifies the opportunity using ONLY measured data -- no assumed LTV, no fabricated revenue.
hypothesis_formatter.py query_planner.py claim_tagger.py
Loop 1: Discovery
Max 3 retries
Rejected findings sent
back to Product Analyst
Agent 2
Analytics Engineer
"Data Guardian" -- Pure data quality validation. Does NOT analyze data or find patterns.
Runs 5-Point Data Admissibility Check: (C1) event existence in schema, (C2) property spelling and type, (C3) metric definition accuracy, (C4) filter logic validity, (C5) sample size adequacy. Certifies or rejects each finding. Rejected findings are sent back to the Product Analyst for re-query. Certified findings are passed back to the Product Analyst for Phase 2 deep analysis.
schema_validator.py
Schema validation
5-point admissibility check
Certify or reject
Loop 2: Evidence
Max 2 retries
Quarantined findings sent
back to Product Analyst
Agent 3
Data Scientist
"The Statistician" -- Confounder detection, significance testing, effect size quantification, predictive indicators
Highest priority: confounder detection. For EVERY finding, identifies specific alternative explanations that could create a spurious relationship. Each confounder must be specific and testable -- not vague labels like "user bias," but concrete mechanisms (e.g., "iOS users skew wealthier; when we control for geography, the platform effect drops from 4pp to 1.2pp"). Rates confounder risk as LOW / MEDIUM / HIGH. Quarantines findings with high risk.

Also runs statistical significance tests (chi-squared, z-test) where appropriate, quantifies effect sizes in business terms, and identifies predictive indicators. Does NOT fabricate or assume any numbers not measured in the investigation.
statistical_tests.py
Significance testing
Confounder detection
Effect size (Cohen's h/d)
Loop 3 target
Agent 4
Product Strategist
"The Translator" -- Adds business context, converts findings into actionable strategy
Sources business context from: (1) Airtable business data (revenue, programs, experiments -- fetched via read-only firewall), (2) Memory Bank files (project context, active priorities, past decisions, prevention rules), (3) investigation mission and implicit signals, (4) strategy frameworks (ICE scoring, Impact vs Effort). Builds impact projections using ONLY measured data -- does not fabricate LTV, revenue-per-user, or any unmeasured financial metric.
impact_model.py
Revenue projection
Sensitivity analysis
ICE scoring
Loop 3: Challenge
Max 2 retries
Logic-blocked claims
routed to Strategist + DS
Agent 5
Red Team
"The Challenger" -- Adversarial review of logic, reasoning, and assumptions (not data or math)
Only reviews findings that PASSED the Data Scientist. Quarantined findings are not re-reviewed (the DS already identified the core problem). Executes a 7-Point Challenge Protocol across 13 bias types.

Distinct from the Data Scientist: the DS challenges the DATA (confounders, statistical validity); the Red Team challenges the LOGIC (could there be an alternative interpretation? Does the reasoning hold? Will the effect hold at scale?). Example: DS validates that AI engagement correlates with conversion. Red Team asks: "If you make AI more prominent, will the new users who discover it convert at the same rate?"
bias_detector.py
13 bias types
7 challenge protocols
Logic-based confounder detection
Agent 6
Verification Analyst
"The Cross-Checker" -- Arithmetic, consistency, fabrication auditing
Runs 5 verification categories: (1) Arithmetic -- recomputes every percentage, ratio, and count independently; (2) Internal Consistency -- checks that numbers across different sections agree with each other; (3) Fabrication Auditing (HARD RULE) -- rejects any claim built on unmeasured data (e.g., if LTV was never measured, any revenue estimate using LTV is flagged as FABRICATED and rejected); (4) Sample Adequacy -- flags small samples; (5) Double-Counting -- catches overlapping populations across findings.
verify_report.py
5 verification categories
Fabrication auditing (hard gate)
Double-counting detection
Deterministic Verification (Pure Python -- No LLM Involved)
Loop 4: Verification
Max 2 retries
Blocked claims sent
all the way back to
Product Analyst
Triple Check Code Gate
This is the most critical layer in the system. Everything before this was AI agents reviewing each other's work. This part is pure Python. No LLM involved. Every quantitative claim -- every percentage, every count, every ratio -- must pass all 3 independent verification methods or the entire report is blocked from publication.
Method 1
Direct Computation
Recomputes the claimed number from the raw numerator and denominator independently
Method 2
Cross-Reference
Derives the same number via a completely different formula path to confirm convergence
Method 3
Bounds Check
Validates mathematical possibility: percentages 0-100, counts non-negative, funnels decreasing
Any disagreement between methods = claim blocked. Report will not publish.
triple_check.py verification_engine.py number_validator.py
Why the Verification Layer Matters
LLMs are powerful at pattern recognition and narrative generation, but they routinely hallucinate numbers. A model might round 3.19% to 3.2%, invert a ratio, or confuse a nominator with a denominator. In a business context, a single wrong number can lead to a wrong decision. The verification layer exists to guarantee that no number leaves the system unchecked.
The Problem
AI agents generate insights with numbers embedded in natural language. These numbers are plausible-sounding but not always mathematically correct. Rounding errors, formula mistakes, and hallucinated statistics are common failure modes.
The Solution
Every quantitative claim is extracted as a structured ClaimSpec (with raw inputs, computation type, and expected output). This ClaimSpec is then passed to Python code -- not an LLM -- that independently recomputes, cross-references, and bounds-checks the result using 3 independent methods.
The Guarantee
If any verification method disagrees with the claimed value, the claim is blocked and the entire report is held from publication. The system can refuse to publish. That is the feature -- the ability to say "I don't trust this number" is what separates this from a chatbot.
Memory System -- How the Team Learns
The system does not start from zero on each investigation. It maintains persistent memory files that are injected into every agent's prompt, so past mistakes inform future behavior.
projectContext.md
Stores the environment setup, schema version, analytics platform configuration, and entity counts. Updated when the data source changes. Every agent reads this at the start of their execution so they know exactly what data is available.
activeContext.md
Tracks the current investigation state: which hypothesis is being tested, which agents have completed their work, which feedback loops are active, and what the current status is. This is the "working memory" of the system.
decisionLog.md
Records every decision every agent made, with the rationale behind it. "Product Analyst chose event X over event Y because..." "Red Team flagged survivorship bias because..." This creates a full audit trail of the system's reasoning.
errorLog.md -- Generates PREVENTION RULES
When the system makes an error -- a wrong query, a failed verification, a bias it missed -- it logs the error and generates a PREVENTION RULE. These rules are injected into all future agent prompts.

Example: If Investigation #1 failed because it used a deprecated event name, the prevention rule "Never query 'old_event_name'. Use 'new_event_name' instead." is automatically added to every agent's prompt in Investigation #2 and beyond.
How memory flows into prompts
The orchestrator reads all 4 memory files before each agent runs. It injects the relevant context into the agent's system prompt as structured sections: [ENVIRONMENT], [ACTIVE CONTEXT], [DECISIONS SO FAR], [PREVENTION RULES]. Each agent sees the full history of what happened before it, plus explicit rules about what not to do.
Feedback Loops Summary
Loop 1: Discovery
Analytics Engineer rejects findings with schema violations or data quality issues. Sends feedback to Product Analyst to re-query.
Max 3 retries.
Loop 2: Evidence
Data Scientist quarantines findings that fail significance tests or have high confounder risk. Routes back to Product Analyst.
Max 2 retries.
Loop 3: Challenge
Red Team blocks claims with logic gaps. Routes to Product Strategist and Data Scientist to address.
Max 2 retries.
Loop 4: Verification
Triple Check code gate blocks wrong numbers. Sends all the way back to Product Analyst.
Max 2 retries.
Complete Script Inventory (16 Files)
Verification
triple_check.py
verification_engine.py
number_validator.py
verify_report.py
Security + Schema + Context
amplitude_firewall.py
airtable_firewall.py
schema_validator.py
business_context.py
amplitude_schema.json
Analysis + Storage
hypothesis_formatter.py
query_planner.py
claim_tagger.py
statistical_tests.py
bias_detector.py
impact_model.py
data_store.py