Data Intelligence Team -- System Architecture

Data Source + Security Layer

Data Sources (Live Read Access)

The system connects to multiple data sources through secure, read-only access. Product analytics via MCP for behavioral data. Airtable via REST API for business context (revenue, programs, experiments).

Events (raw + custom)

Event properties

User properties

Business context (Airtable)

Metrics + Segments

Memory files

Every entity pre-cataloged into a machine-readable schema file

Security Firewall (3 Layers per Source)

Read-only tools whitelisted 18

Write tools permanently blocked 5

Injection detection layers 3

Airtable approved bases 3

No agent can create, modify, or delete anything in any data source. All queries are validated against allow-lists before execution. Prompt injection patterns are detected and rejected. Airtable access is locked to 3 approved base IDs with GET-only methods.

amplitude_firewall.py airtable_firewall.py schema_validator.py

orchestrator.py ~1,500 lines Coordinates all 6 agents + 4 feedback loops Injects memory, context, and prevention rules into every prompt

Agent Pipeline (Sequential with Feedback Loops)

Loop 1 target

Agent 1 (runs twice)

Product Analyst

"The Detective" -- Phase 1: Decision-first hypotheses + query data. Phase 2: Journey mapping + opportunity sizing.

Phase 1: Receives the business question. Forms decision-driven hypotheses (each tied to a specific decision leadership needs to make). Formulates structured queries. Pulls live data via MCP. Extracts every quantitative claim as a machine-readable ClaimSpec (numerator, denominator, computation type).

Phase 2 (after Analytics Engineer certification): Takes the certified data back and maps user journeys for key cohorts (converters vs non-converters). Identifies the divergence point where user paths split. Segments deeply (channel, platform, geography, user tenure). Quantifies the opportunity using ONLY measured data -- no assumed LTV, no fabricated revenue.

hypothesis_formatter.py query_planner.py claim_tagger.py

Loop 1: Discovery

Max 3 retries
Rejected findings sent
back to Product Analyst

Agent 2

Analytics Engineer

"Data Guardian" -- Pure data quality validation. Does NOT analyze data or find patterns.

Runs 5-Point Data Admissibility Check: (C1) event existence in schema, (C2) property spelling and type, (C3) metric definition accuracy, (C4) filter logic validity, (C5) sample size adequacy. Certifies or rejects each finding. Rejected findings are sent back to the Product Analyst for re-query. Certified findings are passed back to the Product Analyst for Phase 2 deep analysis.

schema_validator.py

Schema validation
5-point admissibility check
Certify or reject

Loop 2: Evidence

Max 2 retries
Quarantined findings sent
back to Product Analyst

Agent 3

Data Scientist

"The Statistician" -- Confounder detection, significance testing, effect size quantification, predictive indicators

Highest priority: confounder detection. For EVERY finding, identifies specific alternative explanations that could create a spurious relationship. Each confounder must be specific and testable -- not vague labels like "user bias," but concrete mechanisms (e.g., "iOS users skew wealthier; when we control for geography, the platform effect drops from 4pp to 1.2pp"). Rates confounder risk as LOW / MEDIUM / HIGH. Quarantines findings with high risk.

Also runs statistical significance tests (chi-squared, z-test) where appropriate, quantifies effect sizes in business terms, and identifies predictive indicators. Does NOT fabricate or assume any numbers not measured in the investigation.

statistical_tests.py

Significance testing
Confounder detection
Effect size (Cohen's h/d)

Loop 3 target

Agent 4

Product Strategist

"The Translator" -- Adds business context, converts findings into actionable strategy

Sources business context from: (1) Airtable business data (revenue, programs, experiments -- fetched via read-only firewall), (2) Memory Bank files (project context, active priorities, past decisions, prevention rules), (3) investigation mission and implicit signals, (4) strategy frameworks (ICE scoring, Impact vs Effort). Builds impact projections using ONLY measured data -- does not fabricate LTV, revenue-per-user, or any unmeasured financial metric.

impact_model.py

Revenue projection
Sensitivity analysis
ICE scoring

Loop 3: Challenge

Max 2 retries
Logic-blocked claims
routed to Strategist + DS

Agent 5

Red Team

"The Challenger" -- Adversarial review of logic, reasoning, and assumptions (not data or math)

Only reviews findings that PASSED the Data Scientist. Quarantined findings are not re-reviewed (the DS already identified the core problem). Executes a 7-Point Challenge Protocol across 13 bias types.

Distinct from the Data Scientist: the DS challenges the DATA (confounders, statistical validity); the Red Team challenges the LOGIC (could there be an alternative interpretation? Does the reasoning hold? Will the effect hold at scale?). Example: DS validates that AI engagement correlates with conversion. Red Team asks: "If you make AI more prominent, will the new users who discover it convert at the same rate?"

bias_detector.py

13 bias types
7 challenge protocols
Logic-based confounder detection

Agent 6

Verification Analyst

"The Cross-Checker" -- Arithmetic, consistency, fabrication auditing

Runs 5 verification categories: (1) Arithmetic -- recomputes every percentage, ratio, and count independently; (2) Internal Consistency -- checks that numbers across different sections agree with each other; (3) Fabrication Auditing (HARD RULE) -- rejects any claim built on unmeasured data (e.g., if LTV was never measured, any revenue estimate using LTV is flagged as FABRICATED and rejected); (4) Sample Adequacy -- flags small samples; (5) Double-Counting -- catches overlapping populations across findings.

verify_report.py

5 verification categories
Fabrication auditing (hard gate)
Double-counting detection

Deterministic Verification (Pure Python -- No LLM Involved)

Loop 4: Verification

Max 2 retries
Blocked claims sent
all the way back to
Product Analyst

Triple Check Code Gate

This is the most critical layer in the system. Everything before this was AI agents reviewing each other's work. This part is pure Python. No LLM involved. Every quantitative claim -- every percentage, every count, every ratio -- must pass all 3 independent verification methods or the entire report is blocked from publication.

Method 1

Direct Computation

Recomputes the claimed number from the raw numerator and denominator independently

Method 2

Cross-Reference

Derives the same number via a completely different formula path to confirm convergence

Method 3

Bounds Check

Validates mathematical possibility: percentages 0-100, counts non-negative, funnels decreasing

Any disagreement between methods = claim blocked. Report will not publish.

triple_check.py verification_engine.py number_validator.py

Why the Verification Layer Matters

LLMs are powerful at pattern recognition and narrative generation, but they routinely hallucinate numbers. A model might round 3.19% to 3.2%, invert a ratio, or confuse a nominator with a denominator. In a business context, a single wrong number can lead to a wrong decision. The verification layer exists to guarantee that no number leaves the system unchecked.

The Problem

AI agents generate insights with numbers embedded in natural language. These numbers are plausible-sounding but not always mathematically correct. Rounding errors, formula mistakes, and hallucinated statistics are common failure modes.

The Solution

Every quantitative claim is extracted as a structured ClaimSpec (with raw inputs, computation type, and expected output). This ClaimSpec is then passed to Python code -- not an LLM -- that independently recomputes, cross-references, and bounds-checks the result using 3 independent methods.

The Guarantee

If any verification method disagrees with the claimed value, the claim is blocked and the entire report is held from publication. The system can refuse to publish. That is the feature -- the ability to say "I don't trust this number" is what separates this from a chatbot.

Memory System -- How the Team Learns

The system does not start from zero on each investigation. It maintains persistent memory files that are injected into every agent's prompt, so past mistakes inform future behavior.

projectContext.md

Stores the environment setup, schema version, analytics platform configuration, and entity counts. Updated when the data source changes. Every agent reads this at the start of their execution so they know exactly what data is available.

activeContext.md

Tracks the current investigation state: which hypothesis is being tested, which agents have completed their work, which feedback loops are active, and what the current status is. This is the "working memory" of the system.

decisionLog.md

Records every decision every agent made, with the rationale behind it. "Product Analyst chose event X over event Y because..." "Red Team flagged survivorship bias because..." This creates a full audit trail of the system's reasoning.

errorLog.md -- Generates PREVENTION RULES

When the system makes an error -- a wrong query, a failed verification, a bias it missed -- it logs the error and generates a PREVENTION RULE. These rules are injected into all future agent prompts.

Example: If Investigation #1 failed because it used a deprecated event name, the prevention rule "Never query 'old_event_name'. Use 'new_event_name' instead." is automatically added to every agent's prompt in Investigation #2 and beyond.

How memory flows into prompts

The orchestrator reads all 4 memory files before each agent runs. It injects the relevant context into the agent's system prompt as structured sections: [ENVIRONMENT], [ACTIVE CONTEXT], [DECISIONS SO FAR], [PREVENTION RULES]. Each agent sees the full history of what happened before it, plus explicit rules about what not to do.

Feedback Loops Summary

Loop 1: Discovery

Analytics Engineer rejects findings with schema violations or data quality issues. Sends feedback to Product Analyst to re-query.
Max 3 retries.

Loop 2: Evidence

Data Scientist quarantines findings that fail significance tests or have high confounder risk. Routes back to Product Analyst.
Max 2 retries.

Loop 3: Challenge

Red Team blocks claims with logic gaps. Routes to Product Strategist and Data Scientist to address.
Max 2 retries.

Loop 4: Verification

Triple Check code gate blocks wrong numbers. Sends all the way back to Product Analyst.
Max 2 retries.

Complete Script Inventory (16 Files)

Verification

triple_check.py
verification_engine.py
number_validator.py
verify_report.py

Security + Schema + Context

amplitude_firewall.py
airtable_firewall.py
schema_validator.py
business_context.py
amplitude_schema.json

Analysis + Storage

hypothesis_formatter.py
query_planner.py
claim_tagger.py
statistical_tests.py
bias_detector.py
impact_model.py
data_store.py