Skip to content

Data Quality Framework

How we measure, maintain, and communicate the quality of every attribute in PropPie's canonical store. Quality is not optional — the Honest Broker's credibility is only as good as the data underneath.

Core principle

Every attribute carries its own quality passport: source, confidence, freshness, lineage. Users and downstream systems never consume a number without knowing how trustworthy it is.


1. The five quality dimensions

Dimension Definition Measured as
Confidence How likely is the value correct? 0.0 to 1.0 per attribute instance
Freshness How recent is the underlying observation? extracted_at timestamp + staleness threshold
Completeness Is the value present and non-null? % fill rate per attribute across corpus
Lineage Can we trace back to the exact source? source, source_url, source_doc_id, source_doc_page
Consistency Do overlapping sources agree? Conflict flag + resolution record

2. Confidence scoring

2.1 How confidence is assigned

Every extracted or derived value gets a confidence score at write-time. Three methods:

Method When used Example
Rule-based Structured fields from authoritative sources MahaRERA number from RERA page → 0.99
Model-based LLM/OCR extraction from unstructured docs Address from scanned PDF → model confidence
Human-verified Manual QA pass Title chain reviewed by analyst → 0.95

2.2 Confidence thresholds

Tier Range Policy
High 0.85 – 1.0 Display to users without qualification
Medium 0.60 – 0.84 Display with "approximate" label or confidence indicator
Low 0.30 – 0.59 Display only with explicit caveat; exclude from derived-attribute inputs unless no alternative
Unusable < 0.30 Do not display; do not use in derivations; flag for re-extraction or manual review

2.3 Confidence propagation in derived attributes

When a derived attribute (e.g., risk.zone_risk_index) depends on multiple inputs:

derived_confidence = min(input_confidences) * formula_confidence

Where formula_confidence is the intrinsic reliability of the computation method (e.g., simple arithmetic = 1.0; Monte Carlo simulation = 0.85; LLM synthesis = 0.80).

This is conservative by design. A chain is only as strong as its weakest link.

2.4 Confidence display to users

Users never see raw 0-1 numbers. Display as:

Confidence User-facing label Visual
≥ 0.85 (no label — treated as reliable) Solid text
0.60 – 0.84 "Approximate" or "Based on limited data" Slightly muted, info icon
0.30 – 0.59 "Low confidence — verify independently" Dashed border, warning icon
< 0.30 Not shown

3. Freshness rules

3.1 Staleness thresholds per source

Source Expected update frequency Stale after Action when stale
MahaRERA project pages Quarterly (Form C) + ad-hoc 120 days since last scrape Re-scrape; flag "last updated X months ago"
IGR transactions (priority micromarkets) Daily 3 days Alert pipeline team; fall back to weekly
IGR transactions (other) Weekly 14 days Alert; acceptable lag
Government Resolutions Daily 48 hours Alert; NLP classifier may miss fast-breaking policy
MahaBhulekh / 7-12 On-demand 180 days since lookup Re-fetch on next query
GIS layers Quarterly 6 months Acceptable; layers change slowly
News Hourly 6 hours Alert if feed breaks
Social Hourly 6 hours Alert; non-critical
Macro (repo rate, CPI) Weekly 14 days Alert
Internal (Fractional) Real-time / daily 7 days Alert

3.2 Freshness display

Every user-facing data point shows as of [date] when the underlying observation is older than: - 7 days for transaction data - 30 days for project data - 90 days for structural/spatial data

3.3 Freshness monitoring

Pipeline produces a daily freshness report:

For each (source, attribute_category):
  - Last successful extraction timestamp
  - Records updated in last 24h / 7d / 30d
  - % of records past staleness threshold
  - Alert: YES/NO

Alerts go to Slack channel + pipeline dashboard. Any critical priority attribute going stale triggers an on-call investigation within 4 hours.


4. Completeness rules

4.1 Fill-rate targets

Priority Target fill rate Policy if below
critical ≥ 90% across active projects Pipeline bug or source degradation — investigate immediately
high ≥ 70% Acceptable gap; note in data-quality dashboard
medium ≥ 50% Expected for some attributes (floor plans, parking details)
low No target Nice-to-have

4.2 Null handling

  • Explicit null with reason: every missing value stores a null_reason enum: source_missing, extraction_failed, not_applicable, awaiting_enrichment
  • No silent nulls: if a field is missing, the system must record why
  • User display: missing critical fields show "Not available — [reason]" not blank space

5. Lineage and audit trail

5.1 Per-attribute lineage record

Every attribute instance carries:

Field Type Example
source enum maharera, igr, gr_portal, mahabhulekh, internal, enrichment, derived
source_url string https://maharera.maharashtra.gov.in/project/P52100012345
source_doc_id string Internal document UUID
source_doc_page int/null Page number in PDF
extracted_at datetime 2026-05-15T14:30:00+05:30
extraction_method enum scraper_v3, llm_gpt4o, ocr_tesseract, manual, formula
extraction_model_version string gpt-4o-2025-12-01
confidence float 0.92
confidence_method enum rule, model, human
human_verified bool false
human_verified_by string/null aishvarya
human_verified_at datetime/null

5.2 Change history

For attributes that change over time (project cost revisions, completion dates, ownership):

change_history: [
  { value: "2027-03-31", source: "Form B v3", extracted_at: "2026-01-14", confidence: 0.95 },
  { value: "2026-12-31", source: "Form B v2", extracted_at: "2025-06-20", confidence: 0.95 },
  { value: "2025-12-31", source: "Form B v1", extracted_at: "2024-03-15", confidence: 0.95 }
]

This is critical for delay forensics — the number of revisions IS the signal.

5.3 Audit requirements

For DPDP and general defensibility:

  • Immutable extraction log: every extraction event logged with inputs, outputs, model version. Append-only.
  • Derived-attribute audit: every derived score stores the input attribute IDs + values at computation time, so a future query can reconstruct "why was this score X on date Y?"
  • PII audit: all access to PII-containing attributes (investor data) logged with accessor, timestamp, purpose.
  • Retention policy: extraction logs retained for 3 years minimum. PII logs per DPDP retention schedule.

6. Conflict resolution

6.1 When sources disagree

The source-of-truth hierarchy (from data-sources.md):

Priority Source Wins on
1 IGR Index-II Price, parties, transaction date
2 MahaRERA Project metadata, dates, promoter
3 MahaBhulekh Land ownership, area
4 GRs Policy, infra timelines
5 GIS layers Geospatial
6 Internal (Fractional) Realised yields, vacancy
7 Licensed feeds Where they aggregate primaries
8 News Context only
9 Social Sentiment only
10 Listings Discovery only

6.2 Conflict resolution process

When a new extraction disagrees with the existing canonical value:

1. If new_source priority > existing_source priority:
     → Replace. Log old value in change_history.

2. If new_source priority == existing_source priority:
     → Use more recent extraction (freshness wins within same tier).
     → Log conflict in conflict_log with both values + sources.

3. If new_source priority < existing_source priority:
     → Do NOT replace. Store as supplementary evidence.
     → Flag if the delta is > 20% on numeric values (investigate).

4. If conflict is > 50% on a numeric attribute from same-tier sources:
     → Flag for human review. Do not auto-resolve.

6.3 Conflict log

Maintain a queryable conflict log:

Field Type
attribute_id string
entity_id string (project/transaction/micromarket)
existing_value any
existing_source string
conflicting_value any
conflicting_source string
delta_pct float
resolution enum: replaced, kept_existing, flagged_for_review
resolved_by string
resolved_at datetime

Monthly report: top 20 conflicting attributes by volume. Systemic conflicts indicate source degradation.


7. Quality SLAs per attribute priority

Priority Confidence floor Fill rate floor Max staleness Conflict resolution SLA
critical 0.85 90% Per source table 4 hours if flagged
high 0.70 70% Per source table 24 hours
medium 0.60 50% 2x source table 72 hours
low No floor No floor No enforcement Best-effort

Violations against critical SLAs trigger an alert + entry in the quality incident log.


8. Quality monitoring and alerting

8.1 Daily quality dashboard

Generated by pipeline, consumed by product + data teams:

Metric Granularity Target
Extraction success rate Per source, per day > 95%
Confidence distribution Per attribute category Median > 0.85 for critical
Fill rate Per attribute Per SLA table
Staleness % Per source < 5% records stale
Conflict volume Per attribute < 2% of records
Human review queue depth Global < 50 items

8.2 Alerts

Alert Trigger Channel Response SLA
Source down Scraper returns errors for > 1 hour Slack #pipeline-alerts 2 hours
Confidence drop Attribute category median drops > 10% day-over-day Slack 4 hours
Fill rate drop Critical attribute fill rate drops > 5% Slack 4 hours
Staleness breach Any critical attribute source past threshold Slack 4 hours
Conflict spike > 10% conflict rate on any attribute in a day Slack 24 hours
Hallucination report User reports incorrect AI output Slack #incidents 1 hour

8.3 Quality incident log

Every alert that required intervention is logged:

Field Type
incident_id UUID
detected_at datetime
alert_type enum
affected_attributes list
affected_entity_count int
root_cause text
resolution text
resolved_at datetime
resolved_by string
user_impact bool (did any user see incorrect data?)

Monthly review: patterns, systemic issues, improvement priorities.


9. Human-in-the-loop verification

9.1 When humans verify

  • Always for Title Clarity Score on new projects (before first display)
  • Always for Developer Trust Score on first computation
  • Always for flagged conflicts (> 50% delta, same-tier sources)
  • Sampling (5% random) on all LLM-extracted attributes weekly
  • On user report of any incorrect data

9.2 Verification workflow

1. Item enters review queue with context (attribute, entity, source docs, extraction, confidence)
2. Reviewer confirms or corrects
3. If corrected:
   a. Canonical store updated
   b. Old value moved to change_history
   c. confidence_method set to 'human'
   d. human_verified = true
   e. If correction indicates systemic extraction issue → raise pipeline bug
4. If confirmed:
   a. human_verified = true
   b. confidence boosted by 0.05 (capped at 1.0)

9.3 Review capacity planning

At scale (~40,000 projects, ~500 new/month): - Title Clarity: ~500/month new projects → 500 reviews/month - Developer Trust: ~100 new developers/year → manageable - Sampling: 5% of LLM extractions at ~10 attrs/project × 500 projects × 5% = ~2,500 spot-checks/month - Conflicts: estimate ~200/month flagged

Total: ~3,200 review items/month. At 5 min/item = ~270 hours/month. Budget for 1.5 FTE of data quality analyst or distribute across team with tooling.


10. Data quality for AI/LLM outputs

Special rules for derived attributes that involve LLM generation (narratives, summaries, explanations):

10.1 Grounding enforcement

Every LLM-generated output must: 1. Reference only attributes present in the canonical store 2. Not assert any fact not traceable to a stored attribute 3. Include inline citations that resolve to real source_urls 4. Pass a post-generation grounding check (automated: verify all cited numbers match canonical store)

10.2 Hallucination detection

Pipeline for every LLM output:

1. Generate output with citations
2. Extract all factual claims (regex + NLI model)
3. For each claim:
   a. Resolve citation to canonical store
   b. Verify numerical value matches (within tolerance)
   c. Verify entity reference is correct
4. If any claim fails verification:
   a. Flag output, do NOT serve to user
   b. Regenerate with stricter prompt
   c. If still fails: serve partial output with failed claims removed + "I couldn't verify X" note
5. Log all verification results for eval monitoring

10.3 Tone compliance

Every LLM output is checked for Honest Broker compliance: - No "you should" / "I recommend" / "good investment" language (regex + classifier) - Probability framing present on projections - "I don't know" present when confidence is low

Violations: block output, regenerate with compliance prompt injection.


11. Data quality culture

11.1 Ownership

Every attribute category has a named owner (see data-attributes.md). The owner is accountable for quality SLAs on their attributes.

11.2 Weekly quality standup

15-minute weekly review: - Top 5 quality incidents - Fill rate and staleness trends - Human review queue depth - Any new conflict patterns

11.3 Quarterly quality retro

  • Full SLA compliance review
  • Confidence calibration: are our confidence scores accurate? (Compare human-verified subset against automated scores.)
  • Extraction model performance review
  • Source health review (any sources degrading?)
  • Update SLA thresholds if needed

12. Bootstrapping quality (first 90 days)

During the foundation phase, before full automation:

Week Focus
1-2 Establish confidence baseline: manually grade 100 projects across all critical attributes
3-4 Set up freshness monitoring for MahaRERA + IGR scrapers
5-6 Implement automated conflict detection on first 1,000 projects
7-8 Deploy hallucination detection on LLM extraction pipeline
9-10 First quality dashboard live; first weekly standup
11-12 First derived-attribute human verification pass (Title Clarity, Developer Trust)

See also: - data-attributes.md — attribute catalogue with per-attribute ownership - derived-attributes-spec.md — math specs including validation methods - pipeline-spec-for-vishal.md — SLAs consumed by pipeline - ../../.cursor/skills/proppie-data-sources/SKILL.md — source hierarchy