Skip to content

Data Sources — Detailed Catalogue

This is the narrative companion to /.cursor/skills/proppie-data-sources/SKILL.md. The SKILL is the quick-reference; this doc has the longer narrative, the layered architecture diagram, the source-of-truth rules, and the gap analysis.

The five layers of PropPie's data

flowchart TD
    L0[Layer 0: Raw Capture<br/>Scrapers / APIs / Document OCR / NLP]
    L1[Layer 1: Canonical Extraction<br/>Vishal's schema v2 - structured per source]
    L2[Layer 2: Entity Resolution + Fusion<br/>Same project across RERA + IGR + News]
    L3[Layer 3: Derived Attributes<br/>Scores, indices, velocities, trajectories]
    L4[Layer 4: Product-facing API<br/>Asset cards, dashboards, conversational]

    L0 --> L1 --> L2 --> L3 --> L4

This document covers L0 and feeds L1. The derivations live in derived-attributes-spec.md. Lineage/confidence/freshness rules live in data-quality-framework.md.


Source-of-truth hierarchy

When two sources disagree about the same attribute, resolve in this order:

Priority Source Wins on
1 IGR Index-II (registered deed) Sale price, transaction date, parties, stamp duty paid
2 MahaRERA project filings Project metadata, plans, completion dates, promoter, financials
3 MahaBhulekh / 7-12 / Property Card Land ownership, area, mutations
4 Government Resolutions (GRs) Policy, infra projects, zoning
5 GIS layers (MRSAC, OSM, Bhuvan) Geospatial, proximity, environmental
6 Internal data (Fractional platform) Realised yields, vacancy, secondary trades
7 Licensed feeds (PropEquity, CRE Matrix) Where they aggregate primary sources
8 News articles Event detection, context — not factual claims
9 Social / forums Sentiment only, never factual
10 Listing portals Discovery only, not pricing truth

Every attribute carries its source, source_url, extracted_at, and confidence — so resolution is auditable.


Source #1 — MahaRERA

Access

  • Portal: https://maharera.maharashtra.gov.in/
  • API: None public
  • Approach: Scraping with robust monitors
  • Polite limits: 2 req/sec sustained, courtesy delays during peak hours
  • robots.txt: Generally permissive for project pages
  • Legal basis: Public records under RERA Act § 11

What we extract (mapped to Vishal's schema v2)

Vishal's schema covers this comprehensively. The project_identity, location, land_and_area_metrics, plot_details, parties, legal_and_compliance, rera_form_b, architectural_plans, escrow_bank_account, loan_disclosure, financial_details, project_info, documents objects all derive primarily from MahaRERA.

Document types ingested

Document What it tells us Quality Typical issues
Form A (application) Project identity, promoter, location Good Older filings have scanned PDFs
Form B (project details) Phases, towers, units, timeline, costs Good Free-form text in some fields
Form C (quarterly progress) Construction status updates Medium Often perfunctory, hard to verify
Building Plan Approval Sanctioned built-up, FSI, layout Variable Large PDFs, plan drawings
Commencement Certificate (CC) Construction legally started Good
Occupancy / Part Occupancy Cert Habitability Good
Title Search Report Chain of title summary Variable Lawyer-written, varying detail
Title Certificate Legal opinion on title Variable
Search Report (SR office) Encumbrances Variable
Land Plan, Site Plan Layout Variable Vector vs scanned
Layout Plan, Floor Plan Spatial layout Variable
Sale Agreement (template) Buyer terms Good Often boilerplate
Allotment Letter (template) Booking terms Good
Parking Plan Parking provision Sometimes missing
Environment Clearance EC certificate Sparse Only large projects
Stamp Payment Receipt Stamp duty paid Indirect — implied by stamp duty rate

Update cadence

  • Project page re-scrape: Monthly
  • Document delta: Detect file-list changes, re-extract changed documents
  • Form C (progress): Quarterly per project
  • Complaint records: Weekly

Known gotchas

  1. Shell registrations — some projects register but never upload documents. Detect via documents count = 0.
  2. Name changes — promoter mergers/demergers don't auto-link. Use PAN/CIN as primary key.
  3. Phase confusion — large projects register phases separately; reconcile via alternate_names and address proximity.
  4. Withdrawn projects — re-appear later under different names. Maintain a withdrawal-aware history.
  5. Scanned PDFs — older filings (pre-2020) often scanned; OCR + LLM extraction needed.
  6. Marathi text — addresses, descriptions, court orders often in Marathi.
  7. Bulk download is rate-limited — be polite, parallelise carefully.
  8. Date formats — DD/MM/YYYY, DD-MM-YYYY, "31st March 2026" all appear; normalise on extraction.

Volume estimates

  • ~40,000 registered projects (residential + commercial) since 2017
  • ~5-10 documents per project on average → ~300,000 documents total
  • ~500-1,000 new project registrations per month
  • ~50,000 quarterly updates per quarter

Source #2 — IGR Maharashtra (Inspector General of Registration)

Access

  • Public portal: https://igrmaharashtra.gov.in/
  • Search interface: District/SR-office → Year → Index-II lookup
  • Bulk feeds: Licensed via PropEquity, CRE Matrix, etc. (₹2-25 Lakh/year/state)
  • Approach: Hybrid — official portal for sample/spot-check, licensed feed for bulk
  • Legal basis: Public records under Registration Act 1908 § 57

Index-II structure (what each deed gives us)

Field Always present Notes
Deed number, SR office, sub-district, registration date Yes Primary key
Type of deed (Sale, L&L, Gift, Mortgage, etc.) Yes
Property identifier (CTS / Survey, address) Yes Address often Marathi
Parties (vendor / vendee / lessor / lessee) Yes With addresses, sometimes PAN
Consideration (sale price OR rent + tenure) Yes The declared amount
Market value (per ASR) Yes Government's view
Stamp duty + registration fee paid Yes
Document URL / scan Sometimes Often paywalled or rate-limited

Update cadence

  • Daily for priority micromarkets (Pune Tier-1, MMR core, Bhiwandi warehousing)
  • Weekly for tier-2 micromarkets
  • Monthly rollups for trend analytics

Known gotchas

  1. Declared price ≠ actual price for high-value transactions (cash component is illegal but real). ASR Gap % helps detect; we flag, never accuse.
  2. Marathi property descriptions require good translation pipeline.
  3. Same property registered multiple times (correction deeds, family transfers). Dedupe by (property_id, parties, date proximity).
  4. L&L lock-in < 11 months = designed to avoid registration; if registered, often suspicious.
  5. Stamp duty rates differ by deed type, property type (residential vs commercial), and exemption status.
  6. Indexed within 24-72 hours of registration — sub-day claims should be hedged.
  7. Bulk feed costs are non-trivial; budget item.

Linking IGR → MahaRERA

The hardest fusion problem. We use: - Survey / CTS number overlap - Geocoded address proximity (< 200m + textual fuzzy match) - Promoter name match (Class I) + parties match for sales from promoter to buyer - Project name in property description

Expected match rate: 40-60% for residential, higher for commercial transactions of promoter→buyer type, lower for secondary market.


Source #3 — Government Resolutions (GRs) and Notifications

Access

  • Master portal: https://gr.maharashtra.gov.in/
  • Department portals: UDD, Revenue, Housing, Industries, Environment, Town Planning
  • Approach: Scraping + RSS where available
  • Update cadence: Daily ingest, hourly during volatile periods (budget season, election notifications)

Departments to track

Department Why it matters
Urban Development Department (UDD) Master plan changes, TP schemes, DCR notifications, FSI revisions, TDR
Revenue Department Land conversion, NA orders, stamp duty changes
Housing Department Affordable housing schemes, slum rehabilitation
Public Works Department Road, bridge, infrastructure projects
Industries Department MIDC plot allotments, SEZ notifications
Environment Department EC clearances, CRZ, forest
Town Planning DP changes, zoning, reserve plot use changes

NLP pipeline

GRs are unstructured. Pipeline: 1. Ingest PDFs from portals (often Marathi) 2. OCR + translation to English (LLM-assisted for nuance) 3. Classify real-estate relevance (binary: relevant/not, then category) 4. Entity extraction — affected districts, talukas, project names, survey numbers, coordinates, effective dates 5. Impact scoring — high/medium/low impact on real-estate markets 6. Store with full text + classification + entities + impact

Known gotchas

  1. ~80% of GRs are irrelevant; classifier critical to keep noise out
  2. Effective date ≠ publication date — extract both
  3. Amendments — a single decision may have 4+ amendment GRs; chain them
  4. Annexures with coordinates / shape data are often what matters most
  5. Marathi nuance — "उपलब्ध" (available) vs "उपलब्ध" in different bureaucratic constructions can mean very different things; LLM-assisted

Coverage targets

  • 100% of GRs from priority departments (UDD, Revenue, Housing) within 24 hours of publication
  • 100% archive back to 2017 for trend analysis

Source #4 — MahaBhulekh and Land Records

Access

  • Rural 7/12: https://bhulekh.mahabhumi.gov.in/
  • Digital 7/12: https://digitalsatbara.mahabhumi.gov.in/
  • Property Card (urban): City-specific portals (Mumbai's is at https://mahabhulekh.maharashtra.gov.in/PropertyCard)
  • Approach: Per-property lookup (no bulk available); on-demand caching

What we extract

Attribute Source Notes
Holder name(s), share 7-12 / PC
Area (hectares + gunthas / sqm) 7-12 / PC
Land class (Class I / Class II) 7-12 Class II = restricted transfer
Crops / use 7-12 Agricultural
Mutations (changes in ownership) 7-12 / PC Dates + reasons
Encumbrances / charges 7-12 (Other Rights column) Often partial
NA conversion order ref 7-12 / Revenue records If applicable

Known gotchas

  1. Survey numbers split over time — track mutation chain
  2. Class II land can't be transferred without permission — always flag
  3. 7-12 stale on mutations by 30-90 days — note last_mutation_date
  4. Urban CTS ≠ rural survey numbers — different systems
  5. Property Cards are less digitised; some cities require physical request

Source #5 — GIS layers

Sources by use case

Use case Primary source Secondary License
Flood / inundation zones MRSAC BMC stormwater drainage data MRSAC may require request
Forest cover MRSAC + Forest Dept GIS ISRO Bhuvan Public + restricted
CRZ (Coastal Reg Zone) MoEFCC notifications MRSAC Public
Land use / zoning City Development Plan shapefiles Bhuvan Variable
Roads OpenStreetMap Google Roads API OSM open / Google paid
Transit (metro, suburban rail) Maha-Metro, MMRC, MMRDA OSM Mostly public
MIDC industrial boundaries MIDC portal Public
SEZ boundaries Government notifications Public
POIs (schools, hospitals, malls) Google Places API OSM Google paid / OSM free
Demographic micro-areas Census 2011 boundaries MRSAC Public but stale

Pipeline

  • Normalise all coordinates to EPSG:4326 on ingest
  • Vector layers stored in PostGIS
  • Spatial queries via PostGIS / GeoPandas; complex routing via OSRM
  • Heat layers (sentiment, velocity) computed via H3 hex grid for consistent aggregation

Known gotchas

  1. Coordinate systems vary (WGS84, EPSG:32643). Normalise on ingest.
  2. MRSAC layers often need formal request — budget time
  3. DP shapefiles poor for tier-2 cities
  4. OSM building footprints incomplete in new developments
  5. Bhuvan older than ground truth in fast-changing areas

Source #6 — News, analyst reports, social

News sources

Source Type Notes
Economic Times Realty Mainstream, English Daily
Hindustan Times Real Estate Mainstream, English Daily
Mint Real Estate Analytical, English Weekly + breaking
Business Standard Business, English Daily
Moneycontrol Real Estate Investor angle, English Daily
Loksatta / Maharashtra Times Local, Marathi Daily; hyperlocal infra news
Knight Frank, JLL, CBRE, Anarock, Cushman, Colliers reports Analyst, English Quarterly
MagicBricks / Housing / NoBroker blogs Aspirational, English Discovery only

Pipeline

  1. Ingest (RSS + scraping + occasional vendor API)
  2. Entity linking — tag each article to project / micromarket / developer / authority / GR
  3. Sentiment — VADER + LLM-based per-entity sentiment
  4. Topic classification — pricing / launch / delay / regulation / infra / dispute
  5. Aggregate weekly per entity → input to derived sentiment score
  6. Never quote as our own claim — always attribute

Social listening

Source Use
Twitter / X Real-time event, complaint sentiment
Reddit (r/india, r/mumbai, r/pune, r/IndianStreetBets) Investor sentiment, complaints
YouTube (real-estate channels) Comments + transcript sentiment
Local forums (Pune Whispers etc.) Hyperlocal trust signal

Vendor: Brandwatch / Talkwalker / Mention (₹15-30L/yr) OR self-built scraping + Meilisearch.

Always aggregate to entity-level signal. Never cite individual posts in user-facing content (defamation risk).

Cadence

  • News: Hourly
  • Social: Hourly
  • Analyst reports: Quarterly

Source #7 — Internal data (PropPie Fractional)

What we have

Data Sensitivity Use
Listed assets and SPV structures Commercial Ground truth for asset attributes, AI overlay
Investor KYC records High (PII) Personalisation (consent-required), never AI training without consent
Investment portfolios Medium Portfolio-fit recommendations (info-only)
Distribution history Medium Realised yields per asset — gold standard
Vacancy / occupancy data Medium Ground truth for cap-rate / yield validation
Secondary market trades Medium Liquidity signals
Asset documents (lease, title, valuation) Variable Document Q&A

Access governance

  • AI layer gets pseudonymised investor data only (no PAN/Aadhaar in prompts)
  • Asset-level documents are commercial data — accessible
  • PII vault separate from analytics layer
  • Per-investor consent for AI features (separate from KYC consent)

Source #8 — Third-party / licensed feeds (budget)

Feed What it gives Approx cost (₹/year) Necessity
PropEquity Pan-India transaction + project database 5-25L High — speeds up B2B Analytix
CRE Matrix Commercial lease + sale transactions 3-15L High for B2B commercial
Liases Foras Residential micromarket + forecasts 5-20L Medium
CIBIL Commercial Developer credit ratings per-pull Medium for Developer Trust Score
MCA21 / Tofler / Probe42 Company filings, directors, litigation 1-5L Medium for promoter diligence
CERSAI Property charges registry API access Required for encumbrance
News API (Aggregate Intelligence, etc.) Structured news 2-10L Optional — can self-scrape
Brandwatch / Talkwalker Social listening 15-30L Optional — can self-build
Skymet / IMD Weather/climate Free + paid tiers Medium for climate-risk attrs
Google Maps Platform POIs, distance matrix Per-call Medium

Total budget estimate for "full data stack" pre-revenue: ₹40-80L/year. Cheaper alternative (self-scrape primary, skip licensed feeds where possible): ₹15-25L/year + significant engineering time.


Gaps — Data we don't have but want

These are explicit. Call out the gap whenever a product feature depends on them.

Gap Workaround today Future fix
Demographic micro-data (ward-level income, education) Census 2011 + proxies Anonymised app analytics partnerships; targeted surveys
Employment density shapefile SEZ / IT-park boundaries Telco data partnership (anonymised)
Crime statistics by area NCRB aggregates only RTI requests for selected hotspots
School / hospital quality Review proxies Partnership with rating agencies
Construction quality / structural audits Only when complaints Independent audit partnership
Builder financial stress Partial via MCA + news Subscription to credit feeds
NRI investor sentiment None Custom surveys / community partnerships
Real-time policy impact (post-GR) News + manual analysis Government engagement / advisory
Sub-market commute times Google Distance Matrix (paid) Local mobility partnerships
Property condition (interior, age, refurbishment) Listings (unreliable) User-submitted with verification

Putting it together — the source landscape

flowchart LR
    subgraph Govt[Government / Public]
        RERA[MahaRERA]
        IGR[IGR Index-II]
        BHU[MahaBhulekh / PC]
        GR[GR Portals]
        MRSAC[MRSAC GIS]
    end

    subgraph Lic[Licensed / Paid]
        PE[PropEquity]
        CRE[CRE Matrix]
        CIBIL[CIBIL Commercial]
        MCA[MCA / Tofler]
        Brand[Brandwatch]
    end

    subgraph Open[Open / Free]
        OSM[OpenStreetMap]
        News[News RSS]
        Soc[Social Scraping]
        Bhuvan[ISRO Bhuvan]
    end

    subgraph Int[Internal]
        Frac[PropPie Fractional DB]
    end

    Govt --> Pipeline[Vishal's Pipeline]
    Lic --> Pipeline
    Open --> Pipeline
    Int --> Pipeline

    Pipeline --> CanonicalStore[Canonical Attribute Store]
    CanonicalStore --> Products[Broker / Analytix / Fractional AI overlay]

Source-attribution display rules (user-facing)

When showing data to users, follow this:

Attribute type Display format
RERA-sourced "Per MahaRERA Form B v3 (filed 14 Jan 2026)" + link
IGR-sourced "Per registered sale deed, Pune SR-7, 22 Mar 2026" + link
Aggregated transactions "Median of 23 registered transactions, Jan-Mar 2026, ±5% confidence"
GR-sourced "Per GR No. UDD/2025/123, dated 14 Aug 2025" + link
News-sourced "As reported by ET Realty, 18 Feb 2026" + link
Derived (Score) "PropPie-derived score; click to see inputs" (links to expand)
Forecast "Model projection; range ±X% at 80% confidence" with assumptions
Internal (Fractional) "From PropPie Fractional asset records; verified"

Display the source on every user-facing data point. No exceptions.


See also: - /.cursor/skills/proppie-data-sources/SKILL.md — quick reference - data-attributes.md — the ~140-attribute spec - derived-attributes-spec.md — math for derived - data-quality-framework.md — lineage / confidence / freshness - pipeline-spec-for-vishal.md — contract for the pipeline