Data Sources — Detailed Catalogue¶

This is the narrative companion to /.cursor/skills/proppie-data-sources/SKILL.md. The SKILL is the quick-reference; this doc has the longer narrative, the layered architecture diagram, the source-of-truth rules, and the gap analysis.

The five layers of PropPie's data¶

flowchart TD
    L0[Layer 0: Raw Capture<br/>Scrapers / APIs / Document OCR / NLP]
    L1[Layer 1: Canonical Extraction<br/>Vishal's schema v2 - structured per source]
    L2[Layer 2: Entity Resolution + Fusion<br/>Same project across RERA + IGR + News]
    L3[Layer 3: Derived Attributes<br/>Scores, indices, velocities, trajectories]
    L4[Layer 4: Product-facing API<br/>Asset cards, dashboards, conversational]

    L0 --> L1 --> L2 --> L3 --> L4

This document covers L0 and feeds L1. The derivations live in derived-attributes-spec.md. Lineage/confidence/freshness rules live in data-quality-framework.md.

Source-of-truth hierarchy¶

When two sources disagree about the same attribute, resolve in this order:

Priority	Source	Wins on
1	IGR Index-II (registered deed)	Sale price, transaction date, parties, stamp duty paid
2	MahaRERA project filings	Project metadata, plans, completion dates, promoter, financials
3	MahaBhulekh / 7-12 / Property Card	Land ownership, area, mutations
4	Government Resolutions (GRs)	Policy, infra projects, zoning
5	GIS layers (MRSAC, OSM, Bhuvan)	Geospatial, proximity, environmental
6	Internal data (Fractional platform)	Realised yields, vacancy, secondary trades
7	Licensed feeds (PropEquity, CRE Matrix)	Where they aggregate primary sources
8	News articles	Event detection, context — not factual claims
9	Social / forums	Sentiment only, never factual
10	Listing portals	Discovery only, not pricing truth

Every attribute carries its source, source_url, extracted_at, and confidence — so resolution is auditable.

Source #1 — MahaRERA¶

Access¶

Portal: https://maharera.maharashtra.gov.in/
API: None public
Approach: Scraping with robust monitors
Polite limits: 2 req/sec sustained, courtesy delays during peak hours
robots.txt: Generally permissive for project pages
Legal basis: Public records under RERA Act § 11

What we extract (mapped to Vishal's schema v2)¶

Vishal's schema covers this comprehensively. The project_identity, location, land_and_area_metrics, plot_details, parties, legal_and_compliance, rera_form_b, architectural_plans, escrow_bank_account, loan_disclosure, financial_details, project_info, documents objects all derive primarily from MahaRERA.

Document types ingested¶

Document	What it tells us	Quality	Typical issues
Form A (application)	Project identity, promoter, location	Good	Older filings have scanned PDFs
Form B (project details)	Phases, towers, units, timeline, costs	Good	Free-form text in some fields
Form C (quarterly progress)	Construction status updates	Medium	Often perfunctory, hard to verify
Building Plan Approval	Sanctioned built-up, FSI, layout	Variable	Large PDFs, plan drawings
Commencement Certificate (CC)	Construction legally started	Good
Occupancy / Part Occupancy Cert	Habitability	Good
Title Search Report	Chain of title summary	Variable	Lawyer-written, varying detail
Title Certificate	Legal opinion on title	Variable
Search Report (SR office)	Encumbrances	Variable
Land Plan, Site Plan	Layout	Variable	Vector vs scanned
Layout Plan, Floor Plan	Spatial layout	Variable
Sale Agreement (template)	Buyer terms	Good	Often boilerplate
Allotment Letter (template)	Booking terms	Good
Parking Plan	Parking provision	Sometimes missing
Environment Clearance	EC certificate	Sparse	Only large projects
Stamp Payment Receipt	Stamp duty paid	Indirect — implied by stamp duty rate

Update cadence¶

Project page re-scrape: Monthly
Document delta: Detect file-list changes, re-extract changed documents
Form C (progress): Quarterly per project
Complaint records: Weekly

Known gotchas¶

Shell registrations — some projects register but never upload documents. Detect via documents count = 0.
Name changes — promoter mergers/demergers don't auto-link. Use PAN/CIN as primary key.
Phase confusion — large projects register phases separately; reconcile via alternate_names and address proximity.
Withdrawn projects — re-appear later under different names. Maintain a withdrawal-aware history.
Scanned PDFs — older filings (pre-2020) often scanned; OCR + LLM extraction needed.
Marathi text — addresses, descriptions, court orders often in Marathi.
Bulk download is rate-limited — be polite, parallelise carefully.
Date formats — DD/MM/YYYY, DD-MM-YYYY, "31st March 2026" all appear; normalise on extraction.

Volume estimates¶

~40,000 registered projects (residential + commercial) since 2017
~5-10 documents per project on average → ~300,000 documents total
~500-1,000 new project registrations per month
~50,000 quarterly updates per quarter

Source #2 — IGR Maharashtra (Inspector General of Registration)¶

Access¶

Public portal: https://igrmaharashtra.gov.in/
Search interface: District/SR-office → Year → Index-II lookup
Bulk feeds: Licensed via PropEquity, CRE Matrix, etc. (₹2-25 Lakh/year/state)
Approach: Hybrid — official portal for sample/spot-check, licensed feed for bulk
Legal basis: Public records under Registration Act 1908 § 57

Index-II structure (what each deed gives us)¶

Field	Always present	Notes
Deed number, SR office, sub-district, registration date	Yes	Primary key
Type of deed (Sale, L&L, Gift, Mortgage, etc.)	Yes
Property identifier (CTS / Survey, address)	Yes	Address often Marathi
Parties (vendor / vendee / lessor / lessee)	Yes	With addresses, sometimes PAN
Consideration (sale price OR rent + tenure)	Yes	The declared amount
Market value (per ASR)	Yes	Government's view
Stamp duty + registration fee paid	Yes
Document URL / scan	Sometimes	Often paywalled or rate-limited

Update cadence¶

Daily for priority micromarkets (Pune Tier-1, MMR core, Bhiwandi warehousing)
Weekly for tier-2 micromarkets
Monthly rollups for trend analytics

Known gotchas¶

Declared price ≠ actual price for high-value transactions (cash component is illegal but real). ASR Gap % helps detect; we flag, never accuse.
Marathi property descriptions require good translation pipeline.
Same property registered multiple times (correction deeds, family transfers). Dedupe by (property_id, parties, date proximity).
L&L lock-in < 11 months = designed to avoid registration; if registered, often suspicious.
Stamp duty rates differ by deed type, property type (residential vs commercial), and exemption status.
Indexed within 24-72 hours of registration — sub-day claims should be hedged.
Bulk feed costs are non-trivial; budget item.

Linking IGR → MahaRERA¶

The hardest fusion problem. We use: - Survey / CTS number overlap - Geocoded address proximity (< 200m + textual fuzzy match) - Promoter name match (Class I) + parties match for sales from promoter to buyer - Project name in property description

Expected match rate: 40-60% for residential, higher for commercial transactions of promoter→buyer type, lower for secondary market.

Source #3 — Government Resolutions (GRs) and Notifications¶

Access¶

Master portal: https://gr.maharashtra.gov.in/
Department portals: UDD, Revenue, Housing, Industries, Environment, Town Planning
Approach: Scraping + RSS where available
Update cadence: Daily ingest, hourly during volatile periods (budget season, election notifications)

Departments to track¶

Department	Why it matters
Urban Development Department (UDD)	Master plan changes, TP schemes, DCR notifications, FSI revisions, TDR
Revenue Department	Land conversion, NA orders, stamp duty changes
Housing Department	Affordable housing schemes, slum rehabilitation
Public Works Department	Road, bridge, infrastructure projects
Industries Department	MIDC plot allotments, SEZ notifications
Environment Department	EC clearances, CRZ, forest
Town Planning	DP changes, zoning, reserve plot use changes

NLP pipeline¶

GRs are unstructured. Pipeline: 1. Ingest PDFs from portals (often Marathi) 2. OCR + translation to English (LLM-assisted for nuance) 3. Classify real-estate relevance (binary: relevant/not, then category) 4. Entity extraction — affected districts, talukas, project names, survey numbers, coordinates, effective dates 5. Impact scoring — high/medium/low impact on real-estate markets 6. Store with full text + classification + entities + impact

Known gotchas¶

~80% of GRs are irrelevant; classifier critical to keep noise out
Effective date ≠ publication date — extract both
Amendments — a single decision may have 4+ amendment GRs; chain them
Annexures with coordinates / shape data are often what matters most
Marathi nuance — "उपलब्ध" (available) vs "उपलब्ध" in different bureaucratic constructions can mean very different things; LLM-assisted

Coverage targets¶

100% of GRs from priority departments (UDD, Revenue, Housing) within 24 hours of publication
100% archive back to 2017 for trend analysis

Source #4 — MahaBhulekh and Land Records¶

Access¶

Rural 7/12: https://bhulekh.mahabhumi.gov.in/
Digital 7/12: https://digitalsatbara.mahabhumi.gov.in/
Property Card (urban): City-specific portals (Mumbai's is at https://mahabhulekh.maharashtra.gov.in/PropertyCard)
Approach: Per-property lookup (no bulk available); on-demand caching

What we extract¶

Attribute	Source	Notes
Holder name(s), share	7-12 / PC
Area (hectares + gunthas / sqm)	7-12 / PC
Land class (Class I / Class II)	7-12	Class II = restricted transfer
Crops / use	7-12	Agricultural
Mutations (changes in ownership)	7-12 / PC	Dates + reasons
Encumbrances / charges	7-12 (Other Rights column)	Often partial
NA conversion order ref	7-12 / Revenue records	If applicable

Known gotchas¶

Survey numbers split over time — track mutation chain
Class II land can't be transferred without permission — always flag
7-12 stale on mutations by 30-90 days — note last_mutation_date
Urban CTS ≠ rural survey numbers — different systems
Property Cards are less digitised; some cities require physical request

Source #5 — GIS layers¶

Sources by use case¶

Use case	Primary source	Secondary	License
Flood / inundation zones	MRSAC	BMC stormwater drainage data	MRSAC may require request
Forest cover	MRSAC + Forest Dept GIS	ISRO Bhuvan	Public + restricted
CRZ (Coastal Reg Zone)	MoEFCC notifications	MRSAC	Public
Land use / zoning	City Development Plan shapefiles	Bhuvan	Variable
Roads	OpenStreetMap	Google Roads API	OSM open / Google paid
Transit (metro, suburban rail)	Maha-Metro, MMRC, MMRDA	OSM	Mostly public
MIDC industrial boundaries	MIDC portal		Public
SEZ boundaries	Government notifications		Public
POIs (schools, hospitals, malls)	Google Places API	OSM	Google paid / OSM free
Demographic micro-areas	Census 2011 boundaries	MRSAC	Public but stale

Pipeline¶

Normalise all coordinates to EPSG:4326 on ingest
Vector layers stored in PostGIS
Spatial queries via PostGIS / GeoPandas; complex routing via OSRM
Heat layers (sentiment, velocity) computed via H3 hex grid for consistent aggregation

Known gotchas¶

Coordinate systems vary (WGS84, EPSG:32643). Normalise on ingest.
MRSAC layers often need formal request — budget time
DP shapefiles poor for tier-2 cities
OSM building footprints incomplete in new developments
Bhuvan older than ground truth in fast-changing areas

News sources¶

Source	Type	Notes
Economic Times Realty	Mainstream, English	Daily
Hindustan Times Real Estate	Mainstream, English	Daily
Mint Real Estate	Analytical, English	Weekly + breaking
Business Standard	Business, English	Daily
Moneycontrol Real Estate	Investor angle, English	Daily
Loksatta / Maharashtra Times	Local, Marathi	Daily; hyperlocal infra news
Knight Frank, JLL, CBRE, Anarock, Cushman, Colliers reports	Analyst, English	Quarterly
MagicBricks / Housing / NoBroker blogs	Aspirational, English	Discovery only

Pipeline¶

Ingest (RSS + scraping + occasional vendor API)
Entity linking — tag each article to project / micromarket / developer / authority / GR
Sentiment — VADER + LLM-based per-entity sentiment
Topic classification — pricing / launch / delay / regulation / infra / dispute
Aggregate weekly per entity → input to derived sentiment score
Never quote as our own claim — always attribute

Source	Use
Twitter / X	Real-time event, complaint sentiment
Reddit (r/india, r/mumbai, r/pune, r/IndianStreetBets)	Investor sentiment, complaints
YouTube (real-estate channels)	Comments + transcript sentiment
Local forums (Pune Whispers etc.)	Hyperlocal trust signal

Vendor: Brandwatch / Talkwalker / Mention (₹15-30L/yr) OR self-built scraping + Meilisearch.

Always aggregate to entity-level signal. Never cite individual posts in user-facing content (defamation risk).

Cadence¶

News: Hourly
Social: Hourly
Analyst reports: Quarterly

Source #7 — Internal data (PropPie Fractional)¶

What we have¶

Data	Sensitivity	Use
Listed assets and SPV structures	Commercial	Ground truth for asset attributes, AI overlay
Investor KYC records	High (PII)	Personalisation (consent-required), never AI training without consent
Investment portfolios	Medium	Portfolio-fit recommendations (info-only)
Distribution history	Medium	Realised yields per asset — gold standard
Vacancy / occupancy data	Medium	Ground truth for cap-rate / yield validation
Secondary market trades	Medium	Liquidity signals
Asset documents (lease, title, valuation)	Variable	Document Q&A

Access governance¶

AI layer gets pseudonymised investor data only (no PAN/Aadhaar in prompts)
Asset-level documents are commercial data — accessible
PII vault separate from analytics layer
Per-investor consent for AI features (separate from KYC consent)

Source #8 — Third-party / licensed feeds (budget)¶

Feed	What it gives	Approx cost (₹/year)	Necessity
PropEquity	Pan-India transaction + project database	5-25L	High — speeds up B2B Analytix
CRE Matrix	Commercial lease + sale transactions	3-15L	High for B2B commercial
Liases Foras	Residential micromarket + forecasts	5-20L	Medium
CIBIL Commercial	Developer credit ratings	per-pull	Medium for Developer Trust Score
MCA21 / Tofler / Probe42	Company filings, directors, litigation	1-5L	Medium for promoter diligence
CERSAI	Property charges registry	API access	Required for encumbrance
News API (Aggregate Intelligence, etc.)	Structured news	2-10L	Optional — can self-scrape
Brandwatch / Talkwalker	Social listening	15-30L	Optional — can self-build
Skymet / IMD	Weather/climate	Free + paid tiers	Medium for climate-risk attrs
Google Maps Platform	POIs, distance matrix	Per-call	Medium

Total budget estimate for "full data stack" pre-revenue: ₹40-80L/year. Cheaper alternative (self-scrape primary, skip licensed feeds where possible): ₹15-25L/year + significant engineering time.

Gaps — Data we don't have but want¶

These are explicit. Call out the gap whenever a product feature depends on them.

Gap	Workaround today	Future fix
Demographic micro-data (ward-level income, education)	Census 2011 + proxies	Anonymised app analytics partnerships; targeted surveys
Employment density shapefile	SEZ / IT-park boundaries	Telco data partnership (anonymised)
Crime statistics by area	NCRB aggregates only	RTI requests for selected hotspots
School / hospital quality	Review proxies	Partnership with rating agencies
Construction quality / structural audits	Only when complaints	Independent audit partnership
Builder financial stress	Partial via MCA + news	Subscription to credit feeds
NRI investor sentiment	None	Custom surveys / community partnerships
Real-time policy impact (post-GR)	News + manual analysis	Government engagement / advisory
Sub-market commute times	Google Distance Matrix (paid)	Local mobility partnerships
Property condition (interior, age, refurbishment)	Listings (unreliable)	User-submitted with verification

Putting it together — the source landscape¶

flowchart LR
    subgraph Govt[Government / Public]
        RERA[MahaRERA]
        IGR[IGR Index-II]
        BHU[MahaBhulekh / PC]
        GR[GR Portals]
        MRSAC[MRSAC GIS]
    end

    subgraph Lic[Licensed / Paid]
        PE[PropEquity]
        CRE[CRE Matrix]
        CIBIL[CIBIL Commercial]
        MCA[MCA / Tofler]
        Brand[Brandwatch]
    end

    subgraph Open[Open / Free]
        OSM[OpenStreetMap]
        News[News RSS]
        Soc[Social Scraping]
        Bhuvan[ISRO Bhuvan]
    end

    subgraph Int[Internal]
        Frac[PropPie Fractional DB]
    end

    Govt --> Pipeline[Vishal's Pipeline]
    Lic --> Pipeline
    Open --> Pipeline
    Int --> Pipeline

    Pipeline --> CanonicalStore[Canonical Attribute Store]
    CanonicalStore --> Products[Broker / Analytix / Fractional AI overlay]

Source-attribution display rules (user-facing)¶

When showing data to users, follow this:

Attribute type	Display format
RERA-sourced	"Per MahaRERA Form B v3 (filed 14 Jan 2026)" + link
IGR-sourced	"Per registered sale deed, Pune SR-7, 22 Mar 2026" + link
Aggregated transactions	"Median of 23 registered transactions, Jan-Mar 2026, ±5% confidence"
GR-sourced	"Per GR No. UDD/2025/123, dated 14 Aug 2025" + link
News-sourced	"As reported by ET Realty, 18 Feb 2026" + link
Derived (Score)	"PropPie-derived score; click to see inputs" (links to expand)
Forecast	"Model projection; range ±X% at 80% confidence" with assumptions
Internal (Fractional)	"From PropPie Fractional asset records; verified"

Display the source on every user-facing data point. No exceptions.

See also: - /.cursor/skills/proppie-data-sources/SKILL.md — quick reference - data-attributes.md — the ~140-attribute spec - derived-attributes-spec.md — math for derived - data-quality-framework.md — lineage / confidence / freshness - pipeline-spec-for-vishal.md — contract for the pipeline

Data Sources — Detailed Catalogue¶

The five layers of PropPie's data¶

Source-of-truth hierarchy¶

Source #1 — MahaRERA¶

Access¶

What we extract (mapped to Vishal's schema v2)¶

Document types ingested¶

Update cadence¶

Known gotchas¶

Volume estimates¶

Source #2 — IGR Maharashtra (Inspector General of Registration)¶

Access¶

Index-II structure (what each deed gives us)¶

Update cadence¶

Known gotchas¶

Linking IGR → MahaRERA¶

Source #3 — Government Resolutions (GRs) and Notifications¶

Access¶

Departments to track¶

NLP pipeline¶

Known gotchas¶

Coverage targets¶

Source #4 — MahaBhulekh and Land Records¶

Access¶

What we extract¶

Known gotchas¶

Source #5 — GIS layers¶

Sources by use case¶

Pipeline¶

Known gotchas¶

Source #6 — News, analyst reports, social¶

News sources¶

Pipeline¶

Social listening¶

Cadence¶

Source #7 — Internal data (PropPie Fractional)¶

What we have¶

Access governance¶

Source #8 — Third-party / licensed feeds (budget)¶

Gaps — Data we don't have but want¶

Putting it together — the source landscape¶

Source-attribution display rules (user-facing)¶