Skip to content

Data Attributes — Canonical Catalogue

This is the canonical attribute spec for PropPie. It is the refined evolution of the original 92-attribute CSV, brought in line with Vishal's extraction schema v2, the SOUL/compliance posture, and the wow-moments roadmap. ~140 attributes across 14 categories.

The machine-readable version is data-attributes.csv. The derivation math for every score / index lives in derived-attributes-spec.md. Lineage and quality rules are in data-quality-framework.md.

How to read this catalogue

Every attribute has:

Column Meaning
ID Stable identifier (used in pipeline, API, derived references)
Category One of 14 buckets
Attribute Human name
Type raw (extracted directly) / derived (computed from others) / enrichment (from external API/feed)
Priority critical / high / medium / low — drives sequencing
Source Primary upstream source(s)
Refresh cadence How often it's updated
Source confidence (target) Baseline confidence (0-1) we expect on extraction
Legal basis What makes ingestion lawful
Used in Which product surface(s) — Broker, Analytix, Fractional
Wow moment The user-facing magic this enables

Categorisation framework

flowchart TD
    Static[Static / Structural<br/>What the property IS] --> Transactional[Transactional<br/>What happens TO it]
    Static --> Legal[Legal & Compliance<br/>What's allowed]
    Static --> Spatial[Spatial / Environmental<br/>WHERE it sits]

    Transactional --> Market[Market Signals<br/>What the market is doing]
    Transactional --> Tenant[Tenant & Lease Intel<br/>Cash flow reality]

    Market --> Derived[Derived / AI<br/>Scores, projections, narratives]
    Tenant --> Derived
    Legal --> Derived
    Spatial --> Derived
    Policy[Policy & Infra<br/>What's coming] --> Derived

    Persona[Investor Persona] --> Derived
    Macro[Macro Overlay] --> Derived

    Operations[Operations / Lineage] -.-> Static
    Operations -.-> Transactional
    Operations -.-> Derived

The 14 categories

# Category Approx count Role
01 Project Identity 12 The "who and what"
02 Location & Geospatial 12 The "where"
03 Land & Area Metrics 10 Physical size, FSI
04 Parties & Developer 8 The "by whom"
05 Legal & Compliance 11 Title, encumbrance, litigation
06 Approvals & Permissions 8 Regulatory clearances
07 Financial & Transaction 14 Real prices, real rents
08 Tenant & Lease Intelligence 9 Commercial cash-flow
09 Unit & Configuration 7 Inventory and product mix
10 Market Signals 10 Velocity, momentum, yields
11 Policy & Infrastructure 8 GRs and infra catalysts
12 Risk Intelligence 9 Environmental, zonal, composite
13 AI / Derived Insights 12 Scores, projections, narratives
14 Investor Persona 8 For Broker personalisation
15 Macro Overlay 6 Repo, rates, employment
16 Fractional-specific 7 SPV, escrow, waterfall, secondary
Ops Data Lineage & Audit (per-attribute, not separate) Confidence, freshness, source

Total: ~140 functional attributes (excluding per-attribute lineage columns which apply to all).


Category 01 — Project Identity (12)

The "who and what" — the entity resolution backbone.

ID Attribute Type Priority Source Notes
proj.id Canonical project ID derived critical Internal UUID; primary key across system
proj.maharera_number MahaRERA registration number raw critical MahaRERA Anchor for entity resolution
proj.name Project name raw critical MahaRERA
proj.name_aliases Alternate / marketed names raw high MahaRERA + News List
proj.building_or_wing_name Sub-name raw high MahaRERA For multi-tower projects
proj.type Type (Residential, Commercial, Industrial, Mixed-Use) raw critical MahaRERA Drives all segment filtering
proj.phases_count Number of phases raw high MahaRERA
proj.towers_count Number of towers raw high MahaRERA
proj.blocks_count Number of blocks raw medium MahaRERA
proj.promised_completion_dates All declared/revised completion dates with versions raw critical MahaRERA Form B Used in dev.delay_history
proj.actual_completion_date Actual delivery date raw high MahaRERA OC/POC
proj.estimated_cost_history Original + revised cost figures with versions raw high MahaRERA Form A/B Cost overrun signal

Category 02 — Location & Geospatial (12)

ID Attribute Type Priority Source Notes
loc.full_address Full address + PIN raw critical MahaRERA
loc.unit_number, loc.street, loc.landmark Address parts raw medium MahaRERA
loc.locality Locality / colony raw high MahaRERA + reverse geocoding
loc.taluka, loc.district, loc.state Admin hierarchy raw high MahaRERA
loc.pin_code PIN code raw high MahaRERA
loc.lat, loc.lng Coordinates raw critical MahaRERA + Google Geocoding fallback Normalised EPSG:4326
loc.survey_numbers Survey/CTS/Gat/Hissa/TP numbers raw critical MahaRERA + 7-12 List
loc.boundaries_n_s_e_w Cardinal boundaries raw medium MahaRERA + 7-12
loc.micromarket_id Micromarket bucket derived critical GIS clustering See derived-spec
loc.micromarket_name Human-readable label derived critical Internal taxonomy
loc.infra_proximity_score Composite 0-100 derived critical GIS + metro/SEZ/etc. See derived-spec
loc.commute_times Commute times to key job centres enrichment medium Google Distance Matrix

Category 03 — Land & Area Metrics (10)

ID Attribute Type Priority Source Notes
area.total_land_sqm Total land area raw critical MahaRERA
area.land_for_registration_sqm Registered land area raw high MahaRERA
area.permissible_builtup_sqm Permissible built-up raw critical MahaRERA / BPA
area.sanctioned_builtup_sqm Sanctioned built-up raw critical BPA
area.open_space_sqm Aggregate open space raw medium MahaRERA
area.carpet_sqft Carpet area (per-unit or total) raw critical Floor plans RERA-mandated
area.built_up_sqft Built-up area raw high Floor plans
area.super_built_up_sqft Super built-up raw medium Floor plans Being phased out
area.fsi_consumed FSI consumed raw critical MahaRERA / BPA
area.permissible_fsi Permissible FSI raw critical MahaRERA / DCR

Category 04 — Parties & Developer (8)

ID Attribute Type Priority Source Notes
party.promoter_name Promoter / developer name raw critical MahaRERA
party.promoter_pan Promoter PAN raw critical MahaRERA Primary key for promoter
party.promoter_cin CIN if corporate raw high MahaRERA + MCA
party.promoter_gst GST raw medium MahaRERA
party.land_owners Landowner names + share raw high MahaRERA + 7-12
party.architects Engaged architects raw low MahaRERA
party.contractors Engaged contractors raw low MahaRERA
dev.trust_score Developer Trust Score 0-100 derived critical Multi-factor See derived-spec — defamation-guarded

ID Attribute Type Priority Source Notes
legal.title_search_report_url TSR URL raw high MahaRERA
legal.title_certificate_url TC URL raw high MahaRERA
legal.chain_of_title Historical owners list raw high TSR + 7-12 Structured
legal.encumbrances Charges, mortgages, liens raw critical TSR + CERSAI
legal.cersai_charges CERSAI charge records raw high CERSAI API
legal.litigation_records Court cases raw high TSR + court portals
legal.complaints_maharera MahaRERA complaints count + status raw high MahaRERA
legal.land_class Class I / Class II raw critical 7-12 Class II = restricted
legal.na_order_status NA conversion done? raw high Revenue records
legal.crz_status Within CRZ? enrichment high MoEFCC + MRSAC
legal.title_clarity_score Title Clarity Score 0-100 derived critical Multi-input See derived-spec

Category 06 — Approvals & Permissions (8)

ID Attribute Type Priority Source Notes
appr.building_plan_approval BPA reference + date raw critical MahaRERA + municipal
appr.commencement_certificate CC reference + date raw critical MahaRERA
appr.part_occupancy_certificate POC ref + scope raw high MahaRERA
appr.occupancy_certificate OC ref + date raw critical MahaRERA Habitability anchor
appr.environment_clearance EC ref + date raw high MahaRERA + MoEFCC
appr.forest_clearance FC if applicable raw high MoEFCC
appr.fire_noc Fire NOC raw medium Municipal
appr.water_sewerage_noc Water/sewerage NOC raw medium Municipal

Category 07 — Financial & Transaction (14)

ID Attribute Type Priority Source Notes
fin.sale_price Declared sale price (per transaction) raw critical IGR Index-II
fin.sale_date Transaction date raw critical IGR
fin.sale_parties Buyer + seller raw critical IGR
fin.stamp_duty_paid Stamp duty paid raw high IGR
fin.registration_fee_paid Registration fee raw high IGR
fin.market_value_asr Market value per ASR raw high IGR Government view
fin.asr_value_per_sqft ASR per sqft for the address enrichment critical ASR portal
fin.asr_gap_pct (Sale price - ASR) / ASR * 100 derived critical Sale + ASR See derived-spec
fin.price_per_sqft_carpet Price ÷ carpet area derived critical Sale + area
fin.estimated_cost_revised Revised project cost raw high MahaRERA
fin.cost_overrun_pct (Revised - Original) / Original * 100 derived high Cost history See derived-spec
fin.escrow_account_details 70%-escrow account raw high MahaRERA
fin.escrow_balance_proxy Estimated balance (from filings) derived medium Escrow + sale velocity Approximate
fin.loan_disclosure Project loan disclosure raw high MahaRERA Form B

Category 08 — Tenant & Lease Intelligence (9)

ID Attribute Type Priority Source Notes
lease.rent_amount_monthly Monthly rent raw critical IGR L&L
lease.tenure_months Tenure raw critical IGR L&L
lease.lock_in_months Lock-in period raw high IGR L&L
lease.security_deposit Security deposit raw high IGR L&L
lease.escalation_pct Annual escalation % raw high IGR L&L
lease.tenant_name Tenant name raw high IGR L&L
lease.tenant_industry Tenant industry (classified) derived high Tenant name + MCA enrichment
lease.tenant_credit_score Tenant credit profile enrichment high CIBIL Commercial
lease.tenant_anchor_quality_score Anchor tenant score 0-100 derived critical Industry + credit + size See derived-spec

Category 09 — Unit & Configuration (7)

ID Attribute Type Priority Source Notes
unit.total_count Total units raw critical MahaRERA
unit.booked_count Units booked raw high MahaRERA quarterly
unit.sold_count Units sold raw high MahaRERA quarterly
unit.unsold_count Units unsold derived high Total - sold
unit.type_breakdown Mix (1BHK/2BHK/Office/Retail) raw high Floor plans Object/array
unit.parking_provision Covered/Open/Stilt counts raw high MahaRERA + parking plan
unit.amenities Amenities list raw medium MahaRERA Pool, gym, club, etc.

Category 10 — Market Signals (10)

ID Attribute Type Priority Source Notes
mkt.transaction_velocity_90d Registered txn count in 90d (micromarket) derived critical IGR rolling See derived-spec
mkt.transaction_velocity_180d Same, 180d derived critical IGR rolling
mkt.median_price_per_sqft_90d Median ₹/sqft in micromarket derived critical IGR aggregates
mkt.price_appreciation_yoy_pct YoY price change derived high IGR aggregates
mkt.absorption_rate Sold / Launched per quarter derived high MahaRERA + IGR
mkt.days_on_market_median Listing-to-sale time derived medium Listing portals + IGR Approximate
mkt.sector_momentum_pct % change in sector volume vs baseline derived critical IGR sector aggregates
mkt.cap_rate_median Median cap rate by sector/zone derived critical L&L + sale aggregates
mkt.yield_benchmark Gross yield benchmark by segment derived critical L&L aggregates
mkt.micromarket_lifecycle_stage Emerging / Growing / Mature / Cooling derived high Time-series clustering See derived-spec

Category 11 — Policy & Infrastructure (8)

ID Attribute Type Priority Source Notes
policy.applicable_grs List of relevant GRs (last 12 mo) raw critical GR portals Filtered
policy.tailwind_flags Boolean flags for positive policies derived critical GR NLP classifier See derived-spec
policy.headwind_flags Boolean flags for restrictive policies derived high GR NLP classifier
infra.upcoming_projects List of nearby infra projects with timelines derived critical GRs + news + Maha-Metro Event extraction
infra.metro_proximity_status Operational / Under-construction / Approved / Speculated derived high Maha-Metro filings
infra.sez_industrial_proximity Distance + status of SEZ/MIDC derived high Government data
infra.airport_proximity_km Distance to nearest airport enrichment medium Geospatial
infra.school_hospital_proximity Count of quality schools/hospitals within 5km derived medium Google Places + filters

Category 12 — Risk Intelligence (9)

ID Attribute Type Priority Source Notes
risk.flood_zone_flag In flood zone? enrichment critical MRSAC + BMC
risk.forest_zone_flag In forest area? enrichment high Forest Dept GIS
risk.crz_flag In CRZ? enrichment high MoEFCC + MRSAC
risk.aqi_annual_avg Annual avg AQI enrichment medium CPCB data
risk.heat_island_index Urban heat island index enrichment medium MRSAC
risk.rainfall_trend 5-yr rainfall trend enrichment low IMD
risk.urban_flooding_history Past flooding incidents in pincode enrichment medium BMC / news
risk.zone_risk_index Composite Zone Risk Index 0-10 derived critical Ensemble See derived-spec
risk.title_risk_flag Composite title risk flag derived critical Title score < threshold

Category 13 — AI / Derived Insights (12)

These are the magic. Every one has a math spec in derived-attributes-spec.md.

ID Attribute Type Priority Source Notes
ai.alpha_narrative "Why this, why now" plain-English summary derived critical LLM over all signals Cited
ai.risk_narrative "What could go wrong" plain-English derived critical LLM over risk attrs Cited
ai.hidden_costs_breakdown Full cost breakdown (stamp, registration, GST, society, etc.) derived critical Calculator over attrs Personalised to user state
ai.wealth_trajectory_paths Monte Carlo wealth paths derived critical Simulation See derived-spec
ai.projected_irr_p50_p20_p80 IRR projections at 50/20/80 percentiles derived critical Simulation
ai.comparable_set 5-10 comparable transactions / projects derived critical Similarity + filters
ai.sentiment_score Aggregated sentiment (news+social) derived high NLP aggregation
ai.developer_track_record_summary Plain-English developer summary derived high LLM over developer data Defamation-guarded
ai.title_chain_explanation Plain-English title walkthrough derived high LLM over legal docs
ai.gr_impact_summary Plain-English GR impact on this asset derived high LLM over relevant GRs
ai.persona_fit_score Fit between user persona and asset derived critical Cosine similarity Personalises ranking
ai.brutal_honesty_flags List of red flags + "consider before buying" notes derived high Rule + LLM Honest broker discipline

Category 14 — Investor Persona (8)

Only stored with explicit consent. Used to personalise Broker outputs (info-only, not advice).

ID Attribute Type Priority Source Notes
persona.investment_goal Retirement / Yield / Capital growth / Lifestyle / Tax planning raw high User self-report
persona.horizon_years Investment horizon raw high User self-report
persona.risk_tolerance Conservative / Moderate / Aggressive raw high User self-report
persona.tax_bracket User's effective tax bracket raw medium User self-report
persona.residency_status Resident / NRI raw high User self-report Affects FEMA flows
persona.budget_range Investable surplus range raw high User self-report
persona.family_stage Single / Couple / Young Family / Empty Nester / Retired raw medium User self-report
persona.existing_exposure Other RE exposure (city/segment) raw medium User self-report Diversification

Category 15 — Macro Overlay (6)

ID Attribute Type Priority Source Notes
macro.repo_rate RBI repo rate enrichment high RBI Weekly refresh
macro.home_loan_rate_median Median prime home loan rate enrichment high Major banks
macro.it_hiring_index IT-ITES hiring index (Maharashtra) enrichment medium NASSCOM / job boards
macro.salary_growth_yoy YoY salary growth (key segments) enrichment medium LinkedIn / industry
macro.gdp_growth_state Maharashtra GDP growth enrichment low Govt
macro.cpi_inflation CPI inflation enrichment medium MoSPI

Category 16 — Fractional-specific (7)

Augments the existing PropPie Fractional asset records.

ID Attribute Type Priority Source Notes
frac.spv_id SPV identifier raw critical Internal
frac.scheme_size_cr Scheme size raw critical Internal SM-REIT threshold check
frac.investor_count Current investor count raw critical Internal SM-REIT threshold check
frac.distribution_waterfall Waterfall structure raw high Internal
frac.secondary_market_active Active on secondary market? raw high Internal
frac.realised_yield_ttm Trailing 12-month realised yield derived critical Internal distributions Ground truth
frac.vacancy_pct Current vacancy raw critical Internal

Operational columns (apply to every attribute)

Each attribute row in the canonical store carries these:

Operational column Purpose
source Where the value came from (RERA / IGR / GR / Internal / etc.)
source_url Specific deep link if available
source_doc_id Internal pointer to source document
source_doc_page Page reference if PDF
extracted_at When ingested
confidence 0-1
confidence_method How confidence was computed (rule-based / model-based / human-verified)
human_verified Bool — has a human signed off
conflict_resolution_applied If sources disagreed, which rule resolved
last_changed_at When this value last changed
change_history (optional) prior values with timestamps

See data-quality-framework.md for the full rules.


Owner column (per attribute, for the team)

Each attribute is owned by someone. Owner determines who is responsible for definition, quality, and changes.

Owner Categories
Vishal (CEO) All pipeline / ingestion / extraction-layer attributes (raw)
Aishvarya (COO) All derived attributes, scoring methodology, AI / persona / Broker-facing
TBD (Compliance Officer) All legal / compliance / lineage / audit attributes
Designated dev per category (Will be filled in per the team-and-roles doc)

What's intentionally NOT in this catalogue (yet)

  • Construction material quality — only via complaints; underreported
  • NRI repatriation profile — too dependent on individual circumstances
  • Listing portal asking prices — used only for trend detection, not authority
  • Influencer / blog mentions — too noisy
  • Property condition / age — interior, refurbishment status; user-submitted with verification later

Add when reliably sourceable.


Migration path from the original 92-attribute CSV

Original CSV count Refined catalogue
92 attributes ~140 attributes
Manual "Captured?" column Operational columns (confidence, human_verified, extracted_at)
Free-form "Suggested Data Sources" Structured source enum with hierarchy
Implicit derivation Explicit math spec per derived attribute
No persona / macro / fractional Categories 14, 15, 16 added
No lineage / audit Operational columns mandated

Next steps

  1. Vishal to map this catalogue against the extraction schema v2 and identify what's already covered vs. needs new extraction work
  2. Define each derived attribute's math in derived-attributes-spec.md (done) and validate with sample data
  3. Establish source-confidence baselines per attribute by sampling 100 projects and human-grading extractions
  4. Build the canonical store schema (Vishal's call: SQL? graph? key-value? lakehouse?)
  5. Wire up freshness monitors per source — alarms if a feed goes stale
  6. Document owner-per-attribute in a separate operational sheet (not this doc)

See also: - data-attributes.csv — machine-readable version - derived-attributes-spec.md — math for derived - data-quality-framework.md — quality rules - pipeline-spec-for-vishal.md — pipeline contract