# Architectural Analysis: Root Causes of Data Discrepancies

**Date**: October 13, 2025
**Issue**: Fundamental data inconsistencies causing incorrect counts and confused state

## Current Problems Identified

### 1. Multiple Sources of Truth
**The Core Issue**: System has 3+ independent data sources with no synchronization:

```
extracted_property_urls.csv (156 URLs)
    ↓ (no sync)
enriched_data.json (199 properties)
    ↓ (no sync)
analysis_output.csv (historic GPT results)
    ↓ (no sync)
UI displays (computed independently)
```

**Symptoms:**
- Pipeline Overview shows 89 pending
- Analyze Only processes 156 properties
- Actual need: 96 properties
- Each source tells a different story

### 2. Inconsistent Data Models

**enriched_data.json**:
```json
{
  "url": "...",
  "gpt_score": 0.0,
  "custom_score": 3.1,
  "overall_score": 2.09  // Can be >0 even if gpt_score=0!
}
```

**Problem**: `overall_score > 0` doesn't mean "analyzed" - it can just have custom_score

**UI Logic**: Checked `overall_score === 0` (wrong)
**Script Logic**: Checked `gpt_score === 0` (correct)
**Result**: UI and backend disagreed on what "pending" means

### 3. State Spread Across Files

**Different files for different states:**
- `extracted_property_urls.csv` - "properties to process"
- `enriched_data.json` - "current state"
- `analysis_output.csv` - "GPT results history"
- `.gpt_cache/` - "what's been processed"
- `/tmp/farmmatch_progress_*.json` - "current operation"

**Problem**: No single place to ask "What needs analysis?"

### 4. No Validation Layer

**Missing safeguards:**
- No check that CSV and JSON are in sync
- No validation of data flow integrity
- No alerts when counts don't match
- No automated reconciliation

### 5. Implicit Assumptions

**Hidden business rules:**
- "Property is analyzed if gpt_score > 0"
- "Overall score can exist without GPT score"
- "CSV is subset of JSON (or is it?)"
- "Status 'Active' means... what exactly?"

**Problem**: Rules exist in code comments, not enforced

## Root Cause Categories

### 1. **Architectural Debt**
- Started as simple scraper
- Evolved organically without redesign
- Multiple data formats added over time
- No migration strategy between versions

### 2. **Lack of Schema Enforcement**
- JSON files can be any shape
- No required fields validation
- No type checking
- Silent failures when fields missing

### 3. **Distributed State Management**
- Each component maintains its own view
- No central coordinator
- No event system for state changes
- Race conditions possible

### 4. **No Single Source of Truth**
- Which is authoritative: CSV or JSON?
- What if they disagree?
- How to resolve conflicts?
- Who updates what when?

## Why This Keeps Happening

### Pattern 1: Add Feature → Add File
```
Need to track favorites → add favorites.json
Need GPT cache → add .gpt_cache/
Need progress → add /tmp/progress_*.json
Need history → add analysis_output.csv
```
**Result**: 5 files, 0 synchronization

### Pattern 2: Each Script Reads What It Needs
```python
# favorites_scraper.py
data = json.load('enriched_data.json')
# Updates enriched_data.json

# analyze_from_urls_optimized.py
df = pd.read_csv('extracted_property_urls.csv')
# Reads CSV, updates analysis_output.csv

# criteria_manager.html
fetch('enriched_data.json')
# Displays based on JSON
```
**Result**: No coordination, each does its own thing

### Pattern 3: Counts Calculated Differently Everywhere
```javascript
// UI
pendingAnalysis = props.filter(p => p.overall_score === 0)

// Python
analyzed_urls = {p['url'] for p in data if p.get('gpt_score', 0) > 0}

// Another script
unanalyzed = len(df) - len(existing_df)
```
**Result**: 3 different definitions of "pending"

## Recommended Architecture

### Phase 1: Single Source of Truth (Immediate)

```
┌─────────────────────────────────────┐
│     properties.db (SQLite)          │
│  - Single authoritative database    │
│  - ACID guarantees                  │
│  - Schema enforcement               │
│  - Indexes for performance          │
└─────────────────────────────────────┘
            ↑         ↓
    All reads    All writes
    go here      go here
            │         │
┌───────────┴─────────┴───────────┐
│    data_manager.py (API Layer)  │
│  - get_properties()             │
│  - update_property()            │
│  - get_pending_analysis()       │
│  - atomic operations            │
└─────────────────────────────────┘
```

**Benefits:**
- ONE place to query state
- ONE definition of "pending"
- Transactions prevent inconsistency
- SQL enforces data integrity

### Phase 2: Data Validation Layer

```python
class PropertyValidator:
    """Enforce business rules at write time"""

    @staticmethod
    def validate_property(prop: dict) -> ValidationResult:
        errors = []

        # Required fields
        if not prop.get('url'):
            errors.append("URL is required")

        # Business rules
        if prop.get('overall_score', 0) > 0:
            if prop.get('gpt_score', 0) == 0 and prop.get('custom_score', 0) == 0:
                errors.append("overall_score > 0 but no component scores")

        # Data integrity
        if prop.get('status') == 'Removed' and prop.get('gpt_score', 0) == 0:
            errors.append("Removed property should have been analyzed")

        return ValidationResult(valid=len(errors)==0, errors=errors)
```

**Benefits:**
- Catch issues at write time
- Explicit business rules
- Self-documenting constraints
- Prevent bad data from entering

### Phase 3: Reconciliation System

```python
class DataReconciler:
    """Find and fix inconsistencies"""

    def check_integrity(self) -> IntegrityReport:
        issues = []

        # Check 1: All URLs in DB should have enriched data
        db_urls = set(db.get_all_urls())
        json_urls = set(json.load('enriched_data.json'))
        missing = db_urls - json_urls
        if missing:
            issues.append(f"{len(missing)} URLs in DB but not in JSON")

        # Check 2: Properties with overall_score should have components
        invalid = db.query("""
            SELECT COUNT(*) FROM properties
            WHERE overall_score > 0
            AND (gpt_score = 0 OR custom_score = 0)
        """)
        if invalid > 0:
            issues.append(f"{invalid} properties have invalid score composition")

        # Check 3: Pending count should match across systems
        pending_db = db.count_pending()
        pending_ui = ui.count_pending()
        if pending_db != pending_ui:
            issues.append(f"Pending count mismatch: DB={pending_db}, UI={pending_ui}")

        return IntegrityReport(healthy=len(issues)==0, issues=issues)
```

**Benefits:**
- Detect problems automatically
- Alert before user sees issues
- Provide fix suggestions
- Can run on schedule

### Phase 4: Event-Driven Updates

```python
class PropertyEventBus:
    """Coordinate updates across system"""

    def property_scraped(self, url: str, data: dict):
        # 1. Update database
        db.upsert_property(url, data)

        # 2. Emit event
        self.emit('property.scraped', {'url': url})

        # 3. Trigger dependent updates
        if not data.get('gps_coordinates'):
            geocoder.queue_geocoding(url)

        # 4. Update caches
        cache.invalidate(url)

        # 5. Notify UI
        websocket.broadcast('property_updated', url)
```

**Benefits:**
- Atomic updates
- Automatic propagation
- No stale data
- Real-time UI updates

## Implementation Plan

### Week 1: Database Migration

**Goal**: Move from JSON files to SQLite database

```sql
CREATE TABLE properties (
    url TEXT PRIMARY KEY,
    title TEXT,
    summary TEXT,
    status TEXT DEFAULT 'Active' CHECK(status IN ('Active', 'Removed')),

    -- Scores (with constraints)
    gpt_score REAL DEFAULT 0 CHECK(gpt_score >= 0),
    custom_score REAL DEFAULT 0 CHECK(custom_score >= 0),
    overall_score REAL DEFAULT 0 CHECK(overall_score >= 0),

    -- Location
    lat REAL,
    lon REAL,
    location TEXT,

    -- Property details
    price REAL,
    land_size_m2 REAL,
    bedrooms INTEGER,
    bathrooms INTEGER,

    -- Metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_analyzed_at TIMESTAMP,

    -- Constraints
    CHECK (overall_score = 0 OR gpt_score > 0 OR custom_score > 0),
    CHECK (status = 'Removed' OR (lat IS NOT NULL AND lon IS NOT NULL))
);

CREATE INDEX idx_status ON properties(status);
CREATE INDEX idx_gpt_score ON properties(gpt_score);
CREATE INDEX idx_overall_score ON properties(overall_score);

-- View for easy querying
CREATE VIEW properties_pending_analysis AS
SELECT * FROM properties
WHERE status = 'Active'
AND gpt_score = 0;
```

**Migration Script**:
```python
def migrate_to_database():
    # 1. Create database
    db = sqlite3.connect('properties.db')
    db.execute(SCHEMA)

    # 2. Load existing data
    with open('enriched_data.json') as f:
        properties = json.load(f)

    # 3. Validate and insert
    for prop in properties:
        try:
            validated = PropertyValidator.validate(prop)
            if validated.valid:
                db.insert_property(validated.data)
            else:
                log_migration_error(prop['url'], validated.errors)
        except Exception as e:
            log_migration_error(prop['url'], str(e))

    # 4. Verify migration
    assert db.count() == len(properties)
    assert db.count_pending() == expected_pending

    # 5. Backup old files
    shutil.move('enriched_data.json', '_backups/enriched_data_pre_migration.json')
```

### Week 2: Data Manager API

**Goal**: Single interface for all data operations

```python
class PropertyDataManager:
    """Centralized data access layer"""

    def __init__(self, db_path='properties.db'):
        self.db = sqlite3.connect(db_path)
        self.validator = PropertyValidator()

    def get_property(self, url: str) -> Optional[Property]:
        """Get single property by URL"""
        row = self.db.execute(
            "SELECT * FROM properties WHERE url = ?",
            (url,)
        ).fetchone()
        return Property.from_row(row) if row else None

    def get_pending_analysis(self) -> List[Property]:
        """Get all properties needing GPT analysis"""
        rows = self.db.execute("""
            SELECT * FROM properties
            WHERE status = 'Active'
            AND gpt_score = 0
            ORDER BY created_at ASC
        """).fetchall()
        return [Property.from_row(row) for row in rows]

    def update_gpt_score(self, url: str, score: float, analysis: str):
        """Update GPT analysis results atomically"""
        # Validate
        if not 0 <= score <= 5:
            raise ValueError(f"Invalid score: {score}")

        # Begin transaction
        with self.db:
            # Update scores
            self.db.execute("""
                UPDATE properties
                SET gpt_score = ?,
                    last_analyzed_at = CURRENT_TIMESTAMP,
                    updated_at = CURRENT_TIMESTAMP
                WHERE url = ?
            """, (score, url))

            # Recalculate overall score
            self.db.execute("""
                UPDATE properties
                SET overall_score = (
                    (gpt_score * ?) + (custom_score * ?)
                ) / (? + ?)
                WHERE url = ?
            """, (GPT_WEIGHT, CUSTOM_WEIGHT, GPT_WEIGHT, CUSTOM_WEIGHT, url))

            # Store analysis text
            self.db.execute("""
                INSERT INTO analyses (url, analysis_text, created_at)
                VALUES (?, ?, CURRENT_TIMESTAMP)
            """, (url, analysis))

        # Clear caches
        self.invalidate_cache(url)

    def get_statistics(self) -> Statistics:
        """Get system-wide statistics"""
        return Statistics(
            total=self.db.execute("SELECT COUNT(*) FROM properties").fetchone()[0],
            active=self.db.execute("SELECT COUNT(*) FROM properties WHERE status='Active'").fetchone()[0],
            with_gpt=self.db.execute("SELECT COUNT(*) FROM properties WHERE gpt_score > 0").fetchone()[0],
            pending=self.db.execute("SELECT COUNT(*) FROM properties WHERE status='Active' AND gpt_score=0").fetchone()[0],
            avg_score=self.db.execute("SELECT AVG(overall_score) FROM properties WHERE overall_score > 0").fetchone()[0]
        )
```

**Update All Scripts**:
```python
# OLD WAY
with open('enriched_data.json') as f:
    data = json.load(f)
pending = [p for p in data if p.get('gpt_score', 0) == 0]

# NEW WAY
from data_manager import PropertyDataManager
dm = PropertyDataManager()
pending = dm.get_pending_analysis()
```

### Week 3: Validation & Integrity Checks

**Goal**: Prevent bad data from entering system

```python
# Add to API endpoints
@app.route('/api/update-property', methods=['POST'])
def update_property():
    data = request.json

    # Validate before writing
    validation = validator.validate_property(data)
    if not validation.valid:
        return jsonify({
            'success': False,
            'errors': validation.errors
        }), 400

    # Write to database
    dm.update_property(data['url'], data)

    return jsonify({'success': True})
```

**Add Integrity Checks**:
```python
# Run automatically on startup and every hour
def check_data_integrity():
    reconciler = DataReconciler()
    report = reconciler.check_integrity()

    if not report.healthy:
        # Log issues
        logger.error(f"Data integrity issues found: {report.issues}")

        # Send alert
        send_admin_alert(report)

        # Try to fix
        for issue in report.issues:
            if issue.auto_fixable:
                issue.fix()
```

### Week 4: Deprecate Old Files

**Goal**: Remove JSON files, force all access through DB

```python
# 1. Add read-only mode to JSON files
def load_enriched_data():
    warnings.warn(
        "enriched_data.json is deprecated. Use PropertyDataManager instead.",
        DeprecationWarning
    )
    return json.load(open('enriched_data.json'))

# 2. After migration period, remove files
os.rename('enriched_data.json', '_deprecated/enriched_data.json.old')
os.rename('analysis_output.csv', '_deprecated/analysis_output.csv.old')

# 3. Update documentation
# 4. Update all scripts to use database
```

## Benefits of New Architecture

### 1. Single Source of Truth
```
Before: 3+ files, each with different "truth"
After:  1 database, ONE answer to every question
```

### 2. Data Integrity Guaranteed
```
Before: overall_score can be wrong, no validation
After:  Database constraints prevent invalid states
```

### 3. Consistent Definitions
```
Before: "pending" defined differently in each script
After:  SELECT * FROM properties_pending_analysis
```

### 4. Atomic Operations
```
Before: Update JSON, crash, file corrupted
After:  Transaction rollback, data always consistent
```

### 5. Performance
```
Before: Load entire 199-property JSON to count pending
After:  SELECT COUNT(*) FROM properties WHERE gpt_score=0  (instant)
```

### 6. Scalability
```
Before: 10,000 properties = 10MB JSON file, slow to parse
After:  10,000 properties = indexed DB, still fast
```

## Immediate Actions (Today)

### 1. Document the Rules
Create `BUSINESS_RULES.md`:
```markdown
# Property Analysis Business Rules

## Property States
- **Pending Analysis**: status='Active' AND gpt_score=0
- **Analyzed**: gpt_score > 0
- **Complete**: gpt_score > 0 AND custom_score > 0

## Score Calculation
- overall_score = (gpt_score * 0.6) + (custom_score * 0.4)
- overall_score can only be > 0 if at least one component > 0

## Data Flow
1. Scraper adds property → enriched_data.json
2. GPT analysis → updates gpt_score
3. Custom criteria → updates custom_score
4. Overall score → recalculated automatically
```

### 2. Add Validation Functions
```python
def validate_pending_count():
    """Ensure all systems agree on pending count"""
    db_pending = len(get_pending_from_json())
    csv_urls = load_csv_urls()

    # This should always be true
    assert db_pending <= len(csv_urls), \
        f"Pending ({db_pending}) > CSV entries ({len(csv_urls)})"

    return db_pending
```

### 3. Fix UI to Match Backend
```javascript
// ALWAYS use the same logic
const pendingAnalysis = properties
    .filter(p => p.status !== 'Removed' && (!p.gpt_score || p.gpt_score === 0))
    .length;
```

### 4. Add Warning System
```python
# In criteria_api.py startup
@app.before_first_request
def check_data_consistency():
    issues = []

    # Load all sources
    json_data = json.load(open('enriched_data.json'))
    csv_data = pd.read_csv('extracted_property_urls.csv')

    # Check consistency
    json_pending = len([p for p in json_data if p.get('gpt_score', 0) == 0])
    csv_total = len(csv_data)

    if json_pending > csv_total:
        issues.append(f"⚠️  JSON has {json_pending} pending but CSV only has {csv_total} URLs")

    if issues:
        logger.warning("DATA INCONSISTENCY DETECTED:\n" + "\n".join(issues))
```

## Long-term Vision

### Ideal Architecture (6 months)

```
┌─────────────────────────────────────────────────────────┐
│                    Web Interface                        │
│          (criteria_manager.html, map_viewer)            │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ↓
┌─────────────────────────────────────────────────────────┐
│                    API Layer                            │
│              (FastAPI with Pydantic)                    │
│  - Type checking                                        │
│  - Validation                                           │
│  - Authentication                                       │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ↓
┌─────────────────────────────────────────────────────────┐
│                 Business Logic Layer                    │
│              (Services with interfaces)                 │
│  - PropertyService                                      │
│  - AnalysisService                                      │
│  - ScoringService                                       │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ↓
┌─────────────────────────────────────────────────────────┐
│                    Data Layer                           │
│              (SQLAlchemy ORM)                           │
│  - Models with validation                               │
│  - Migrations                                           │
│  - Relationships                                        │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ↓
┌─────────────────────────────────────────────────────────┐
│                   PostgreSQL                            │
│              (with triggers, views, constraints)        │
└─────────────────────────────────────────────────────────┘
```

**Key Features:**
- Type safety at every layer (Pydantic models)
- Database migrations (Alembic)
- Automatic API documentation (OpenAPI)
- Real-time updates (WebSockets)
- Comprehensive logging & monitoring
- Automated testing at every layer

## Success Metrics

After implementing this architecture, you should see:

✅ **Zero count discrepancies** between UI and backend
✅ **Database constraints prevent** invalid states
✅ **All scripts use** same data source
✅ **Automated tests catch** inconsistencies before production
✅ **Clear error messages** when something goes wrong
✅ **Audit log** of all data changes
✅ **Performance improvement** (100x faster queries)

## Conclusion

The root cause is **lack of architectural foundation**. The system grew organically from simple scripts to complex application, but retained the multi-file, unvalidated, eventually-consistent architecture.

**The fix is not a patch—it's a redesign:**
1. Single source of truth (SQLite/PostgreSQL)
2. Validation layer (Pydantic models)
3. Centralized API (FastAPI)
4. Automated integrity checks
5. Comprehensive testing

This prevents these issues **by design**, not by hoping developers remember to check 3 different files.
