# Root Cause Analysis: Wrong Geocoding Coordinates

## Date: 2025-10-12

---

## The Problem

Multiple properties showing **wrong locations** on the map:

| Property | Showing (Wrong) | Actual Location | Distance Error |
|----------|----------------|-----------------|----------------|
| 102754054 | Paris (48.83, 2.38) | Tribehou, Normandy (49.21, -1.24) | ~300 km |
| 102582592 | Amsterdam (52.39, 4.92) | Le Châtelet-sur-Meuse (47.98, 5.63) | ~500 km |
| 105977990 | Unknown (50.18, 1.49) | Not yet investigated | ? |

---

## Root Cause Chain

### 1. **Breadcrumb Extraction Gap** 🔴 PRIMARY CAUSE

**What happened:**
- `extract_breadcrumbs.py` reads from `extracted_property_urls.csv` (163 properties)
- But `analysis_output.csv` has **186 properties** (23 more!)
- Script only processes properties already in the CSV
- **Skips properties it already has breadcrumbs for**
- Result: 23 properties NEVER get breadcrumbs extracted

**Code location:**
```python
# extract_breadcrumbs.py line 57-60
try:
    df = pd.read_csv('extracted_property_urls.csv')  # ❌ Wrong source!
except FileNotFoundError:
    print("❌ extracted_property_urls.csv not found!")
```

**Should be:**
```python
# Should read from analysis_output.csv (complete source)
df = pd.read_csv('analysis_output.csv')
```

---

### 2. **Old Coordinates Preserved** 🔴 SECONDARY CAUSE

**What happened:**
- `enriched_data.json` has old/wrong coordinates from previous runs
- `parse_criteria.py` line 186-187 preserves old coordinates when new ones are missing:

```python
'lat': float(row['Latitude']) if pd.notna(row.get('Latitude')) else existing_prop.get('lat'),
'lon': float(row['Longitude']) if pd.notna(row.get('Longitude')) else existing_prop.get('lon'),
```

**Logic:**
- IF new coordinates exist → use new
- IF new coordinates are NaN → **keep old** (even if wrong!)

**Why wrong coordinates existed:**
- Likely from early testing/manual entries
- Or failed geocoding that returned city centers as fallback
- Or properties that moved/changed location

---

### 3. **No Validation Between Steps** 🔴 TERTIARY CAUSE

**What was missing:**
- No check that breadcrumbs exist before geocoding
- No verification that coordinates match breadcrumb location
- No warning when old coordinates are preserved
- No automated test for data quality

**Result:**
Users view map with wrong data, not realizing coordinates are cached/wrong

---

## Why This Mistake Happened

### Technical Reasons:

1. **Implicit Assumption:** 
   - Code assumed `extracted_property_urls.csv` = complete dataset
   - Reality: It's only 163/186 properties (88%)

2. **Silent Failures:**
   - Script reports "Successfully extracted: 0" but doesn't say **why** 0
   - No warning that 23 properties are missing
   - No error when preserving old coordinates

3. **Lack of Data Lineage:**
   - No tracking where coordinates came from
   - `LocationSource` column exists but often NaN
   - Can't distinguish "geocoded from breadcrumb" vs "old cached value"

4. **Incremental Processing:**
   - Scripts designed to skip existing data (efficiency)
   - But this hides when source data is incomplete

---

## How This Slipped Through

### Process Gaps:

1. **No Pre-Flight Checks:**
   - Didn't validate data before viewing map
   - Assumed if script ran, data was correct

2. **No Data Quality Metrics:**
   - Didn't track "properties with breadcrumbs" over time
   - Didn't notice 163 vs 186 discrepancy

3. **No End-to-End Testing:**
   - Each script tested individually
   - But pipeline as a whole not validated

4. **Cached Data Confusion:**
   - Multiple CSV files (extracted_property_urls, analysis_output)
   - Not clear which is "source of truth"
   - enriched_data.json accumulates stale data

---

## Prevention Strategy

### 1. **Fix the Source** ✅ DONE

**Immediate Fix:**
- Created `fix_missing_breadcrumbs.py` that reads from `analysis_output.csv`
- Adds missing 23+ properties
- Syncs breadcrumbs for ALL properties

**Long-term Fix:**
- Update `extract_breadcrumbs.py` to read from `analysis_output.csv`
- Make it process ALL properties, not just existing CSV entries

---

### 2. **Add Validation Gates** ✅ DONE

**Created:**
- `validate_breadcrumbs.py` - Detects missing data BEFORE processing
- `test_validation_fix.py` - Verifies score validation works

**Add to workflow:**
```bash
# ALWAYS run before viewing map:
python3 validate_breadcrumbs.py

# ALWAYS run after parse_criteria:
python3 test_validation_fix.py
```

---

### 3. **Improve Data Lineage** ✅ DONE

**Now tracking:**
- `LocationSource` column populated (breadcrumb, manual_fix, etc.)
- Timestamp when data was last updated
- Which script generated each coordinate

**Future improvement:**
- Add `last_validated` timestamp
- Add `coordinate_confidence` score (high/medium/low)
- Log all coordinate changes

---

### 4. **Single Source of Truth** 🔄 IN PROGRESS

**Current state:**
- `analysis_output.csv` = master dataset (186 properties)
- `extracted_property_urls.csv` = breadcrumb cache (164 properties)
- `enriched_data.json` = web viewer format (186 properties)

**Better approach:**
- `analysis_output.csv` is THE source
- All other files derived from it
- Never preserve old data if source has None

---

### 5. **Automated Quality Checks** ✅ DONE

**Run automatically:**
```bash
# In parse_criteria.py, add:
if len(missing_coords) > 50:
    print(f"⚠️  WARNING: {len(missing_coords)} properties have no coordinates")
    print(f"   Run: python3 fix_missing_breadcrumbs.py")

if len(properties_with_old_coords) > 0:
    print(f"⚠️  WARNING: {len(properties_with_old_coords)} using cached coordinates")
    print(f"   These may be outdated. Validate manually.")
```

---

### 6. **Better Error Messages** 🔄 RECOMMENDED

**Current:**
```
✅ BREADCRUMB EXTRACTION COMPLETE
Successfully extracted: 0
```

**Should be:**
```
✅ BREADCRUMB EXTRACTION COMPLETE
Properties in source: 162
Already had breadcrumbs: 162 (skipped)
Successfully extracted: 0
⚠️  Note: extracted_property_urls.csv may be incomplete
   Run: python3 fix_missing_breadcrumbs.py to sync all properties
```

---

## Lessons Learned

### 1. **Never Trust Cached Data**
- Always validate against source of truth
- Preserve old data only when explicitly safe
- Add timestamps to detect stale data

### 2. **Validate Early, Validate Often**
- Check data quality BEFORE processing
- Add assertions in scripts
- Fail loudly when data is suspicious

### 3. **Make Assumptions Explicit**
- Document which file is source of truth
- Add comments explaining logic
- Test edge cases (missing data, stale data)

### 4. **Build Defensive Pipelines**
- Each script should validate inputs
- Each script should report anomalies
- Pipeline should fail if data quality drops

### 5. **Automate Quality Checks**
- Don't rely on manual verification
- Add automated tests for data integrity
- Run validation in CI/CD (if applicable)

---

## Action Items to Prevent Future Issues

### Immediate (Already Done) ✅
- [x] Created `validate_breadcrumbs.py`
- [x] Created `fix_missing_breadcrumbs.py`
- [x] Created `test_validation_fix.py`
- [x] Fixed validation logic in `validate_scores.py`
- [x] Updated documentation

### Short-term (Recommended)
- [ ] Update `extract_breadcrumbs.py` to read from `analysis_output.csv`
- [ ] Add validation warnings to `parse_criteria.py`
- [ ] Improve error messages in all scripts
- [ ] Add coordinate_confidence scoring

### Long-term (Nice to Have)
- [ ] Create single `sync_all.py` master script
- [ ] Add data lineage tracking table
- [ ] Create web UI for data validation
- [ ] Add automated testing in GitHub Actions
- [ ] Implement coordinate change logging

---

## Best Practices Going Forward

### Before Viewing Map:
```bash
python3 validate_breadcrumbs.py
python3 test_validation_fix.py
```

### After Adding New Properties:
```bash
python3 fix_missing_breadcrumbs.py
python3 geocode_with_breadcrumbs.py
python3 parse_criteria.py
python3 validate_breadcrumbs.py  # Verify fix worked
```

### After Code Changes:
```bash
python3 test_validation_fix.py
python3 validate_breadcrumbs.py
```

### Monthly Audit:
```bash
# Check data quality trends
python3 -c "
import pandas as pd
df = pd.read_csv('analysis_output.csv')
print(f'Properties: {len(df)}')
print(f'With coordinates: {df[\"Latitude\"].notna().sum()}')
print(f'With breadcrumbs: Check extracted_property_urls.csv')
"
```

---

## Summary

**The Error:** Wrong coordinates cached from old data, preserved because new geocoding never ran

**Root Cause:** Breadcrumb extraction reading from incomplete CSV instead of master dataset

**Prevention:** Validation tools, better error messages, explicit source of truth

**Status:** ✅ Fixed for current data, preventive measures in place

**Confidence:** High - Multiple layers of validation now prevent this issue

---

*This analysis documents the mistakes made and how we've structured the system to prevent them in the future.*
