# FarmMatch Business Rules
**Authoritative source for system behavior**

## Core Definitions

### Property States

**Pending Analysis**
- Condition: `status = 'Active' AND gpt_score = 0`
- Meaning: Property needs GPT analysis
- Should appear in "Analyze Only" count
- Should appear in Pipeline Overview "Pending Analysis" count

**Analyzed**
- Condition: `gpt_score > 0`
- Meaning: GPT has analyzed this property
- May or may not have custom_score yet

**Complete**
- Condition: `gpt_score > 0 AND custom_score > 0`
- Meaning: Both GPT and custom analysis done
- Has valid overall_score

**Removed**
- Condition: `status = 'Removed'`
- Meaning: Property no longer available
- Should not appear in any active counts
- Should not be analyzed

### Score Calculation

**GPT Score** (`gpt_score`)
- Range: 0.0 to 5.0
- Source: OpenAI GPT-4o-mini analysis
- Criteria: Market garden (2.0), Guest accommodation (2.5), Workshop (2.0), Rental units (1.5), Location (3.0), Local market (1.5)
- Zero means: Not yet analyzed

**Custom Score** (`custom_score`)
- Range: 0.0 to 5.0
- Source: Objective data (rainfall, temperature, airport distance, population)
- Calculated by: `custom_criteria.py`
- Zero means: Not yet calculated OR insufficient data

**Overall Score** (`overall_score`)
- Formula: `(gpt_score * 0.6) + (custom_score * 0.4)`
- Range: 0.0 to 5.0
- **RULE**: Can only be > 0 if at least one component score > 0
- **RULE**: If `gpt_score = 0`, then `overall_score = custom_score * 0.4`
- **RULE**: If `custom_score = 0`, then `overall_score = gpt_score * 0.6`

### Data Integrity Rules

**Rule 1: Score Composition**
```
IF overall_score > 0
THEN (gpt_score > 0 OR custom_score > 0)
```
Violation: overall_score exists but no component scores

**Rule 2: Pending Definition**
```
Property is "pending analysis" IFF:
  status = 'Active' AND gpt_score = 0
```
This is THE ONLY definition. All UI, scripts, and queries MUST use this.

**Rule 3: Geographic Data**
```
IF status = 'Active'
THEN (lat IS NOT NULL AND lon IS NOT NULL)
```
Violation: Active property without coordinates

**Rule 4: URL Uniqueness**
```
url IS PRIMARY KEY
No two properties can have same URL
```

**Rule 5: Status Values**
```
status IN ('Active', 'Removed')
No other values allowed
```

## Data Flow

### 1. Property Scraping
```
favorites_scraper.py
    ↓
enriched_data.json (adds property)
    ↓
Property state:
  - url: set
  - status: 'Active'
  - gpt_score: 0
  - custom_score: 0
  - overall_score: 0
  - lat/lon: NULL (needs geocoding)
```

### 2. Geocoding
```
bulletproof_geocoding.py
    ↓
Updates enriched_data.json
    ↓
Property state:
  - lat/lon: set
  - gps_source: 'js_object' | 'breadcrumb' | etc
```

### 3. GPT Analysis
```
analyze_from_urls_optimized.py
    ↓
Updates enriched_data.json
    ↓
Property state:
  - gpt_score: 1.0-5.0
  - analysis: full text
  - criteria: {market_garden: 4, ...}
  - overall_score: recalculated
```

### 4. Custom Criteria
```
custom_criteria.py
    ↓
Updates enriched_data.json
    ↓
Property state:
  - custom_score: 1.0-5.0
  - overall_score: recalculated (if gpt_score also exists)
```

### 5. Property Removal
```
smart_unfavorite.py OR check_availability.py
    ↓
Updates enriched_data.json
    ↓
Property state:
  - status: 'Removed'
  - availability_status_code: 404 | 410 | etc
  - No longer counted in active properties
```

## Count Calculations

### Total Properties
```python
total = len([p for p in properties if p['status'] == 'Active'])
```

### Pending Analysis
```python
pending = len([p for p in properties
               if p['status'] == 'Active'
               and p.get('gpt_score', 0) == 0])
```

### With GPT Scores
```python
with_gpt = len([p for p in properties
                if p.get('gpt_score', 0) > 0])
```

### Average Score
```python
scores = [p['overall_score'] for p in properties
          if p['status'] == 'Active'
          and p.get('overall_score', 0) > 0]
avg = sum(scores) / len(scores) if scores else 0
```

### Top Properties
```python
top = len([p for p in properties
           if p['status'] == 'Active'
           and p.get('overall_score', 0) > 4.0])
```

## Validation Rules

### On Property Create
- [x] URL is required and valid
- [x] Status defaults to 'Active'
- [x] All scores default to 0
- [x] Title and summary should not be empty

### On Property Update
- [x] Cannot change URL (it's the primary key)
- [x] Status can only be 'Active' or 'Removed'
- [x] Scores must be 0-5 range
- [x] If status changes to 'Removed', preserve all data

### On Score Update
- [x] When gpt_score changes, recalculate overall_score
- [x] When custom_score changes, recalculate overall_score
- [x] overall_score = (gpt_score * 0.6) + (custom_score * 0.4)
- [x] Round to 2 decimal places

### On Batch Operations
- [x] All properties in extracted_property_urls.csv must exist in enriched_data.json
- [x] All Active properties should have lat/lon
- [x] No property should have overall_score > 0 with both component scores = 0

## Cache Behavior

### GPT Analysis Cache
Location: `.gpt_cache/`

**Cache Key**: Hash of (url + text + prompt)

**Cache Hit**: Skip GPT API call, reuse previous analysis

**Cache Miss**: Call GPT API, save to cache

**Cache Invalidation**:
- Manual deletion of cache files
- Content change detection (hash changes)

**Rule**: Cache is optimization only. enriched_data.json is source of truth.

## File Responsibilities

### enriched_data.json
- **Role**: Single source of truth for all property data
- **Updated by**: All scripts (scraper, analyzer, custom criteria, etc.)
- **Read by**: UI, all scripts
- **Authority**: Final state of every property

### extracted_property_urls.csv
- **Role**: Queue of URLs to process
- **Updated by**: favorites_scraper.py
- **Read by**: analyze_from_urls_optimized.py
- **Authority**: "What URLs should we analyze?"

### analysis_output.csv
- **Role**: Historical record of GPT analyses
- **Updated by**: analyze_from_urls_optimized.py
- **Read by**: Rarely (historical review only)
- **Authority**: Archive, not source of truth

### .gpt_cache/
- **Role**: Performance optimization
- **Updated by**: analyze_from_urls_optimized.py
- **Read by**: analyze_from_urls_optimized.py
- **Authority**: None (can be deleted safely)

## Error Handling

### Inconsistency Detection
When counts don't match between UI and backend:
1. Check which definition is being used
2. Ensure both use `gpt_score = 0` for pending
3. Ensure both exclude `status = 'Removed'`
4. Log discrepancy with details
5. Alert admin if difference > 5%

### Missing Data
If property missing required data:
1. Log warning with property URL
2. Set to pending analysis
3. Queue for re-scraping if needed
4. Do not remove from database

### Validation Failures
If property fails validation:
1. Log error with full details
2. Do not write invalid data
3. Return error to caller
4. Suggest corrective action

## Migration Path

### Current Architecture Issues
- Multiple sources of truth (CSV, JSON, cache)
- No validation layer
- Inconsistent definitions
- Manual synchronization

### Target Architecture
- Single database (SQLite → PostgreSQL)
- Schema enforcement
- Atomic operations
- Automated consistency checks

### Transition Period
1. Document current rules (this file)
2. Fix immediate inconsistencies (UI vs backend)
3. Add validation functions
4. Create database schema
5. Migrate data
6. Deprecate JSON files
7. Update all scripts to use database

## Decision Log

### Why gpt_score = 0 for "pending"?
- Clear, unambiguous definition
- Easy to query
- Distinguishes "not analyzed" from "analyzed with low score"

### Why not use overall_score = 0?
- Property can have custom_score but no gpt_score
- Would incorrectly count these as "pending"
- Creates confusion about what "pending" means

### Why separate gpt_score and custom_score?
- Different data sources
- Different update frequencies
- Allows partial analysis
- User can adjust weights independently

### Why 60/40 weight split?
- GPT analysis is more comprehensive (6 criteria)
- Custom criteria is more objective but limited data
- User testing showed this balance worked best
- Can be adjusted in future

## Enforcement

This document is authoritative. When in doubt:
1. Check this file
2. If behavior doesn't match, file is correct
3. Update code to match this file
4. Update this file only with explicit decision

Last updated: October 13, 2025
