# Data Quality Dashboard ## Overview The Data Quality Dashboard provides comprehensive monitoring and management of geocoding accuracy and location data integrity. This feature enables campaign administrators to identify and resolve data quality issues, track geocoding provider performance, and ensure reliable map data for canvassing operations. **Key Features:** - Real-time geocoding quality metrics - Provider success rate tracking - Low-confidence location detection - Duplicate location identification - Bulk re-geocoding operations - Address validation reporting - Interactive quality charts - Export quality reports **Use Cases:** - Monthly data quality audits - NAR import validation - Geocoding provider evaluation - Pre-canvass data verification - Address database cleanup - Campaign planning accuracy checks **Architecture Highlights:** - Aggregate statistics via database queries - Confidence threshold filtering (0-100 scale) - Provider performance comparison - Duplicate detection via coordinate matching - Manual review workflows - Prometheus metrics integration ## Architecture ```mermaid flowchart TB subgraph Admin Interface Admin[Admin User] Dashboard[DataQualityDashboardPage] LocationsPage[LocationsPage] end subgraph API Layer StatsAPI["/api/locations/geocode-stats"] LocationsAPI["/api/locations"] DuplicatesAPI["/api/locations/duplicates"] RegeocodeAPI["/api/locations/:id/regeocode"] BulkGeocodeAPI["/api/locations/bulk-geocode"] end subgraph Database LocationsDB[(Locations)] Indexes[(Indexes)] end subgraph Geocoding Service GeocodingService[GeocodingService] Providers[6 Providers] Cache[Redis Cache] end subgraph Monitoring Prometheus[Prometheus] Metrics[cm_locations_low_confidence_count] end Admin --> Dashboard Admin --> LocationsPage Dashboard --> StatsAPI Dashboard --> LocationsAPI Dashboard --> DuplicatesAPI LocationsPage --> RegeocodeAPI LocationsPage --> BulkGeocodeAPI StatsAPI --> LocationsDB LocationsAPI --> LocationsDB DuplicatesAPI --> LocationsDB RegeocodeAPI --> GeocodingService BulkGeocodeAPI --> GeocodingService LocationsDB --> Indexes GeocodingService --> Providers GeocodingService --> Cache StatsAPI --> Prometheus Prometheus --> Metrics ``` **Data Flow:** 1. **Statistics Aggregation:** - Query all locations with geocoding metadata - Calculate aggregate metrics (total, geocoded %, avg confidence) - Group by provider for success rate comparison - Identify low-confidence locations (< 50) - Detect duplicates via coordinate matching 2. **Quality Review:** - Admin views dashboard statistics - Filters low-confidence locations - Reviews individual location details - Identifies patterns (provider failures, address format issues) 3. **Remediation:** - Manual address correction - Single location re-geocoding - Bulk re-geocoding with different provider - Duplicate merging or marking 4. **Monitoring:** - Prometheus metrics track quality trends - Alert rules trigger for quality degradation - Grafana dashboards visualize provider performance ## Database Models ### Location Model ```prisma model Location { id Int @id @default(autoincrement()) address String latitude Float? longitude Float? postalCode String? province String? // Geocoding metadata geocodeConfidence Int? // 0-100 quality score geocodeProvider String? // Provider used for geocoding geocodedAt DateTime? // Timestamp of last geocode // NAR import fields locGuid String? @unique federalDistrict String? buildingUse Int? // 1 = Residential addresses Address[] createdAt DateTime @default(now()) updatedAt DateTime @updatedAt @@index([geocodeConfidence]) @@index([geocodeProvider]) @@index([latitude, longitude]) @@index([latitude, longitude], where: latitude IS NOT NULL AND longitude IS NOT NULL) } ``` **Geocode Confidence Scale:** - 0-20: Very Low (manual review required) - 21-40: Low (likely incorrect, re-geocode recommended) - 41-60: Medium (acceptable but consider verification) - 61-80: Good (likely accurate) - 81-100: Excellent (high confidence) **Geocode Provider Enum:** ```typescript enum GeocodeProvider { GOOGLE = 'GOOGLE', MAPBOX = 'MAPBOX', NOMINATIM = 'NOMINATIM', PHOTON = 'PHOTON', LOCATIONIQ = 'LOCATIONIQ', ARCGIS = 'ARCGIS', UNKNOWN = 'UNKNOWN' } ``` ### Address Model ```prisma model Address { id Int @id @default(autoincrement()) locationId Int location Location @relation(fields: [locationId], references: [id], onDelete: Cascade) unitNumber String? firstName String? lastName String? supportLevel Int? notes String? // Address validation isValidated Boolean @default(false) validatedAt DateTime? createdAt DateTime @default(now()) updatedAt DateTime @updatedAt @@index([locationId]) } ``` ## API Endpoints ### GET /api/locations/geocode-stats Fetch aggregate geocoding quality statistics. **Authentication:** Required (SUPER_ADMIN, MAP_ADMIN) **Response:** ```json { "total": 1500, "geocoded": 1450, "geocodedPercent": 96.67, "avgConfidence": 78.5, "providerBreakdown": { "GOOGLE": 800, "MAPBOX": 350, "NOMINATIM": 200, "PHOTON": 100, "ARCGIS": 0, "LOCATIONIQ": 0, "UNKNOWN": 50 }, "confidenceDistribution": { "0-20": 15, "21-40": 35, "41-60": 150, "61-80": 450, "81-100": 800 }, "lowConfidenceCount": 50, "missingCoordinates": 50, "duplicatesCount": 12 } ``` **Implementation:** ```typescript // locations.service.ts async getGeocodeStats() { const locations = await prisma.location.findMany({ select: { latitude: true, longitude: true, geocodeConfidence: true, geocodeProvider: true } }); const total = locations.length; const geocoded = locations.filter(l => l.latitude && l.longitude).length; const avgConfidence = locations.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0) / total; const providerBreakdown = locations.reduce((acc, l) => { const provider = l.geocodeProvider || 'UNKNOWN'; acc[provider] = (acc[provider] || 0) + 1; return acc; }, {} as Record); const confidenceDistribution = { '0-20': 0, '21-40': 0, '41-60': 0, '61-80': 0, '81-100': 0 }; locations.forEach(l => { const conf = l.geocodeConfidence || 0; if (conf <= 20) confidenceDistribution['0-20']++; else if (conf <= 40) confidenceDistribution['21-40']++; else if (conf <= 60) confidenceDistribution['41-60']++; else if (conf <= 80) confidenceDistribution['61-80']++; else confidenceDistribution['81-100']++; }); const lowConfidenceCount = locations.filter(l => (l.geocodeConfidence || 0) < 50).length; return { total, geocoded, geocodedPercent: (geocoded / total) * 100, avgConfidence, providerBreakdown, confidenceDistribution, lowConfidenceCount, missingCoordinates: total - geocoded, duplicatesCount: await this.countDuplicates() }; } ``` ### GET /api/locations?geocodeConfidence=lt:50 Fetch locations filtered by geocode confidence. **Authentication:** Required **Query Parameters:** - `geocodeConfidence` (filter): `lt:X`, `gt:X`, `eq:X`, `null` - `geocodeProvider` (filter): Provider name (GOOGLE, MAPBOX, etc.) - `page` (optional): Page number (default: 1) - `limit` (optional): Results per page (default: 50) - `sortBy` (optional): Field to sort by (default: "geocodeConfidence") - `order` (optional): "asc" or "desc" (default: "asc") **Examples:** ``` GET /api/locations?geocodeConfidence=lt:50 GET /api/locations?geocodeConfidence=null GET /api/locations?geocodeProvider=NOMINATIM&geocodeConfidence=lt:70 GET /api/locations?geocodeConfidence=gt:80&sortBy=address ``` **Response:** ```json { "data": [ { "id": 1001, "address": "123 Main St", "latitude": 43.6532, "longitude": -79.3832, "postalCode": "M5H 2N2", "geocodeConfidence": 45, "geocodeProvider": "NOMINATIM", "geocodedAt": "2025-02-10T10:00:00Z", "addresses": [...] } ], "pagination": { "page": 1, "limit": 50, "total": 150, "pages": 3 } } ``` ### GET /api/locations/duplicates Identify locations with identical coordinates. **Authentication:** Required (SUPER_ADMIN, MAP_ADMIN) **Query Parameters:** - `threshold` (optional): Distance threshold in meters (default: 1, matches exact duplicates) **Response:** ```json { "duplicates": [ { "coordinates": { "latitude": 43.6532, "longitude": -79.3832 }, "count": 3, "locations": [ { "id": 1001, "address": "123 Main St", "postalCode": "M5H 2N2" }, { "id": 1002, "address": "123 Main Street", "postalCode": "M5H 2N2" }, { "id": 1003, "address": "123 Main St, Unit 1", "postalCode": "M5H 2N2" } ] } ], "total": 12 } ``` **Implementation:** ```typescript // locations.service.ts async findDuplicates(thresholdMeters: number = 1) { const locations = await prisma.location.findMany({ where: { AND: [ { latitude: { not: null } }, { longitude: { not: null } } ] }, select: { id: true, address: true, latitude: true, longitude: true, postalCode: true } }); const coordMap = new Map(); locations.forEach(loc => { // Round to 6 decimal places (~0.1m precision) const key = `${loc.latitude!.toFixed(6)},${loc.longitude!.toFixed(6)}`; if (!coordMap.has(key)) { coordMap.set(key, []); } coordMap.get(key)!.push(loc); }); const duplicates = Array.from(coordMap.entries()) .filter(([_, locs]) => locs.length > 1) .map(([coords, locs]) => { const [lat, lng] = coords.split(',').map(Number); return { coordinates: { latitude: lat, longitude: lng }, count: locs.length, locations: locs }; }); return { duplicates, total: duplicates.reduce((sum, dup) => sum + dup.count, 0) }; } ``` ### POST /api/locations/:id/regeocode Re-geocode a single location with specified provider. **Authentication:** Required (SUPER_ADMIN, MAP_ADMIN) **Request Body:** ```json { "provider": "GOOGLE", "address": "123 Main St, Toronto ON M5H 2N2" } ``` **Parameters:** - `provider` (optional): Specific provider to use (default: fallback chain) - `address` (optional): Override address string (default: use existing) **Response:** ```json { "id": 1001, "address": "123 Main St", "latitude": 43.6532, "longitude": -79.3832, "geocodeConfidence": 95, "geocodeProvider": "GOOGLE", "geocodedAt": "2025-02-13T10:30:00Z" } ``` ### POST /api/locations/bulk-geocode Bulk re-geocode multiple locations. **Authentication:** Required (SUPER_ADMIN, MAP_ADMIN) **Request Body:** ```json { "locationIds": [1001, 1002, 1003], "provider": "GOOGLE", "confidenceThreshold": 50 } ``` **Parameters:** - `locationIds` (optional): Specific location IDs (default: all with confidence < threshold) - `provider` (optional): Specific provider to use (default: fallback chain) - `confidenceThreshold` (optional): Only re-geocode locations below this confidence (default: 50) **Response:** ```json { "jobId": "bulk-geocode-20250213-103000", "status": "queued", "total": 150, "message": "Bulk geocoding job started" } ``` **Job Progress Endpoint:** ``` GET /api/locations/bulk-geocode/:jobId ``` **Job Status Response:** ```json { "jobId": "bulk-geocode-20250213-103000", "status": "processing", "progress": { "total": 150, "processed": 75, "successful": 70, "failed": 5, "percent": 50 }, "startedAt": "2025-02-13T10:30:00Z", "estimatedCompletion": "2025-02-13T10:35:00Z" } ``` ## Configuration ### Environment Variables | Variable | Type | Default | Description | |----------|------|---------|-------------| | GEOCODE_CONFIDENCE_THRESHOLD | number | 50 | Minimum confidence for acceptable geocoding | | GEOCODE_PRIMARY_PROVIDER | string | GOOGLE | Primary geocoding provider | | GEOCODE_FALLBACK_PROVIDERS | string | MAPBOX,NOMINATIM | Comma-separated fallback providers | | GEOCODE_CACHE_TTL | number | 2592000 | Cache TTL in seconds (30 days) | ### Quality Thresholds | Metric | Warning | Critical | Description | |--------|---------|----------|-------------| | Geocoded % | < 95% | < 90% | Percentage of locations with coordinates | | Avg Confidence | < 70 | < 60 | Average geocode confidence score | | Low Confidence Count | > 50 | > 100 | Locations with confidence < 50 | | Duplicates | > 20 | > 50 | Locations with identical coordinates | | Missing Coordinates | > 5% | > 10% | Locations without lat/lng | ### Prometheus Metrics **Custom Metrics:** ```typescript // api/src/utils/metrics.ts export const geocodingQualityGauge = new Gauge({ name: 'cm_geocoding_avg_confidence', help: 'Average geocoding confidence score (0-100)', async collect() { const stats = await locationsService.getGeocodeStats(); this.set(stats.avgConfidence); } }); export const lowConfidenceLocationsGauge = new Gauge({ name: 'cm_locations_low_confidence_count', help: 'Number of locations with geocode confidence < 50', async collect() { const stats = await locationsService.getGeocodeStats(); this.set(stats.lowConfidenceCount); } }); export const geocodedPercentGauge = new Gauge({ name: 'cm_locations_geocoded_percent', help: 'Percentage of locations with coordinates', async collect() { const stats = await locationsService.getGeocodeStats(); this.set(stats.geocodedPercent); } }); export const duplicateLocationsGauge = new Gauge({ name: 'cm_locations_duplicates_count', help: 'Number of duplicate location entries', async collect() { const duplicates = await locationsService.findDuplicates(); this.set(duplicates.total); } }); ``` **Alert Rules:** ```yaml # configs/prometheus/alerts.yml groups: - name: data_quality interval: 5m rules: - alert: LowGeocodingConfidence expr: cm_geocoding_avg_confidence < 60 for: 10m labels: severity: warning annotations: summary: Low average geocoding confidence description: "Average geocoding confidence is {{ $value }}, below threshold of 60" - alert: HighLowConfidenceLocations expr: cm_locations_low_confidence_count > 100 for: 5m labels: severity: critical annotations: summary: High number of low-confidence locations description: "{{ $value }} locations have geocoding confidence < 50" - alert: LowGeocodedPercent expr: cm_locations_geocoded_percent < 90 for: 10m labels: severity: warning annotations: summary: Low percentage of geocoded locations description: "Only {{ $value }}% of locations have coordinates" - alert: HighDuplicateLocations expr: cm_locations_duplicates_count > 50 for: 15m labels: severity: warning annotations: summary: High number of duplicate locations description: "{{ $value }} duplicate location entries detected" ``` ## Quality Metrics ### Geocoding Confidence **Calculation:** Geocoding confidence is calculated based on multiple factors: ```typescript interface GeocodeResult { latitude: number; longitude: number; matchType: 'exact' | 'interpolated' | 'approximate' | 'fallback'; addressComponents: { streetNumber?: string; street?: string; city?: string; postalCode?: string; province?: string; }; providerConfidence?: number; // Provider-specific score } function calculateConfidence(result: GeocodeResult, inputAddress: string): number { let confidence = 0; // Match type (0-40 points) switch (result.matchType) { case 'exact': confidence += 40; break; case 'interpolated': confidence += 30; break; case 'approximate': confidence += 20; break; case 'fallback': confidence += 10; break; } // Address component completeness (0-30 points) const components = result.addressComponents; if (components.streetNumber) confidence += 10; if (components.street) confidence += 10; if (components.postalCode) confidence += 10; // Provider-specific confidence (0-30 points) if (result.providerConfidence) { confidence += (result.providerConfidence / 100) * 30; } return Math.min(Math.round(confidence), 100); } ``` **Confidence Levels:** - **81-100 (Excellent):** Exact match with full address components - **61-80 (Good):** Interpolated match with most components - **41-60 (Medium):** Approximate match, missing some components - **21-40 (Low):** Fallback geocoding, significant uncertainty - **0-20 (Very Low):** Minimal match, likely incorrect ### Provider Success Rates **Metrics Tracked:** ```typescript interface ProviderMetrics { provider: GeocodeProvider; totalAttempts: number; successfulGeocodes: number; successRate: number; // 0-100% avgConfidence: number; // 0-100 avgResponseTime: number; // milliseconds errorCount: number; lastError?: string; } ``` **Success Rate Calculation:** ```typescript const calculateProviderMetrics = async (): Promise => { const locations = await prisma.location.findMany({ select: { geocodeProvider: true, geocodeConfidence: true, latitude: true, longitude: true } }); const providerGroups = groupBy(locations, 'geocodeProvider'); return Object.entries(providerGroups).map(([provider, locs]) => { const total = locs.length; const successful = locs.filter(l => l.latitude && l.longitude).length; const avgConf = locs.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0) / total; return { provider: provider as GeocodeProvider, totalAttempts: total, successfulGeocodes: successful, successRate: (successful / total) * 100, avgConfidence: avgConf, avgResponseTime: 0, // Would need separate tracking errorCount: total - successful }; }); }; ``` ### Duplicate Detection **Detection Methods:** 1. **Exact Coordinate Match:** ```typescript // Round to 6 decimal places (~0.1m precision) const isDuplicateExact = (loc1: Location, loc2: Location): boolean => { return loc1.latitude!.toFixed(6) === loc2.latitude!.toFixed(6) && loc1.longitude!.toFixed(6) === loc2.longitude!.toFixed(6); }; ``` 2. **Proximity Threshold:** ```typescript // Haversine distance check const isDuplicateProximity = (loc1: Location, loc2: Location, thresholdM: number): boolean => { const distance = haversineDistance( [loc1.latitude!, loc1.longitude!], [loc2.latitude!, loc2.longitude!] ); return distance < thresholdM; }; ``` 3. **Address Similarity:** ```typescript import { distance as levenshteinDistance } from 'fastest-levenshtein'; const isDuplicateAddress = (addr1: string, addr2: string): boolean => { const normalized1 = normalizeAddress(addr1); const normalized2 = normalizeAddress(addr2); const dist = levenshteinDistance(normalized1, normalized2); const similarity = 1 - (dist / Math.max(normalized1.length, normalized2.length)); return similarity > 0.9; // 90% similar }; const normalizeAddress = (address: string): string => { return address .toLowerCase() .replace(/\bstreet\b/g, 'st') .replace(/\bavenue\b/g, 'ave') .replace(/\broad\b/g, 'rd') .replace(/\bdrive\b/g, 'dr') .replace(/[^a-z0-9]/g, ''); }; ``` ### Address Validation **Validation Checks:** ```typescript interface AddressValidationResult { isValid: boolean; issues: string[]; suggestions?: string[]; } const validateAddress = (address: string): AddressValidationResult => { const issues: string[] = []; // Check minimum length if (address.length < 5) { issues.push('Address too short'); } // Check for street number if (!/^\d+/.test(address)) { issues.push('Missing street number'); } // Check for street name if (!/\d+\s+([A-Za-z]+\s*)+/.test(address)) { issues.push('Missing street name'); } // Check for postal code (Canadian format) if (!/[A-Z]\d[A-Z]\s?\d[A-Z]\d/.test(address)) { issues.push('Missing or invalid postal code'); } // Check for unusual characters if (/[^A-Za-z0-9\s,.-]/.test(address)) { issues.push('Contains unusual characters'); } return { isValid: issues.length === 0, issues }; }; ``` ## Admin Workflow ### Navigate to Data Quality Dashboard **Step 1: Access Dashboard** 1. Log in as SUPER_ADMIN or MAP_ADMIN 2. Click **Map** in sidebar 3. Click **Data Quality** submenu 4. Dashboard loads with statistics **Step 2: Review Overall Statistics** Dashboard displays 4 main statistic cards: ```plaintext ┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐ │ Total Locations │ Geocoded │ Avg Confidence │ Low Confidence │ │ 1,500 │ 1,450 (96.7%) │ 78.5 │ 50 │ └──────────────────┴──────────────────┴──────────────────┴──────────────────┘ ``` **Step 3: Analyze Provider Performance** Provider breakdown table shows: | Provider | Count | Success Rate | Avg Confidence | |----------|-------|--------------|----------------| | GOOGLE | 800 | 99.2% | 85.3 | | MAPBOX | 350 | 97.1% | 82.1 | | NOMINATIM | 200 | 94.5% | 75.8 | | PHOTON | 100 | 91.0% | 68.2 | | UNKNOWN | 50 | N/A | 0 | **Step 4: Review Confidence Distribution** Bar chart displays confidence distribution: ```plaintext Confidence Distribution 100 | ┌──────┐ 80 | │ │ 60 | ┌──────┤ │ 40 | ┌──────┤ │ │ 20 | │ │ │ │ 0 └──┴──────┴──────┴──────┴──────┘ 0-20 21-40 41-60 61-80 81-100 15 35 150 450 800 ``` ### Identify and Review Low-Confidence Locations **Step 1: Filter Low-Confidence Locations** 1. Click **Low Confidence** tab on dashboard 2. Table loads with locations where confidence < 50 3. Sort by confidence (ascending) to prioritize worst **Step 2: Review Location Details** Click row to open detail drawer: ```plaintext ┌─────────────────────────────────────────┐ │ Location Details │ ├─────────────────────────────────────────┤ │ Address: 123 Main St │ │ Postal Code: M5H 2N2 │ │ Coordinates: 43.6532, -79.3832 │ │ │ │ Geocoding Info: │ │ Confidence: 45 (Low) │ │ Provider: NOMINATIM │ │ Geocoded: Feb 10, 2025 10:00 AM │ │ │ │ Issues: │ │ • Missing street number in response │ │ • Approximate match only │ │ │ │ [Re-geocode] [Edit Address] [View Map] │ └─────────────────────────────────────────┘ ``` **Step 3: Take Action** Options for remediation: 1. **Re-geocode with different provider:** - Click **Re-geocode** button - Select provider (GOOGLE recommended for low confidence) - Click **Geocode Now** - New confidence displayed 2. **Edit address:** - Click **Edit Address** - Correct typos or formatting issues - Save changes - Auto-triggers re-geocoding 3. **View on map:** - Click **View Map** - Verify location accuracy visually - Drag marker to correct position if needed ### Bulk Re-geocoding **Step 1: Select Locations** 1. In Low Confidence tab, use table checkboxes to select locations 2. Or click **Select All** to select all visible 3. Selected count displays: "50 selected" **Step 2: Choose Provider** 1. Click **Bulk Re-geocode** button 2. Modal opens with provider selection: ```plaintext ┌─────────────────────────────────────┐ │ Bulk Re-geocode │ ├─────────────────────────────────────┤ │ Re-geocode 50 locations │ │ │ │ Provider: [GOOGLE ▼] │ │ │ │ Options: │ │ ☑ Only if confidence < 50 │ │ ☑ Cache results │ │ ☐ Overwrite existing coordinates │ │ │ │ Estimated time: ~2 minutes │ │ │ │ [Cancel] [Start Re-geocoding] │ └─────────────────────────────────────┘ ``` **Step 3: Monitor Progress** 1. Job starts, progress bar appears: ```plaintext Re-geocoding in progress... 25/50 (50%) [████████████░░░░░░░░░░░░] 50% ``` 2. Real-time updates: - Total processed - Successful geocodes - Failed geocodes - Average new confidence **Step 4: Review Results** Job completion summary: ```plaintext ┌─────────────────────────────────────┐ │ Bulk Re-geocode Complete │ ├─────────────────────────────────────┤ │ Processed: 50 │ │ Successful: 47 (94%) │ │ Failed: 3 (6%) │ │ │ │ Quality Improvement: │ │ Avg Confidence Before: 42.5 │ │ Avg Confidence After: 81.3 │ │ Improvement: +38.8 │ │ │ │ [View Failed] [Close] │ └─────────────────────────────────────┘ ``` ### Handle Duplicates **Step 1: View Duplicates Tab** 1. Click **Duplicates** tab on dashboard 2. Table groups locations by coordinates **Step 2: Review Duplicate Groups** Table displays: | Coordinates | Count | Addresses | Action | |-------------|-------|-----------|--------| | 43.6532, -79.3832 | 3 | 123 Main St, 123 Main Street, 123 Main St Unit 1 | [Review] | | 43.6540, -79.3825 | 2 | 456 Bay St, 456 Bay Street | [Review] | **Step 3: Resolve Duplicates** Click **Review** to open resolution modal: ```plaintext ┌─────────────────────────────────────┐ │ Resolve Duplicates │ ├─────────────────────────────────────┤ │ 3 locations at 43.6532, -79.3832 │ │ │ │ ○ Merge into single location │ │ Primary: 123 Main St │ │ Merge units from duplicates │ │ │ │ ○ Keep as separate multi-unit │ │ Mark as validated multi-unit │ │ │ │ ○ Re-geocode individually │ │ Try to get unique coordinates │ │ │ │ [Cancel] [Resolve] │ └─────────────────────────────────────┘ ``` **Resolution Options:** 1. **Merge:** Combine into single Location with multiple Address records 2. **Multi-unit:** Mark as legitimate multi-unit building 3. **Re-geocode:** Attempt to get unique coordinates for each ## Quality Improvement Strategies ### Multi-Provider Geocoding **Fallback Chain:** ```typescript // geocoding.service.ts const PROVIDER_CHAIN: GeocodeProvider[] = [ 'GOOGLE', // Primary: Best accuracy, paid 'MAPBOX', // Fallback 1: Good accuracy, paid 'NOMINATIM', // Fallback 2: Free, decent accuracy 'PHOTON', // Fallback 3: Free, lower accuracy 'ARCGIS' // Fallback 4: Free, basic accuracy ]; async geocode(address: string): Promise { for (const provider of PROVIDER_CHAIN) { try { const result = await this.geocodeWithProvider(address, provider); if (result && result.confidence >= 50) { return result; // Success, confidence acceptable } } catch (error) { logger.warn(`Geocoding failed with ${provider}:`, error); // Try next provider } } return null; // All providers failed } ``` **Benefits:** - Increases success rate (90% → 96%+) - Reduces dependency on single provider - Cost optimization (use free providers as fallback) - Provider outage resilience ### Address Normalization **Pre-Geocoding Normalization:** ```typescript const normalizeAddressForGeocoding = (address: string): string => { let normalized = address; // Remove extra whitespace normalized = normalized.replace(/\s+/g, ' ').trim(); // Standardize abbreviations const replacements: Record = { 'Street': 'St', 'Avenue': 'Ave', 'Road': 'Rd', 'Drive': 'Dr', 'Boulevard': 'Blvd', 'Apartment': 'Apt', 'Unit': 'Unit', 'Suite': 'Ste' }; Object.entries(replacements).forEach(([long, short]) => { const regex = new RegExp(`\\b${long}\\b`, 'gi'); normalized = normalized.replace(regex, short); }); // Ensure postal code spacing (Canadian format) normalized = normalized.replace(/([A-Z]\d[A-Z])(\d[A-Z]\d)/, '$1 $2'); // Remove periods from abbreviations normalized = normalized.replace(/\./g, ''); return normalized; }; ``` **Improvements:** - Reduces geocoding errors by 10-15% - Increases confidence scores - Better cache hit rate ### Geocoding Cache **Redis Cache Implementation:** ```typescript // geocoding.service.ts private async geocodeWithCache(address: string): Promise { const cacheKey = `geocode:${normalizeAddress(address)}`; // Check cache const cached = await redis.get(cacheKey); if (cached) { logger.debug('Geocoding cache hit:', address); return JSON.parse(cached); } // Cache miss, geocode const result = await this.geocode(address); if (result) { // Cache for 30 days await redis.setex(cacheKey, 2592000, JSON.stringify(result)); } return result; } ``` **Benefits:** - Reduces API costs (90% cache hit rate) - Faster response times (Redis: <5ms vs API: 200-500ms) - Consistent results for same address - Provider API rate limit avoidance ### Manual Verification **Critical Location Verification:** Manually verify high-priority locations: 1. **Campaign offices:** Ensure exact coordinates 2. **Shift start points:** Verify accessibility 3. **Event venues:** Confirm entrance location 4. **Polling stations:** Critical for voter info **Verification Process:** ```typescript // Mark location as manually verified await prisma.location.update({ where: { id: locationId }, data: { geocodeConfidence: 100, geocodeProvider: 'MANUAL', geocodedAt: new Date() } }); ``` ### Regular Audits **Monthly Quality Audit Checklist:** 1. **Run quality report:** ```bash curl http://localhost:4000/api/locations/geocode-stats ``` 2. **Check metrics against thresholds:** - Geocoded % > 95% - Avg confidence > 70 - Low confidence count < 50 - Duplicates < 20 3. **Review low-confidence locations:** - Filter locations with confidence < 50 - Review top 20 by address - Identify patterns (specific streets, providers) 4. **Bulk re-geocode low confidence:** - Use GOOGLE provider for accuracy - Monitor improvement in avg confidence 5. **Resolve duplicates:** - Review all duplicate groups - Merge or mark as multi-unit - Update addresses as needed 6. **Export quality report:** ```typescript const report = await generateQualityReport(); fs.writeFileSync(`quality-report-${date}.json`, JSON.stringify(report, null, 2)); ``` ## Code Examples ### DataQualityDashboardPage.tsx ```typescript import React, { useEffect, useState } from 'react'; import { Card, Row, Col, Statistic, Table, Tabs, Button, message } from 'antd'; import { WarningOutlined, CheckCircleOutlined } from '@ant-design/icons'; import { api } from '@/lib/api'; import { Bar } from 'react-chartjs-2'; interface GeocodeStats { total: number; geocoded: number; geocodedPercent: number; avgConfidence: number; providerBreakdown: Record; confidenceDistribution: Record; lowConfidenceCount: number; missingCoordinates: number; duplicatesCount: number; } const DataQualityDashboardPage: React.FC = () => { const [stats, setStats] = useState(null); const [lowConfLocations, setLowConfLocations] = useState([]); const [duplicates, setDuplicates] = useState([]); const [loading, setLoading] = useState(false); useEffect(() => { fetchStats(); fetchLowConfidenceLocations(); fetchDuplicates(); }, []); const fetchStats = async () => { setLoading(true); try { const { data } = await api.get('/locations/geocode-stats'); setStats(data); } catch (error) { message.error('Failed to load statistics'); } finally { setLoading(false); } }; const fetchLowConfidenceLocations = async () => { try { const { data } = await api.get('/locations?geocodeConfidence=lt:50&limit=100'); setLowConfLocations(data.data); } catch (error) { message.error('Failed to load low-confidence locations'); } }; const fetchDuplicates = async () => { try { const { data } = await api.get('/locations/duplicates'); setDuplicates(data.duplicates); } catch (error) { message.error('Failed to load duplicates'); } }; const handleRegeocodeLocation = async (locationId: number) => { try { await api.post(`/locations/${locationId}/regeocode`, { provider: 'GOOGLE' }); message.success('Location re-geocoded successfully'); fetchStats(); fetchLowConfidenceLocations(); } catch (error) { message.error('Failed to re-geocode location'); } }; const confidenceChartData = stats ? { labels: Object.keys(stats.confidenceDistribution), datasets: [{ label: 'Locations', data: Object.values(stats.confidenceDistribution), backgroundColor: [ '#e74c3c', // 0-20: Red '#f39c12', // 21-40: Orange '#f1c40f', // 41-60: Yellow '#3498db', // 61-80: Blue '#27ae60' // 81-100: Green ] }] } : null; const lowConfColumns = [ { title: 'Address', dataIndex: 'address', key: 'address' }, { title: 'Confidence', dataIndex: 'geocodeConfidence', key: 'confidence', render: (val: number) => ( {val} )}, { title: 'Provider', dataIndex: 'geocodeProvider', key: 'provider' }, { title: 'Action', key: 'action', render: (_: any, record: any) => ( )} ]; return (

Data Quality Dashboard

{/* Statistics Cards */} } /> 95 ? '#27ae60' : '#f39c12' }} /> 70 ? '#27ae60' : '#f39c12' }} /> } valueStyle={{ color: (stats?.lowConfidenceCount || 0) > 50 ? '#e74c3c' : '#f39c12' }} /> {/* Charts and Tables */} {confidenceChartData && } ({ provider, count })) : []} columns={[ { title: 'Provider', dataIndex: 'provider', key: 'provider' }, { title: 'Count', dataIndex: 'count', key: 'count' } ]} pagination={false} /> ) }, { key: 'low-confidence', label: `Low Confidence (${lowConfLocations.length})`, children: (
) }, { key: 'duplicates', label: `Duplicates (${duplicates.length})`, children: (
`${record.coordinates.latitude.toFixed(6)}, ${record.coordinates.longitude.toFixed(6)}` }, { title: 'Count', dataIndex: 'count', key: 'count' }, { title: 'Addresses', key: 'addresses', render: (_, record: any) => record.locations.map((l: any) => l.address).join(', ') } ]} rowKey={(record) => `${record.coordinates.latitude}-${record.coordinates.longitude}`} /> ) } ]} /> ); }; export default DataQualityDashboardPage; ``` ### Geocode Statistics Service ```typescript // locations.service.ts import { prisma } from '@/config/database'; import type { GeocodeProvider } from '@prisma/client'; export class LocationsService { async getGeocodeStats() { const locations = await prisma.location.findMany({ select: { id: true, latitude: true, longitude: true, geocodeConfidence: true, geocodeProvider: true } }); const total = locations.length; const geocoded = locations.filter(l => l.latitude && l.longitude).length; const sumConfidence = locations.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0); const avgConfidence = total > 0 ? sumConfidence / total : 0; // Provider breakdown const providerBreakdown: Record = {}; locations.forEach(l => { const provider = l.geocodeProvider || 'UNKNOWN'; providerBreakdown[provider] = (providerBreakdown[provider] || 0) + 1; }); // Confidence distribution const confidenceDistribution = { '0-20': 0, '21-40': 0, '41-60': 0, '61-80': 0, '81-100': 0 }; locations.forEach(l => { const conf = l.geocodeConfidence || 0; if (conf <= 20) confidenceDistribution['0-20']++; else if (conf <= 40) confidenceDistribution['21-40']++; else if (conf <= 60) confidenceDistribution['41-60']++; else if (conf <= 80) confidenceDistribution['61-80']++; else confidenceDistribution['81-100']++; }); const lowConfidenceCount = locations.filter(l => (l.geocodeConfidence || 0) < 50).length; const duplicatesCount = await this.countDuplicates(); return { total, geocoded, geocodedPercent: total > 0 ? (geocoded / total) * 100 : 0, avgConfidence, providerBreakdown, confidenceDistribution, lowConfidenceCount, missingCoordinates: total - geocoded, duplicatesCount }; } async countDuplicates(): Promise { const locations = await prisma.location.findMany({ where: { AND: [ { latitude: { not: null } }, { longitude: { not: null } } ] }, select: { latitude: true, longitude: true } }); const coordMap = new Map(); locations.forEach(l => { const key = `${l.latitude!.toFixed(6)},${l.longitude!.toFixed(6)}`; coordMap.set(key, (coordMap.get(key) || 0) + 1); }); return Array.from(coordMap.values()).filter(count => count > 1).reduce((sum, count) => sum + count, 0); } async regeocode(locationId: number, provider?: GeocodeProvider) { const location = await prisma.location.findUnique({ where: { id: locationId } }); if (!location) { throw new Error('Location not found'); } const result = await geocodingService.geocode(location.address, provider); if (!result) { throw new Error('Geocoding failed'); } return await prisma.location.update({ where: { id: locationId }, data: { latitude: result.latitude, longitude: result.longitude, geocodeConfidence: result.confidence, geocodeProvider: result.provider, geocodedAt: new Date() } }); } } ``` ## Troubleshooting ### Problem: Many low-confidence locations **Symptoms:** - > 100 locations with confidence < 50 - Avg confidence < 60 - Prometheus alert firing **Solutions:** 1. **Check provider API keys:** ```bash # Test Google Geocoding API curl "https://maps.googleapis.com/maps/api/geocode/json?address=123+Main+St+Toronto&key=YOUR_KEY" # Verify key in .env echo $GEOCODE_GOOGLE_API_KEY ``` 2. **Try different primary provider:** ```env # In .env, change primary provider GEOCODE_PRIMARY_PROVIDER=GOOGLE # Most accurate # Or try: GEOCODE_PRIMARY_PROVIDER=MAPBOX # Good alternative ``` 3. **Verify address format:** ```typescript // Bad: Missing city/postal "123 Main St" // Good: Full address "123 Main St, Toronto ON M5H 2N2" ``` 4. **Use postal code for better accuracy:** ```typescript // Append postal code if available const fullAddress = location.postalCode ? `${location.address}, ${location.postalCode}` : location.address; ``` 5. **Bulk re-geocode with Google:** ```bash # Via API curl -X POST http://localhost:4000/api/locations/bulk-geocode \ -H "Authorization: Bearer $TOKEN" \ -d '{"provider":"GOOGLE","confidenceThreshold":50}' ``` ### Problem: Duplicate locations detected **Symptoms:** - Multiple locations at same coordinates - Duplicates tab shows many groups - Inflated location counts in cuts **Solutions:** 1. **Check if legitimately multi-unit:** ```sql -- Find buildings with multiple addresses SELECT l.id, l.address, COUNT(a.id) as unit_count FROM "Location" l JOIN "Address" a ON a."locationId" = l.id GROUP BY l.id HAVING COUNT(a.id) > 1; ``` 2. **Verify geocoding precision:** ```typescript // Check if rounding issue const isDuplicateRounding = (loc1, loc2) => { // Use 4 decimal places (~11m precision) instead of 6 (~0.1m) return loc1.latitude.toFixed(4) === loc2.latitude.toFixed(4) && loc1.longitude.toFixed(4) === loc2.longitude.toFixed(4); }; ``` 3. **Review NAR import process:** ```typescript // Ensure LOC_GUID unique constraint const location = await prisma.location.upsert({ where: { locGuid: narRecord.LOC_GUID }, update: { /* update fields */ }, create: { /* create fields */ } }); ``` 4. **Merge duplicates:** ```typescript // Merge function const mergeDuplicates = async (primaryId: number, duplicateIds: number[]) => { // Move addresses to primary location await prisma.address.updateMany({ where: { locationId: { in: duplicateIds } }, data: { locationId: primaryId } }); // Delete duplicates await prisma.location.deleteMany({ where: { id: { in: duplicateIds } } }); }; ``` ### Problem: Geocoding stats slow to load **Symptoms:** - GET /api/locations/geocode-stats takes > 5 seconds - Dashboard timeout errors - High database CPU **Solutions:** 1. **Add database indexes:** ```sql CREATE INDEX CONCURRENTLY idx_locations_geocode_confidence ON "Location"(geocodeConfidence); CREATE INDEX CONCURRENTLY idx_locations_geocode_provider ON "Location"(geocodeProvider); CREATE INDEX CONCURRENTLY idx_locations_coords ON "Location"(latitude, longitude) WHERE latitude IS NOT NULL AND longitude IS NOT NULL; ``` 2. **Cache stats in Redis:** ```typescript // Cache for 5 minutes const getCachedStats = async () => { const cached = await redis.get('geocode:stats'); if (cached) return JSON.parse(cached); const stats = await locationsService.getGeocodeStats(); await redis.setex('geocode:stats', 300, JSON.stringify(stats)); return stats; }; ``` 3. **Use aggregation pipeline:** ```typescript // Raw SQL for better performance const stats = await prisma.$queryRaw` SELECT COUNT(*) as total, COUNT(latitude) as geocoded, AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence, "geocodeProvider", COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence FROM "Location" GROUP BY "geocodeProvider" `; ``` 4. **Materialize stats view:** ```sql -- Create materialized view CREATE MATERIALIZED VIEW geocode_stats_mv AS SELECT COUNT(*) as total, COUNT(latitude) FILTER (WHERE latitude IS NOT NULL) as geocoded, AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence, COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence FROM "Location"; -- Refresh hourly REFRESH MATERIALIZED VIEW geocode_stats_mv; ``` ## Performance Considerations ### Database Query Optimization **Indexes:** - `geocodeConfidence` (filtering) - `geocodeProvider` (grouping) - `(latitude, longitude)` composite (duplicate detection) - Partial index on non-null coordinates **Query Performance:** - geocode-stats: ~500ms (1500 locations) - Low confidence filter: ~100ms (with index) - Duplicate detection: ~200ms (coordinate grouping) - Bulk re-geocode: ~2-5 min (150 locations, depends on provider) ### API Rate Limits **Provider Limits:** - Google: 50 QPS, $5/1000 requests - Mapbox: 100,000/month free, then $0.50/1000 - Nominatim: 1 QPS (public), no commercial use - Photon: No official limit, self-hosted recommended - ArcGIS: 100,000/month free **Optimization:** - Use Redis cache (30-day TTL) - Batch geocoding jobs (avoid rate limits) - Fallback to free providers for non-critical - Monitor usage via provider dashboards ### Caching Strategy **Cache Layers:** 1. **Application Cache (Redis):** ```typescript // 30-day TTL for geocode results const cacheKey = `geocode:${normalizeAddress(address)}`; await redis.setex(cacheKey, 2592000, JSON.stringify(result)); ``` 2. **Statistics Cache:** ```typescript // 5-minute TTL for stats await redis.setex('geocode:stats', 300, JSON.stringify(stats)); ``` 3. **Provider Response Cache:** ```typescript // Cache raw provider responses separately await redis.setex(`provider:${provider}:${address}`, 604800, JSON.stringify(rawResponse)); ``` **Cache Hit Rates:** - Geocoding: 90%+ (repeated addresses) - Statistics: 95%+ (frequent dashboard views) - Provider responses: 85%+ (re-geocoding attempts) ## Related Documentation ### Backend Documentation - **Locations Service:** `api/src/modules/map/locations/locations.service.ts` - Geocode stats aggregation - Duplicate detection - Re-geocoding operations - **Geocoding Service:** `api/src/modules/map/geocoding/geocoding.service.ts` - Multi-provider fallback - Confidence calculation - Cache integration - **Bulk Geocoding:** `api/src/modules/map/locations/bulk-geocode.routes.ts` - Job queue integration - Progress tracking - Error handling ### Frontend Documentation - **Data Quality Dashboard:** `admin/src/pages/DataQualityDashboardPage.tsx` - Statistics display - Charts and tables - Bulk actions - **Locations Page:** `admin/src/pages/LocationsPage.tsx` - CSV import/export - Inline geocoding - Address editing ### Database Documentation - **Location Model:** `api/prisma/schema.prisma` - Geocoding metadata fields - Indexes for performance - Relations to Address ### Monitoring Documentation - **Prometheus Metrics:** `api/src/utils/metrics.ts` - Custom geocoding metrics - Quality gauges - Alert integration - **Grafana Dashboard:** `configs/grafana/dashboards/data-quality.json` - Quality trend charts - Provider comparison - Alert visualization ### External Resources - **Google Geocoding API:** https://developers.google.com/maps/documentation/geocoding - **Mapbox Geocoding API:** https://docs.mapbox.com/api/search/geocoding - **Nominatim API:** https://nominatim.org/release-docs/latest/api/Search - **Photon API:** https://photon.komoot.io