1844 lines
50 KiB
Markdown

# Data Quality Dashboard
## Overview
The Data Quality Dashboard provides comprehensive monitoring and management of geocoding accuracy and location data integrity. This feature enables campaign administrators to identify and resolve data quality issues, track geocoding provider performance, and ensure reliable map data for canvassing operations.
**Key Features:**
- Real-time geocoding quality metrics
- Provider success rate tracking
- Low-confidence location detection
- Duplicate location identification
- Bulk re-geocoding operations
- Address validation reporting
- Interactive quality charts
- Export quality reports
**Use Cases:**
- Monthly data quality audits
- NAR import validation
- Geocoding provider evaluation
- Pre-canvass data verification
- Address database cleanup
- Campaign planning accuracy checks
**Architecture Highlights:**
- Aggregate statistics via database queries
- Confidence threshold filtering (0-100 scale)
- Provider performance comparison
- Duplicate detection via coordinate matching
- Manual review workflows
- Prometheus metrics integration
## Architecture
```mermaid
flowchart TB
subgraph Admin Interface
Admin[Admin User]
Dashboard[DataQualityDashboardPage]
LocationsPage[LocationsPage]
end
subgraph API Layer
StatsAPI["/api/locations/geocode-stats"]
LocationsAPI["/api/locations"]
DuplicatesAPI["/api/locations/duplicates"]
RegeocodeAPI["/api/locations/:id/regeocode"]
BulkGeocodeAPI["/api/locations/bulk-geocode"]
end
subgraph Database
LocationsDB[(Locations)]
Indexes[(Indexes)]
end
subgraph Geocoding Service
GeocodingService[GeocodingService]
Providers[6 Providers]
Cache[Redis Cache]
end
subgraph Monitoring
Prometheus[Prometheus]
Metrics[cm_locations_low_confidence_count]
end
Admin --> Dashboard
Admin --> LocationsPage
Dashboard --> StatsAPI
Dashboard --> LocationsAPI
Dashboard --> DuplicatesAPI
LocationsPage --> RegeocodeAPI
LocationsPage --> BulkGeocodeAPI
StatsAPI --> LocationsDB
LocationsAPI --> LocationsDB
DuplicatesAPI --> LocationsDB
RegeocodeAPI --> GeocodingService
BulkGeocodeAPI --> GeocodingService
LocationsDB --> Indexes
GeocodingService --> Providers
GeocodingService --> Cache
StatsAPI --> Prometheus
Prometheus --> Metrics
```
**Data Flow:**
1. **Statistics Aggregation:**
- Query all locations with geocoding metadata
- Calculate aggregate metrics (total, geocoded %, avg confidence)
- Group by provider for success rate comparison
- Identify low-confidence locations (< 50)
- Detect duplicates via coordinate matching
2. **Quality Review:**
- Admin views dashboard statistics
- Filters low-confidence locations
- Reviews individual location details
- Identifies patterns (provider failures, address format issues)
3. **Remediation:**
- Manual address correction
- Single location re-geocoding
- Bulk re-geocoding with different provider
- Duplicate merging or marking
4. **Monitoring:**
- Prometheus metrics track quality trends
- Alert rules trigger for quality degradation
- Grafana dashboards visualize provider performance
## Database Models
### Location Model
```prisma
model Location {
id Int @id @default(autoincrement())
address String
latitude Float?
longitude Float?
postalCode String?
province String?
// Geocoding metadata
geocodeConfidence Int? // 0-100 quality score
geocodeProvider String? // Provider used for geocoding
geocodedAt DateTime? // Timestamp of last geocode
// NAR import fields
locGuid String? @unique
federalDistrict String?
buildingUse Int? // 1 = Residential
addresses Address[]
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([geocodeConfidence])
@@index([geocodeProvider])
@@index([latitude, longitude])
@@index([latitude, longitude], where: latitude IS NOT NULL AND longitude IS NOT NULL)
}
```
**Geocode Confidence Scale:**
- 0-20: Very Low (manual review required)
- 21-40: Low (likely incorrect, re-geocode recommended)
- 41-60: Medium (acceptable but consider verification)
- 61-80: Good (likely accurate)
- 81-100: Excellent (high confidence)
**Geocode Provider Enum:**
```typescript
enum GeocodeProvider {
GOOGLE = 'GOOGLE',
MAPBOX = 'MAPBOX',
NOMINATIM = 'NOMINATIM',
PHOTON = 'PHOTON',
LOCATIONIQ = 'LOCATIONIQ',
ARCGIS = 'ARCGIS',
UNKNOWN = 'UNKNOWN'
}
```
### Address Model
```prisma
model Address {
id Int @id @default(autoincrement())
locationId Int
location Location @relation(fields: [locationId], references: [id], onDelete: Cascade)
unitNumber String?
firstName String?
lastName String?
supportLevel Int?
notes String?
// Address validation
isValidated Boolean @default(false)
validatedAt DateTime?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([locationId])
}
```
## API Endpoints
### GET /api/locations/geocode-stats
Fetch aggregate geocoding quality statistics.
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
**Response:**
```json
{
"total": 1500,
"geocoded": 1450,
"geocodedPercent": 96.67,
"avgConfidence": 78.5,
"providerBreakdown": {
"GOOGLE": 800,
"MAPBOX": 350,
"NOMINATIM": 200,
"PHOTON": 100,
"ARCGIS": 0,
"LOCATIONIQ": 0,
"UNKNOWN": 50
},
"confidenceDistribution": {
"0-20": 15,
"21-40": 35,
"41-60": 150,
"61-80": 450,
"81-100": 800
},
"lowConfidenceCount": 50,
"missingCoordinates": 50,
"duplicatesCount": 12
}
```
**Implementation:**
```typescript
// locations.service.ts
async getGeocodeStats() {
const locations = await prisma.location.findMany({
select: {
latitude: true,
longitude: true,
geocodeConfidence: true,
geocodeProvider: true
}
});
const total = locations.length;
const geocoded = locations.filter(l => l.latitude && l.longitude).length;
const avgConfidence = locations.reduce((sum, l) =>
sum + (l.geocodeConfidence || 0), 0) / total;
const providerBreakdown = locations.reduce((acc, l) => {
const provider = l.geocodeProvider || 'UNKNOWN';
acc[provider] = (acc[provider] || 0) + 1;
return acc;
}, {} as Record<string, number>);
const confidenceDistribution = {
'0-20': 0,
'21-40': 0,
'41-60': 0,
'61-80': 0,
'81-100': 0
};
locations.forEach(l => {
const conf = l.geocodeConfidence || 0;
if (conf <= 20) confidenceDistribution['0-20']++;
else if (conf <= 40) confidenceDistribution['21-40']++;
else if (conf <= 60) confidenceDistribution['41-60']++;
else if (conf <= 80) confidenceDistribution['61-80']++;
else confidenceDistribution['81-100']++;
});
const lowConfidenceCount = locations.filter(l =>
(l.geocodeConfidence || 0) < 50).length;
return {
total,
geocoded,
geocodedPercent: (geocoded / total) * 100,
avgConfidence,
providerBreakdown,
confidenceDistribution,
lowConfidenceCount,
missingCoordinates: total - geocoded,
duplicatesCount: await this.countDuplicates()
};
}
```
### GET /api/locations?geocodeConfidence=lt:50
Fetch locations filtered by geocode confidence.
**Authentication:** Required
**Query Parameters:**
- `geocodeConfidence` (filter): `lt:X`, `gt:X`, `eq:X`, `null`
- `geocodeProvider` (filter): Provider name (GOOGLE, MAPBOX, etc.)
- `page` (optional): Page number (default: 1)
- `limit` (optional): Results per page (default: 50)
- `sortBy` (optional): Field to sort by (default: "geocodeConfidence")
- `order` (optional): "asc" or "desc" (default: "asc")
**Examples:**
```
GET /api/locations?geocodeConfidence=lt:50
GET /api/locations?geocodeConfidence=null
GET /api/locations?geocodeProvider=NOMINATIM&geocodeConfidence=lt:70
GET /api/locations?geocodeConfidence=gt:80&sortBy=address
```
**Response:**
```json
{
"data": [
{
"id": 1001,
"address": "123 Main St",
"latitude": 43.6532,
"longitude": -79.3832,
"postalCode": "M5H 2N2",
"geocodeConfidence": 45,
"geocodeProvider": "NOMINATIM",
"geocodedAt": "2025-02-10T10:00:00Z",
"addresses": [...]
}
],
"pagination": {
"page": 1,
"limit": 50,
"total": 150,
"pages": 3
}
}
```
### GET /api/locations/duplicates
Identify locations with identical coordinates.
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
**Query Parameters:**
- `threshold` (optional): Distance threshold in meters (default: 1, matches exact duplicates)
**Response:**
```json
{
"duplicates": [
{
"coordinates": {
"latitude": 43.6532,
"longitude": -79.3832
},
"count": 3,
"locations": [
{
"id": 1001,
"address": "123 Main St",
"postalCode": "M5H 2N2"
},
{
"id": 1002,
"address": "123 Main Street",
"postalCode": "M5H 2N2"
},
{
"id": 1003,
"address": "123 Main St, Unit 1",
"postalCode": "M5H 2N2"
}
]
}
],
"total": 12
}
```
**Implementation:**
```typescript
// locations.service.ts
async findDuplicates(thresholdMeters: number = 1) {
const locations = await prisma.location.findMany({
where: {
AND: [
{ latitude: { not: null } },
{ longitude: { not: null } }
]
},
select: {
id: true,
address: true,
latitude: true,
longitude: true,
postalCode: true
}
});
const coordMap = new Map<string, typeof locations>();
locations.forEach(loc => {
// Round to 6 decimal places (~0.1m precision)
const key = `${loc.latitude!.toFixed(6)},${loc.longitude!.toFixed(6)}`;
if (!coordMap.has(key)) {
coordMap.set(key, []);
}
coordMap.get(key)!.push(loc);
});
const duplicates = Array.from(coordMap.entries())
.filter(([_, locs]) => locs.length > 1)
.map(([coords, locs]) => {
const [lat, lng] = coords.split(',').map(Number);
return {
coordinates: { latitude: lat, longitude: lng },
count: locs.length,
locations: locs
};
});
return {
duplicates,
total: duplicates.reduce((sum, dup) => sum + dup.count, 0)
};
}
```
### POST /api/locations/:id/regeocode
Re-geocode a single location with specified provider.
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
**Request Body:**
```json
{
"provider": "GOOGLE",
"address": "123 Main St, Toronto ON M5H 2N2"
}
```
**Parameters:**
- `provider` (optional): Specific provider to use (default: fallback chain)
- `address` (optional): Override address string (default: use existing)
**Response:**
```json
{
"id": 1001,
"address": "123 Main St",
"latitude": 43.6532,
"longitude": -79.3832,
"geocodeConfidence": 95,
"geocodeProvider": "GOOGLE",
"geocodedAt": "2025-02-13T10:30:00Z"
}
```
### POST /api/locations/bulk-geocode
Bulk re-geocode multiple locations.
**Authentication:** Required (SUPER_ADMIN, MAP_ADMIN)
**Request Body:**
```json
{
"locationIds": [1001, 1002, 1003],
"provider": "GOOGLE",
"confidenceThreshold": 50
}
```
**Parameters:**
- `locationIds` (optional): Specific location IDs (default: all with confidence < threshold)
- `provider` (optional): Specific provider to use (default: fallback chain)
- `confidenceThreshold` (optional): Only re-geocode locations below this confidence (default: 50)
**Response:**
```json
{
"jobId": "bulk-geocode-20250213-103000",
"status": "queued",
"total": 150,
"message": "Bulk geocoding job started"
}
```
**Job Progress Endpoint:**
```
GET /api/locations/bulk-geocode/:jobId
```
**Job Status Response:**
```json
{
"jobId": "bulk-geocode-20250213-103000",
"status": "processing",
"progress": {
"total": 150,
"processed": 75,
"successful": 70,
"failed": 5,
"percent": 50
},
"startedAt": "2025-02-13T10:30:00Z",
"estimatedCompletion": "2025-02-13T10:35:00Z"
}
```
## Configuration
### Environment Variables
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| GEOCODE_CONFIDENCE_THRESHOLD | number | 50 | Minimum confidence for acceptable geocoding |
| GEOCODE_PRIMARY_PROVIDER | string | GOOGLE | Primary geocoding provider |
| GEOCODE_FALLBACK_PROVIDERS | string | MAPBOX,NOMINATIM | Comma-separated fallback providers |
| GEOCODE_CACHE_TTL | number | 2592000 | Cache TTL in seconds (30 days) |
### Quality Thresholds
| Metric | Warning | Critical | Description |
|--------|---------|----------|-------------|
| Geocoded % | < 95% | < 90% | Percentage of locations with coordinates |
| Avg Confidence | < 70 | < 60 | Average geocode confidence score |
| Low Confidence Count | > 50 | > 100 | Locations with confidence < 50 |
| Duplicates | > 20 | > 50 | Locations with identical coordinates |
| Missing Coordinates | > 5% | > 10% | Locations without lat/lng |
### Prometheus Metrics
**Custom Metrics:**
```typescript
// api/src/utils/metrics.ts
export const geocodingQualityGauge = new Gauge({
name: 'cm_geocoding_avg_confidence',
help: 'Average geocoding confidence score (0-100)',
async collect() {
const stats = await locationsService.getGeocodeStats();
this.set(stats.avgConfidence);
}
});
export const lowConfidenceLocationsGauge = new Gauge({
name: 'cm_locations_low_confidence_count',
help: 'Number of locations with geocode confidence < 50',
async collect() {
const stats = await locationsService.getGeocodeStats();
this.set(stats.lowConfidenceCount);
}
});
export const geocodedPercentGauge = new Gauge({
name: 'cm_locations_geocoded_percent',
help: 'Percentage of locations with coordinates',
async collect() {
const stats = await locationsService.getGeocodeStats();
this.set(stats.geocodedPercent);
}
});
export const duplicateLocationsGauge = new Gauge({
name: 'cm_locations_duplicates_count',
help: 'Number of duplicate location entries',
async collect() {
const duplicates = await locationsService.findDuplicates();
this.set(duplicates.total);
}
});
```
**Alert Rules:**
```yaml
# configs/prometheus/alerts.yml
groups:
- name: data_quality
interval: 5m
rules:
- alert: LowGeocodingConfidence
expr: cm_geocoding_avg_confidence < 60
for: 10m
labels:
severity: warning
annotations:
summary: Low average geocoding confidence
description: "Average geocoding confidence is {{ $value }}, below threshold of 60"
- alert: HighLowConfidenceLocations
expr: cm_locations_low_confidence_count > 100
for: 5m
labels:
severity: critical
annotations:
summary: High number of low-confidence locations
description: "{{ $value }} locations have geocoding confidence < 50"
- alert: LowGeocodedPercent
expr: cm_locations_geocoded_percent < 90
for: 10m
labels:
severity: warning
annotations:
summary: Low percentage of geocoded locations
description: "Only {{ $value }}% of locations have coordinates"
- alert: HighDuplicateLocations
expr: cm_locations_duplicates_count > 50
for: 15m
labels:
severity: warning
annotations:
summary: High number of duplicate locations
description: "{{ $value }} duplicate location entries detected"
```
## Quality Metrics
### Geocoding Confidence
**Calculation:**
Geocoding confidence is calculated based on multiple factors:
```typescript
interface GeocodeResult {
latitude: number;
longitude: number;
matchType: 'exact' | 'interpolated' | 'approximate' | 'fallback';
addressComponents: {
streetNumber?: string;
street?: string;
city?: string;
postalCode?: string;
province?: string;
};
providerConfidence?: number; // Provider-specific score
}
function calculateConfidence(result: GeocodeResult, inputAddress: string): number {
let confidence = 0;
// Match type (0-40 points)
switch (result.matchType) {
case 'exact': confidence += 40; break;
case 'interpolated': confidence += 30; break;
case 'approximate': confidence += 20; break;
case 'fallback': confidence += 10; break;
}
// Address component completeness (0-30 points)
const components = result.addressComponents;
if (components.streetNumber) confidence += 10;
if (components.street) confidence += 10;
if (components.postalCode) confidence += 10;
// Provider-specific confidence (0-30 points)
if (result.providerConfidence) {
confidence += (result.providerConfidence / 100) * 30;
}
return Math.min(Math.round(confidence), 100);
}
```
**Confidence Levels:**
- **81-100 (Excellent):** Exact match with full address components
- **61-80 (Good):** Interpolated match with most components
- **41-60 (Medium):** Approximate match, missing some components
- **21-40 (Low):** Fallback geocoding, significant uncertainty
- **0-20 (Very Low):** Minimal match, likely incorrect
### Provider Success Rates
**Metrics Tracked:**
```typescript
interface ProviderMetrics {
provider: GeocodeProvider;
totalAttempts: number;
successfulGeocodes: number;
successRate: number; // 0-100%
avgConfidence: number; // 0-100
avgResponseTime: number; // milliseconds
errorCount: number;
lastError?: string;
}
```
**Success Rate Calculation:**
```typescript
const calculateProviderMetrics = async (): Promise<ProviderMetrics[]> => {
const locations = await prisma.location.findMany({
select: {
geocodeProvider: true,
geocodeConfidence: true,
latitude: true,
longitude: true
}
});
const providerGroups = groupBy(locations, 'geocodeProvider');
return Object.entries(providerGroups).map(([provider, locs]) => {
const total = locs.length;
const successful = locs.filter(l => l.latitude && l.longitude).length;
const avgConf = locs.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0) / total;
return {
provider: provider as GeocodeProvider,
totalAttempts: total,
successfulGeocodes: successful,
successRate: (successful / total) * 100,
avgConfidence: avgConf,
avgResponseTime: 0, // Would need separate tracking
errorCount: total - successful
};
});
};
```
### Duplicate Detection
**Detection Methods:**
1. **Exact Coordinate Match:**
```typescript
// Round to 6 decimal places (~0.1m precision)
const isDuplicateExact = (loc1: Location, loc2: Location): boolean => {
return loc1.latitude!.toFixed(6) === loc2.latitude!.toFixed(6) &&
loc1.longitude!.toFixed(6) === loc2.longitude!.toFixed(6);
};
```
2. **Proximity Threshold:**
```typescript
// Haversine distance check
const isDuplicateProximity = (loc1: Location, loc2: Location, thresholdM: number): boolean => {
const distance = haversineDistance(
[loc1.latitude!, loc1.longitude!],
[loc2.latitude!, loc2.longitude!]
);
return distance < thresholdM;
};
```
3. **Address Similarity:**
```typescript
import { distance as levenshteinDistance } from 'fastest-levenshtein';
const isDuplicateAddress = (addr1: string, addr2: string): boolean => {
const normalized1 = normalizeAddress(addr1);
const normalized2 = normalizeAddress(addr2);
const dist = levenshteinDistance(normalized1, normalized2);
const similarity = 1 - (dist / Math.max(normalized1.length, normalized2.length));
return similarity > 0.9; // 90% similar
};
const normalizeAddress = (address: string): string => {
return address
.toLowerCase()
.replace(/\bstreet\b/g, 'st')
.replace(/\bavenue\b/g, 'ave')
.replace(/\broad\b/g, 'rd')
.replace(/\bdrive\b/g, 'dr')
.replace(/[^a-z0-9]/g, '');
};
```
### Address Validation
**Validation Checks:**
```typescript
interface AddressValidationResult {
isValid: boolean;
issues: string[];
suggestions?: string[];
}
const validateAddress = (address: string): AddressValidationResult => {
const issues: string[] = [];
// Check minimum length
if (address.length < 5) {
issues.push('Address too short');
}
// Check for street number
if (!/^\d+/.test(address)) {
issues.push('Missing street number');
}
// Check for street name
if (!/\d+\s+([A-Za-z]+\s*)+/.test(address)) {
issues.push('Missing street name');
}
// Check for postal code (Canadian format)
if (!/[A-Z]\d[A-Z]\s?\d[A-Z]\d/.test(address)) {
issues.push('Missing or invalid postal code');
}
// Check for unusual characters
if (/[^A-Za-z0-9\s,.-]/.test(address)) {
issues.push('Contains unusual characters');
}
return {
isValid: issues.length === 0,
issues
};
};
```
## Admin Workflow
### Navigate to Data Quality Dashboard
**Step 1: Access Dashboard**
1. Log in as SUPER_ADMIN or MAP_ADMIN
2. Click **Map** in sidebar
3. Click **Data Quality** submenu
4. Dashboard loads with statistics
**Step 2: Review Overall Statistics**
Dashboard displays 4 main statistic cards:
```plaintext
┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐
│ Total Locations │ Geocoded │ Avg Confidence │ Low Confidence │
│ 1,500 │ 1,450 (96.7%) │ 78.5 │ 50 │
└──────────────────┴──────────────────┴──────────────────┴──────────────────┘
```
**Step 3: Analyze Provider Performance**
Provider breakdown table shows:
| Provider | Count | Success Rate | Avg Confidence |
|----------|-------|--------------|----------------|
| GOOGLE | 800 | 99.2% | 85.3 |
| MAPBOX | 350 | 97.1% | 82.1 |
| NOMINATIM | 200 | 94.5% | 75.8 |
| PHOTON | 100 | 91.0% | 68.2 |
| UNKNOWN | 50 | N/A | 0 |
**Step 4: Review Confidence Distribution**
Bar chart displays confidence distribution:
```plaintext
Confidence Distribution
100 | ┌──────┐
80 | │ │
60 | ┌──────┤ │
40 | ┌──────┤ │ │
20 | │ │ │ │
0 └──┴──────┴──────┴──────┴──────┘
0-20 21-40 41-60 61-80 81-100
15 35 150 450 800
```
### Identify and Review Low-Confidence Locations
**Step 1: Filter Low-Confidence Locations**
1. Click **Low Confidence** tab on dashboard
2. Table loads with locations where confidence < 50
3. Sort by confidence (ascending) to prioritize worst
**Step 2: Review Location Details**
Click row to open detail drawer:
```plaintext
┌─────────────────────────────────────────┐
│ Location Details │
├─────────────────────────────────────────┤
│ Address: 123 Main St │
│ Postal Code: M5H 2N2 │
│ Coordinates: 43.6532, -79.3832 │
│ │
│ Geocoding Info: │
│ Confidence: 45 (Low) │
│ Provider: NOMINATIM │
│ Geocoded: Feb 10, 2025 10:00 AM │
│ │
│ Issues: │
│ • Missing street number in response │
│ • Approximate match only │
│ │
│ [Re-geocode] [Edit Address] [View Map] │
└─────────────────────────────────────────┘
```
**Step 3: Take Action**
Options for remediation:
1. **Re-geocode with different provider:**
- Click **Re-geocode** button
- Select provider (GOOGLE recommended for low confidence)
- Click **Geocode Now**
- New confidence displayed
2. **Edit address:**
- Click **Edit Address**
- Correct typos or formatting issues
- Save changes
- Auto-triggers re-geocoding
3. **View on map:**
- Click **View Map**
- Verify location accuracy visually
- Drag marker to correct position if needed
### Bulk Re-geocoding
**Step 1: Select Locations**
1. In Low Confidence tab, use table checkboxes to select locations
2. Or click **Select All** to select all visible
3. Selected count displays: "50 selected"
**Step 2: Choose Provider**
1. Click **Bulk Re-geocode** button
2. Modal opens with provider selection:
```plaintext
┌─────────────────────────────────────┐
│ Bulk Re-geocode │
├─────────────────────────────────────┤
│ Re-geocode 50 locations │
│ │
│ Provider: [GOOGLE ▼] │
│ │
│ Options: │
│ ☑ Only if confidence < 50 │
│ ☑ Cache results │
│ ☐ Overwrite existing coordinates │
│ │
│ Estimated time: ~2 minutes │
│ │
│ [Cancel] [Start Re-geocoding] │
└─────────────────────────────────────┘
```
**Step 3: Monitor Progress**
1. Job starts, progress bar appears:
```plaintext
Re-geocoding in progress... 25/50 (50%)
[████████████░░░░░░░░░░░░] 50%
```
2. Real-time updates:
- Total processed
- Successful geocodes
- Failed geocodes
- Average new confidence
**Step 4: Review Results**
Job completion summary:
```plaintext
┌─────────────────────────────────────┐
│ Bulk Re-geocode Complete │
├─────────────────────────────────────┤
│ Processed: 50 │
│ Successful: 47 (94%) │
│ Failed: 3 (6%) │
│ │
│ Quality Improvement: │
│ Avg Confidence Before: 42.5 │
│ Avg Confidence After: 81.3 │
│ Improvement: +38.8 │
│ │
│ [View Failed] [Close] │
└─────────────────────────────────────┘
```
### Handle Duplicates
**Step 1: View Duplicates Tab**
1. Click **Duplicates** tab on dashboard
2. Table groups locations by coordinates
**Step 2: Review Duplicate Groups**
Table displays:
| Coordinates | Count | Addresses | Action |
|-------------|-------|-----------|--------|
| 43.6532, -79.3832 | 3 | 123 Main St, 123 Main Street, 123 Main St Unit 1 | [Review] |
| 43.6540, -79.3825 | 2 | 456 Bay St, 456 Bay Street | [Review] |
**Step 3: Resolve Duplicates**
Click **Review** to open resolution modal:
```plaintext
┌─────────────────────────────────────┐
│ Resolve Duplicates │
├─────────────────────────────────────┤
│ 3 locations at 43.6532, -79.3832 │
│ │
│ ○ Merge into single location │
│ Primary: 123 Main St │
│ Merge units from duplicates │
│ │
│ ○ Keep as separate multi-unit │
│ Mark as validated multi-unit │
│ │
│ ○ Re-geocode individually │
│ Try to get unique coordinates │
│ │
│ [Cancel] [Resolve] │
└─────────────────────────────────────┘
```
**Resolution Options:**
1. **Merge:** Combine into single Location with multiple Address records
2. **Multi-unit:** Mark as legitimate multi-unit building
3. **Re-geocode:** Attempt to get unique coordinates for each
## Quality Improvement Strategies
### Multi-Provider Geocoding
**Fallback Chain:**
```typescript
// geocoding.service.ts
const PROVIDER_CHAIN: GeocodeProvider[] = [
'GOOGLE', // Primary: Best accuracy, paid
'MAPBOX', // Fallback 1: Good accuracy, paid
'NOMINATIM', // Fallback 2: Free, decent accuracy
'PHOTON', // Fallback 3: Free, lower accuracy
'ARCGIS' // Fallback 4: Free, basic accuracy
];
async geocode(address: string): Promise<GeocodeResult | null> {
for (const provider of PROVIDER_CHAIN) {
try {
const result = await this.geocodeWithProvider(address, provider);
if (result && result.confidence >= 50) {
return result; // Success, confidence acceptable
}
} catch (error) {
logger.warn(`Geocoding failed with ${provider}:`, error);
// Try next provider
}
}
return null; // All providers failed
}
```
**Benefits:**
- Increases success rate (90% → 96%+)
- Reduces dependency on single provider
- Cost optimization (use free providers as fallback)
- Provider outage resilience
### Address Normalization
**Pre-Geocoding Normalization:**
```typescript
const normalizeAddressForGeocoding = (address: string): string => {
let normalized = address;
// Remove extra whitespace
normalized = normalized.replace(/\s+/g, ' ').trim();
// Standardize abbreviations
const replacements: Record<string, string> = {
'Street': 'St',
'Avenue': 'Ave',
'Road': 'Rd',
'Drive': 'Dr',
'Boulevard': 'Blvd',
'Apartment': 'Apt',
'Unit': 'Unit',
'Suite': 'Ste'
};
Object.entries(replacements).forEach(([long, short]) => {
const regex = new RegExp(`\\b${long}\\b`, 'gi');
normalized = normalized.replace(regex, short);
});
// Ensure postal code spacing (Canadian format)
normalized = normalized.replace(/([A-Z]\d[A-Z])(\d[A-Z]\d)/, '$1 $2');
// Remove periods from abbreviations
normalized = normalized.replace(/\./g, '');
return normalized;
};
```
**Improvements:**
- Reduces geocoding errors by 10-15%
- Increases confidence scores
- Better cache hit rate
### Geocoding Cache
**Redis Cache Implementation:**
```typescript
// geocoding.service.ts
private async geocodeWithCache(address: string): Promise<GeocodeResult | null> {
const cacheKey = `geocode:${normalizeAddress(address)}`;
// Check cache
const cached = await redis.get(cacheKey);
if (cached) {
logger.debug('Geocoding cache hit:', address);
return JSON.parse(cached);
}
// Cache miss, geocode
const result = await this.geocode(address);
if (result) {
// Cache for 30 days
await redis.setex(cacheKey, 2592000, JSON.stringify(result));
}
return result;
}
```
**Benefits:**
- Reduces API costs (90% cache hit rate)
- Faster response times (Redis: <5ms vs API: 200-500ms)
- Consistent results for same address
- Provider API rate limit avoidance
### Manual Verification
**Critical Location Verification:**
Manually verify high-priority locations:
1. **Campaign offices:** Ensure exact coordinates
2. **Shift start points:** Verify accessibility
3. **Event venues:** Confirm entrance location
4. **Polling stations:** Critical for voter info
**Verification Process:**
```typescript
// Mark location as manually verified
await prisma.location.update({
where: { id: locationId },
data: {
geocodeConfidence: 100,
geocodeProvider: 'MANUAL',
geocodedAt: new Date()
}
});
```
### Regular Audits
**Monthly Quality Audit Checklist:**
1. **Run quality report:**
```bash
curl http://localhost:4000/api/locations/geocode-stats
```
2. **Check metrics against thresholds:**
- Geocoded % > 95%
- Avg confidence > 70
- Low confidence count < 50
- Duplicates < 20
3. **Review low-confidence locations:**
- Filter locations with confidence < 50
- Review top 20 by address
- Identify patterns (specific streets, providers)
4. **Bulk re-geocode low confidence:**
- Use GOOGLE provider for accuracy
- Monitor improvement in avg confidence
5. **Resolve duplicates:**
- Review all duplicate groups
- Merge or mark as multi-unit
- Update addresses as needed
6. **Export quality report:**
```typescript
const report = await generateQualityReport();
fs.writeFileSync(`quality-report-${date}.json`, JSON.stringify(report, null, 2));
```
## Code Examples
### DataQualityDashboardPage.tsx
```typescript
import React, { useEffect, useState } from 'react';
import { Card, Row, Col, Statistic, Table, Tabs, Button, message } from 'antd';
import { WarningOutlined, CheckCircleOutlined } from '@ant-design/icons';
import { api } from '@/lib/api';
import { Bar } from 'react-chartjs-2';
interface GeocodeStats {
total: number;
geocoded: number;
geocodedPercent: number;
avgConfidence: number;
providerBreakdown: Record<string, number>;
confidenceDistribution: Record<string, number>;
lowConfidenceCount: number;
missingCoordinates: number;
duplicatesCount: number;
}
const DataQualityDashboardPage: React.FC = () => {
const [stats, setStats] = useState<GeocodeStats | null>(null);
const [lowConfLocations, setLowConfLocations] = useState<any[]>([]);
const [duplicates, setDuplicates] = useState<any[]>([]);
const [loading, setLoading] = useState(false);
useEffect(() => {
fetchStats();
fetchLowConfidenceLocations();
fetchDuplicates();
}, []);
const fetchStats = async () => {
setLoading(true);
try {
const { data } = await api.get<GeocodeStats>('/locations/geocode-stats');
setStats(data);
} catch (error) {
message.error('Failed to load statistics');
} finally {
setLoading(false);
}
};
const fetchLowConfidenceLocations = async () => {
try {
const { data } = await api.get('/locations?geocodeConfidence=lt:50&limit=100');
setLowConfLocations(data.data);
} catch (error) {
message.error('Failed to load low-confidence locations');
}
};
const fetchDuplicates = async () => {
try {
const { data } = await api.get('/locations/duplicates');
setDuplicates(data.duplicates);
} catch (error) {
message.error('Failed to load duplicates');
}
};
const handleRegeocodeLocation = async (locationId: number) => {
try {
await api.post(`/locations/${locationId}/regeocode`, { provider: 'GOOGLE' });
message.success('Location re-geocoded successfully');
fetchStats();
fetchLowConfidenceLocations();
} catch (error) {
message.error('Failed to re-geocode location');
}
};
const confidenceChartData = stats ? {
labels: Object.keys(stats.confidenceDistribution),
datasets: [{
label: 'Locations',
data: Object.values(stats.confidenceDistribution),
backgroundColor: [
'#e74c3c', // 0-20: Red
'#f39c12', // 21-40: Orange
'#f1c40f', // 41-60: Yellow
'#3498db', // 61-80: Blue
'#27ae60' // 81-100: Green
]
}]
} : null;
const lowConfColumns = [
{ title: 'Address', dataIndex: 'address', key: 'address' },
{ title: 'Confidence', dataIndex: 'geocodeConfidence', key: 'confidence', render: (val: number) => (
<span style={{ color: val < 30 ? '#e74c3c' : '#f39c12' }}>{val}</span>
)},
{ title: 'Provider', dataIndex: 'geocodeProvider', key: 'provider' },
{ title: 'Action', key: 'action', render: (_: any, record: any) => (
<Button size="small" onClick={() => handleRegeocodeLocation(record.id)}>
Re-geocode
</Button>
)}
];
return (
<div>
<h1>Data Quality Dashboard</h1>
{/* Statistics Cards */}
<Row gutter={16} style={{ marginBottom: 24 }}>
<Col span={6}>
<Card>
<Statistic
title="Total Locations"
value={stats?.total || 0}
prefix={<CheckCircleOutlined />}
/>
</Card>
</Col>
<Col span={6}>
<Card>
<Statistic
title="Geocoded"
value={stats?.geocoded || 0}
suffix={`(${stats?.geocodedPercent.toFixed(1) || 0}%)`}
valueStyle={{ color: (stats?.geocodedPercent || 0) > 95 ? '#27ae60' : '#f39c12' }}
/>
</Card>
</Col>
<Col span={6}>
<Card>
<Statistic
title="Avg Confidence"
value={stats?.avgConfidence.toFixed(1) || 0}
valueStyle={{ color: (stats?.avgConfidence || 0) > 70 ? '#27ae60' : '#f39c12' }}
/>
</Card>
</Col>
<Col span={6}>
<Card>
<Statistic
title="Low Confidence"
value={stats?.lowConfidenceCount || 0}
prefix={<WarningOutlined />}
valueStyle={{ color: (stats?.lowConfidenceCount || 0) > 50 ? '#e74c3c' : '#f39c12' }}
/>
</Card>
</Col>
</Row>
{/* Charts and Tables */}
<Tabs
items={[
{
key: 'overview',
label: 'Overview',
children: (
<div>
<Card title="Confidence Distribution" style={{ marginBottom: 24 }}>
{confidenceChartData && <Bar data={confidenceChartData} />}
</Card>
<Card title="Provider Performance">
<Table
dataSource={stats ? Object.entries(stats.providerBreakdown).map(([provider, count]) => ({
provider,
count
})) : []}
columns={[
{ title: 'Provider', dataIndex: 'provider', key: 'provider' },
{ title: 'Count', dataIndex: 'count', key: 'count' }
]}
pagination={false}
/>
</Card>
</div>
)
},
{
key: 'low-confidence',
label: `Low Confidence (${lowConfLocations.length})`,
children: (
<Table
dataSource={lowConfLocations}
columns={lowConfColumns}
rowKey="id"
loading={loading}
/>
)
},
{
key: 'duplicates',
label: `Duplicates (${duplicates.length})`,
children: (
<Table
dataSource={duplicates}
columns={[
{ title: 'Coordinates', key: 'coords', render: (_, record: any) =>
`${record.coordinates.latitude.toFixed(6)}, ${record.coordinates.longitude.toFixed(6)}`
},
{ title: 'Count', dataIndex: 'count', key: 'count' },
{ title: 'Addresses', key: 'addresses', render: (_, record: any) =>
record.locations.map((l: any) => l.address).join(', ')
}
]}
rowKey={(record) => `${record.coordinates.latitude}-${record.coordinates.longitude}`}
/>
)
}
]}
/>
</div>
);
};
export default DataQualityDashboardPage;
```
### Geocode Statistics Service
```typescript
// locations.service.ts
import { prisma } from '@/config/database';
import type { GeocodeProvider } from '@prisma/client';
export class LocationsService {
async getGeocodeStats() {
const locations = await prisma.location.findMany({
select: {
id: true,
latitude: true,
longitude: true,
geocodeConfidence: true,
geocodeProvider: true
}
});
const total = locations.length;
const geocoded = locations.filter(l => l.latitude && l.longitude).length;
const sumConfidence = locations.reduce((sum, l) => sum + (l.geocodeConfidence || 0), 0);
const avgConfidence = total > 0 ? sumConfidence / total : 0;
// Provider breakdown
const providerBreakdown: Record<string, number> = {};
locations.forEach(l => {
const provider = l.geocodeProvider || 'UNKNOWN';
providerBreakdown[provider] = (providerBreakdown[provider] || 0) + 1;
});
// Confidence distribution
const confidenceDistribution = {
'0-20': 0,
'21-40': 0,
'41-60': 0,
'61-80': 0,
'81-100': 0
};
locations.forEach(l => {
const conf = l.geocodeConfidence || 0;
if (conf <= 20) confidenceDistribution['0-20']++;
else if (conf <= 40) confidenceDistribution['21-40']++;
else if (conf <= 60) confidenceDistribution['41-60']++;
else if (conf <= 80) confidenceDistribution['61-80']++;
else confidenceDistribution['81-100']++;
});
const lowConfidenceCount = locations.filter(l => (l.geocodeConfidence || 0) < 50).length;
const duplicatesCount = await this.countDuplicates();
return {
total,
geocoded,
geocodedPercent: total > 0 ? (geocoded / total) * 100 : 0,
avgConfidence,
providerBreakdown,
confidenceDistribution,
lowConfidenceCount,
missingCoordinates: total - geocoded,
duplicatesCount
};
}
async countDuplicates(): Promise<number> {
const locations = await prisma.location.findMany({
where: {
AND: [
{ latitude: { not: null } },
{ longitude: { not: null } }
]
},
select: { latitude: true, longitude: true }
});
const coordMap = new Map<string, number>();
locations.forEach(l => {
const key = `${l.latitude!.toFixed(6)},${l.longitude!.toFixed(6)}`;
coordMap.set(key, (coordMap.get(key) || 0) + 1);
});
return Array.from(coordMap.values()).filter(count => count > 1).reduce((sum, count) => sum + count, 0);
}
async regeocode(locationId: number, provider?: GeocodeProvider) {
const location = await prisma.location.findUnique({
where: { id: locationId }
});
if (!location) {
throw new Error('Location not found');
}
const result = await geocodingService.geocode(location.address, provider);
if (!result) {
throw new Error('Geocoding failed');
}
return await prisma.location.update({
where: { id: locationId },
data: {
latitude: result.latitude,
longitude: result.longitude,
geocodeConfidence: result.confidence,
geocodeProvider: result.provider,
geocodedAt: new Date()
}
});
}
}
```
## Troubleshooting
### Problem: Many low-confidence locations
**Symptoms:**
- > 100 locations with confidence < 50
- Avg confidence < 60
- Prometheus alert firing
**Solutions:**
1. **Check provider API keys:**
```bash
# Test Google Geocoding API
curl "https://maps.googleapis.com/maps/api/geocode/json?address=123+Main+St+Toronto&key=YOUR_KEY"
# Verify key in .env
echo $GEOCODE_GOOGLE_API_KEY
```
2. **Try different primary provider:**
```env
# In .env, change primary provider
GEOCODE_PRIMARY_PROVIDER=GOOGLE # Most accurate
# Or try:
GEOCODE_PRIMARY_PROVIDER=MAPBOX # Good alternative
```
3. **Verify address format:**
```typescript
// Bad: Missing city/postal
"123 Main St"
// Good: Full address
"123 Main St, Toronto ON M5H 2N2"
```
4. **Use postal code for better accuracy:**
```typescript
// Append postal code if available
const fullAddress = location.postalCode
? `${location.address}, ${location.postalCode}`
: location.address;
```
5. **Bulk re-geocode with Google:**
```bash
# Via API
curl -X POST http://localhost:4000/api/locations/bulk-geocode \
-H "Authorization: Bearer $TOKEN" \
-d '{"provider":"GOOGLE","confidenceThreshold":50}'
```
### Problem: Duplicate locations detected
**Symptoms:**
- Multiple locations at same coordinates
- Duplicates tab shows many groups
- Inflated location counts in cuts
**Solutions:**
1. **Check if legitimately multi-unit:**
```sql
-- Find buildings with multiple addresses
SELECT l.id, l.address, COUNT(a.id) as unit_count
FROM "Location" l
JOIN "Address" a ON a."locationId" = l.id
GROUP BY l.id
HAVING COUNT(a.id) > 1;
```
2. **Verify geocoding precision:**
```typescript
// Check if rounding issue
const isDuplicateRounding = (loc1, loc2) => {
// Use 4 decimal places (~11m precision) instead of 6 (~0.1m)
return loc1.latitude.toFixed(4) === loc2.latitude.toFixed(4) &&
loc1.longitude.toFixed(4) === loc2.longitude.toFixed(4);
};
```
3. **Review NAR import process:**
```typescript
// Ensure LOC_GUID unique constraint
const location = await prisma.location.upsert({
where: { locGuid: narRecord.LOC_GUID },
update: { /* update fields */ },
create: { /* create fields */ }
});
```
4. **Merge duplicates:**
```typescript
// Merge function
const mergeDuplicates = async (primaryId: number, duplicateIds: number[]) => {
// Move addresses to primary location
await prisma.address.updateMany({
where: { locationId: { in: duplicateIds } },
data: { locationId: primaryId }
});
// Delete duplicates
await prisma.location.deleteMany({
where: { id: { in: duplicateIds } }
});
};
```
### Problem: Geocoding stats slow to load
**Symptoms:**
- GET /api/locations/geocode-stats takes > 5 seconds
- Dashboard timeout errors
- High database CPU
**Solutions:**
1. **Add database indexes:**
```sql
CREATE INDEX CONCURRENTLY idx_locations_geocode_confidence
ON "Location"(geocodeConfidence);
CREATE INDEX CONCURRENTLY idx_locations_geocode_provider
ON "Location"(geocodeProvider);
CREATE INDEX CONCURRENTLY idx_locations_coords
ON "Location"(latitude, longitude)
WHERE latitude IS NOT NULL AND longitude IS NOT NULL;
```
2. **Cache stats in Redis:**
```typescript
// Cache for 5 minutes
const getCachedStats = async () => {
const cached = await redis.get('geocode:stats');
if (cached) return JSON.parse(cached);
const stats = await locationsService.getGeocodeStats();
await redis.setex('geocode:stats', 300, JSON.stringify(stats));
return stats;
};
```
3. **Use aggregation pipeline:**
```typescript
// Raw SQL for better performance
const stats = await prisma.$queryRaw`
SELECT
COUNT(*) as total,
COUNT(latitude) as geocoded,
AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence,
"geocodeProvider",
COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence
FROM "Location"
GROUP BY "geocodeProvider"
`;
```
4. **Materialize stats view:**
```sql
-- Create materialized view
CREATE MATERIALIZED VIEW geocode_stats_mv AS
SELECT
COUNT(*) as total,
COUNT(latitude) FILTER (WHERE latitude IS NOT NULL) as geocoded,
AVG(COALESCE("geocodeConfidence", 0)) as avg_confidence,
COUNT(*) FILTER (WHERE "geocodeConfidence" < 50) as low_confidence
FROM "Location";
-- Refresh hourly
REFRESH MATERIALIZED VIEW geocode_stats_mv;
```
## Performance Considerations
### Database Query Optimization
**Indexes:**
- `geocodeConfidence` (filtering)
- `geocodeProvider` (grouping)
- `(latitude, longitude)` composite (duplicate detection)
- Partial index on non-null coordinates
**Query Performance:**
- geocode-stats: ~500ms (1500 locations)
- Low confidence filter: ~100ms (with index)
- Duplicate detection: ~200ms (coordinate grouping)
- Bulk re-geocode: ~2-5 min (150 locations, depends on provider)
### API Rate Limits
**Provider Limits:**
- Google: 50 QPS, $5/1000 requests
- Mapbox: 100,000/month free, then $0.50/1000
- Nominatim: 1 QPS (public), no commercial use
- Photon: No official limit, self-hosted recommended
- ArcGIS: 100,000/month free
**Optimization:**
- Use Redis cache (30-day TTL)
- Batch geocoding jobs (avoid rate limits)
- Fallback to free providers for non-critical
- Monitor usage via provider dashboards
### Caching Strategy
**Cache Layers:**
1. **Application Cache (Redis):**
```typescript
// 30-day TTL for geocode results
const cacheKey = `geocode:${normalizeAddress(address)}`;
await redis.setex(cacheKey, 2592000, JSON.stringify(result));
```
2. **Statistics Cache:**
```typescript
// 5-minute TTL for stats
await redis.setex('geocode:stats', 300, JSON.stringify(stats));
```
3. **Provider Response Cache:**
```typescript
// Cache raw provider responses separately
await redis.setex(`provider:${provider}:${address}`, 604800, JSON.stringify(rawResponse));
```
**Cache Hit Rates:**
- Geocoding: 90%+ (repeated addresses)
- Statistics: 95%+ (frequent dashboard views)
- Provider responses: 85%+ (re-geocoding attempts)
## Related Documentation
### Backend Documentation
- **Locations Service:** `api/src/modules/map/locations/locations.service.ts`
- Geocode stats aggregation
- Duplicate detection
- Re-geocoding operations
- **Geocoding Service:** `api/src/modules/map/geocoding/geocoding.service.ts`
- Multi-provider fallback
- Confidence calculation
- Cache integration
- **Bulk Geocoding:** `api/src/modules/map/locations/bulk-geocode.routes.ts`
- Job queue integration
- Progress tracking
- Error handling
### Frontend Documentation
- **Data Quality Dashboard:** `admin/src/pages/DataQualityDashboardPage.tsx`
- Statistics display
- Charts and tables
- Bulk actions
- **Locations Page:** `admin/src/pages/LocationsPage.tsx`
- CSV import/export
- Inline geocoding
- Address editing
### Database Documentation
- **Location Model:** `api/prisma/schema.prisma`
- Geocoding metadata fields
- Indexes for performance
- Relations to Address
### Monitoring Documentation
- **Prometheus Metrics:** `api/src/utils/metrics.ts`
- Custom geocoding metrics
- Quality gauges
- Alert integration
- **Grafana Dashboard:** `configs/grafana/dashboards/data-quality.json`
- Quality trend charts
- Provider comparison
- Alert visualization
### External Resources
- **Google Geocoding API:** https://developers.google.com/maps/documentation/geocoding
- **Mapbox Geocoding API:** https://docs.mapbox.com/api/search/geocoding
- **Nominatim API:** https://nominatim.org/release-docs/latest/api/Search
- **Photon API:** https://photon.komoot.io