initial commit: ocm_autopull docs

This commit is contained in:
admin 2025-06-10 12:57:19 -06:00
commit a90ddb0f02
11 changed files with 1585 additions and 0 deletions

83
README.md Normal file
View File

@ -0,0 +1,83 @@
# ocm_autopull
The following is instructions on how to process all OCM data (City of Edmonton contact form data) into a google sheet using appscript.
Using internal resources, this process builds a single spreadsheet with all available OCM emails from a users account.
## How To
### Step 1 - Create Label
In your gmail, create a new label called `ocm_autopull`
Search your email and label all OCM emails. For example:
To: `councillors email`
Has the words: This message is intended for Councillor `councillor name` and their staff.
Select all those emails and label them with the new label you just created.
### Step 2 - Create Appscript Project
In your web browser go to
```
https://script.google.com/home/
```
And create a new project called `ocm_autopull`
### Step 3 - Add Script to ocm_autopull project
In your new project, delete the function, and replace with the code from here:
#### [code](code.md)
![alt text](image-1.png)
Remember to click save after you have copied your code over.
#### Configure
It is possible to configure your script using the configuration settings at the top. If you already have a gmail label for ocm's, you can update the label in the configuration section of the script:
![alt text](image-7.png)
### Step 4 - Test
After saving, you will have new functions available. The first function we want to run is the testSmallSample function. Select the function, then click run:
![alt text](image-2.png)
This will test a few emails from your ocm_autopull label and provide a readout. It
![alt text](image-3.png)
Review the sheet it produces and if it looks good continue to the next step.
### Step 5 - Run Script
If test passes, you are ready to run the script. Select the function called `downloadEmailsByLabel` and click run:
![alt text](image-4.png)
**Now wait for the system to run.** It will take a few minutes, on average a minute per hundred emails.
The system also will run in 5 minute batches; this is so the script stays compliant with Google's requirements. Sometimes these batches will be in smaller increments; this is normal.
When you click `run` the script will output the location of the sheet. You can observer this sheet, and its logs, to see data being added in real time by your system.
![alt text](image-5.png)
### Failure Case
On occasion, google appscripts can fail. If this happens, you should be able to click `run` again and the script will continue processing emails as normal.
![alt text](image-6.png)
#### Total Breakage
If script fails entirely, delete the outputed spreadsheet, and try running it again.

1150
code.md Normal file

File diff suppressed because it is too large Load Diff

352
code_function.md Normal file
View File

@ -0,0 +1,352 @@
# Gmail to Google Sheets Export Script - Function Documentation
## Overview
This Google Apps Script automatically exports OCM (Office of City Manager) contact form submissions from Gmail to Google Sheets. The script processes emails with a specific label, parses contact form data, and creates a structured spreadsheet with real-time output and checkpoint/resume functionality.
## Configuration Constants
### Email Processing
- `LABEL_NAME`: Gmail label to process ("ocm_autopull")
- `BATCH_SIZE`: Number of emails to write before updating checkpoint (10)
- `MAX_EXECUTION_TIME`: Maximum script runtime before checkpoint (5 minutes)
### Spreadsheet Setup
- `SPREADSHEET_NAME_PREFIX`: Prefix for generated spreadsheet names
- `MAIN_SHEET_NAME`: Name of the main data sheet
- `ERROR_SHEET_NAME`: Name of the error logging sheet
- `LOG_SHEET_NAME`: Name of the processing log sheet
### Data Validation
- `REQUIRED_OCM_PHRASES`: Keywords that must be present in OCM emails
- `REPLY_INDICATORS`: Subject line patterns that indicate replies/forwards
- `SKIP_SENDER_DOMAINS`: Email domains to skip (e.g., "edmonton.ca")
- `EXCLUDE_COUNTRIES`: Countries to exclude from full address construction
---
## Main Functions
### `downloadEmailsByLabel()`
**Primary export function that orchestrates the entire process**
**Purpose**: Main entry point for exporting emails by label to Google Sheets with real-time output and checkpoint/resume capability.
**Process**:
1. Checks for existing checkpoint to determine if resuming or starting fresh
2. Creates or connects to spreadsheet and initializes sheets
3. Scans Gmail for all threads with the specified label
4. Processes each thread's root email sequentially
5. Saves checkpoints periodically and on timeout
6. Formats final spreadsheet and sends completion notification
**Returns**: Spreadsheet URL
**Key Features**:
- Real-time data output (writes immediately to sheets)
- Automatic checkpoint saving every 10 emails
- Time-based checkpointing (5-minute limit)
- Automatic resume scheduling on timeout
- Progress logging and error handling
- Email notification on completion
---
## Sheet Management Functions
### `initializeSheets(spreadsheet, isResume)`
**Creates or connects to spreadsheet sheets based on operation type**
**Parameters**:
- `spreadsheet`: The Google Spreadsheet object
- `isResume`: Boolean indicating if this is a resume operation
**Fresh Start Mode** (`isResume = false`):
- Creates new sheets with unique timestamped names
- Clears any existing content
- Sets up headers and formatting
- Freezes header rows
**Resume Mode** (`isResume = true`):
- Connects to existing sheets without clearing data
- Preserves all previously processed data
- Creates missing sheets as fallback
- Logs number of existing data rows
**Creates Three Sheets**:
1. **Main Data Sheet**: OCM contact form data
2. **Error Sheet**: Processing errors and exceptions
3. **Log Sheet**: Detailed processing activity log
### `getUniqueSheetNames()`
**Generates unique sheet names with timestamps to avoid conflicts**
**Returns**: Object containing unique names for main, error, and log sheets
### `writeRowToSheet(sheet, rowData)`
**Writes a single row of data immediately to the specified sheet**
**Parameters**:
- `sheet`: Target Google Sheet object
- `rowData`: Array of values to write
**Purpose**: Enables real-time output by writing each processed email immediately rather than batching.
### `formatSheet(sheet, totalRows)`
**Applies formatting to improve spreadsheet readability**
**Formatting Applied**:
- Auto-resizes all columns
- Sets specific column widths for OCM data fields
- Applies alternating row banding (light grey)
- Formats date columns with consistent date/time format
- Optimizes layout for OCM contact form structure
---
## Data Processing Functions
### `getAllThreadsForLabel(label)`
**Retrieves ALL email threads for a Gmail label, handling pagination**
**Parameters**:
- `label`: Gmail Label object
**Process**:
- Handles Gmail's 500-thread pagination limit
- Retrieves threads in batches with progress logging
- Includes brief pauses between batches to avoid rate limits
**Returns**: Array of all Gmail Thread objects
### `parseOCMContactForm(message, thread, threadNumber)`
**Parses OCM contact form data from email body using regex patterns**
**Parameters**:
- `message`: Gmail Message object
- `thread`: Gmail Thread object
- `threadNumber`: Sequential thread number for logging
**Extracted Fields**:
- Confirmation Number
- Submission Date
- Personal Info: First Name, Last Name, Email, Phone, Fax
- Organization details
- Address: Street, City, Province, Country, Postal Code
- Subject and Comments
- Gmail metadata: Date, From, Subject, Thread ID, Message ID
**Data Cleaning**:
- Removes email quote markers and formatting artifacts
- Normalizes whitespace and removes colons
- Cleans email addresses from mailto links
- Handles empty organization and fax fields
- Strips footer text from comments
**Filtering**:
- Skips obvious replies/forwards based on subject indicators
- Excludes emails from specified domains (e.g., edmonton.ca)
- Processes only root emails from each thread
**Error Handling**:
- Returns error row with available metadata if parsing fails
- Logs detailed error information for troubleshooting
### `constructFullAddress(street, streetCont, city, province, country, postal)`
**Builds formatted full address from individual components**
**Address Construction Rules**:
- Validates and cleans each component
- Excludes specified countries (e.g., "Canada")
- Removes field prefixes that leaked through parsing
- Joins components with commas
- Handles empty or duplicate fields gracefully
**Returns**: Formatted address string
---
## Logging and Error Handling
### `logMessage(action, threadNumber, message, details)`
**Logs processing activities to the log sheet and console**
**Parameters**:
- `action`: Type of action (START, PROCESS, SUCCESS, ERROR, etc.)
- `threadNumber`: Current thread number
- `message`: Primary log message
- `details`: Additional details (optional)
**Log Entry Format**:
- Timestamp
- Thread Number
- Action Type
- Message with Details
- Status (SUCCESS/ERROR)
### `logError(threadNumber, errorType, errorDetails, emailDate, emailFrom)`
**Logs errors to the dedicated error sheet**
**Error Entry Format**:
- Thread Number
- Error Type
- Detailed Error Description
- Email Date and Sender
- Error Timestamp
**Common Error Types**:
- Parsing Error: Failed to extract form data
- Thread Processing Error: General thread handling failure
- Sheet Writing Error: Failed to write to spreadsheet
---
## Checkpoint and Resume System
### `saveCheckpoint(checkpointData)`
**Saves current progress to Google Apps Script Properties**
**Checkpoint Data Includes**:
- Spreadsheet ID
- Label name
- Total thread count
- Number of threads processed
- Number of emails found
- Start timestamp
### `getCheckpoint()`
**Retrieves saved checkpoint data**
**Returns**: Checkpoint object or null if no checkpoint exists
### `clearCheckpoint()`
**Removes checkpoint data when export completes**
### `scheduleResume()`
**Creates time-based trigger to automatically resume processing**
**Process**:
- Deletes any existing resume triggers
- Creates new trigger to run `resumeExport()` after specified delay
- Default delay: 1 minute
### `resumeExport()`
**Triggered function that resumes export from checkpoint**
**Process**:
- Cleans up the trigger that called it
- Calls `downloadEmailsByLabel()` to continue processing
---
## Testing and Utility Functions
### `testLabelCount()`
**Comprehensive readiness test - RUN THIS FIRST**
**Test Sequence**:
1. **Label Existence**: Verifies the specified Gmail label exists
2. **Thread Count**: Gets total number of threads with the label
3. **Message Analysis**: Analyzes sample threads for size and structure
4. **Parsing Test**: Tests email parsing on sample messages
**Output**:
- Detailed test results and statistics
- Available Gmail labels if target label not found
- Thread size analysis (min/max/average messages per thread)
- Parsing success rate on sample data
### `testSmallSample()`
**Processes 3 sample emails with real-time output - RUN THIS SECOND**
**Purpose**:
- Validates parsing logic with real data
- Tests spreadsheet creation and formatting
- Verifies real-time output functionality
- Creates test spreadsheet for review
**Process**:
- Creates test spreadsheet with all sheets
- Processes up to 3 labeled emails
- Applies full formatting
- Returns test spreadsheet URL
### `checkCheckpointStatus()`
**Displays current checkpoint information**
**Shows**:
- Spreadsheet ID and URL
- Processing progress (threads/emails)
- Start time and label name
- Resume instructions
### `clearCheckpointAndTriggers()`
**Emergency function to clear stuck checkpoints and triggers**
**Use Cases**:
- Stuck or corrupted checkpoint data
- Multiple resume triggers created
- Starting completely fresh export
### `getCurrentSpreadsheetUrl()`
**Gets URL of currently active export spreadsheet**
**Returns**: Spreadsheet URL or null if no active export
---
## Data Structure
### Main Sheet Columns
1. **Thread Number**: Sequential processing number
2. **Confirmation Number**: OCM form confirmation ID
3. **Submission Date**: Date form was submitted
4. **First Name**: Submitter's first name
5. **Last Name**: Submitter's last name
6. **Email Address**: Contact email
7. **Phone Number**: Contact phone
8. **Fax**: Fax number (often empty)
9. **Organization**: Organization name (often empty)
10. **Street Address**: Primary street address
11. **Street Address (cont.)**: Address continuation
12. **City**: City name
13. **Province**: Province/state
14. **Country**: Country name
15. **Postal Code**: Postal/ZIP code
16. **Full Address**: Constructed complete address
17. **Subject**: Form subject line
18. **Comments**: Form comments/message
19. **Gmail Date**: Email received date
20. **Gmail From**: Email sender
21. **Gmail Subject**: Email subject line
22. **Thread ID**: Gmail thread identifier
23. **Email ID**: Gmail message identifier
---
## Usage Instructions
### Initial Setup
1. Update `LABEL_NAME` constant with your Gmail label
2. Run `testLabelCount()` to verify setup
3. Run `testSmallSample()` to test parsing
4. Review test results before full export
### Full Export
1. Run `downloadEmailsByLabel()` for complete export
2. Monitor console logs for progress
3. Export automatically resumes if interrupted
4. Check email for completion notification
### Troubleshooting
- Use `checkCheckpointStatus()` to view progress
- Use `clearCheckpointAndTriggers()` to reset if stuck
- Check error and log sheets for detailed information
- Verify Gmail label exists and contains expected emails
### Performance Notes
- Processes ~10-20 emails per minute depending on content
- Automatically checkpoints every 5 minutes
- Resumes automatically with 1-minute delay
- Real-time output shows immediate results
- Handles large exports (1000+ emails) through checkpointing

BIN
image-1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

BIN
image-2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

BIN
image-3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

BIN
image-4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.5 KiB

BIN
image-5.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

BIN
image-6.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

BIN
image-7.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

BIN
image.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB