A modular pipeline for analyzing public comments from federal regulations using AI-powered theme and entity discovery.
This tool processes public comments to:
- Load comments from regulations.gov API or CSV bulk downloads.
- Condense verbose comments into structured summaries.
- Discover Themes by building a hierarchical taxonomy of topics.
- Extract Theme-Specific Content from each comment for precise analysis.
- Summarize Themes by synthesizing all relevant comment extracts into a narrative.
- Discover Entities by identifying organizations, programs, and concepts.
- Build a Website to explore the results interactively.
# Install Bun (if not already installed)
curl -fsSL https://bun.sh/install | bash
# Install dependencies
bun installThis section provides a high-level overview of the main commands. For a detailed explanation of how each step works, see the "Pipeline Deep Dive" section below.
All commands follow the pattern:
bun run cli <command> <document-id> [options]Load comments from a CSV file (recommended) or the regulations.gov API.
From CSV:
bun run cli load CMS-2025-0050-0031.csv --limit 500From API:
bun run cli load CMS-2025-0050-0031 --limit 100Generate condensed, structured versions of each comment using an AI model.
bun run cli condense CMS-2025-0050-0031 --limit 100Analyze comments to build a hierarchical taxonomy of themes and topics discussed.
bun run cli discover-themes CMS-2025-0050-0031 --filter-duplicates- Use
--filter-duplicatesto remove form letters and improve theme quality.
For each comment, extract the specific sentences and arguments that are relevant to each theme in the taxonomy. This is a critical step for high-quality summaries.
bun run cli extract-theme-content CMS-2025-0050-0031Synthesize all the theme-specific extracts into a comprehensive narrative analysis for each theme.
bun run cli summarize-themes-v2 CMS-2025-0050-0031 --filter-duplicatesExtract named entities (organizations, programs, etc.) from comments and build a taxonomy.
bun run cli discover-entities-v2 CMS-2025-0050-0031Generate all necessary JSON files for the interactive web dashboard.
bun run cli build-website CMS-2025-0050-0031Clean and optimize the SQLite database to reduce file size.
bun run cli vacuum-db CMS-2025-0050-0031Execute the entire 8-step pipeline in sequence with automatic crash recovery.
# Run the complete pipeline with a CSV file
bun run cli pipeline CMS-2025-0050-0031.csv
# Start from a specific step (e.g., step 4 = extract-theme-content)
bun run cli pipeline CMS-2025-0050-0031.csv --start-at 4
# Filter duplicates and set a similarity threshold
bun run cli pipeline CMS-2025-0050-0031.csv --filter-duplicates --similarity-threshold 0.75Generate the dist/index.html landing page that lists all available regulation dashboards.
bun run cli generate-landing-pageInspect and manage the LLM prompt/response cache stored in the database.
# View cache statistics
bun run cli cache stats CMS-2025-0050-0031
# Clear the entire cache
bun run cli cache clear CMS-2025-0050-0031 --allThis section explains the how and why behind each step of the analysis pipeline.
- Purpose: To get the raw comment data into a local SQLite database.
- How it Works: It can either fetch comments one-by-one from the regulations.gov API or, more efficiently, parse a bulk-downloaded CSV file. It also downloads any attachments associated with the comments, with built-in support for extracting text from PDF and DOCX files.
- The AI's Role: None. This step is purely data ingestion.
- Database Impact:
comments: Populated with raw comment data.attachments: Populated with PDF data.
- Purpose: To transform long, unstructured comments (especially PDFs) into a clean, consistent, and structured format for easier analysis.
- How it Works: It takes the full text of each comment (including text extracted from PDFs) and sends it to an AI model.
- The AI's Role:
- Prompt Goal: To act as an expert analyst, reading a comment and summarizing it into a structured format with predefined sections.
- Key Prompt Snippet: From
src/prompts/condense.ts:You MUST organize every comment into these exact sections with these exact headers: ### DETAILED CONTENT, ### ONE-LINE SUMMARY, ### COMMENTER PROFILE, ### CORE POSITION... - Conceptual Example:
- Before: A 10-page, unstructured PDF attachment.
- After: A clean summary with sections like
### CORE POSITIONand### KEY RECOMMENDATIONS.
- Database Impact:
- Reads from:
comments,attachments. - Writes to:
condensed_comments(stores the structured summary).
- Reads from:
- Purpose: To understand the main topics of discussion by creating a hierarchical taxonomy of themes from the ground up.
- How it Works: It takes thousands of condensed comments, batches them together into large contexts, and asks the AI to identify the recurring themes and organize them.
- The AI's Role:
- Prompt Goal: To read a large volume of comments and act as a research synthesizer, identifying and organizing the key themes into a MECE (Mutually Exclusive, Collectively Exhaustive) hierarchy.
- Key Prompt Snippet: From
src/prompts/theme-discovery.ts:[Number]. [Label]. [Brief Description] || [Detailed Guidelines]. - Conceptual Example:
- Before: Thousands of comments mentioning "paperwork," "forms," and "reporting."
- After: A structured theme is created:
1.1. Administrative Burden, with detailed guidelines explaining what is and isn't included in this theme.
- Handling Large Datasets (Batching & Merging): If the total word count of comments exceeds a threshold (e.g., >250k words), this command automatically splits the comments into smaller batches. It generates a separate theme taxonomy for each batch. Then, it recursively merges these taxonomies using a special
THEME_MERGE_PROMPTuntil a single, unified hierarchy remains. This allows the pipeline to process millions of words, far beyond the context window of a single AI call. - Database Impact:
- Reads from:
condensed_comments. - Writes to:
theme_hierarchy(stores the final taxonomy).
- Reads from:
- Purpose: To find the "semantic signal" by precisely identifying which parts of a comment discuss which themes. This is the foundation for high-quality theme summaries.
- How it Works: This step iterates through each comment. For every individual comment, it makes a single AI call, providing the full comment text along with the entire theme taxonomy. This is a highly efficient "one-to-many" operation.
- The AI's Role:
- Prompt Goal: To read a single comment and, for every theme in the taxonomy, extract the exact text (positions, concerns, recommendations) that pertains to that theme.
- Key Prompt Snippet: From
src/prompts/theme-extract.ts:For each theme, extract EXACTLY what this commenter says about it... If they don't address a theme, mark it as not addressed. - Conceptual Example:
- Input: A full comment and the theme "1.1 Small Practice Impact".
- Output: A JSON object for that comment and theme containing just the relevant sentence:
{"positions": ["As a solo practitioner, this rule would force me to hire a full-time administrator I cannot afford."]}.
- Database Impact:
- Reads from:
condensed_comments,theme_hierarchy. - Writes to:
comment_theme_extracts(stores the JSON snippets).
- Reads from:
- Purpose: To create a rich, narrative summary for each theme, explaining the consensus points, debates, and key perspectives.
- How it Works: For a given theme (e.g., "1.1 Small Practice Impact"), it gathers all the specific extracts created in the previous step. It then sends this collection of targeted content to the AI to be synthesized.
- The AI's Role:
- Prompt Goal: To act as a policy analyst, reading all the pre-filtered, on-topic extracts and synthesizing them into a structured report with sections for consensus, debate, stakeholder views, and more.
- Key Prompt Snippet: From
src/prompts/theme-extract.ts:You are a policy analyst synthesizing public input... Your analysis will inform decision-makers... Required Analysis Sections: ### CONSENSUS POINTS, ### AREAS OF DEBATE... - Conceptual Example:
- Before: Dozens of individual JSON extracts about the impact on small practices.
- After: A final, structured summary for theme 1.1 that begins:
### EXECUTIVE SUMMARY: Small practices universally oppose the rule, citing cost concerns as the primary driver...
- Handling Large Datasets (Batching & Merging): If a single theme has a very large volume of extracted text, this step will automatically batch the extracts, generate a summary for each batch, and then use a final
EXTRACT_MERGE_PROMPTto combine the partial summaries into one comprehensive analysis. - Database Impact:
- Reads from:
comment_theme_extracts,theme_hierarchy. - Writes to:
theme_summaries(stores the final narrative analysis).
- Reads from:
- Purpose: To identify and categorize all the specific organizations, regulations, programs, and technical terms mentioned in the comments.
- How it Works: Similar to theme discovery, it sends a large batch of comments to the AI and asks it to generate a taxonomy of named entities. It then scans all comments to annotate which comments mention which entities.
- The AI's Role:
- Prompt Goal: To identify named entities and group them into logical categories (e.g., "Government Agencies," "Medical Conditions").
- Key Prompt Snippet: From
src/prompts/entity-discovery.ts:Return a JSON object with this structure: { "category_name": [ { "label": "Entity Name", "definition": "...", "terms": ["exact term 1", "variant spellings"] } ] }
- Database Impact:
- Reads from:
condensed_comments. - Writes to:
entity_taxonomy(the categories and definitions) andcomment_entities(the mapping between comments and entities).
- Reads from:
- Purpose: To export all the processed data from the database into static JSON files that the web dashboard can consume.
- How it Works: This is a pure export script. It queries the various database tables and writes the results into a series of JSON files.
- The AI's Role: None.
- Database Impact: Reads from all major tables (
comments,themes,entities, etc.). Writes to thedist/datadirectory.
- Purpose: To optimize the SQLite database file, reclaiming unused space and improving performance.
- How it Works: It runs the
VACUUMcommand on the SQLite database. This is good practice to do after a lot of writes. - The AI's Role: None.
- Database Impact: Rewrites the database file, potentially reducing its size.
src/
├── commands/ # CLI command implementations
├── lib/ # Shared utilities (database, AI, comment processing)
├── prompts/ # AI prompt templates
├── types/ # TypeScript type definitions
└── cli.ts # Main entry point
dbs/ # SQLite databases (one per document)
debug/ # Debug outputs when --debug flag is used
dist/ # Output for built website files
Each document gets its own SQLite database in dbs/<document-id>.sqlite containing:
comments: Raw comment data from regulations.gov.attachments: PDF and other attachments with extracted text content.condensed_comments: AI-generated structured summaries of comments.theme_hierarchy: The hierarchical taxonomy of themes.comment_theme_extracts: (New) Stores theme-specific text extracted from each comment.theme_summaries: (New) Stores the final AI-generated narrative analysis for each theme.entity_taxonomy: The taxonomy of discovered entities (organizations, etc.).comment_entities: Maps which comments mention which entities.llm_cache: (New) Caches AI prompts and responses to avoid re-running expensive calls.
Build separate dashboard instances for each regulation database:
./scripts/build-all-dashboards.shThis will:
- Find all SQLite databases in
dbs/ - Generate data files for each regulation using the
build-websitecommand. - Output to
dist/<regulation-id>/directories. - Create an index page at
dist/index.htmllisting all dashboards.
The web dashboard provides:
- Interactive Theme Explorer: Browse hierarchical theme structure with comment counts.
- Theme Summaries: Read detailed narrative analyses of key themes.
- Entity Browser: Explore discovered entities by category.
- Comment Search: Full-text search across all comments.
- Copy for LLM: Export data in LLM-friendly formats.
Build a dashboard for a specific regulation:
./scripts/build-single-dashboard.sh CMS-2025-0050-0031cd dist
python -m http.server 8000
# Then visit http://localhost:8000bun run typecheckbun run clean # Removes all dbs/* and debug/*MIT