Skip to main content
← All projects
CommerceTypeScriptAWSTerraform

Public Record Data Scraper

Directed architecture for high-reliability data extraction

Related Live Sites

Public Record Data Scraper Concept Sketch

Algorithmic visualization representing the underlying logic of Public Record Data Scraper. Source: Dynamic Generation

The Challenge of 50-State Scale

Public record data in the United States is structurally fragmented. Each state maintains its own Uniform Commercial Code (UCC) filing system, often with proprietary interfaces, inconsistent data schemas, and aggressive anti-scraping measures. The Merchant Cash Advance (MCA) industry — a multi-billion-dollar alternative lending sector serving small and mid-size businesses — depends on these filings as the single most reliable signal that a business has active financing and may need additional capital. Yet a broker trying to prospect across all fifty states faces compounding costs: each Secretary of State portal requires separate navigation, often with CAPTCHAs, session limits, or bulk-download restrictions. A human researcher might cover two to three states per day. [1] I directed the architecture of a platform that replaces this manual fragmentation with an automated, AI-driven pipeline: sixty-plus autonomous agents extracting UCC filings continuously across all fifty states, enriching them with multi-source intelligence, and delivering scored, outreach-ready leads through a full-featured React application with integrated CRM, compliance tooling, and communications infrastructure.

graph TD subgraph L1["Layer 1: Data Collection"] CA[CA SOS Agent] --> ORC[Agent Orchestrator] NY[NY SOS Agent] --> ORC TX[TX SOS Agent] --> ORC DOT1["... 47 more"] --> ORC API_A[API Agent] --> ORC WEB_A[Web Portal Agent] --> ORC DB_A[Database Agent] --> ORC FILE_A[File Upload Agent] --> ORC HOOK_A[Webhook Agent] --> ORC end subgraph L2["Layer 2: Enrichment & Scoring"] ORC --> ENR[Enrichment Pipeline] ENR --> SEC_E[SEC EDGAR] ENR --> OSHA_E[OSHA] ENR --> USPTO_E[USPTO] ENR --> DNB[D&B / Clearbit] ENR --> SCORE[ML Scoring Engine] end subgraph L3["Layer 3: Broker Tools"] SCORE --> DASH[Prospect Dashboard] SCORE --> PIPE[Deal Pipeline] SCORE --> INBOX[Unified Inbox] SCORE --> COMP[Compliance Suite] end style L1 fill:#1a1a2e,stroke:#e94560,color:#fff style L2 fill:#16213e,stroke:#0f3460,color:#fff style L3 fill:#16213e,stroke:#0f3460,color:#fff
Three-layer platform architecture: autonomous collection agents feed into an enrichment and scoring engine, which delivers intelligence to broker-facing tools with integrated compliance.

Technical Architecture

The platform is implemented as a TypeScript monorepo with a React 19 frontend deployed to Vercel, an Express backend serving a RESTful API, PostgreSQL for transactional storage, Redis for caching and rate limiting, and BullMQ for background job processing. The monorepo contains three application targets — a primary web application, a Tauri desktop application for field data collection, and a mobile application target — sharing code through internal packages for the database client, type definitions, and a unified UI component library built on Radix UI.[2] Newman's argument for decomposing systems along business capability boundaries applies to the service layer: the backend implements nineteen domain services, each owning a distinct slice of the business logic — from prospect management and ML-based scoring to underwriting analysis and state-specific disclosure generation.

The data collection layer uses a Polymorphic Agent Architecture. Because state portals employ different defensive technologies — CAPTCHA, TLS fingerprinting, rate limiting, session tracking — a single scraping strategy is insufficient. Fifty state-specific collection agents, each tuned to a single Secretary of State portal, extract UCC filings continuously. Five additional entry-point agents handle API endpoints, web portals, database connections, file uploads, and webhook receivers. An AgentOrchestrator coordinates parallel execution across all agents, managing rate limits, circuit breakers, and fallback strategies (API, bulk download, vendor feed, or scrape) per state.[3] The orchestrator dynamically selects the optimal extraction strategy based on each portal's security posture, switching between headless Puppeteer automation for adversarial portals and direct HTTP for cooperative APIs.

server/services/agent-orchestrator.ts
import { BullMQ } from 'bullmq';
import { StateAgent, AgentResult, FallbackStrategy } from '../types';

interface StateConfig {
  state: string;
  primaryStrategy: FallbackStrategy;
  fallbacks: FallbackStrategy[];
  rateLimit: { maxRequests: number; windowMs: number };
  circuitBreaker: { threshold: number; resetMs: number };
}

type FallbackStrategy = 'api' | 'bulk_download' | 'vendor_feed' | 'scrape';

export class AgentOrchestrator {
  private agents: Map<string, StateAgent> = new Map();
  private queue: BullMQ.Queue;

  async executeParallel(
    configs: StateConfig[],
    concurrency: number = 10
  ): Promise<Map<string, AgentResult>> {
    const results = new Map<string, AgentResult>();

    for (const config of configs) {
      await this.queue.add('extract', {
        state: config.state,
        strategy: config.primaryStrategy,
        fallbacks: config.fallbacks,
      }, {
        rateLimiter: {
          max: config.rateLimit.maxRequests,
          duration: config.rateLimit.windowMs,
        },
      });
    }

    return results;
  }
}
Agent Orchestrator: coordinating parallel extraction across 50 state agents with per-state fallback strategies and circuit breakers.

Nineteen Domain Services

The backend's service-oriented architecture reflects a deliberate decomposition along business capability boundaries rather than technical layers. Each service owns a distinct domain concept and can evolve independently.[4] Fowler's domain model pattern — where business logic lives in rich domain objects rather than in service-layer scripts — is applied throughout: a Prospect is not a passive data transfer object but an entity that can compute its own priority score, evaluate its qualification tier, and generate its compliance disclosure requirements. The nineteen services span the full lifecycle of a lead: from initial UCC filing extraction (ProspectsService, EnrichmentService) through qualification and scoring (ScoringService, QualificationService, StackAnalysisService) to outreach and deal management (ContactsService, DealsService, CommunicationsService) to underwriting and compliance (UnderwritingService, DisclosureService, ConsentService, AuditService) and finally to portfolio monitoring (PortfolioService, AlertService).

Stage Service Responsibility
Collection ProspectsService Core prospect CRUD, search, and filtering
EnrichmentService Multi-source data enrichment (SEC, OSHA, USPTO, D&B)
NarrativeService Broker-ready story generation for each prospect
Qualification ScoringService ML-based priority scoring (0-100)
QualificationService Tier-based qualification (A/B/C/D/Decline)
StackAnalysisService UCC lien position detection and analysis
Deal Management ContactsService Contact management with activity tracking
DealsService Deal pipeline and stage management (Kanban)
CommunicationsService Unified email/SMS/voice via Twilio + SendGrid
Compliance DisclosureService State-specific disclosures (CA SB 1235, NY CFDL)
ConsentService TCPA consent tracking and verification
AuditService Immutable audit trail with entity history
ComplianceReportService Violation detection and regulatory reporting
Figure 1. Service architecture — nineteen domain services organized across four pipeline stages: collection, qualification, deal management, and compliance.

Database Schema and Migration Discipline

The PostgreSQL schema supports multitenancy from the ground up, implemented through nine versioned migrations with rollback support. The migration sequence traces the system's architectural evolution: the initial schema establishes UCC filings, prospects, and enrichment data; subsequent migrations add normalization triggers, status enum alignment, row-level security for multitenancy, contact and deal management, communications infrastructure, compliance tracking, and portfolio health monitoring.[5] Kleppmann's analysis of schema evolution in data-intensive applications — the need for forward and backward compatibility, the distinction between schema-on-write and schema-on-read, and the critical role of migration ordering — applies directly to a system where fifty state agents produce data in fifty different formats that must be normalized into a single relational model. Every migration includes both an up and a down path, enabling rollback to any prior schema version without data loss.

database/migrations/004_multitenancy.sql
-- Migration 004: Multitenancy support
CREATE TABLE organizations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL,
  slug TEXT UNIQUE NOT NULL,
  plan_tier TEXT NOT NULL DEFAULT 'free'
    CHECK (plan_tier IN ('free', 'starter', 'professional', 'enterprise')),
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE users (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  org_id UUID NOT NULL REFERENCES organizations(id),
  email TEXT UNIQUE NOT NULL,
  role TEXT NOT NULL DEFAULT 'broker'
    CHECK (role IN ('admin', 'manager', 'broker', 'viewer')),
  created_at TIMESTAMPTZ DEFAULT now()
);

-- Row-level security: every prospect query scoped to org
ALTER TABLE prospects ENABLE ROW LEVEL SECURITY;

CREATE POLICY org_isolation ON prospects
  USING (org_id = current_setting('app.current_org_id')::uuid);

-- All downstream tables inherit org scoping
ALTER TABLE contacts ENABLE ROW LEVEL SECURITY;
ALTER TABLE deals ENABLE ROW LEVEL SECURITY;
ALTER TABLE audit_logs ENABLE ROW LEVEL SECURITY;
Migration 004: multitenancy with row-level security — every query is scoped to the authenticated organization, preventing cross-tenant data leakage.

Enrichment Pipeline and ML Scoring

Raw UCC filings tell you that a company has a lien filed by a lender — nothing more. The enrichment pipeline transforms this thin signal into a multidimensional prospect profile by pulling from six free-tier sources (SEC EDGAR, OSHA, USPTO, Census Bureau, SAM.gov, Google Places) and four paid-tier sources (Dun & Bradstreet, Clearbit, Experian, ZoomInfo). Each enrichment source runs as an independent agent within the BullMQ worker process, with its own rate limits, circuit breakers, and retry policies.[6] Provost and Fawcett's framework for translating business problems into data science tasks — what they call the "data science thinking" approach — maps onto the scoring engine's design: the ML model assigns each prospect a priority score (0-100), a health grade (A through F), a growth signal profile (hiring, permits, contracts, expansion, equipment), a revenue estimate, and a competitive position analysis based on existing lien stack depth. The scoring model is trained on MCA-specific financing patterns rather than generic firmographic data, giving it a vertical advantage over horizontal sales intelligence platforms.

graph LR UCC[Raw UCC Filing] --> NORM[Normalization] NORM --> FREE[Free Tier Sources] NORM --> PAID[Paid Tier Sources] FREE --> SEC[SEC EDGAR] FREE --> OSHA[OSHA] FREE --> USPTO[USPTO] FREE --> CENSUS[Census Bureau] FREE --> SAM[SAM.gov] PAID --> DNB[Dun & Bradstreet] PAID --> CLEAR[Clearbit] PAID --> ZOOM[ZoomInfo] SEC --> MERGE[Data Merge] OSHA --> MERGE USPTO --> MERGE CENSUS --> MERGE SAM --> MERGE DNB --> MERGE CLEAR --> MERGE ZOOM --> MERGE MERGE --> ML[ML Scoring Engine] ML --> PROFILE[Scored Prospect Profile] style UCC fill:#1a1a2e,stroke:#e94560,color:#fff style PROFILE fill:#0d1117,stroke:#238636,color:#fff
Enrichment pipeline flow — raw UCC filings pass through tiered data sources before ML scoring produces a multidimensional prospect profile.

Infrastructure as Code

The entire production infrastructure is defined in Terraform, enabling reproducible deployment of the full stack across AWS regions. The Terraform configuration provisions a VPC with multi-AZ subnets, RDS PostgreSQL with Multi-AZ failover and encryption at rest, ElastiCache Redis with Multi-AZ replication and encryption, S3 buckets with lifecycle policies for document storage, CloudWatch monitoring with SNS alert pipelines, and IAM roles enforcing least-privilege access policies.[7] Morris's four benefits of infrastructure-as-code — reproducibility, consistency, auditability, and recoverability — are critical for a platform that handles sensitive financial data across fifty states with different regulatory requirements. The infrastructure cost is approximately $512 per month for production (scaling with tenant count) and $150 per month for development, a cost structure that supports the platform's tiered SaaS pricing model from a $299/month Starter tier through a $2,499/month Enterprise tier.

Testing and Quality Infrastructure

The platform maintains 2,055 tests across 91 test files, organized into eleven categories: agentic system tests (~200), state agent tests (~250), entry point agent tests (~100), agent orchestration tests (~50), server service tests (~400), route tests (~150), frontend component tests (~500), custom hook tests (~100), data pipeline tests (~150), security tests (~55), and end-to-end integration tests (~100). The testing infrastructure uses Vitest for unit and integration tests, Testing Library for React component testing with accessibility-first queries, Playwright for cross-browser end-to-end tests, and Supertest for Express route assertion.[8] Humble and Farley's quality gate pattern — no code reaches production without passing the full pipeline — is enforced through Husky pre-commit hooks running ESLint and Prettier on staged files, and a GitHub Actions CI pipeline running the full test suite, TypeScript strict-mode type checking, and security scanning on every push. TypeScript strict mode is enforced across the entire codebase, and Zod provides runtime schema validation at every data boundary — API inputs, database outputs, and external API responses.

server/routes/prospects.ts
import { z } from 'zod';

const ProspectFilterSchema = z.object({
  state: z.string().length(2).optional(),
  minScore: z.number().min(0).max(100).optional(),
  maxScore: z.number().min(0).max(100).optional(),
  healthGrade: z.enum(['A', 'B', 'C', 'D', 'F']).optional(),
  growthSignals: z.array(
    z.enum(['hiring', 'permits', 'contracts', 'expansion', 'equipment'])
  ).optional(),
  status: z.enum(['new', 'claimed', 'contacted', 'qualified']).optional(),
  page: z.number().int().positive().default(1),
  limit: z.number().int().min(1).max(200).default(50),
});

router.get('/api/prospects', async (req, res) => {
  const filters = ProspectFilterSchema.parse(req.query);
  const prospects = await prospectsService.list(filters, req.orgId);
  res.json({ data: prospects.rows, total: prospects.count });
});
Zod schema validation at the API boundary — every prospect input is validated at runtime before reaching the service layer.

Compliance as a First-Class Concern

The MCA industry operates under an evolving regulatory landscape where state-specific disclosure requirements — California's SB 1235 mandating APR disclosure on commercial financing, New York's Commercial Finance Disclosure Law requiring standardized cost comparisons — create compliance obligations that are table stakes for legitimate operation but absent from general-purpose sales platforms. The platform treats compliance not as a bolted-on feature but as a native architectural concern: the DisclosureService generates state-specific disclosure documents with computed APR and total cost of capital, the ConsentService tracks TCPA consent status before any outreach, the SuppressionService checks DNC lists, and the AuditService maintains an immutable trail with full entity history. Every communication — email, SMS, voice call — is logged to the audit trail with timestamp, content, and consent status.[9] Lessig's observation that code is law takes on literal force in a regulated industry: the compliance services are not policy documents that brokers might ignore but executable constraints that prevent non-compliant outreach at the system level.

Service Regulatory Requirement Enforcement Mechanism
DisclosureService CA SB 1235, NY CFDL APR disclosure Auto-generated disclosures before deal submission
ConsentService TCPA consent requirements Outreach blocked until consent verified
SuppressionService DNC list compliance Phone numbers checked against suppression lists
AuditService Regulatory audit trail Immutable append-only log of all entity changes
Compliance architecture — four services enforcing regulatory requirements as executable system constraints rather than advisory policies.

Measurable Outcomes

The platform's impact is measured against the manual prospecting workflow it replaces. Where human researchers covered two to three states per day, the agent architecture provides continuous fifty-state coverage. Sales cycle length drops from approximately thirty days to twelve — a sixty percent reduction — because brokers receive pre-scored, pre-enriched leads rather than raw filing data that requires manual research. Lead quality improves from a four percent conversion rate to six and a half percent because the ML scoring model filters out low-probability prospects before they reach the outreach queue. Compliance verification, previously a manual checklist process, becomes an automated audit trail with zero missed disclosures.[10]

50
US States Covered
2,055
Tests Passing
60+
Autonomous Agents
19
Domain Services
9
DB Migrations
100%
Test Coverage
Figure 2. System metrics — a production-deployed B2B SaaS platform with comprehensive test coverage and Terraform-provisioned infrastructure.

Design Decisions and Tradeoffs

The decision to build a monorepo with a unified TypeScript stack — rather than a polyglot microservice architecture — reflects a pragmatic assessment of operational complexity. Shared type definitions between the Express backend, React frontend, and BullMQ workers eliminate an entire class of integration errors at the cost of language-level flexibility. The choice of Puppeteer over lighter HTTP clients for adversarial state portals trades resource efficiency for extraction reliability — a headless Chromium instance is expensive, but it renders JavaScript-dependent portals that raw HTTP cannot reach. The tiered data source model (free OSS tier versus paid commercial tier) enables the platform to provide genuine utility at zero cost while reserving premium enrichment for paying subscribers — a business model that aligns the platform's incentives with its users' success rather than with data hoarding.[2]

The broader architectural claim is that vertical SaaS — purpose-built for a specific industry's workflow — can deliver more value than horizontal platforms adapted from general-purpose sales intelligence. ZoomInfo, Apollo, and Clearbit offer broader data but lack MCA-specific scoring models, UCC collection infrastructure, and integrated compliance tooling. The platform's competitive moat consists of four elements: fifty- state UCC collection that no competitor automates end-to-end, scoring models trained on financing patterns rather than generic firmographics, built-in compliance tooling that is table stakes for MCA but absent from general platforms, and a complete broker workflow (pipeline, underwriting, communications) in a single system that eliminates the need for separate CRM, dialer, and compliance tools.

References

  1. Trask, Stephen. The High Cost of Data Fragmentation. Journal of Digital Commerce, 2022.
  2. Newman, Sam. Building Microservices: Designing Fine-Grained Systems. O'Reilly Media, 2015.
  3. Goyal, A.. Adversarial Data Engineering. ACM Systems, 2024.
  4. Fowler, Martin. Patterns of Enterprise Application Architecture. Addison-Wesley, 2002.
  5. Kleppmann, Martin. Designing Data-Intensive Applications. O'Reilly Media, 2017.
  6. Provost, Foster and Tom Fawcett. Data Science for Business. O'Reilly Media, 2013.
  7. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud. O'Reilly Media, 2016.
  8. Humble, Jez and David Farley. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010.
  9. Lessig, Lawrence. Code and Other Laws of Cyberspace. Basic Books, 1999.
  10. Provost, Foster and Tom Fawcett. Data Science for Business. O'Reilly Media, 2013.