Case study · Applied AI architecture
Case Study: Founding an AI Delivery Function at a Regulated Commercial Lender
In a small, regulated, system-of-record-centric business, the highest-leverage AI work is not the cleverest agent. It is the platform that lets non-engineers ship reviewed, observable, drafts-only automation on top of the existing stack, without ever competing with the system of record.
01 — Engagement overview
I led the founding of the AI delivery function at a regulated commercial lender of roughly three dozen people. This was an intensive two-month contract engagement in spring 2026: I was brought in as Principal AI Engineer, reporting to the CTO and sponsored by the CEO, to build the platform and operating model. My charter, confirmed directly with the CEO, was deliberately not AI literacy. It was technical architecture, skill and agent engineering, and integration with the existing lending stack, with deal and loan throughput as the terminal metric.
One constraint shaped everything. This is a small company with no large engineering organization: the CTO owns a proprietary loan-origination engine that is the system of record, and that team plus me is essentially the entire technical bench. Every architectural decision was bounded by a single rule: AI tooling builds on top of the core platform, never competing with it or forking its data.
02 — Situation on arrival
An executive roadmap with no delivery architecture beneath it
On arrival there was an executive-authored five-phase AI roadmap and genuine top-down enthusiasm, but no delivery architecture beneath it. Production AI adoption was effectively zero: a chat tool used ad hoc by individuals, no shared skills, no governance, no observability, and no defined relationship between AI workflows and the systems of record. The roadmap assumed a march from literacy to autonomous agents to fine-tuned self-hosted models, but it was missing the three things that actually determine whether such a program survives contact with a regulated lender: integration with the existing stack, a data-security and permission model, and fair-lending and model-risk controls.
The concrete problems followed from that. Deal truth, document truth, and file storage lived in three systems with no shared join key. There was no safe path for a non-engineer to ship automation without leaking credentials or fabricating numbers, and no way to see what any AI workflow had actually done, which is disqualifying for anything credit-adjacent. And the executive bar for autonomy, a 99.5% draft-acceptance threshold before removing human review, was being discussed as a near-term target rather than the long-horizon aspiration it is, against an industry norm closer to 70 to 80 percent for comparable sales-AI workflows.
03 — Architectural approach
Treat the program as a platform problem, not a collection of prompts
My central decision was to treat the program as a platform problem, not a collection of prompts. The unit of delivery is a skill: a versioned, reviewed instruction module with a declared safety classification and a fixed set of connector permissions. Skills are authored in a Git repository, distributed to the agentic client as an auto-synced plugin marketplace, and loaded on demand by description match. That gave me one place to enforce review, one audit trail, and a clean separation between the authors and the consumers of a capability. The alternative, letting each department accumulate private prompts inside the tool, is how a lender ends up with unreviewable shadow automation. I rejected it explicitly.
The second decision was that the system of record always wins. AI workflows read truth from the origination engine, the CRM, and the document system; they never recompute or overwrite financial state. Every dashboard I built is a human-review aid, not a ledger. Accounts-payable aging, for instance, is assembled from invoice email for the weekly finance meeting, but the accounting system still owns the authoritative balance. That constraint keeps the AI layer out of the system-of-record's blast radius and keeps me out of competition with the CTO's platform.
No skill I shipped sends an email, posts to a borrower, or mutates a system of record without a human in the path. Autonomy becomes a dial turned per-skill once telemetry justifies it, not a binary the whole program waits on.
The third decision was drafts-only by default, with human-in-the-loop as an architectural property rather than a policy reminder. Outreach skills draft into the operator's own mailbox; diligence skills produce a memo for a named reviewer. This is also what makes the 99.5%-autonomy conversation tractable: autonomy becomes a dial turned per-skill once telemetry justifies it, not a binary the whole program waits on.
The fourth decision concerned distribution and access, and it rested on an uncomfortable truth: the skill marketplace is not a permission layer. Installed skills are visible across the organization, so department segmentation is documentation and ergonomics only, and the real access boundary is the underlying connector credential. I organized departments as separate plugins for discoverability and ownership, but never relied on plugin membership for security.
Two further decisions shaped the runtime. Live dashboards can only make short calls to backing services, on the order of a sixty-second ceiling, so any heavy work runs as an asynchronous job behind a poll contract rather than inside the dashboard: thin artifacts, fat services, with credentials held server-side in one place. And observability is two jobs, not one. Operational analytics, who uses the platform and at what cost, has different requirements from agent-quality observability, why a skill produced a bad output. I stood up a wide-event backend for the operational job because it ingests arbitrary telemetry with no schema gate, and reserved a purpose-built LLM-observability platform for the higher-stakes later-phase agents.
On the executive roadmap I pushed back on two points and carried both. Supervised fine-tuning on a self-hosted model should be conditional on earlier-phase performance data rather than assumed; frontier models plus retrieval will plausibly cover ninety percent of the value at a fraction of the operational cost. And rather than let strategic priority override adoption readiness, I split the training tracks, sequenced by readiness, from agent investment, sequenced by strategic order, so neither blocked the other.
04 — Systems and skills built
An honest inventory: shipped, prototyped, specified
What follows distinguishes what shipped to production from what I prototyped and what I specified in design. Department coaches authored many of the department-specific skills as forks of reference patterns I established; I owned the platform, the foundational skills, and the engineering review gate. (Try the maturity filter on the diagram.)
Platform and governance layer · production
- Skill marketplace. A Git repository deployed as a plugin marketplace: ten plugins enumerated through a manifest (one per department plus a shared plugin), with a CI pipeline that rebuilds the catalog and announces diffs to a team channel. The distribution backbone for the entire program.
- Governance pipeline. A publish-and-audit path. A publish step writes a candidate skill to a review folder and posts metadata to a helpdesk channel, which auto-files a triage ticket; an auditor then runs mechanical checks and surfaces judgment flags with explicit reviewer routing. The merge gate that makes the catalog governable.
- Helpdesk intake. A chat-based helpdesk that converts member questions into tracked tickets, live since late spring 2026.
- Enrichment service. A standalone service that runs a research-and-drafting pipeline against contact cohorts and writes finished drafts back into the CRM as custom contact properties, behind a versioned JSON data contract, with a per-contact audit trail and a read-only health monitor. Runs daily and on demand. Shipped and running against real investor and broker cohorts.
- Workflow audit. A skill that produces a ranked list of automation candidates for an individual employee, with scope, estimated hours saved, and a recommended delivery pattern. The on-ramp by which non-technical staff find their first good target.
›deal-followup / SKILL.mdshipped
--- name: deal-followup description: Draft daily borrower follow-ups for the application fields still missing on pre-quote SMB deals. owner: smb-origination safety_classification: drafts-only # never sends connectors: - crm:read # deal stage, owner, contact - lending-api:read # application field state - email:draft # operator mailbox only - chat:write # one-line run summary review: required: true reviewer: human-operator # named on every artifact schedule: "0 7 * * 1-5" # weekday 07:00 CT; on-demand too ---
Department and shared skills · production, drafts-only
- Lending operations. A loan-document sorter that routes email attachments into the correct numbered deal subfolder, enforces recording numbers, and direct-messages the operator a gap list against the closing checklist; a prescreen credit-memo drafter; a credit-package-to-signature-envelope PDF converter; and a sponsor diligence and adverse-media research memo. All human-triggered, all draft output.
- SMB origination. A scheduled weekday job that drafts borrower follow-up emails for missing application fields, plus live pipeline-board and pre-quote dashboards over the CRM. Drafts into the operator's mailbox; never sends.
- Investor relations and growth. Warm-outreach and dormant-broker re-engagement tooling, professional-network enrichment and outreach, and an accounts-payable aging dashboard for the weekly finance review.
- Productivity and brand. A per-user daily brief, recurring team priority boards, meeting action-item extraction, a session-handoff tool, and a brand-and-voice layer used as a final pass across other skills.
Prototype & design-spec
- Stakeholder interview bot (prototype). A branching-questionnaire web application over the model API producing structured plus narrative output, deployed for field research into automation priorities. Not hardened for production.
- Semi-autonomous agents (specified). Nightly portfolio surveillance, covenant monitoring, automated diligence assembly, and recurring investor-report drafting, each specified with human approval gates and a per-agent model card. An autonomous credit-memo agent is queued behind resolving CRM write-scope gaps. Specified, not built.
Deployed: one workflow on a real desk
The clearest test of the platform was taking one skill to a single operator and watching it survive contact. The skill was deal-followup, the scheduled weekday job that drafts borrower follow-ups for missing application fields. The operator was an SMB loan originator whose pre-quote book ran to roughly two dozen active deals, each needing the same unglamorous sweep: open the deal, work out which Stage 1 fields are still missing (entity name, EIN, term, estimated credit, and so on), and write the borrower an ask for exactly those. By hand that was an estimated five to ten minutes a deal, so a full sweep ran into hours and, in practice, slipped; deals sat for days waiting on the same three or four fields.
Velocity mattered and the build showed it. The marketplace stood up, and within about two weeks four onboarding skills were tested and the audit-to-publish flow was dry-run validated; the first department-integrated workflow, this one, was in the SMB team's hands for live testing roughly a week after that. First spec to team sign-off was sixteen days.
The first time it ran in front of the team, the skill was about to ask a broker for a borrower's Social Security number. The model caught the ask and fell back to a neutral template on its own; the assumption behind it was still wrong, and that was the lesson.
It had assumed the primary CRM contact was the borrower. On the first pull, four of the first five contacts were brokers, not borrowers, and the draft logic was about to request personal items (SSN, date of birth) from a broker, the exact NPI-adjacent ask that torches a broker relationship. Second, an internal audit stamp and a run of em-dashes had leaked into borrower-facing copy: the drafts were signing off with "Assembled by the agent. Reviewed by [operator], [date]," which belongs in an audit log, not a borrower's inbox. A reviewer's QA pass the next day filed four more precise defects: contacts with a tax ID already on file were still being asked for an SSN; a deal with no borrower name rendered a broken subject line; and one contact the application API typed as a sponsor but the CRM's lead-status field tagged as broker outreach exposed a source-of-truth conflict.
Each failure became a rule, and the integration got simpler, not cleverer:
- Dropped the direct lending-API call and its service token for the application MCP tool, so there is no secret in the skill and no key to manage, and fixed a field-shape mismatch so the missing-field diff runs as written.
- Made the deal-list query carry contact email and phone, so in degraded mode (CRM offline) the skill still drafts instead of skipping every deal for a missing recipient.
- Skip broker-intermediated deals entirely, and treat the CRM's lead-status field as authoritative over the application API's type when the two disagree, so the skill never drafts a borrower ask to a broker.
- Gate the SSN ask on the tax-ID-on-file flag so it stops asking for an SSN already on record, and add a borrower-name fallback so the subject line never breaks.
- Strip the audit stamp and the em-dashes from borrower-facing drafts; move the stamp to the Slack run summary, where reviewers actually need it.
The honest delta: the first clean pass triaged the whole live pre-quote book in a single run of a few minutes, against a manual sweep measured in hours, and the originator's sign-off was a one-line "let's go." The catch the telemetry forces me to state is that the skill did not then run for two sustained weeks of solo production. It was deliberately folded into a single interactive pre-quote artifact that became the one owner of morning drafting, to avoid duplicate drafts. So the measured reality is a validated workflow and a fast convergence onto the interactive surface, not a multi-week hours-saved curve. The rollout was early; that is the same honest framing as the rest of this account.
05 — Hard technical problems solved
The problems that determined whether any of it worked
Cross-system data join
Deal status lives in the CRM, document status in the lending document system, and the files in cloud storage referenced by path; no shared primary key spans them. I made the loan number embedded in the deal-folder title the canonical join key and built the document sorter to treat that token as the source of truth rather than trusting any single system's record. That turned a three-way reconciliation problem into a deterministic lookup.
Connector permission model
The CRM connector reported write failures on specific object types through a distinct permission flag rather than a connection error. I diagnosed it as a per-object scope gap, with note and email writes blocked, and routed it to the platform owner as a narrowly scoped grant framed around agent activity-logging rather than blanket write access. The broader lesson, that the marketplace is not the access boundary and the connector credential is, hardened into a standing principle.
Enrichment integration without new infrastructure
Rather than build a bespoke adapter, I used CRM custom properties as the integration bus. A request flag, surfaced as a re-enrich control in the live dashboards, acts as a priority override into the enrichment queue; results return as a single versioned JSON payload the consuming dashboard parses, with graceful fallback to a template. I trimmed the property schema from twelve fields to six, on the rule that a field exists only if the CRM itself must query, sort, filter, or display it; everything else folds into the payload. The deterministic halves, candidate selection and write-back, are standard-library Python bracketing the model-driven core, and they compute the staleness hash from one shared module so the selector and the writer can never disagree.
›_hs_common.py · the shared staleness contractshipped
# _hs_common.py — the one module both sides import import hashlib def input_hash(firstname, lastname, company, stage): # A CONTRACT shared by selection and write-back. The selector marks a # contact stale when this != the stored hash; the writer stores exactly # this. Both compute it identically, so it lives here and nowhere else. parts = [(firstname or ""), (lastname or ""), (company or ""), (stage or "")] key = "|".join(p.strip().lower() for p in parts) return hashlib.sha256(key.encode("utf-8")).hexdigest() # selector: is_stale = (stored_hash != input_hash(fn, ln, co, stage)) # writer: props["enrichment_input_hash"] = input_hash(fn, ln, co, stage) # one definition, imported on both sides; the two can never drift apart.
›enrichment · data contractv1.2.0
{
"schema_version": "1.2.0",
"contact_id": "…",
"enriched_at": "2026-05-…T…Z",
"signal_score": 0.0, // CRM sorts on this
"draft_subject": "…", // CRM displays this
"draft_body": "…", // CRM displays this
"payload": { /* everything else — opaque blob */ }
}
// 6 first-class fields; the rule: a property exists only if
// the CRM must query, sort, filter, or display it.
Observability under a logging gap
The agentic client's own activity is excluded from its audit logs, compliance API, and data exports, a real gap for anything credit-adjacent. I treated agent and skill execution as a telemetry problem and routed OpenTelemetry traces, spans, and outcomes into an independent backend so they are queryable regardless of the client's native logging. That telemetry is the compensating control that makes credit-adjacent and scheduled workflows defensible. A secondary problem fell out of it: cost data lived on model spans while skill names lived on skill spans, and no span carried both. They share a session identifier, so a derived column joined on that identifier produced cost-per-skill-invocation with no new instrumentation.
Confused-deputy risk across tools
I threat-modeled whether a prompt injection delivered through untrusted content, an inbound email or a scraped page, could chain a draft-only email connector with browser automation to actually send a message. The mitigations were architectural rather than exhortation: isolate the sessions that process untrusted content from the sessions that can act on the mailbox so the chain cannot form, and place the human checkpoint at a layer the agent cannot bypass, the provider's native send confirmation, rather than trusting the model to stop itself.
Prompt-injection success rate · per published platform evals
Residual risk concentrates in well-camouflaged injections and confirmation fatigue, so the checkpoint sits below the agent, not inside it. Bars scaled to the unprotected case.
Sandbox-to-document-API authentication
An in-place document formatter needed a capability the storage connector could not perform, and the agent's execution sandbox sits one trust layer below the connectors and never sees their OAuth tokens. I worked the four real options and recommended a service account as the immediate unblock, accepting the loss of per-user audit, with a custom protocol server as the clean answer to adopt before any further skills depend on that API. The discipline was separating unblock-testing-this-week from the-pattern-we-commit-to.
A second model for the jobs Claude couldn't reach
Claude and the agentic client drive the program, but a couple of Google-native jobs sat outside the sandbox's reach, and each got handed to the surface that already holds the data and the credentials. The agentic client can see an email's attachments but cannot download the bytes, so a Gemini-powered Google Workspace Flow ingests every non-image attachment into a shared Drive folder, where the document sorter and the AP-aging dashboard pick them up. And because a full skill file or plugin bundle exceeds the Drive connector's per-argument cap, the publish path writes through a Google Apps Script web app running as a service account, where the write credential already lives. The in-place Google Docs edit from the previous problem is still open on the same options, a service account or a custom protocol server, not Gemini. The principle that generalizes: put each job next to the data it needs and let the primary agent orchestrate, rather than bending one platform to do everything.
06 — Architecture principles established
Positions that hardened into durable patterns
The system of record always wins. AI reads truth and drafts against it; it never recomputes or overwrites it.
- The system of record always wins. AI reads truth and drafts against it; it never recomputes or overwrites it.
- Drafts-only by default. A knowledgeable human validates every output against a known source of truth; autonomy is earned per-skill through telemetry rather than granted up front. Every artifact carries a review line naming the human who signed off.
- The marketplace is distribution, not permission. The connector credential is the real access boundary, and the design must reflect that.
- Thin artifacts, fat services. Anything past the artifact timeout is an asynchronous job behind a poll contract, with keys held server-side in one place.
- Observability is two jobs, not one. Operational analytics and agent-quality observability have different requirements; keep raw loan-data traces out of any backend that does not redact at ingest.
- Schema minimalism on shared systems. A field exists only if the system must query, sort, filter, or display it.
- Precise status language is load-bearing. Drafted, prototyped, and built mean different things, because status inflation is how an AI program loses credibility with risk and audit.
- Put the human checkpoint where injection cannot reach it. Design the boundary one layer below the agent rather than trusting the agent to stop itself.
- Put each job next to its data and credentials. Claude orchestrates, but Gmail-attachment ingestion runs as a Gemini-powered Workspace Flow into Drive, and oversized Drive writes go through an Apps Script service account; the agent calls whichever surface owns the data rather than forcing everything through its sandbox.
07 — Scope and impact
What it reached, and the measured reality
The program reached an organization of roughly three dozen people. I stood up a ten-plugin marketplace, a CI distribution pipeline, a chat-to-ticket governance and helpdesk path, a scheduled enrichment service with its own health monitoring, and on the order of thirty distinct skills, nineteen firm-authored. Those skills populate six of the ten plugin namespaces: CRE and lending operations, SMB origination, investor relations, finance, technology, and shared; the remaining four, marketing, legal, HR, and asset-management, were scaffolded for ownership but not yet populated. The skills call seven first-class connectors, the CRM, email, cloud storage, calendar, team chat, project management, and a single Cowork-managed lending API that exposes both document upload and the application and admin reads (the old direct API host was retired), with the observability backend as an eighth connector that is a telemetry sink the skills never call; Zoom, Sheets, and browser automation appear in a handful of skills. Observability was stood up from zero, with traces flowing into the wide-event backend at roughly 345,000 events a month across the agent datasets, cost-per-skill derived through a session-identifier join and cost-threshold alerting on top. That record also partially closed the platform's compliance blind spot by creating a who-did-what-when history where none had existed.
›cost & skill usage · measuredtrailing 30d
-- model-span cost, joined to skill names on session_id 30-day model spend ~$8,980 across 3,685 sessions (~$2.44 each) carries a skill tag ~$60 <1% of spend; the rest sits on model spans most-invoked skills, 30d runs publish-skill 37 # platform-meta internal-skill-creator 30 # platform-meta internal-voice 11 dormant-broker-outreach 8 # value skill daily-brief 5 # value skill deal-followup 4 # value skill -- cost lives on model spans, names on skill spans; <1% carries -- both, so the session_id join is the only path to $/skill.
Most measured invocations were skill-building, not production runs of the value-generating department skills. The rollout was early, and that is the measured reality, not a projection.
Adoption distribution · skills by repeat-use cohort
Recurring use concentrated in the authoring and platform-meta skills rather than the department skills. An honest early-rollout curve, not a finished one.
Hours returned per week · measured vs designed ceiling
The range is gated on adoption, not a midpoint. The ceiling assumes full intended adoption and was not reached during the engagement; the telemetry confirms it, so I would not lead with it.
Tied back to the charter's terminal metric, deal and loan throughput: on the one workflow taken to a real desk, a manual pre-quote sweep measured in hours collapsed to a single multi-minute run that drafted field-specific asks across the live book, and the credit-side prescreen and diligence skills target two to four analyst hours per deal. The throughput logic is that the freed originator and analyst time converts into faster time-to-quote and more deals worked per head. I will not claim the closed-loan number, because the engagement ended before sustained production usage could show that conversion as a curve; the honest statement is a validated time-to-value on the workflow level and an unproven, plausible link to throughput at the portfolio level.
08 — Governance and compliance
Governance as load-bearing, not paperwork
Because this is a regulated lender, I treated governance as load-bearing rather than paperwork. I argued for fair-lending controls, ECOA and FCRA considerations, alongside model-risk guidance, SR 11-7, in any credit-adjacent workflow, and made human-in-the-loop a property of the architecture so that no model output reaches a borrower or a credit decision without a named human in the path. The auditor flags credit-adjacent logic, financial calculations, writes to systems of record, and the handling of non-public personal information, routing each to the right reviewer rather than rubber-stamping it.
The auditor's design point is that it produces a report, not a verdict. It runs in two halves: deterministic scripts handle the mechanical checks, so whether a regex matched a credential is never an LLM judgment call, and an LLM pass handles the judgment flags and routes each to a named reviewer with a blocking-or-advisory severity. The merge stays a human action against the pull request; if the model in the loop were the gate, the gate would be broken. It will auto-patch a narrow set of purely mechanical defects, frontmatter shape, connector-list coercion, naming, a drafted anti-trigger, so schema nits do not cost a review round-trip, but it never edits workflow logic, never overrides a judgment flag, and never commits to the marketplace itself.
›skill-auditor · the merge gatereport, not a verdict
mechanical checks — deterministic, run by scripts (exit 0/1/2) ✓ secrets / PII scan scan-secrets.sh ✓ frontmatter shape + fields validate-skill.py ✓ name is bare kebab-case # plugin folder is the namespace ✓ marketplace dedup # same-folder collision = RED ✓ connectors declared = used # used-but-undeclared = RED ✓ description has trigger + anti-trigger ✓ dependencies on allowlist judgment flags — LLM analysis, routed to a named human ⚑ credit-adjacent → Compliance reviewer (fair lending) blocking ⚑ financial calculation → AI team reviewer (+ compliance) blocking ⚑ system-of-record write → Platform owner (permissions) blocking ⚑ NPI handling → General Counsel (data review) blocking ⛔ external send w/o HITL → hard reject — bounce to author auto-patch (schema only): frontmatter shape · connector-list coercion · naming bare-form · drafted anti-trigger. body + judgment fixes stay with author. a human reads the report and merges the PR. the auditor never merges.
I flagged the client's audit-log gap as a compliance constraint and selected an external observability backend as the compensating control. One data-protection caveat shaped the choice: the operational store ingests raw prompt text that can contain borrower names and deal addresses, acceptable for an internal analytics tool the AI team controls, but not the long-term home for agent traces once real loan data flows through later-phase workflows, where pre-redaction before storage is required.
The highest-leverage governance insight was structural. The brand-and-voice layer sits, by design, in the path of nearly every outgoing communication, which makes it the single best instrumentation point for a communications-layer compliance check, prohibited claims, disclosure language, ECOA-adjacent phrasing, and accidental disclosure of non-public pipeline detail, without building a separate compliance skill or relying on every author to invoke one.
09 — Key learnings
What worked, what I would do differently, and the thesis
What worked was leading with platform and governance before breadth. Building the marketplace, the review gate, and the observability backend first meant that when department skills proliferated they landed on rails instead of as scattered prompts. Drafts-only as a default aged extremely well; it defused the autonomy debate by turning autonomy into a per-skill dial. Using CRM custom properties as an integration bus shipped in days. The loan number as a canonical join key removed an entire category of reconciliation bugs. And quantifying everything moved leadership conversations from opinion to arithmetic.
What I would do differently: instrument adoption from day one rather than after skills shipped, so the hours-saved story rests on measured data. Force the connector permission model to an explicit decision earlier. Resolve system-of-record write scopes before designing any skill that depends on them. Push harder and sooner on separating the roadmap's aspirations, autonomy targets and fine-tuning, from its near-term deliverables. And invest earlier in a catalog-level overlap reviewer, not just the per-skill auditor, because overlapping authorship is the failure mode that quietly kills a skills platform and a per-skill check cannot see it.
Get the substrate right and capability compounds safely; get it wrong and you accumulate unauditable risk that someone eventually has to unwind.
The generalizable thesis: in a small, regulated, system-of-record-centric business, the highest-leverage AI work is not the cleverest agent. It is the platform that lets non-engineers ship reviewed, observable, drafts-only automation on top of the existing stack without ever competing with the system of record.