Economics: 72% Cache, 90% Margin — Profiled Technical Blog

The economics of an AI-powered platform are typically discussed in terms of capability: which model, which context length, which quality level. What is less often discussed is the compounding structural advantage that emerges when you combine semantic caching, intelligent model routing, and behavioral profile density. The Profiled platform has been running in production for six months with a stable $0.02 average cost per query — a 5-10x cost advantage over systems that naively route every query to a frontier model. The mechanism behind this advantage is worth understanding precisely.

The headline metrics: $0.02 average cost per query (vs. $0.10-$0.20 industry standard), 72% semantic cache hit rate, and 70-90% gross margins depending on tier. These are not aspirational — they are production-measured over six months of real user interactions. The structural properties that produce them do not degrade at scale; they improve.

$0.02

Avg Cost Per Query

vs $0.10-$0.20 standard

72%

Cache Hit Rate

semantic equivalence

70-90%

Gross Margins

by revenue tier

6mo

Production Proven

real user data

The Semantic Cache Mechanism

The semantic cache is the primary cost driver. To understand why it achieves 72% hit rate rather than the 20-30% hit rate of literal string-matching caches, you have to understand how behavioral queries are distributed.

When user A asks "explain transformer attention mechanism" and user B asks "how does attention in transformers work" — these are different strings. A literal cache misses this match. A semantic cache, which stores queries as embedding vectors and retrieves matches within a cosine similarity threshold, correctly identifies these as semantically equivalent and returns user A's answer for user B at near-zero cost.

The 72% hit rate means 72% of queries cost nothing beyond the embedding lookup (a fraction of a cent) after the first user who asked a semantically equivalent question. Only the remaining 28% of queries require actual model inference — the novel queries that have no semantic match in the cache.

Why 72% Is the Right Number

The hit rate is a function of corpus size and query distribution. With a small corpus, many queries are novel and the hit rate is low. As the corpus grows, the hit rate increases asymptotically. 72% after 6 months of production use suggests the query distribution is relatively concentrated — most users are asking about a common set of behavioral and developmental topics, with a long tail of unique queries. The ceiling is somewhere above 85% at current corpus size, rising as user volume grows.

The behavioral intelligence domain is particularly amenable to high cache rates because behavioral queries cluster around a limited set of universal human developmental themes. Questions about career transitions, identity clarity, relationship patterns, and purpose alignment appear in many semantically equivalent forms across many different users. The semantic cache captures this clustering and turns it into a structural cost advantage.

The Economic Routing Table

The second cost driver is intelligent model routing. Not every query goes to the most expensive model. The routing system assigns each query to a model tier based on the revenue it is expected to generate, then applies the cheapest model that can deliver acceptable quality at that revenue level.

Gem Cost	Revenue	Model Strategy	Target Margin
100+ gems	$10+	Claude (premium quality)	70%
50-99 gems	$5-9.99	Gemini → Claude (balanced)	80%
25-49 gems	$2.50-4.99	Gemini → GPT-4o-mini (efficient)	85%
<25 gems	<$2.50	Gemini → GPT-4o-mini (budget)	90%
Free	$0	Gemini (cheapest)	N/A

The logic is simple: high-revenue queries justify expensive models because the margin is still healthy at 70%. Low-revenue queries require cheap models to hit margin targets. The routing system ensures that every query is served by the most economical model that can maintain the quality standard for that query type.

A critical property: some task types always use specific models regardless of cost. Life composition analysis always uses Claude, because the quality difference between Claude and cheaper models on this task type is significant enough to affect user trust and retention. Quest generation uses economic routing because the quality difference is less pronounced and the volume is much higher. Task-type overrides take precedence over gem-cost routing when they conflict.

Real-Time Margin Health Output

Every query through the routing system generates a margin health log. This is not post-hoc reporting — it is real-time margin monitoring at the individual query level, which enables immediate detection of routing decisions that are not hitting target margins.

services/unifiedAI.js — margin health output JavaScript

console.log(result.marginHealth);
// {
//   healthy: true,
//   margin: 87.3,
//   targetMargin: 85,
//   profit: 2.18,
//   revenue: 2.50,
//   cost: 0.32,
//   modelUsed: 'gemini',
//   recommendation: 'Margin healthy - maintain current model selection'
// }

The output above is from a 25-gem query (revenue $2.50, target margin 85%). The actual achieved margin is 87.3% — slightly above target. Model used: Gemini. Cost: $0.32. Profit: $2.18. The recommendation field provides the routing system's self-assessment of whether this query's economics justify the current model assignment.

When margin drops below threshold for 3+ consecutive requests on a given query type, an automatic alert is generated. This early warning system prevents margin erosion from compounding — a model that has become more expensive or a query type whose complexity has increased will be detected and re-routed before it materially impacts the P&L.

The Three-Tier Fallback Chain

The routing system does not rely on any single model. Every query has a three-tier fallback chain: Primary Model → Fallback → Emergency. If the primary model fails (API timeout, rate limit, service error), the system immediately attempts the fallback. If the fallback fails, the emergency model handles the query. The user sees no interruption.

Fallback Chain Examples

Life-composition (100 gems): Claude → Gemini → GPT-4o
Quest-generation (25 gems): Gemini → GPT-4o-mini → Gemini
Discovery-analysis (high complexity): Claude → Gemini → GPT-4o
Story-generation (high volume): Gemini → GPT-4o-mini → Gemini

The fallback selection preserves quality as much as possible. For high-sensitivity tasks, Gemini is the fallback rather than GPT-4o-mini because Gemini is closer in quality to Claude for analytical tasks. For high-volume, lower-sensitivity tasks, GPT-4o-mini is the fallback because it maintains acceptable quality at minimum cost. The fallback chain is designed, not defaulted.

Why $0.02 When Claude Charges ~$0.10

The question that most often comes up when these numbers are presented: if Claude costs approximately $0.10 per query and this system is averaging $0.02, where does the 5x difference come from?

Three sources. First: 72% cache hits. 72% of queries cost essentially nothing beyond an embedding lookup. Only 28% of queries reach the model layer at all. Second: intelligent routing within the 28% of uncached queries. Of those that do reach the model layer, a large fraction go to Gemini Flash or GPT-4o-mini rather than Claude — cheaper by a factor of 3-10 depending on the specific model and context length. Third: token efficiency. The behavioral profile context is stored in a compressed structured format that uses fewer tokens per query than a naive approach of concatenating full conversation history.

72%

Cache Effect

28% actually reach models

3-10x

Routing Savings

Gemini/mini vs Claude

5-10x

Total Advantage

vs naive frontier routing

At 10,000 daily queries: industry standard $0.10-$0.20 × 10,000 = $1,000-$2,000/day. Platform average $0.02 × 10,000 = $200/day. The $800-$1,800 daily savings at that query volume compounds rapidly. At 100,000 daily queries: $8,000-$18,000/day in cost avoidance. The cost structure is 5-10x more efficient at every scale point.

The Compounding Effect: Getting Cheaper and Better Simultaneously

The most important economic property of this architecture is that scale improves both quality and economics simultaneously — a property almost unique to systems with behavioral profile depth and semantic caching combined.

The compounding mechanism: every interaction improves the behavioral profile, which improves recommendation quality. Improved quality increases user engagement. Increased engagement increases query volume. Increased query volume increases cache density. Higher cache density further reduces average cost per query. Lower cost enables more users at the same unit economics, which further increases cache density. The system is self-reinforcing on both the quality dimension and the cost dimension.

"The system gets cheaper and better simultaneously. Every new user at scale benefits from the cache built by every user who came before them — and improves it for everyone who comes after."

This is qualitatively different from standard SaaS unit economics where marginal cost of serving an additional user is roughly constant. The behavioral intelligence platform's marginal cost of the n-th user is lower than the marginal cost of the (n-1)-th user if the n-th user's queries have semantic matches in the cache from previous users. In a sufficiently dense corpus, the marginal cost approaches the embedding lookup cost — essentially zero — for most users' most frequent query types.

Content Seeding Economics: The Empty-to-Engaged Pipeline

The content seeding architecture addresses a specific economic problem: a new user with no interaction history generates only low-quality behavioral data, which means the system can only serve them generic content, which produces lower engagement, which generates less revenue per user in the early phase. The cold-start problem is not just a quality problem — it has direct economic consequences.

The hybrid seeding approach addresses this by front-loading value delivery even before the behavioral profile emerges. Registration triggers DNA seeding with behavioral templates calibrated to the user's stated goals. First quest completion produces the initial behavioral signals that begin the profile emergence process. The seeding architecture ensures that the transition from generic to personalized content happens as quickly as possible — typically within 10 interactions — because every interaction in the generic phase is economically less efficient than an interaction in the personalized phase.

Content Seeding Architecture Stages

Registration → DNA seeding (generic): Template content calibrated to stated goals. Engagement: moderate. Revenue yield: baseline.

First quest completion → profile emergence: Initial behavioral signals extracted. Seeding begins personalizing. Engagement: improving.

10+ interactions → personalized seeding active: Full behavioral intelligence driving recommendations. Engagement: high. Revenue yield: optimal.

Business Roadmap: $3.75M ARR Path

The unit economics justify a specific commercial trajectory. The primary B2B2C channel: HR platforms (Workday, SAP SuccessFactors, Oracle HCM), enterprise L&D platforms, and direct enterprise licensing. The pricing model: $10 per behavioral assessment with 90% margins = $9 profit per assessment at scale.

The path to $3.75M ARR: 375,000 assessments per year across the B2B2C channel. At 10,000 enterprise users generating 37.5 assessments each annually (roughly 3 per month), this is achievable with modest enterprise penetration. The unit economics at that scale: 375,000 × $0.02 average cost = $7,500 in COGS. 375,000 × $10 revenue = $3,750,000. Gross margin: 99.8%.

The discovery engine component adds an additional revenue layer that is not included in the above projection. As the discovery engine produces peer-reviewed results, enterprise scientific organizations — pharmaceutical companies, research labs, academic institutions — become potential customers for discovery engine access. The incremental infrastructure cost is near-zero because the discovery engine runs on the same architecture as the behavioral intelligence system.

Why the Unit Economics Are Defensible

A competitor entering this market today faces a specific challenge: they cannot replicate the cache density without the interaction history, and they cannot replicate the interaction history without the behavioral intelligence quality that drives engagement, and they cannot replicate the behavioral intelligence quality without the 22-layer architecture and 300-dimension profile depth. The unit economics are a lagging indicator of behavioral corpus depth — and behavioral corpus depth requires time and interactions, not just capital.

The Honest Caveat on Margins

The margin figures require one honest qualification. The 70-90% gross margin range covers AI inference costs only. Platform infrastructure (MongoDB Atlas, hosting, CDN, logging), human support, and sales and marketing are not included in the gross margin calculation. Net margin after all operating costs is lower — the 70-90% figures are COGS margins, not operating margins.

This is the standard SaaS presentation of gross margin, and the COGS structure is genuinely favorable. But a complete economic picture includes the operating cost structure, which at current scale includes infrastructure costs that are not yet fully amortized across a large user base. The path to the unit economics described here at scale is real; the current economics during the growth phase include the fixed cost overhead of a platform that is not yet at scale.

The structural advantage — semantic cache density improving with scale, intelligent routing preventing cost inflation, behavioral profiles getting denser and cheaper to serve over time — is real and compounding. The current economics during the scaling phase are the cost of building the dense corpus that makes the mature-state economics achievable. This is the correct framing: invest in behavioral corpus density now; harvest the compounding unit economics at scale.