Shipping 4 AI Systems in 18 Months

AI systems in production

18mo

From first proof of concept to all 4 live

1M+

Users served by our AI systems

90+

Countries where our AI runs

The Gap Nobody Talks About

The demos always work. Between 2022 and 2024, we shipped four production AI systems at Edvoy — customer-facing AI serving 1M+ students across 90+ countries making real decisions about universities, visas, and scholarships. The failure modes were creative and almost none were covered in the blog posts we read before we started. This is the article I wish had existed.

The Four Systems We Built

System 01

Genie

RAG Knowledge Engine

Retrieval-augmented generation over Edvoy's entire knowledge base: 750+ institutions, visa rules, scholarships, course requirements, and student FAQs. Sub-second Q&A for 1M+ students.

System 02

Student Nurture Pipeline

Agentic AI System

Multi-agent system orchestrating personalised student engagement across the full funnel. Profiling, content sequencing, nudge timing, and counsellor handoff — all autonomous.

System 03

LLM Chatbot

Conversational AI

Purpose-built chatbot for the study-abroad journey. Multi-turn dialogue handling course discovery, visa guidance, document checklists, and scholarship eligibility with live data feeds.

System 04

GenAI Content Engine

Generative Content Platform

AI platform that auto-creates, localises, and personalises educational content at scale across 12 content types and 90+ markets. 10x content output with SEO and brand quality baked in.

What the Internet Does Not Tell You: The Real Failure Modes

Every AI system we shipped taught us something new about the gap between prototype and production. Here are the failure modes that hit us hardest, and what we did about each one.

1. Hallucination in high-stakes contexts is not a benchmark problem

For Genie, hallucination was a trust and liability issue. A wrong answer about UK Tier 4 visa requirements could result in a refusal, a lost deposit, and a furious support ticket.

The Incident

Three weeks after Genie launched, we discovered that for a specific query pattern involving conditional visa rules, the system was occasionally synthesising a plausible-sounding but incorrect answer by blending retrieval results from two different destination country policies. Our CSAT scores for that query category dropped to 2.1 out of 5. We caught it from user feedback, not from our eval suite.

The fix was architectural: domain-scoped retrieval (UK queries only retrieve from UK-tagged nodes), confidence thresholds with hard fallbacks to human counsellors below 0.72, and weekly adversarial query audits to find failure patterns before students did.

2. Agentic systems fail in ways you cannot anticipate in testing

Multiple agents, tool use, memory persistence, channel integrations. In staging: flawless. In production: creative failure modes we never imagined.

The Loop Problem

One of our nudge agents entered a retry loop after a WhatsApp delivery failure. Because the failure was not surfaced to the orchestrator correctly, the agent kept re-triggering. One student received 23 WhatsApp messages in 40 minutes before our monitoring alert fired. We discovered it because the student called us, not because our system caught it.

Three lessons: agent actions touching external channels must be idempotent and rate-limited at infrastructure level. Orchestrators need circuit breakers — hard limits per entity per time window. And agentic monitoring must track action chains, not just individual API calls.

3. Prompt drift is silent and cumulative

Our LLM chatbot launched with well-tested prompts. Three months later, conversation quality CSAT had declined 11% from baseline. No incidents. No errors. Just slow, invisible degradation.

“Prompt drift is not a single failure event. It is the slow accumulation of edge cases, model version updates, and prompt interactions that degrade quality in a way that no single monitoring alert will catch. By the time you notice it, it has been happening for weeks.”

Root causes: the model had been silently updated by the provider, new user query patterns exposed gaps, and iteratively-added prompt components had introduced contradictions. Fix: a prompt versioning system with A/B testing before rollout, and weekly conversation quality audits — 50 randomly sampled conversations reviewed against a rubric. Tedious. Absolutely necessary.

4. Retrieval quality is harder than generation quality

When Genie produced a bad answer, we assumed the LLM failed. Most of the time, the retrieval had failed — the LLM was generating confident prose from the wrong chunks. Fixed-size chunking split critical context across boundaries. We switched to semantic chunking and added explicit metadata filters (Canadian study permit queries must never retrieve UK Tier 4 guidance, regardless of cosine similarity). Both should have been in the architecture from day one.

5. Latency at scale is a product problem, not just an engineering problem

Genie averaged 1.8 seconds in testing. At peak load: 6–8 seconds. A correct answer delivered 7 seconds late is a failed product experience. We implemented semantic similarity caching, streaming responses, and async pre-fetching. Peak latency dropped to 1.2 seconds.

The GenAI Content Engine: A Different Set of Challenges

The Content Engine: A Different Set of Problems

Publishing content at scale meant quality failures had SEO, brand, and accuracy consequences. Two issues dominated:

Brand voice drift

10,000 pieces a month and every university profile started sounding identical — same adjectives, same templates, same closing sentences. We built a style diversity injection system with explicit structural variation, and a similarity score check that flagged pieces with cosine similarity above 0.85 to recent content, triggering re-generation at higher temperature.

Factual accuracy vs. fluency

When source data was stale, the AI generated fluent content about outdated fees and discontinued programmes. It passed every fluency check and failed every accuracy check. We introduced a data freshness gate: no content generated from data older than 90 days without a human review flag. Throughput dropped 15%; trust failures eliminated.

What the Build Process Taught Us About AI Product Management

Evals are products, not scripts. Every AI system needs a robust evaluation framework that is maintained and evolved as carefully as the system itself. Our eval suites were the most important internal products we built. They caught regressions, validated improvements, and gave us the confidence to ship. Treat your evals like a product: version them, own them, and invest in them continuously.

The system prompt is your most important piece of product copy. More PM attention goes into button labels and onboarding flows than into system prompts. This is wrong. The system prompt defines the entire personality, scope, and behaviour of your AI product. It deserves the same rigour, versioning, testing, and ownership as any other critical product asset.

Human-in-the-loop is not a failure mode — it is a product feature. Every AI system we shipped was better because we designed explicit handoff points to human experts. Genie escalates to counsellors for low-confidence answers. The chatbot hands off to human agents for complex, emotional, or high-stakes queries. These are not fallback mechanisms; they are trust-building features that users appreciate and that make the AI safer to deploy.

Observability for AI is fundamentally different. Standard application monitoring — uptime, error rates, latency — is necessary but insufficient for AI systems. You also need to monitor: answer quality scores, retrieval precision, hallucination rates, conversation abandonment at the AI stage, and user feedback signal ratios. If you cannot measure these, you are flying blind and you will find out about quality degradation from your users, not your dashboards.

AI product velocity requires a feedback flywheel, not a deployment pipeline. The fastest way to improve an AI system is not to ship more features — it is to build a tight loop between user feedback, conversation quality analysis, and prompt or retrieval improvement. We had weekly cycles where real conversation failures fed directly into prompt updates, retrieval tuning, and knowledge base corrections. That loop was worth more than any new feature.

The model is not the product. The system is. Every time a new model version was released, there was internal pressure to upgrade immediately. We learned to resist this. Model upgrades changed behaviour in ways that were often subtle and sometimes harmful. Every model change required a full eval run, a shadow mode period, and a staged rollout. The model is one component. The product is the entire system — retrieval, prompts, guardrails, caching, monitoring, and human oversight together.

The Practical Framework We Now Use

After shipping four AI systems, we developed an internal framework for evaluating the production-readiness of any AI feature before launch. We call it the SHARP checklist:

Letter	Dimension	The Question We Ask
S	Safety	What is the worst thing this system can tell a user, and have we tested for it explicitly?
H	Handoff	When and how does this system escalate to a human, and is that path clearly designed?
A	Accuracy	Do we have an eval suite that measures factual accuracy, not just fluency, on domain-specific test cases?
R	Reliability	Have we load-tested at 3x expected peak? Do we have circuit breakers, rate limits, and fallbacks?
P	Performance	Is the latency acceptable on the worst-case connection our users are likely to have?

No AI feature ships at Edvoy without a green on all five dimensions. It has slowed us down slightly and saved us from at least three significant incidents.

The Outcomes

Where We Are Today

All four AI systems are live, stable, and improving. The early failures were expensive in time and credibility. The lessons we extracted from them are now institutional knowledge that makes every subsequent AI build faster and safer.

✓ 4 AI systems live across 90+ countries

✓ Genie hallucination rate reduced by 78% post-architecture revision

✓ Chatbot CSAT recovered to 4.3/5 after prompt versioning system

✓ Agent loop incidents: zero in the 14 months since circuit breakers deployed

✓ Genie average latency: 1.2 seconds at peak (from 6.8 seconds)

✓ GenAI content accuracy failures eliminated via data freshness gates

✓ SHARP framework now standard for all new AI feature launches

✓ 10x content output with maintained brand quality

The Bigger Lesson

The gap between AI that works in a demo and AI that works reliably in production is not a technical gap — it is a product discipline gap. The organisations building lasting AI products treat it like any other customer-facing system: rigorous design, eval suites, monitoring, iteration. Production AI is harder than demo AI not because of the technology but because of the accountability. And that accountability is exactly what makes it worth building well.

Archisman Sarkar

Director of Product · Edvoy · Hyderabad, India

me@reacharchisman.com

← Back to Insights