The demos always work. The hallucinations, agent loops, prompt drift, and latency spikes happen after you go live. Here is the unfiltered story of building Genie, our agentic pipeline, LLM chatbot, and GenAI content engine at Edvoy.
The demos always work. Between 2022 and 2024, we shipped four production AI systems at Edvoy — customer-facing AI serving 1M+ students across 90+ countries making real decisions about universities, visas, and scholarships. The failure modes were creative and almost none were covered in the blog posts we read before we started. This is the article I wish had existed.
Every AI system we shipped taught us something new about the gap between prototype and production. Here are the failure modes that hit us hardest, and what we did about each one.
For Genie, hallucination was a trust and liability issue. A wrong answer about UK Tier 4 visa requirements could result in a refusal, a lost deposit, and a furious support ticket.
Three weeks after Genie launched, we discovered that for a specific query pattern involving conditional visa rules, the system was occasionally synthesising a plausible-sounding but incorrect answer by blending retrieval results from two different destination country policies. Our CSAT scores for that query category dropped to 2.1 out of 5. We caught it from user feedback, not from our eval suite.
The fix was architectural: domain-scoped retrieval (UK queries only retrieve from UK-tagged nodes), confidence thresholds with hard fallbacks to human counsellors below 0.72, and weekly adversarial query audits to find failure patterns before students did.
Multiple agents, tool use, memory persistence, channel integrations. In staging: flawless. In production: creative failure modes we never imagined.
One of our nudge agents entered a retry loop after a WhatsApp delivery failure. Because the failure was not surfaced to the orchestrator correctly, the agent kept re-triggering. One student received 23 WhatsApp messages in 40 minutes before our monitoring alert fired. We discovered it because the student called us, not because our system caught it.
Three lessons: agent actions touching external channels must be idempotent and rate-limited at infrastructure level. Orchestrators need circuit breakers — hard limits per entity per time window. And agentic monitoring must track action chains, not just individual API calls.
Our LLM chatbot launched with well-tested prompts. Three months later, conversation quality CSAT had declined 11% from baseline. No incidents. No errors. Just slow, invisible degradation.
“Prompt drift is not a single failure event. It is the slow accumulation of edge cases, model version updates, and prompt interactions that degrade quality in a way that no single monitoring alert will catch. By the time you notice it, it has been happening for weeks.”
Root causes: the model had been silently updated by the provider, new user query patterns exposed gaps, and iteratively-added prompt components had introduced contradictions. Fix: a prompt versioning system with A/B testing before rollout, and weekly conversation quality audits — 50 randomly sampled conversations reviewed against a rubric. Tedious. Absolutely necessary.
When Genie produced a bad answer, we assumed the LLM failed. Most of the time, the retrieval had failed — the LLM was generating confident prose from the wrong chunks. Fixed-size chunking split critical context across boundaries. We switched to semantic chunking and added explicit metadata filters (Canadian study permit queries must never retrieve UK Tier 4 guidance, regardless of cosine similarity). Both should have been in the architecture from day one.
Genie averaged 1.8 seconds in testing. At peak load: 6–8 seconds. A correct answer delivered 7 seconds late is a failed product experience. We implemented semantic similarity caching, streaming responses, and async pre-fetching. Peak latency dropped to 1.2 seconds.
Publishing content at scale meant quality failures had SEO, brand, and accuracy consequences. Two issues dominated:
10,000 pieces a month and every university profile started sounding identical — same adjectives, same templates, same closing sentences. We built a style diversity injection system with explicit structural variation, and a similarity score check that flagged pieces with cosine similarity above 0.85 to recent content, triggering re-generation at higher temperature.
When source data was stale, the AI generated fluent content about outdated fees and discontinued programmes. It passed every fluency check and failed every accuracy check. We introduced a data freshness gate: no content generated from data older than 90 days without a human review flag. Throughput dropped 15%; trust failures eliminated.
After shipping four AI systems, we developed an internal framework for evaluating the production-readiness of any AI feature before launch. We call it the SHARP checklist:
| Letter | Dimension | The Question We Ask |
|---|---|---|
| S | Safety | What is the worst thing this system can tell a user, and have we tested for it explicitly? |
| H | Handoff | When and how does this system escalate to a human, and is that path clearly designed? |
| A | Accuracy | Do we have an eval suite that measures factual accuracy, not just fluency, on domain-specific test cases? |
| R | Reliability | Have we load-tested at 3x expected peak? Do we have circuit breakers, rate limits, and fallbacks? |
| P | Performance | Is the latency acceptable on the worst-case connection our users are likely to have? |
No AI feature ships at Edvoy without a green on all five dimensions. It has slowed us down slightly and saved us from at least three significant incidents.
All four AI systems are live, stable, and improving. The early failures were expensive in time and credibility. The lessons we extracted from them are now institutional knowledge that makes every subsequent AI build faster and safer.
The gap between AI that works in a demo and AI that works reliably in production is not a technical gap — it is a product discipline gap. The organisations building lasting AI products treat it like any other customer-facing system: rigorous design, eval suites, monitoring, iteration. Production AI is harder than demo AI not because of the technology but because of the accountability. And that accountability is exactly what makes it worth building well.