{"id":1546,"date":"2026-06-03T08:10:40","date_gmt":"2026-06-03T15:10:40","guid":{"rendered":"https:\/\/www.kenwalger.com\/blog\/?p=1546"},"modified":"2026-06-03T08:57:53","modified_gmt":"2026-06-03T15:57:53","slug":"operating-real-time-ai-slas-observability-and-knowing-when-its-broken","status":"publish","type":"post","link":"https:\/\/www.kenwalger.com\/blog\/ai\/operating-real-time-ai-slas-observability-and-knowing-when-its-broken\/","title":{"rendered":"Operating Real-Time AI: SLAs, Observability, and Knowing When It&#8217;s Broken"},"content":{"rendered":"<p>The previous four posts in this series covered the three architectural pillars of real-time AI at scale: feature pipelines, feature stores, and vector search. Each post addressed the design decisions and failure modes specific to one layer of the stack.<\/p>\n<p>This final post is about the layer that sits above all of them: operations.<\/p>\n<p>You can design a technically sound pipeline, a well-structured feature store, and a carefully maintained vector index \u2014 and still have a system that&#8217;s difficult to run in production, slow to recover from failures, and chronically unclear about whether it&#8217;s actually working. The difference between a system that&#8217;s architecturally sound and one that&#8217;s operationally mature is the difference between a system that was designed and one that was <em>operated<\/em>.<\/p>\n<p>This post is about what operational maturity looks like for real-time AI systems: how to define what &#8220;working&#8221; means, how to know when it isn&#8217;t, and how to recover when things go wrong.<\/p>\n<hr \/>\n<h2>Start With the SLA: What Are You Actually Promising?<\/h2>\n<p>Every discussion of operations should begin with the service level agreement \u2014 not as a compliance document, but as a forcing function for clarity.<\/p>\n<p>An SLA for a real-time AI system needs to answer four questions:<\/p>\n<p><strong>1. What is the latency target?<\/strong><br \/>\nNot just average latency \u2014 P99. The 99th percentile is where user-visible degradation lives. &#8220;Average latency is 50ms&#8221; is compatible with &#8220;1% of requests take 2 seconds,&#8221; which is likely unacceptable for a real-time user-facing system. Define your latency target at P99, and optionally P999 for systems where tail latency matters especially.<\/p>\n<p><strong>2. What is the availability target?<\/strong><br \/>\nWhat fraction of requests must succeed, over what time window? 99.9% availability means roughly 8.7 hours of allowable downtime per year. 99.99% means 52 minutes. The difference in operational complexity between those two targets is significant \u2014 know which one you&#8217;re designing for.<\/p>\n<p><strong>3. What is the freshness target?<\/strong><br \/>\nFor real-time AI specifically, this is a dimension that generic SLA frameworks often omit. How stale can features be before the system is considered degraded? How old can vector index updates be before search quality is affected? Freshness is a correctness dimension, not just a performance dimension.<\/p>\n<p><strong>4. What is the recall target?<\/strong><br \/>\nFor systems that use vector search, recall is part of the quality contract. A system returning search results with 60% recall is functionally broken for many use cases, even if it&#8217;s technically available and within latency targets. Define a minimum acceptable recall threshold and treat violations as SLA breaches.<\/p>\n<p>These four dimensions \u2014 latency, availability, freshness, recall \u2014 form the complete SLA surface for a real-time AI system. Most teams define the first two and ignore the last two. The last two are where silent degradation hides.<\/p>\n<hr \/>\n<h2>The Latency Budget: Where Time Actually Goes<\/h2>\n<p>Once you have a P99 latency target, the next step is a latency budget \u2014 an explicit allocation of that target across each component in the serving path.<\/p>\n<p>A typical real-time inference serving path looks something like this:<\/p>\n<pre><code>Request received\n    \u2502\n    \u251c\u2500\u2500 Feature retrieval (online store lookup)\n    \u2502\n    \u251c\u2500\u2500 Vector search (ANN index query)\n    \u2502\n    \u251c\u2500\u2500 Feature assembly (merge, null handling, type coercion)\n    \u2502\n    \u251c\u2500\u2500 Model inference (forward pass)\n    \u2502\n    \u251c\u2500\u2500 Post-processing (result formatting, business logic)\n    \u2502\n    \u2514\u2500\u2500 Response returned\n<\/code><\/pre>\n<p>Without a latency budget, each component is implicitly allocated &#8220;whatever it takes.&#8221; With a budget, each component has an explicit ceiling, and crossing that ceiling is an actionable signal rather than background noise.<\/p>\n<p>A worked example for a 100ms P99 target:<\/p>\n<table>\n<thead>\n<tr>\n<th>Component<\/th>\n<th>Budget<\/th>\n<th>Notes<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Network (ingress + egress)<\/td>\n<td>10ms<\/td>\n<td>Largely fixed; optimize for geographic proximity<\/td>\n<\/tr>\n<tr>\n<td>Feature retrieval<\/td>\n<td>15ms<\/td>\n<td>Batch point lookup; single round-trip<\/td>\n<\/tr>\n<tr>\n<td>Vector search<\/td>\n<td>25ms<\/td>\n<td>ANN query; tunable via <code>ef<\/code> parameter<\/td>\n<\/tr>\n<tr>\n<td>Feature assembly<\/td>\n<td>5ms<\/td>\n<td>In-process; should be negligible<\/td>\n<\/tr>\n<tr>\n<td>Model inference<\/td>\n<td>35ms<\/td>\n<td>Depends on model size and hardware<\/td>\n<\/tr>\n<tr>\n<td>Post-processing<\/td>\n<td>5ms<\/td>\n<td>Business logic; should be bounded<\/td>\n<\/tr>\n<tr>\n<td><strong>Total<\/strong><\/td>\n<td><strong>95ms<\/strong><\/td>\n<td>5ms headroom at P99<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The budget makes tradeoffs visible. If the model inference step takes 60ms instead of 35ms, you know immediately which other components need to compress to compensate \u2014 or that the overall target needs to be renegotiated. Without the budget, a 60ms model inference step is just &#8220;the model is slow,&#8221; with no clear next action.<\/p>\n<p>Latency budgets should be enforced in monitoring. If feature retrieval regularly exceeds its allocation, that&#8217;s an alert, not just a data point.<\/p>\n<hr \/>\n<h2>Observability: The Full Signal Stack<\/h2>\n<p>Observability for real-time AI systems requires monitoring signals at every layer of the stack. Most infrastructure monitoring covers the compute and network layers well. The AI-specific layers \u2014 feature freshness, value distributions, recall \u2014 are almost always underinstrumented.<\/p>\n<p>The complete signal stack looks like this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.kenwalger.com\/blog\/wp-content\/uploads\/2026\/06\/mermaid-diagram-2026-06-03-085534-scaled.png\"><\/p>\n<p>A few of these signals deserve particular attention because they&#8217;re routinely absent from production monitoring even in mature engineering organizations.<\/p>\n<p><strong>Feature null rate at inference time.<\/strong> When a feature value is missing \u2014 because an entity is new, because a pipeline failed, because a schema changed \u2014 most feature stores serve a default value silently. The null rate tells you how often this is happening. A sudden spike in null rate is a leading indicator of pipeline failure, schema drift, or cold start volume changes. Without tracking it, you&#8217;re flying blind on a significant dimension of input quality.<\/p>\n<p><strong>Prediction distribution drift.<\/strong> If the statistical distribution of your model&#8217;s outputs shifts \u2014 more extreme scores, a different mean, a collapsed variance \u2014 something upstream has changed. It might be a feature pipeline issue, a data quality problem, or genuine change in the underlying population. Monitoring output distribution doesn&#8217;t tell you which, but it tells you something changed, which is the signal that starts the investigation.<\/p>\n<p><strong>Training-serving skew over time.<\/strong> We covered training-serving skew as an architectural problem in Posts 2 and 3. Here it&#8217;s an operational metric. Periodically sampling serving-time feature values and comparing their distribution to training-time values catches skew that accumulates gradually \u2014 not from a single bad deployment, but from slow drift in source data, transformation logic, or serving behavior.<\/p>\n<hr \/>\n<h2>Failure Modes and Recovery Patterns<\/h2>\n<h3>Pipeline Failures<\/h3>\n<p>Batch pipeline failures are the most straightforward: a job fails, the scheduler reports it, and the on-call engineer can rerun it. The question is whether the feature store degrades gracefully in the interim.<\/p>\n<p><strong>Design for stale-but-available.<\/strong> A feature store that returns stale values when the pipeline is delayed is better than one that returns errors. Stale values keep the model running, possibly with reduced quality. Errors stop the model from running entirely. Build explicit staleness thresholds: values older than N minutes trigger alerts; values older than M minutes trigger fallback behavior.<\/p>\n<p>Streaming pipeline failures are more complex. A streaming job that falls behind on processing \u2014 accumulating lag in the event queue \u2014 may not fail outright. It may continue processing, but with increasing delay, silently delivering features that are progressively more stale. <strong>Stream lag monitoring<\/strong> is the signal: track the gap between when events are produced and when they&#8217;re processed, and alert when it crosses a threshold.<\/p>\n<pre><code class=\"language-python\"># Stream lag alert \u2014 conceptual\ndef check_stream_lag(consumer_group, max_lag_seconds):\n    lag = kafka_consumer.get_lag(consumer_group)\n    processing_rate = kafka_consumer.get_processing_rate(consumer_group)\n\n    estimated_catchup_seconds = lag \/ processing_rate if processing_rate &gt; 0 else float('inf')\n\n    if estimated_catchup_seconds &gt; max_lag_seconds:\n        alert(\n            f\"Stream lag critical: {lag} messages behind, \"\n            f\"estimated {estimated_catchup_seconds:.0f}s to catch up\"\n        )\n<\/code><\/pre>\n<h3>Feature Store Failures<\/h3>\n<p>The online store is on the critical path for every inference request. Its failure mode is a total serving outage unless the system is designed with a fallback.<\/p>\n<p><strong>Fallback strategies in priority order:<\/strong><\/p>\n<ol>\n<li><strong>Serve from cache.<\/strong> If the serving layer caches recent feature retrievals, a brief online store outage can be absorbed without user impact for entities whose features were recently accessed.<\/p>\n<\/li>\n<li><strong>Serve defaults.<\/strong> Pre-computed default feature vectors \u2014 global averages, segment priors, or zero vectors \u2014 can keep the model running at reduced quality during an outage.<\/p>\n<\/li>\n<li><strong>Degrade gracefully.<\/strong> For some use cases, serving a simpler non-ML fallback (most popular items, rule-based decisions) is preferable to serving degraded ML predictions.<\/p>\n<\/li>\n<li><strong>Fail fast.<\/strong> For use cases where prediction quality is critical and degraded predictions are worse than no predictions, explicit failure with a clear error is the right answer.<\/p>\n<\/li>\n<\/ol>\n<p>The right strategy depends on your use case. What&#8217;s universally wrong is having no strategy \u2014 discovering during an incident that the serving layer has no fallback path and needs to be designed under pressure.<\/p>\n<h3>Vector Index Failures<\/h3>\n<p>Vector index failures are typically not binary. The index doesn&#8217;t go down \u2014 it degrades. Recall drops. Latency increases. Results become less relevant.<\/p>\n<p>The operational response to index degradation depends on how it&#8217;s detected:<\/p>\n<p><strong>If recall drops below threshold:<\/strong> Trigger an index rebuild or compaction. In a segment-based architecture, compacting the most degraded segments may be sufficient. In a monolithic index, a full rebuild is required \u2014 which means managing traffic during the rebuild window.<\/p>\n<p><strong>If latency increases without load increase:<\/strong> Check tombstone accumulation. An index with a high fraction of deleted vectors will show latency increases before recall visibly degrades. Triggering a cleanup or rebuild early \u2014 before recall becomes a problem \u2014 is cheaper than reacting after the fact.<\/p>\n<p><strong>During an embedding model migration:<\/strong> The dual-index serving strategy is the safest path. Route queries to both the old and new index, returning results from the new index where available and falling back to the old index for records not yet recomputed. Monitor the migration percentage and recall on both indices throughout.<\/p>\n<hr \/>\n<h2>Capacity Planning: Designing Ahead of the Problem<\/h2>\n<p>Real-time AI systems fail at scale in predictable ways. Capacity planning is the practice of anticipating those failures before they occur.<\/p>\n<p><strong>Feature store capacity<\/strong> is driven by three variables: the number of entities, the number of features per entity, and the update rate. As any of these grow, both storage cost and write throughput requirements increase. The online store is typically the binding constraint \u2014 it&#8217;s expensive, and adding capacity requires planning time.<\/p>\n<p>Model the growth of each variable separately. A user feature store that grows linearly with your user base is predictable. One that grows with user activity \u2014 where active users generate many feature updates per day \u2014 can grow superlinearly. Know which one you have.<\/p>\n<p><strong>Vector index capacity<\/strong> is driven by vector count, vector dimensionality, and query rate. Memory requirements for HNSW indices are roughly:<\/p>\n<pre><code>Memory (bytes) \u2248 num_vectors \u00d7 (dimension \u00d7 4 bytes + M \u00d7 8 bytes)\n\nWhere M is the HNSW connectivity parameter (typically 16-64)\n\nExample: 10M vectors, 1536 dimensions, M=32\n\u2248 10M \u00d7 (1536 \u00d7 4 + 32 \u00d7 8)\n\u2248 10M \u00d7 (6144 + 256)\n\u2248 10M \u00d7 6400\n\u2248 64 GB\n<\/code><\/pre>\n<p>At 10 million vectors of typical embedding dimensionality, you&#8217;re looking at 50-100GB of memory just for the index \u2014 before accounting for the base vectors themselves. Planning for this before you hit the wall is significantly cheaper than scaling under pressure.<\/p>\n<p><strong>Inference compute capacity<\/strong> is the most familiar capacity planning domain, but AI workloads have spikier profiles than many web workloads. Model inference is CPU or GPU-bound, not I\/O-bound, which means autoscaling has a longer warmup tail. Design for headroom that can absorb spikes without triggering cold start of new inference instances under load.<\/p>\n<hr \/>\n<h2>Incident Response: What to Do When It Breaks<\/h2>\n<p>When a real-time AI system degrades in production, the diagnosis path should be structured \u2014 not because engineers aren&#8217;t capable of reasoning under pressure, but because structured diagnosis is faster and less error-prone than ad hoc investigation.<\/p>\n<p>A simple decision tree for real-time AI incidents:<\/p>\n<pre><code>Is end-to-end latency elevated?\n\u251c\u2500\u2500 YES \u2192 Check component latency breakdown\n\u2502         \u251c\u2500\u2500 Feature retrieval elevated? \u2192 Online store health\n\u2502         \u251c\u2500\u2500 Vector search elevated? \u2192 Index health (recall, tombstones)\n\u2502         \u2514\u2500\u2500 Model inference elevated? \u2192 Compute resource saturation\n\u2502\n\u2514\u2500\u2500 NO \u2192 Is prediction quality degraded?\n         \u251c\u2500\u2500 Is feature freshness stale? \u2192 Pipeline health (lag, job failures)\n         \u251c\u2500\u2500 Is null rate elevated? \u2192 Schema change or cold start spike\n         \u251c\u2500\u2500 Is output distribution shifted? \u2192 Feature distribution drift\n         \u2514\u2500\u2500 Is recall below threshold? \u2192 Index degradation\n<\/code><\/pre>\n<p>The key discipline is following the tree rather than jumping to conclusions. In complex systems, the symptom that&#8217;s most visible is often not the one that&#8217;s most actionable. A latency spike might be caused by vector search, or by feature retrieval, or by upstream traffic patterns that are saturating the online store. The monitoring signals tell you which \u2014 if they&#8217;re in place.<\/p>\n<p><strong>Runbooks<\/strong> \u2014 documented step-by-step procedures for common failure scenarios \u2014 dramatically reduce mean time to recovery. A runbook for &#8220;online store latency spike&#8221; that lists the specific metrics to check, the commands to run, and the escalation path removes the cognitive load of structuring the investigation under pressure. Writing runbooks before incidents is one of the highest-leverage operational investments a team can make.<\/p>\n<hr \/>\n<h2>The Operational Maturity Progression<\/h2>\n<p>Operational maturity for real-time AI systems isn&#8217;t a binary state. It develops in layers, and most teams are somewhere in the middle. A useful progression:<\/p>\n<p><strong>Level 0 \u2014 Reactive<\/strong>: The team discovers problems when users report them. No AI-specific monitoring. Recovery is ad hoc.<\/p>\n<p><strong>Level 1 \u2014 Instrumented<\/strong>: Basic metrics are in place for latency and availability. AI-specific signals (freshness, recall, distribution drift) are absent or manual.<\/p>\n<p><strong>Level 2 \u2014 Alerted<\/strong>: Alerts exist for the key AI-specific signals. On-call engineers are notified of degradation before users report it. Recovery is faster but still manual.<\/p>\n<p><strong>Level 3 \u2014 Documented<\/strong>: Runbooks exist for common failure scenarios. Incident response is structured and consistent. Post-mortems are conducted and drive improvements.<\/p>\n<p><strong>Level 4 \u2014 Automated<\/strong>: Common remediation actions are automated. Stream lag triggers automatic consumer group scaling. Index tombstone thresholds trigger automatic compaction. Freshness violations trigger automatic pipeline retries.<\/p>\n<p>Most teams building real-time AI systems for the first time are at Level 0 or 1. Getting to Level 2 \u2014 instrumented and alerted on the AI-specific signals \u2014 is the single highest-leverage operational investment available. Levels 3 and 4 follow from the foundation that Level 2 provides.<\/p>\n<hr \/>\n<h2>Closing the Series<\/h2>\n<p>This series started with a simple observation: real-time AI systems that hum in development routinely hit problems in production, and those problems aren&#8217;t model problems \u2014 they&#8217;re infrastructure and operations problems.<\/p>\n<p>The five posts have traced the full operational arc:<\/p>\n<ul>\n<li><strong>Post 1<\/strong>: The gap between development and production, and the three categories of pressure that expose it<\/li>\n<li><strong>Post 2<\/strong>: Feature pipelines \u2014 how to get features from raw events to a computed state with the freshness your model needs<\/li>\n<li><strong>Post 3<\/strong>: Feature stores \u2014 the dual-store architecture, consistency enforcement, and the governance layer that makes reuse possible<\/li>\n<li><strong>Post 4<\/strong>: Vector search \u2014 index degradation, recall monitoring, and hybrid filtering at scale<\/li>\n<li><strong>Post 5<\/strong>: Operations \u2014 SLAs, latency budgets, the full observability stack, and the incident response patterns that reduce recovery time<\/li>\n<\/ul>\n<p>The through-line is a shift in mindset: from thinking of the model as the system, to thinking of the pipeline as the system. At scale, the model is one component \u2014 a critical one, but one that depends entirely on the infrastructure surrounding it.<\/p>\n<p>Building that infrastructure well \u2014 with explicit SLAs, comprehensive observability, thoughtful fallback strategies, and a documented path from alert to recovery \u2014 is what separates systems that scale from systems that struggle.<\/p>\n<p>The problems are identifiable. The patterns are known. The investment pays for itself the first time a monitoring alert catches a degradation that would otherwise have reached your users.<\/p>\n<hr \/>\n<p><em>Thanks for following along through this series. If you found it useful, the best thing you can do is share it with a teammate who&#8217;s building these systems for the first time \u2014 or forward it to someone who&#8217;s hitting these problems and doesn&#8217;t yet know why.<\/em><\/p>\n<h3>When Your AI Pipeline Grows Up Series<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.kenwalger.com\/blog\/ai\/when-your-ai-pipeline-grows-up-infrastructure-thinking-for-real-time-inference-at-scale\">Real Time AI at Scale<\/a><\/li>\n<li><a href=\"https:\/\/www.kenwalger.com\/blog\/ai\/feature-freshness-designing-pipelines-that-keep-up-with-the-world\">Feature Freshness<\/a><\/li>\n<li><a href=\"https:\/\/www.kenwalger.com\/blog\/ai\/the-feature-store-consistency-and-latency-are-both-non-negotiable\/\">Feature Store<\/a><\/li>\n<li><a href=\"\">Vector Search<\/a><\/li>\n<li>Operations &#8211; <em>This Post!<\/em><\/li>\n<\/ul>\n<a class=\"synved-social-button synved-social-button-share synved-social-size-48 synved-social-resolution-single synved-social-provider-facebook nolightbox\" data-provider=\"facebook\" target=\"_blank\" rel=\"nofollow\" title=\"Share on Facebook\" href=\"https:\/\/www.facebook.com\/sharer.php?u=https%3A%2F%2Fwww.kenwalger.com%2Fblog%2Fwp-json%2Fwp%2Fv2%2Fposts%2F1546&amp;t=Operating%20Real-Time%20AI%3A%20SLAs%2C%20Observability%2C%20and%20Knowing%20When%20It%E2%80%99s%20Broken&amp;s=100&amp;p[url]=https%3A%2F%2Fwww.kenwalger.com%2Fblog%2Fwp-json%2Fwp%2Fv2%2Fposts%2F1546&amp;p[images][0]=https%3A%2F%2Fwww.kenwalger.com%2Fblog%2Fwp-content%2Fuploads%2F2026%2F04%2Fblog-of-ken-w.-alger-69f0d1277247a.png&amp;p[title]=Operating%20Real-Time%20AI%3A%20SLAs%2C%20Observability%2C%20and%20Knowing%20When%20It%E2%80%99s%20Broken\" style=\"font-size: 0px;width:48px;height:48px;margin:0;margin-bottom:5px;margin-right:5px\"><img loading=\"lazy\" decoding=\"async\" alt=\"Facebook\" title=\"Share on Facebook\" class=\"synved-share-image synved-social-image synved-social-image-share\" width=\"48\" height=\"48\" style=\"display: inline;width:48px;height:48px;margin: 0;padding: 0;border: none;box-shadow: none\" src=\"https:\/\/www.kenwalger.com\/blog\/wp-content\/plugins\/social-media-feather\/synved-social\/image\/social\/regular\/96x96\/facebook.png\" \/><\/a><a class=\"synved-social-button synved-social-button-share synved-social-size-48 synved-social-resolution-single synved-social-provider-twitter nolightbox\" data-provider=\"twitter\" target=\"_blank\" rel=\"nofollow\" title=\"Share on Twitter\" href=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fwww.kenwalger.com%2Fblog%2Fwp-json%2Fwp%2Fv2%2Fposts%2F1546&amp;text=Hey%20check%20this%20out\" style=\"font-size: 0px;width:48px;height:48px;margin:0;margin-bottom:5px;margin-right:5px\"><img loading=\"lazy\" decoding=\"async\" alt=\"twitter\" title=\"Share on Twitter\" class=\"synved-share-image synved-social-image synved-social-image-share\" width=\"48\" height=\"48\" style=\"display: inline;width:48px;height:48px;margin: 0;padding: 0;border: none;box-shadow: none\" src=\"https:\/\/www.kenwalger.com\/blog\/wp-content\/plugins\/social-media-feather\/synved-social\/image\/social\/regular\/96x96\/twitter.png\" \/><\/a><a class=\"synved-social-button synved-social-button-share synved-social-size-48 synved-social-resolution-single synved-social-provider-reddit nolightbox\" data-provider=\"reddit\" target=\"_blank\" rel=\"nofollow\" title=\"Share on Reddit\" href=\"https:\/\/www.reddit.com\/submit?url=https%3A%2F%2Fwww.kenwalger.com%2Fblog%2Fwp-json%2Fwp%2Fv2%2Fposts%2F1546&amp;title=Operating%20Real-Time%20AI%3A%20SLAs%2C%20Observability%2C%20and%20Knowing%20When%20It%E2%80%99s%20Broken\" style=\"font-size: 0px;width:48px;height:48px;margin:0;margin-bottom:5px;margin-right:5px\"><img loading=\"lazy\" decoding=\"async\" alt=\"reddit\" title=\"Share on Reddit\" class=\"synved-share-image synved-social-image synved-social-image-share\" width=\"48\" height=\"48\" style=\"display: inline;width:48px;height:48px;margin: 0;padding: 0;border: none;box-shadow: none\" src=\"https:\/\/www.kenwalger.com\/blog\/wp-content\/plugins\/social-media-feather\/synved-social\/image\/social\/regular\/96x96\/reddit.png\" \/><\/a><a class=\"synved-social-button synved-social-button-share synved-social-size-48 synved-social-resolution-single synved-social-provider-linkedin nolightbox\" data-provider=\"linkedin\" target=\"_blank\" rel=\"nofollow\" title=\"Share on Linkedin\" href=\"https:\/\/www.linkedin.com\/shareArticle?mini=true&amp;url=https%3A%2F%2Fwww.kenwalger.com%2Fblog%2Fwp-json%2Fwp%2Fv2%2Fposts%2F1546&amp;title=Operating%20Real-Time%20AI%3A%20SLAs%2C%20Observability%2C%20and%20Knowing%20When%20It%E2%80%99s%20Broken\" style=\"font-size: 0px;width:48px;height:48px;margin:0;margin-bottom:5px;margin-right:5px\"><img loading=\"lazy\" decoding=\"async\" alt=\"linkedin\" title=\"Share on Linkedin\" class=\"synved-share-image synved-social-image synved-social-image-share\" width=\"48\" height=\"48\" style=\"display: inline;width:48px;height:48px;margin: 0;padding: 0;border: none;box-shadow: none\" src=\"https:\/\/www.kenwalger.com\/blog\/wp-content\/plugins\/social-media-feather\/synved-social\/image\/social\/regular\/96x96\/linkedin.png\" \/><\/a><a class=\"synved-social-button synved-social-button-share synved-social-size-48 synved-social-resolution-single synved-social-provider-mail nolightbox\" data-provider=\"mail\" rel=\"nofollow\" title=\"Share by email\" href=\"mailto:?subject=Operating%20Real-Time%20AI%3A%20SLAs%2C%20Observability%2C%20and%20Knowing%20When%20It%E2%80%99s%20Broken&amp;body=Hey%20check%20this%20out:%20https%3A%2F%2Fwww.kenwalger.com%2Fblog%2Fwp-json%2Fwp%2Fv2%2Fposts%2F1546\" style=\"font-size: 0px;width:48px;height:48px;margin:0;margin-bottom:5px\"><img loading=\"lazy\" decoding=\"async\" alt=\"mail\" title=\"Share by email\" class=\"synved-share-image synved-social-image synved-social-image-share\" width=\"48\" height=\"48\" style=\"display: inline;width:48px;height:48px;margin: 0;padding: 0;border: none;box-shadow: none\" src=\"https:\/\/www.kenwalger.com\/blog\/wp-content\/plugins\/social-media-feather\/synved-social\/image\/social\/regular\/96x96\/mail.png\" \/><\/a>","protected":false},"excerpt":{"rendered":"<p>The previous four posts in this series covered the three architectural pillars of real-time AI at scale: feature pipelines, feature stores, and vector search. Each post addressed the design decisions and failure modes specific to one layer of the stack. This final post is about the layer that sits above all of them: operations. You &hellip; <a href=\"https:\/\/www.kenwalger.com\/blog\/ai\/operating-real-time-ai-slas-observability-and-knowing-when-its-broken\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Operating Real-Time AI: SLAs, Observability, and Knowing When It&#8217;s Broken&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":1561,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"pmpro_default_level":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[1669],"tags":[],"yst_prominent_words":[1061,108,290,1054],"class_list":["post-1546","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","pmpro-has-access"],"jetpack_featured_media_url":"https:\/\/www.kenwalger.com\/blog\/wp-content\/uploads\/2026\/04\/blog-of-ken-w.-alger-69f0d1277247a.png","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8lx70-oW","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/posts\/1546","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/comments?post=1546"}],"version-history":[{"count":6,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/posts\/1546\/revisions"}],"predecessor-version":[{"id":1671,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/posts\/1546\/revisions\/1671"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/media\/1561"}],"wp:attachment":[{"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/media?parent=1546"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/categories?post=1546"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/tags?post=1546"},{"taxonomy":"yst_prominent_words","embeddable":true,"href":"https:\/\/www.kenwalger.com\/blog\/wp-json\/wp\/v2\/yst_prominent_words?post=1546"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}