When OpenAI went down in December, one of TrueFoundry’s customers faced a crisis that had nothing to do with chatbots or content generation. The company uses large language models to help refill prescriptions. Every second of downtime meant thousands of dollars in lost revenue — and patients who could not access their medications on time.
TrueFoundry, an enterprise AI infrastructure company, announced Wednesday a new product called TrueFailover designed to prevent exactly that scenario. The system automatically detects when AI providers experience outages, slowdowns, or quality degradation, then seamlessly reroutes traffic to backup models and regions before users notice anything went wrong.
"The challenge is that in the AI world, failover is no longer that simple," said Nikunj Bajaj, co-founder and chief executive of TrueFoundry, in an exclusive interview with VentureBeat. "When you move from one model to another, you also have to consider things like output quality, latency, and whether the prompt even works the same way. In many cases, the prompt needs to be adjusted in real-time to prevent results from degrading. That is not something most teams are set up to manage manually."
The announcement arrives at a pivotal moment for enterprise AI adoption. Companies have moved far beyond experimentation. AI now powers prescription refills at pharmacies, generates sales proposals, assists software developers, and handles customer support inquiries. When these systems fail, the consequences ripple through entire organizations.
Why enterprise AI systems remain dangerously dependent on single providers
Large language models from OpenAI, Anthropic, Google, and other providers have become essential infrastructure for thousands of businesses. But unlike traditional cloud services from Amazon Web Services or Microsoft Azure — which offer robust uptime guarantees backed by decades of operational experience — AI providers operate complex, resource-intensive systems that remain prone to unexpected failures.
"Major LLM providers experience outages, slowdowns, or latency spikes every few weeks or months, and we regularly see the downstream impact on businesses that rely on a single provider," Bajaj told VentureBeat.
The December OpenAI outage that affected TrueFoundry's pharmacy customer illustrates the stakes. "At their scale, even seconds of downtime can translate into thousands of dollars in lost revenue," Bajaj explained. "Beyond the economic impact, there is also a human consequence when patients cannot access prescriptions on time. Because this customer had our failover solution in place, they were able to reroute requests to another model provider within minutes of detecting the outage. Without that setup, recovery would likely have taken hours."
The problem extends beyond complete outages. Partial failures — where a model slows down or produces lower-quality responses without going fully offline — can quietly destroy user experience and violate service-level agreements. These "slow but technically up" scenarios often prove more damaging than dramatic crashes because they evade traditional monitoring systems while steadily eroding performance.
Inside the technology that keeps AI applications online when providers fail
TrueFailover operates as a resilience layer on top of TrueFoundry's AI Gateway, which already processes more than 10 billion requests per month for Fortune 1000 companies. The system weaves together several interconnected capabilities into a unified safety net for enterprise AI.
At its core, the product enables multi-model failover by allowing enterprises to define primary and backup models across providers. If OpenAI becomes unavailable, traffic automatically shifts to Anthropic, Google's Gemini, Mistral, or self-hosted alternatives. The routing happens transparently, without requiring application teams to rewrite code or manually intervene.
The system extends this protection across geographic boundaries through multi-region and multi-cloud resilience. By distributing AI endpoints across zones and cloud providers, health-based routing can detect problems in specific regions and divert traffic to healthy alternatives. What would otherwise become a global incident transforms into an invisible infrastructure adjustment that users never perceive.
Perhaps most critically, TrueFailover employs degradation-aware routing that continuously monitors latency, error rates, and quality signals. "We look at a combination of signals that together indicate when a model's performance is starting to degrade," Bajaj explained. "Large language models are shared resources. Providers run the same model instance across many customers, so when demand spikes for one user or workload, it can affect everyone else using that model."
The system watches for rising response times, increasing error rates, and patterns suggesting instability. "Individually, none of these signals tell the full story," Bajaj said. "But taken together, they allow us to detect early signs that a model is slowing down or becoming unreliable. Those signals feed into an AI-driven system that can decide when and how to reroute traffic before users experience a noticeable drop in quality."
Strategic caching rounds out the protection by shielding providers from sudden traffic spikes and preventing rate-limit cascades during high-demand periods. This allows systems to absorb demand surges and provider limits without brownouts or throttling surprises.
The approach represents a fundamental shift in how enterprises should think about AI reliability. "TrueFailover is designed to handle that complexity automatically," Bajaj said. "It continuously monitors how models behave across many customers and use cases, looks for early warning signs like rising latency, and takes action before things break. Most individual enterprises do not have that kind of visibility because they are only able to see their own systems."
The engineering challenge of switching models without sacrificing output quality
One of the thorniest challenges in AI failover involves maintaining consistent output quality when switching between models. A prompt optimized for GPT-5 may produce different results on Claude or Gemini. TrueFoundry addresses this through several mechanisms that balance speed against precision.
"Some teams rely on the fact that large models have become good enough that small differences in prompts do not materially affect the output," Bajaj explained. "In those cases, switching from one provider to another can happen with some visible impact — that's not ideal, but some teams choose to do it."
More sophisticated implementations maintain provider-specific prompts for the same application. "When traffic shifts from one model to another, the prompt shifts with it," Bajaj said. "In that case, failover is not just switching models. It is switching to a configuration that has already been tested."
TrueFailover automates this process. The system dynamically routes requests and adjusts prompts based on which model handles the query, keeping quality within acceptable ranges without manual intervention. The key, Bajaj emphasized, is that "failover is planned, not reactive. The logic, prompts, and guardrails are defined ahead of time, which is why end users typically do not notice when a switch happens."
Importantly, many failover scenarios do not require changing providers at all. "It can be routing traffic from the same model in one region to another region, such as from the East Coast to the West Coast, where no prompt changes are required," Bajaj noted. This geographic flexibility provides a first line of defense before more complex cross-provider switches become necessary.
How regulated industries can use AI failover without compromising compliance
For enterprises in healthcare, financial services, and other regulated sectors, the prospect of AI traffic automatically routing to different providers raises immediate compliance concerns. Patient data cannot simply flow to whichever model happens to be available. Financial records require strict controls over where they travel. TrueFoundry built explicit guardrails to address these constraints.
"TrueFailover will never route data to a model or provider that an enterprise has not explicitly approved," Bajaj said. "Everything is controlled through an admin configuration layer where teams set clear guardrails upfront."
Enterprises define exactly which models qualify for failover, which providers can receive traffic, and even which regions or model categories — such as closed-source versus open-source — are acceptable. Once those rules take effect, TrueFailover operates only within them.
"If a model is not on the approved list, it is simply not an option for routing," Bajaj emphasized. "There is no scenario where traffic is automatically sent somewhere unexpected. The idea is to give teams full control over compliance and data boundaries, while still allowing the system to respond quickly when something goes wrong. That way, reliability improves without compromising security or regulatory requirements."
This design reflects lessons learned from TrueFoundry's existing enterprise deployments. A Fortune 50 healthcare company already uses the platform to handle more than 500 million IVR calls annually through an agentic AI system. That customer required the ability to run workloads across both cloud and on-premise infrastructure while maintaining strict data residency controls — exactly the kind of hybrid environment where failover policies must be precisely defined.
Where automatic failover cannot help and what enterprises must plan for
TrueFoundry acknowledges that TrueFailover cannot solve every reliability problem. The system operates within the guardrails enterprises configure, and those configurations determine what protection is possible.
"If a team allows failover from a large, high-capacity model to a much smaller model without adjusting prompts or expectations, TrueFailover cannot guarantee the same output quality," Bajaj explained. "The system can route traffic, but it cannot make a smaller model behave like a larger one without appropriate configuration."
Infrastructure constraints also limit protection. If an enterprise hosts its own models and all of them run on the same GPU cluster, TrueFailover cannot help when that infrastructure fails. "When there is no alternate infrastructure available, there is nothing to fail over to," Bajaj said.
The question of simultaneous multi-provider failures occasionally surfaces in enterprise risk discussions. Bajaj argues this scenario, while theoretically possible, rarely matches reality. "In practice, 'going down' usually does not mean an entire provider is offline across all models and regions," he explained. "What happens far more often is a slowdown or disruption in a specific model or region because of traffic spikes or capacity issues."
When that occurs, failover can happen at multiple levels — from on-premise to cloud, cloud to on-premise, one region to another, one model to another, or even within the same provider before switching providers entirely. "That alone makes it very unlikely that everything fails at once," Bajaj said. "The key point is that reliability is built on layers of redundancy. The more providers, regions, and models that are included in the guardrails, the smaller the chance that users experience a complete outage."
A startup that built its platform inside Fortune 500 AI deployments
TrueFoundry has established itself as infrastructure for some of the world's largest AI deployments, providing crucial context for its failover ambitions. The company raised $19 million in Series A funding in February 2025, led by Intel Capital with participation from Eniac Ventures, Peak XV Partners, and Jump Capital. Angel investors including Gokul Rajaram and Mohit Aron also joined the round, bringing total funding to $21 million.
The San Francisco-based company was founded in 2021 by Bajaj and co-founders Abhishek Choudhary and Anuraag Gutgutia, all former Meta engineers who met as classmates at IIT Kharagpur. Initially focused on accelerating machine learning deployments, TrueFoundry pivoted to support generative AI capabilities as the technology went mainstream in 2023.
The company's customer roster demonstrates enterprise-scale adoption that few AI infrastructure startups can match. Nvidia employs TrueFoundry to build multi-agent systems that optimize GPU cluster utilization across data centers worldwide — a use case where even small improvements in utilization translate into substantial business impact given the insatiable demand for GPU capacity. Adopt AI routes more than 15 million requests and 40 billion input tokens through TrueFoundry's AI Gateway to power its enterprise agentic workflows.
Gaming company Games 24×7 serves machine learning models to more than 100 million users through the platform at scales exceeding 200 requests per second. Digital adoption platform Whatfix migrated to a microservices architecture on TrueFoundry, reducing its release cycle sixfold and cutting testing time by 40 percent.
TrueFoundry currently reports more than 30 paid customers worldwide and has indicated it exceeded $1.5 million in annual recurring revenue last year while quadrupling its customer base. The company manages more than 1,000 clusters for machine learning workloads across its client base.
TrueFailover will be offered as an add-on module on top of the existing TrueFoundry AI Gateway and platform, with pricing following a usage-based model tied to traffic volume along with the number of users, models, providers, and regions involved. An early access program for design partners opens in the coming weeks.
Why traditional cloud uptime guarantees may never apply to AI providers
Enterprise technology buyers have long demanded uptime commitments from infrastructure providers. Amazon Web Services, Microsoft Azure, and Google Cloud all offer service-level agreements with financial penalties for failures. Will AI providers eventually face similar expectations?
Bajaj sees fundamental constraints that make traditional SLAs difficult to achieve in the current generation of AI infrastructure. "Most foundational LLMs today operate as shared resources, which is what enables the standard pricing you see publicly advertised," he explained. "Providers do offer higher uptime commitments, but that usually means dedicated capacity or reserved infrastructure, and the cost increases significantly."
Even with substantial budgets, enterprises face usage quotas that create unexpected exposure. "If traffic spikes beyond those limits, requests can still spill back into shared infrastructure," Bajaj said. "That makes it hard to achieve the kind of hard guarantees enterprises are used to with cloud providers."
The economics of running large language models create additional barriers that may persist for years. "LLMs are still extremely complex and expensive to run. They require massive infrastructure and energy, and we do not expect a near-term future where most companies run multiple, fully dedicated model instances just to guarantee uptime."
This reality drives demand for solutions like TrueFailover that provide resilience regardless of what individual providers can promise. "Enterprises are realizing that reliability cannot come from the model provider alone," Bajaj said. "It requires additional layers of protection to handle the realities of how these systems operate today."
The new calculus for companies that built AI into critical business processes
The timing of TrueFoundry's announcement reflects a fundamental shift in how enterprises use AI — and what they stand to lose when it fails. What began as internal experimentation has evolved into customer-facing applications where disruptions directly affect revenue and reputation.
"Many enterprises experimented with Gen AI and agentic systems in the past, and production use cases were largely internal-facing," Bajaj observed. "There was no immediate impact on their top line or the public perception of the enterprise."
That era has ended. "Now that these enterprises have launched public-facing applications, where both the top line and public perception can be impacted if an outage occurs, the stakes are much higher than they were even six months ago. That's why we are seeing more and more attention on this now."
For companies that have woven AI into critical business processes — from prescription refills to customer support to sales operations — the calculus has changed entirely. The question is no longer which model performs best on benchmarks or which provider offers the most compelling features. The question that now keeps technology leaders awake is far simpler and far more urgent: what happens when the AI disappears at the worst possible moment?
Somewhere, a pharmacist is filling a prescription. A customer support agent is resolving a complaint. A sales team is generating a proposal for a deal that closes tomorrow. All of them depend on AI systems that depend on providers that, despite their scale and sophistication, still go dark without warning.
TrueFoundry is betting that enterprises will pay handsomely to ensure those moments of darkness never reach the people who matter most — their customers.




Be the first to comment