Home
OpenAI unveils root cause of ChatGPT’s November 8th outage

OpenAI unveils root cause of ChatGPT’s November 8th outage

Web Desk
Nov 16, 2023

WEB DESK: In a recent development, OpenAI has released a comprehensive postmortem detailing the events surrounding a significant service outage on November 8th, lasting from 5:42 AM to 7:16 AM PT.

During this period, a substantial number of requests to OpenAI encountered 502 or 503 error codes, affecting both models and API endpoints.

The root cause of the outage was traced back to routing layer nodes reaching memory limits and failing readiness checks.

This led to a cascading effect, rendering a considerable portion of the service unavailable and incapacitated to handle incoming traffic.

Notably, the morning of the incident witnessed an unprecedented surge in completions, exacerbating the service’s already strained capacity.

To mitigate the issue, OpenAI employed a combination of strategies, including limiting incoming traffic, deploying a mass redeployment of the service, and gradually restoring traffic levels.

As part of the incident response, OpenAI has already implemented several measures to address the underlying problems:

Memory allocation optimisation: The chronic memory issues were linked to the continuous allocation of new response buffers in a loop, causing delays in garbage collection.

OpenAI resolved this by pre-allocating the buffer and reusing it, resulting in a 3X improvement in both memory and CPU usage.

Adjusted memory limits: OpenAI has reconfigured memory limits to a more appropriate level, ensuring the service now maintains significant available headroom.

Rate limit controls: A series of rate limit controls have been introduced to enable more graceful load shedding of traffic during peak periods.

Increased service capacity: As an additional precaution, OpenAI has increased the service’s capacity to enhance resilience against future incidents.

How OpenAI will prevent such incidents in Future?

Looking ahead, OpenAI is committed to preventing similar incidents and improving service reliability.

Future measures include:

Alerting changes: OpenAI will implement alerting changes to identify underlying memory behaviour issues before they escalate into potential service disruptions.

Auto scaling configuration: With the resolution of the underlying issue, OpenAI will configure auto-scaling for this service, allowing for dynamic adjustments to handle varying workloads.

OpenAI acknowledges the impact of extended API outages on customers’ products and businesses and expresses a dedication to preventing such incidents in the future.

The company seems determined to enhance its service reliability and minimise the adverse effects of potential disruptions.

Read next: Toyota Pixis price in Pakistan