Key Highlights
- Cloudflare experienced a significant network outage on November 18, 2025.
- The issue was triggered by a change in database permissions causing the generation of a larger-than-expected feature file for their Bot Management system.
- Initial misdiagnosis led to a mistaken belief that the incident might have been caused by an attack.
- The core proxy system failed due to exceeding its memory limit, resulting in HTTP 5xx error codes.
Cloudflare’s November 18, 2025 Network Outage: A Technical Deep Dive
On November 18, 2025, Cloudflare faced a significant network outage that disrupted core services for its customers. This incident highlights the intricate challenges of maintaining robust and secure digital infrastructure.
The Trigger: Database Permission Changes
The root cause of this outage was rooted in an unintended change to database permissions on November 18, 2025. Specifically, a query running on a ClickHouse database cluster was gradually updated to improve permissions management. This modification inadvertently generated multiple entries into the “feature file” used by Cloudflare’s Bot Management system.
Initial Misdiagnosis and Response
Initially, the symptoms of this issue led Cloudflare’s incident response team to believe that a hyper-scale Distributed Denial of Service (DDoS) attack was in progress. This misdiagnosis delayed a quick resolution as resources were diverted towards combating what turned out to be an internal technical issue.
The Impact: Broad and Diverse
The larger-than-expected feature file caused significant disruptions across various Cloudflare services. Key systems such as the Core CDN, security products, Turnstile, Workers KV, Dashboard, and Access Authentication were all impacted. The increased load on these services further exacerbated the problem.
Resolving the Issue
To address this issue, Cloudflare stopped the generation and propagation of the bad feature file and manually inserted a known good version into the distribution queue. They also forced a restart of their core proxy to clear out the remaining bad states.
The recovery process was meticulous, with teams working over several hours to mitigate increased load and restore normal operations by 17:06 UTC on November 18, 2025. This involved restarting services that had entered a bad state, ensuring all systems were functioning as intended before declaring the issue resolved.
Lessons Learned and Future Improvements
This incident underscores the importance of robust change management processes in large-scale infrastructure like Cloudflare’s. The company has already begun implementing hardening measures to prevent similar issues from arising in the future, including enhancing their query behavior and improving access controls.
Cloudflare acknowledged that such outages, while unacceptable, are an inevitable part of managing complex global networks. Their commitment to continuous improvement demonstrates a dedication to maintaining the reliability and security of their services for all users.