Cloudflare has confirmed a bad software update caused it to lose log data for its customers recently. The incident, which lasted around 3.5 hours, resulted in more than half (55%) of logs being lost.

Embarrassed that the error occurred, the California company apologized to customers in a blog post, promising a similar issue should not happen again.

Cloudflare also noted that failures within systems at scale are inevitable, but subsystems should be built to protect themselves in the event of wider issues.

Cloudflare admits to losing data logs

The problem originated with Cloudflare’s Logpush service, which bundles and sends logs from its global network to customers for compliance, debugging and analytics. A routine update to support a new data set ended up misconfiguring the service, causing the issue.

The company says a configuration bug effectively told one of its internal servers, Logfwdr, that none of its customers had configured logs to be sent, leading to the loss. Although engineers identified and fixed the bug within five minutes, the issue triggered a deeper bug.

A built-in fail-safe, which sends logs to all customers rather than just those with active Logpush jobs, ended up overwhelming the system. The buffering system, Buftee, had to manage 40 times its usual capacity, rendering the system unresponsive.

“We accept that mistakes and misconfigurations are inevitable. All our systems at Cloudflare need to respond to these predictably and gracefully," the company wrote.

Looking ahead, Cloudflare has committed to conducting regular overload tests to simulate this error, providing confidence that its systems can handle future bugs of a similar nature.

