Flammable clouds

Mike Puchol
11 min readMar 12, 2021

Just before 1AM CET on March 10th, thousands of players were busy hacking away at their bases on Rust, dozens of criminals were busy running their malware scams through command & control servers, and a popular radio station in Barcelona was streaming audio to listeners.

Without warning, everything went dark. Or bright, depending on where you stood.

Cloudy IPOs

OVH was founded in 1999 by the Klaba family, most notably, Octave as chairman and CEO, headquartered in France. In 2013 they opened the first Data Center (DC) for business in Paris, and by 2016, $250 million in funding had been raised, which allowed expansion to several other countries and markets. OVH currently operates 27 DCs worldwide at 11 locations, and claim they can produce 400,000 servers per year.

In our planet’s collective drive to adopt The Cloud, many companies have taken to OVH’s cost-effective virtual and dedicated servers to host their services at, placing a bet on greater OPEX against almost zero CAPEX, and the perceived benefits of not having to hire a bunch of nerds to run your in-house metal, maintenance costs, etc. There are, however, several trade-offs not always taken into account, as we shall see.

This drive has caused explosive growth in the managed infrastructure sector, with OVH reaping great benefits, $712 million in 2019. On March 8th, OVH announced the intention to file for an IPO, with the Klaba family still holding onto 80% of the company.

Moteur 110, échelle 51, bataillon 20!

I imagine the Bas-Rhin Fire Department’s automated alert system blares out something along these lines — loosely translated from the systems used in US firehouses — when a callout arrives to respond to an incident.

Some time after 00:47, the call came in — “Working fire at the OVH data center on Rue du Bassin”. The two firehouses nearest to OVH I have found in Strasbourg are between 6.5km and 8km away, and according to Google Maps, crews would take around 15 minutes to reach the site. Interestingly, across the German border in Kehl, a firehouse is only 3.5km / 7 minutes away…

From bell to rigs out the door you are looking at 90 seconds, plus travel time, seize-up and attack start, crews could have been working on the fire in some 20 minutes. I would argue that the total destruction of one DC unit, and considerable damage to a second is only possible, in a case like this, due to a combination of factors that came about prior to the incident.

During these 20 minutes, an unchecked flame can turn into a raging inferno, given the right conditions. In a DC, these conditions should be made as wrong as possible for a fire.

Fire under control, view taken from Bas-Rhin drone footage

What actions could, or should, have been taken by the staff at the OVH DC in this situation?

Fire suppression in data centers

A DC is home to a vast amount of equipment that is highly allergic to contact with water. Not only the equipment is to be considered, but also the large power requirements that lead to considerable power distribution infrastructure, which can in turn lead to fires due to heat or short-circuits, for example. You never drop water on a large electrical installation while it is live.

An average warehouse will employ an array of fire suppresion and prevention measures, including heat-resistant cladding, fire-resistant or fire-retardant doors, water sprinklers, and manual tools such as hoses and extinguishers. In addition, heat and smoke detectors are used to provide warning of a fire in progress as soon as possible.

Rosenbauer warehouse sprinkler system

In a DC, all normal measures are modified towards the extremes. Heat and smoke detection needs to be extremely sensitive, and suppression capabilities need tuning to the unique environment. As an example, one must exercise more caution with water, as it can be so destructive to equipment so as to negate the suppression capabilities it provides. Thus, alternative means are employed such as inert gases (CO2, nitrogen, or argon) or proprietary agents such as FM200 or Novec 1230 by 3M, which displace oxygen and smother the fire, or curtail the chemical reaction. Handheld fire extinguishers will tend to use CO2 or halogens, rather than powder, as the latter can also be disruptive to machinery very distant from the seat of the fire.

Fire control systems can be automated or manual. In the case of sprinklers, there are types where the water pipes are always charged and under pressure, and heat from a fire would fuse a safety valve in the sprinkler heads, releasing the water immediately and without intervention. Other systems use dry pipes, and only operate once valves to specific sprinkler circuits are opened, either manually or automatically.

In many jurisdictions, fire code dictates that water sprinkler systems must be installed no matter what, typically pre-action (“dry pipe”), which can be supplemental to an alternate “clean” system that operates in the first instance. If the fire cannot be contained by the gas-based system, the sprinklers can be put into action. At this stage, the fire is likely releasing so much energy, that additional water damage to equipment becomes irrelevant, and the priority is to contain the fire and stop spread to the rest of the infrastructure.

You would think that all DCs would use some form of fire suppression system, right?

Risk versus Reward

As the fire was put under control, by heavy application of water and water-based foam by the fire crews — including a fire boat, the scale of the damage to the infrastructure was evident.

Photos taken during the fire show the structure completely involved, to a scale on par with fires at chemical plants or warehouses storing large volumes of flammable materials.

There are a few reasons I can think of that would explain a fire of this magnitude, and total destruction of SBG1:

  • A fire detection system that malfuctioned until the fire was too large to be contained.
  • Malfunctions in the suppression system that prevented it from operating properly.
  • A manual suppression system that was not activated after the fire was detected (by the detection system, or staff noticing it).
  • An automated suppression system that failed to operate.
  • Insufficient suppression capability by cost-cutting design, e.g. only hand fire extinguishers available.
  • No detection and/or suppression systems installed at all.

Contributing factors that could have facilitated fire spread are:

  • Ventilation systems kept operating at the start of the fire — these move massive amounts of air in DCs.
  • Poor design of compartments and fire resistance of divisions, leaving openings or vertical conduits to act as chimneys.
  • Protocols not followed , e.g. left fire resistant doors opened, or kept electrical systems operating.

Competition between DC operators is fierce, and this leads to tight controls over costs, which in turn can lead to cutting corners. Sadly, fire prevention and suppression is usually seen as an unnecessary cost, because a fire will never happen to me. Until it does.

Assuming all systems were operational at the time the fire broke out, this should have been the approximate chain of events:

  • Detection system activated for a particular compartment in the DC. Alarm is sounded in the control room, and at least in the affected compartment and adjacent compartments, or the whole floor.
  • Passive elements closed and shut down, such as ventilation systems and any open fire resistant doors. In some setups, PDUs are turned off automatically.
  • DC operators verify the fire condition via CCTV and direct inspection. If positive, working fire, and it is manageable by e.g. fire extinguisher, and attending personnel are trained in their use, suppression may be attempted. Otherwise, compartment is evacuated and sealed. Upon confirmation of a working fire or smoke, fire department is alerted.
  • At this stage, either automated systems take over, and trigger suppression systems, or the operators manually initiate suppression by triggering the gas-based systems.
  • If the gas-based systems fail to contain the fire, water-based sprinkler systems are activated.
  • Fire department arrives and takes over, taking command of in-place suppression systems, and adding its own.

All this time, you are either containing the fire and reducing its spread, or buying yourself time until the fire crews arrive in their big red trucks. If you are not containing, DC equipment is being lost as time goes on, and application of water, by sprinklers and eventually fire crews, may be the only way to avoid further losses.

Be careful what cloud you choose to dream on

It may just be a tad too flammable. DCs can be certified in a tier level system created by the Uptime Institute, which in essence establishes how much downtime a DC is allowed per year under the certified tier, and this imposes a large number of requirements on the DC for it to comply.

To go up a tier, the amount of CAPEX and OPEX required is significant, as you need to have more redundancy, more capable systems, regular maintenance, multiple power sources, and so on. Of course, if you want to offer dedicated servers for $50/month, you may not be able to justify the cost of certifying even on Tier I.

As a result, OVH only lists certifications on management systems for IT services, data protection, or security and risk assessment.

The big ISO missconception

I want to digress for a few lines on a major missconception many people have about what ISO-type standards and certifications mean. ISO 9001, to take the easy one everyone knows, “sets out the criteria for a quality management system and is the only standard in the family that can be certified to (although this is not a requirement)”. Note that it sets criteria for a quality management system, not for quality itself. Your company could define quality to mean “50% of the fridges we make could fail within one year”, and get ISO 9001 certified. As long as you implement a management system to make sure you are hitting that 50% failure rate, clearly define how you respond to quality issues, customer complaints, failures, and you keep documents and records of it all, you can sport a shiny “ISO 9001 Certified” logo on your website.

The next time someone claims their products are top quality because their company is ISO 9001 certified, ask to see the quality manual. It’s a document they must provide customers and partners, and it can tell you how they handle and document quality processes and issues. However, it won’t tell you how good their products actually are.

Tiers? We don’t need no stinkin’ tiers!

In the “cheap and cheerful” DC segment, ISO 27001 seems to be the norm everyone cites as the badge of honor. This standard provides “requirements for an information security management system”. It does not deal with fire prevention, detection, or suppression. It did not help OVH to have this certificate in their efforts to fight the disastrous fire. The two other standards they list, ISO 27002 and 27005, are also concerned with IT security.

OVH are not alone in being shy about their approach to fire events. Hetzner, a provider I have happily used for years, also claims to have ISO 27001 certification, and in their virtual 360 tour of their facilities, there is no fire suppression system that I can spot, other than hand-held extinguishers.

The roof over server racks at Hetzner, minus sprinklers or gas systems

In Hetzner’s ISO 27001 Statement of Applicability, Annex 11, on Physical and Environmental Security, we find nothing (as expected, see here if you are bored!) that relates to fire damage. The only mention of fire on Hetzner’s website states “High security standards and early fire detection throughout the data center park”, but fails to list exactly what those standards are. Note they mention early fire detection, not suppression capabilities.

So what do you get when you pay expensive hosting for your infrastructure at a DC certified to resiliency standards? Specific mentions of fire detection and suppression measures, such as INERGEN gas system at the East Africa Data Centre in Nairobi, Kenya, or icolo.io in Mombasa, which operates an FM200 gas system, VESDA fire alarm, and addressable detection in all areas.

Activate your Disaster Recovery Plan

These are the very last words you ever want to hear if you operate a business — any business. A trucking company could have a DRC (also known as Business Continuity Plan or BCP) in case all their trucks are consumed by fire in the parking lot, so this is not for IT sector businesses only.

During the 9/11 attacks on the WTC, some companies managed to get back to quasi-normal operations in hours, whereas others disappeared, not having a DRP/BCP in place.

Many companies relied on OVH to host their mission-critical data in shared or dedicated servers, and fell into the common trap of believing that clouds are not flammable, or they implement redundancy, failover, and data backups by default. Data has been lost forever, and some companies may not be able to survive the terrible fire. Simply backing up your mission-critical data on a regular basis to a different physical location, even if it’s a different “cloud”, could be the difference between life or death for your business.

I’m on a raised paranoia scale when it comes to backups, having suffered hard drive failures in my desktop and laptop computers within 30 minutes of each other (I lost some data that time, but still had DVDs… nostalgia!). Nowadays, for my personal data, I use Dropbox to sync between two laptops and a desktop. Every few weeks, I also take a snapshot of data into an external drive which I keep away from my house. In the past, I had an external drive inside a waterproofed container buried in the garden, and ran an iSCSI connection to it from inside my home office.

What about work data? At Poa Internet, we use a Tier III DC where we colocate our own Dell servers, plus routing and switching infrastructure. We backup all data, on a daily basis, to an enterprise cloud storage service, and every 7 days, one snapshot is downloaded to a laptop as an extra precaution. Said laptop is backed up by Dropbox too.

Recommendations

I know you’re at the TL;DR stage by now, so here are some rapid-fire suggestions on how you can improve your exposure to low-cost cloud solutions:

  • Ask for photos of the safety measures of the DC — fire detection, prevention and suppression.
  • Ask for supporting documentation of what is installed and how often is it maintained.
  • Ask for certificates relative to OHS, fire, and electrical safety, as imposed by local regulations.
  • If you cannot get clear answers, your DC is not certified to any tier level, or you have doubts, double-down on your backup and disaster recovery efforts. This could happen again tomorrow.
  • Be sure to design and implement a DRP/BCP.
  • Hire staff that have chops in DC operations, even if they’ll never see one where you operate— they will call bullshit for you when they see it.
  • Test your backups. Test your backups. Test your backups again. The worst time to find out the backups got corrupted is when you hear those damned words.

--

--