Turing's Man Blog
- Last Updated on Monday, 06 January 2014 22:00
- Published on Sunday, 25 November 2012 21:26
- Written by Pawel Wawrzyniak
- Hits: 19161
I wondered... How does the software bug look like? I don't mean the place in requirements analysis, design or code itself in which we can point the actual bug. I thought about something epic, visible and spectacular – or even - iconic. There are several examples, but one inspired me the most – Ariane 5 crash on 4th June of 1996. This case may look far away from data center world, but... Today, software is everywhere. We should remember that a simple mistake in software layer can result in a tremendous fail of the most reliable and redundant infrastructure – top down to the physical level. Let's see what happened with Ariane 5 and how it is related to data center world.
Ariane – what is it?
Ariane is a name which comes from the French spelling of Ariadne – a mythological, feminine character. It is used to describe a series of European civilian expandable launch vehicles which are operated from the Centre Spatial Guyanais at Kourou in French Guiana. The Ariane project itself started in 1973 after the agreement between France, Germany and the UK. In fact this was the second attempt to develop its own launcher for the Western Europe at the time. Previously, there was an unsuccessful Europa programme, which consisted of four project areas (with the following results):
- Europa 1 – 4 unsuccessful launches
- Europa 2 – 1 unsuccessful launch
- Europa 3 – cancelled before any launch occurred
- Europa 4 -- study only, later cancelled
Hence, the importance of Ariane project and its need for success. However, not all things went fluently in this case too with the mentioned Ariane 5 explosion. Up to now, the statistics for Ariane rockets are:
- Ariane 1, operational 1979-1986, with 9 successes out of 11 launches
- Ariane 2, operational 1986-1989, with 5 successes out of 6 launches
- Ariane 3, operational 1984-1989, with 10 successes out of 11 launches
- Ariane 4, operational 1990-2003, with 113 successes out of 116 launches
- Ariane 5, operational 1996-present, with 62 successes out of 66 launches (by November 2012)
Ariane 5 – the most current technology
ESA (European Space Agency) spent 10 years and about 7 billion dollars to produce the Ariane 5 rocket. This huge investment intended to give European Union the overwhelming supremacy in the commercial space business. Ariane 5 was capable to transport two satellites, each three-ton in weight, to the space orbit. It's easy to understand how important the project was from business and commercial perspective, especially when it comes to the civilian communication systems. There were 6 successful launches in 2012 (2012-03-23 04:34, 2012-05-15 22:13, 2012-07-05 21:36, 2012-08-02 20:54, 2012-09-28 21:18, 2012-11-10 21:05) out of total 62 launches performed since 1996.
We will now take a closer look to the first ever lunch of Ariane 5. It took place in 1996, it failed and it was caused by one, simple software bug related to data conversion (according to the official sources this was the main reason of failure, however other supporting causes were identified in the field of system design and management issues).
Careless data types casting?
1996-06-04 12:34:06. Centre Spatial Guyanais at Kourou in French Guiana. Ariane 5 rocket is ready to its maiden voyage. Standing proudly on the ground. All systems ready. Countdown sequence initiated. Launch. All what happened next is a piece of software engineering history today. According to James Gleick, who described the failure of Ariane 5 launch on June 1996:
(a) shutdown occurred 36.7 seconds after launch, when the guidance system's own computer tried to convert one piece of data -- the sideways velocity of the rocket -- from a 64-bit format to a 16-bit format. The number was too big, and an overflow error resulted. When the guidance system shut down, it passed control to an identical, redundant unit, which was there to provide backup in case of just such a failure. But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software.
So, the same bug in both redundant systems resulted in a crash. There is some more detailed information presented in the Wikipedia:
The greater horizontal acceleration caused a data conversion from a 64-bit floating point number to a 16-bit signed integer value to overflow and cause a hardware exception. Efficiency considerations had omitted range checks for this particular variable, though conversions of other variables in the code were protected. The exception halted the reference platforms, resulting in the destruction of the flight.
And here it is. The tremendous explosion.
Ariane 5 maiden flight - as presented on YouTube by JeiceTheWarrior
How it relates to the data center?
What can we learn from the data center perspective? We should realize, that reliability of our infrastructure is not only the result of good design practices, high quality redundant components and trained engineering staff. It depends on all layers – software included. Sometimes, the critical bug in code can be invisible during all stages of software development. It can remain unnoticed when given device is in normal operation, unless special conditions occur. The same way like in case of Ariane 5. The bug itself doesn't have to be something sophisticated. This can be the simple data conversion issue – as long as the value of 64-bit floating point number (source parameter) was low enough to fit into 16-bit signed integer number (destination parameter) all systems were running correctly. Their operation was even secured by the required level of redundancy.
That's why we should take care about firmware updates, when recommended by infrastructure vendors, we should try to diversify the vendors of critical components (where possible) and have a solid backup.
The other thing which is worth to be remembered from Ariane 5 case is related to the data conversion issue directly. In the field of data center – to the allowed site capacity (power, cooling and physical space load). We should remember that physical world is full of barriers – that's why it's so nice to break one by one with continuous development, but not with irresponsible ignorance. We cannot expect too much flexibility when it comes to the physical borders. This is not a virtual world. One cannot expect to fill the cup of tea with 200 ml when it was made to be filled up to 150 ml. One cannot put more devices in the server room, when there is no guaranteed power left on the UPS system. One cannot overload the circuit – otherwise, in the best possible scenario, the circuit breaker will be triggered resulting in a partial loss of power for other devices. This rule can be applied to all infrastructure layers – cooling, physical space, communication, storage and even pure software one (ex. types casting). However, some IT guys act like they are unable to accept physical world limitations (too much computer games?). In such case, if they cannot understand the metaphor with cups and tea, they are unable to understand the basics of electrical load capacity or they can't simply accept the fact that we cannot put three connectors concurrently (and safely) to the two sockets... We can present them the case of Ariane 5 and tell a story about integer overflow error during types casting. What is important – the failure doesn't have to be immediate after the bug was made. Sometimes, like in case of Ariane 5, the epic fail can be postponed to the most spectacular moment, when all eyes are directed on us, when there are the highest expectations.
On the other hand, we have to remember that the continuous capacity control is one of the most important requirements when it comes to the data center reliability. When there is no proper capacity management practices and tools implemented, the risk of unexpected downtime is very high. Therefore, data center professionals have to take care about all critical data center resources – power, cooling and physical space (including cabling, too). It's not simple A+B. We have to consider required redundancy level on all layers and the following limitations, electric installations design details (allowable load on the circuits, circuits breakers selectivity, balanced per phase load etc.), construction issues (i.e. allowable load in kN/sqm), cooling system efficiency (directly related to the power consumed by all the devices in the server room) and so on... Including all possible interrelations between data center infrastructure domains, as a change in one area can influence the capacity of other resource areas directly or indirectly. Also, we should have in mind that the overall load is not something constant during the time – there are differences in daily operations (i.e. online processing) vs. night time (i.e. offline, batch processing), middle of the month operations vs. end of month operations (i.e. end of month batch processing), etc. One has to be very careful, have to use dedicated tools for monitoring and calculations. Also, one have to communicate the results in a clear manner.
We should remember the Ariane 5 case from 4th June of 1996. For me – it's my Personal-World Reliability Day.
Coming back from far conclusions concerning data center to the software engineering field again, in scope of the Ariane 5 failure, it's good to remember that there is a Cleanroom software engineering process, which is aimed to produce the software with certifiable level of reliability. This process is mainly focused on defects prevention rather than defects removal (more classic approach). Cleanroom was introduced in the 80s. The first demonstration projects started in the 90s, in military industry. The name Cleanroom was chosen as an analogy to cleanrooms used in the production of semiconductors. The key characteristics of Cleanroom are:
- Incremental Development Life Cycle
- Defect Prevention: Quality Assessment through Statistical Testing
- Disciplined SE methods required to create correct, verifiable software
What is also important – "management" is not stressed in Cleanroom.
In conclusion, we can add that not only reliable software requires reliable infrastructure, but this relation has to work in both directions. Therefore, no matter what is mission-critical for our business, we have to take care about both ends of this story – physical infrastructure and software – with the same level of attention. Otherwise – be ready for a spectacular failure.
More information on Cleanroom software engineering process:
- James Gleick, "A Bug and a Crash. Sometimes a Bug Is More Than a Nuisance", first published in the New York Times Magazine 1 December 1996, http://www.around.com/ariane.html
- Wikipedia: Ariane (rocket family), Ariane 5, Europa (rocket), Cleanroom software engineering