Soft Disasters

Has the Software Industry taken the Wrong Turn?

You are probably not aware that we have been having a spot of excitement in Cambridge University over the past couple of months, which has somewhat disturbed the normal peace and tranquillity of an academic summer. The story starts about a couple of years ago when the University Administration (commonly referred to as 'The Old Schools') decided to update the University's accounting system. Prior to this each Department had more or less managed its own affairs, each having its own accounting staff who handled the placing of orders, the paying of bills etc. The Central University was only involved at the top level, setting the total budget for each Department and getting the total for income and expenditure returned to it. This clearly was not the right way to do things in the new computerised 21st century, so plans were made to replace all this anarchy by a grand computer-based central accounting system. Under the new system everything would be handled in a central data base and the University Financial Officers would, at the press of a key, be able to find out exactly how much had been spent (on paper clips, for example) up to five seconds ago. A Software Company was appointed to design this system, contracts placed, and the Financial Officers settled back, waiting in eager anticipation for their new system.

At the start of this summer the new computers were installed, the software set up and trials of the new system commenced. About a couple of months ago stories started to circulate amongst the academics. It was said that in order to place an order (for paper clips, for example) it was necessary to enter information on seven separate forms, displayed one by one on the computer screen. Since each form hid the previous one it was difficult to refer back to earlier entries to check for consistency. Also, there could be up to half an hour's delay before the next form appeared so it was in any case almost impossible to remember what had been entered on previous forms. To put it simply, the system did not work! Then other stories started to circulate. For example, the story of the sad plight of four Research Students living in a University-owned house in West Cambridge who received a letter to say that their telephone was being cut off because they had not paid the bill. (This bill should have been handled by the University.)

About three weeks ago panic finally set in. In the Engineering Department all the Computer Officers were pulled off their normal work to prepare an emergency accounting system to keep the Department running. Similar steps were being taken elsewhere. Prudent academics started to withdraw into their financial bunkers and to batten down the hatches. Then optimistic reports started to come in. 'The Accounting System is now running at an acceptable speed!' 'It has now been running for a week without major problems!!' The panic started to subside. It appears that there was only one major bug in the system and it had finally been tracked down to its lair. But what if there had been a second bug hiding behind the first, or perhaps twenty other bugs waiting to pounce? The University has been lucky, perhaps more lucky than it deserves.

The point is that this was not a new problem. There have been numerous examples of previous big software systems going wrong. In this country in the past few years we have had the failure of the Passport Office system. Then there was the Taurus Share Accounting System initiated by the Stock Exchange, which was a total failure and abandoned after several million pounds had been spent on it. There is now the new Air Traffic Control system which is running several years late and is still not working. You can no doubt think of other such cases. The University should have known about these disasters and realised the dangers lying on the path it was taking. It should at least have made provision to fall back to the old systems if things went badly wrong.

The problem was not even new and unknown when work started on the first of these British Disasters. There was an article on this subject six years ago in the Scientific American. (Software's Chronic Crisis. Scientific American, Sept. 1994) This described, amongst other things, the horrors of the Denver Airport project; a magnificent new airport, twice the size of Manhattan Island and 10 times as wide as London Airport, which was designed to be capable of landing three 747's simultaneously on different runaways, in poor visibility. This airport was unable to open for over a year due to bugs in the computer-controlled baggage handling systems. The Denver Airport Authority faced losses of several hundred million dollars due to this delay.

At the time the article was written it was claimed that the average software project was overrunning its completion date by 50%, 75% of the larger projects completed were classed as 'operational failures' and 25% of all large projects were abandoned before completion as total failures. It had, even at that time, been known for over twenty years that there were serious problems in software development for large projects but the Software Industry seemed to be unable to solve these problems. The situation is still much the same today.

There is however, one area in which the story is very different. About the time the Scientific American Article appeared Cambridge University was installing a new telephone system. Before this, each College and Department had its own little switchboard, and all of these were connected up to form the University Network. The scheme was to replace all of these by one large computer-controlled Central University Telephone Exchange. In many ways, this system was similar to the Accounting System. It had to handle the setting up of each new telephone call, (Get the details of the calling telephone, identify the number called and give the necessary instructions to the network to try to make a connection.) It then had to check whether the called number was available or engaged and, if so, send back the appropriate 'busy' or 'unavailable' tones. If the called number responded it had to start timing the call. It then had to monitor the progress of the call, note when it ended, work out the charges and enter these in the accounting database under the caller's account. All these things had to be done in real time with perhaps over 100 callers using the system at any given moment. This system was installed by British Telecoms, was completed on time and then during a single night all the telephones in Cambridge University and the Colleges were reconnected to the new exchange. The next morning it all worked perfectly, and has continued to work almost perfectly (there have been two or three faults affecting some of the users for a few hours) ever since. This is not an exception. At that period most of the British Telephone Network was being converted to computer controlled digital systems. These were based on very large and complex electronic telephone exchanges. For example, a big exchange in a financial centre, such as the City of London, can be handling over 50,000 simultaneous telephone calls. Responding to each call request does not take these systems half an hour. It is done in about half a second - the time taken to lift the telephone handset off it's base and get it up to your ear. By the time you get it up the dialling tone is already sounding.

All these systems have been functioning nearly perfectly ever since. In recent years we have added mobile radio phones, which have the additional complication that the system has to know where every cell-phone is located at any given time. The telephone system also carries all of the traffic for the Internet, and has taken on this task without any signs of distress. People now expect to be able to contact anywhere in the world, either by telephone or by the Internet, on demand, and with virtually no delays. What then is the difference between the telecommunications services and other software based systems?

To answer this question we need to go back over 100 years, to a time well before computers were even dreamed about. When the first telephone cable was laid between Britain and France a critical problem became apparent. If the British were to be able to speak to the French by telephone it was essential that a French telephone should be capable of working when connected to a British telephone - and visa versa. This meant that as seen at its terminals it should be impossible to distinguish a British telephone from a French one. They had to emit signals of the same voltages, work on the same input signals, respond to ringing tones in exactly the same way, handle dialling the same way, and work with similar types of telephone cable. In other words the interface between a telephone and the telephone network had to be completely and precisely specified. It had to be the same for every telephone in the world. Consequent on this an International Committee was set up, the International Consultative Committee for Telephones and Telegraphy (commonly called the C.C.I.T.T and now renamed 'The International Telecommunications Union' (I.T.U.)). This had the initial job of specifying telephones, but as time passed it became clear that it had to specify virtually every other aspect of the interconnections between different parts of the telephone systems. For example, the codes used for dialling and signalling had to be standardised. It became necessary to standardise on telephone cables having defined characteristics. When multichannel carrier telephony systems appeared it became necessary to define their signal formats, frequencies and signal levels, otherwise multichannel transmission across national frontiers would not have been possible.

In passing, it is interesting to note that the C.C.I.T.T. never had any authority to enforce its standards. They were always known as 'recommendations', but the consequences of ignoring them were so serious that no telephone operating authority dared do so. They became, in effect, carved in stone alongside the Ten Commandments. A by-product of this rigorous specification was that telephone systems tended to be treated in a modular sense. They were regarded as a combination of black boxes defined entirely by the characteristics seen at the boxes' terminals. It was never necessary to tell the manufacturers how they should design and construct the innards of each box. At another level, there has always been a strong emphasis on reliability in communication systems. This stems from the need to have rapid and reliable communications in emergencies.

All of this background was carried over into computer hardware and software design when these were introduced into the communication network. Software tended to be treated as a set of rigorously specified functional modules. A great deal of thought went into the initial specification of these modules before any attempts were made to implement them. Their specifications covered every possible combinations of inputs and defined exactly what each module should - and should not do - under every possible situation. The background emphasis on reliability meant that there was no hesitation in spending money to achieve this. You will often find that telecommunications systems employ duplicate - or even triplicate - computers running in parallel. If one fails there is another ready to immediately take over.

My own belief is that this example might offer possible solutions to the software industry's problems. For a start, I think that the initial specifications of the systems need to be more detailed and precise. Rigorous use of a modular approach then means that software can be broken down into small independent units. It is easier to define precisely what each of these units should do and to ensure that this actually happens. This could eliminate many of the bugs

At this point I have to bring up another of my pet obsessions - the C programming language! I regard C (and C++) as being major causes of software unreliability. The trouble is that they allow the programmer too much freedom and their notation is, at the same time, too obscure. As one textbook put it - 'A C programming statement can be breathtakingly incomprehensible.' The fact that a C programmer can easily get at the lowest level of the program and, for example, meddle with digits in a processor register, is also a case of asking for trouble. The combination of obscure source code and freedom to break all the rules when you feel like it is asking for trouble squared!

The argument used is that C (or C++) programs are very efficient, but I wonder if this is really valid. A more restrictive language, such as Pascal, which uses strong variable typing (i.e. you are not allowed to freely mix variables of different types) and which also produces more long-winded but more comprehensible source code is claimed to be less efficient. However, experience shows that the loss of efficiency is not very great, and is small compared with the enhancements that can now be achieved by using faster hardware.

Hence I wonder if the following approach might be better.

  1. Use strongly modular software (more cumbersome but more precisely defined) along with programming languages that emphasises clarity and error detection rather than speed and efficiency.
  2. Then use faster, albeit more expensive, hardware to counteract the lower software efficiency that may result.

The increase in costs could well be insignificant compared with the losses that arise from the continuing software failures we now experience.

Derek Ingram.

 

WB01343_.gif (599 bytes)Previous page  WB01345_.gif (616 bytes)Next page