sotware engineering
...ems with the ALHS. One problem occurred when the photo eye at a particular location could not detect the pile of bags on the belt and hence could not signal the system to stop. The baggage system loaded bags into telecarts that were already full, resulting in some bags falling onto the tracks, again causing the telecarts to jam. This problem caused another problem. This one occurred because the system had lost track of which telecarts were loaded or unloaded during a previous jam. When the system came back on-line, it failed to show that the telecarts were loaded. Also the timing between the conveyor belts and the moving telecarts were not properly synchronized, causing bags to fall between the conveyor belt and the telecarts. The bags then became wedged under the telecarts. This eventually caused so many problems that there was a need for a major overhaul of the system. The government report concluded that the ALHS at the new airport was afflicted by "serious mechanical and software problems". However you can not help thinking how much the city was blamed for their part in a lack of demand for proper testing. Denver International Airport had to install a $51 million alternative system to get around the problem. However United Airlines still continue to use the ALHS. A copy of the report can be found at http://www.bts.gov/smart/cat/rc9535br.html. 2.2 Political / Commercial pressures – the Challenger Disaster There are many examples of failures occurring because of this. One of the most famous examples of these is the Challenger disaster. On the 28th January 1986 the challenger space shuttle exploded shortly after launch, killing all seven astronauts onboard. This was initially blamed on the design of the booster rockets and allowing the launch to proceed in cold weather. However it was later revealed that "there was a decision along the way to economize on the sensors and on their computer interpretation by removing the sensors on the booster rockets. There is speculation that those sensors might have permitted earlier detection of the booster-rocket failure, and possible early separation of the shuttle in an effort to save the astronauts. Other shortcuts were also taken so that the team could adhere to an accelerated launch sequence." (Neumann). This was not the first time there had been problems with space shuttle missions. A presidential commission was set up and the Chicago Tribune reported what some astronauts said, "…that poor organization of shuttle operations led to such chronic problems as crucial mission software arrived just before shuttle launches and the constant cannibalization of orbiters for spare parts." Obviously the pressures of getting a space shuttle launch and mission to run smoothly and on time is huge. However there has to be a limit on how many short cuts can be taken. Another example of commercial pressure is the case of a Fortune 500 company. (A Fortune 500 company is one that appears in a listing of the top 500 U.S. companies ranked by revenues, according to Fortune magazine's classic list.) According to Jones, "the client executive and the senior software manager disliked each other so intensely that they could not never reach agreement on the features, schedules, and effort for the project (a sales support system of about 3000 function points)". They both appealed to their higher executives to dismiss the other person. The project was eventually abandoned, after acquiring expenses of up to $500 000. Jones reported another similar case in a different Fortune 500 company. "…two second-line managers on an expert system (a project of about 2500 function points) were political opponents. They both devoted the bulk of their energies to challenging and criticizing the work products of the opposite teams." Not surprisingly the project was abandoned after costing the company $1.5 million. 2.3 Incorrect analysis and assumptions - the Three Mile Island accident Incorrect assumptions can seem very obvious when they are thought about, however it does not stop them from creeping in. According to Neumann a Gemini V rocket landed a hundred miles off course because of an error in the software. The programmer used the Earth’s reference point relative to the Sun, as elapsed time since launch, as a fixed constant. However the programmer did not realise that the Earth position relative to the Sun does not come back to the same point 24 hours later. As a result the error accumulated while the rocket was in space. The Three Mile Island II nuclear accident, on 28th March 1979, was also blamed on assuming too much. The accident started in the cooling system when one of the pipes became blocked, resulting in the temperature of the fuel rods increased from 600 degrees to over 4000 degrees. Instruments used to measure the temperature of the reactor core was not standard equipment at the time, however thermocouples had been installed and could measure high temperatures. However after the temperature reached over 700 degrees the thermocouples had been programmed to produce a string of question marks instead of displaying the temperature. After the reactor started to over-heat the turbines shut down automatically. However this did not stop the rods from over-heating as someone had left the valves for the secondary cooling system closed. There was no way of knowing this at the time because there was no reading on the temperature of the reactor core. Operators testified to the commission that there were so many valves that sometimes the would get left in the wrong position, even though their positions are supposed to be recorded and even padlocked. This is also a case of the designers blaming the operators and vice-versa. In the end the operators had to concede reluctantly that large valves do not close themselves. Petroski says, "Contemporaneous explanations of what was going on during the accident at Three Mile Island were as changeable as the weather forecasts, and even as the accident was in progress, computer models of the plant were being examined to try to figure it out." Lots of assumptions had been made about how high the temperature of the reactor core could go and the state of the valves in the secondary cooling system. This shows that in an environment where safety is supposed to be the number one issue people are still too busy to think about all the little things all the time and high pressure situations develop that compromise the safety of hundreds of thousands of people. It took until August 1993 for the site to be declared safe. Facts are taken from Neumann and Perrow. 2.4 Not properly tested software implemented in a high risk environment – the London Ambulance Service The failure of the London Ambulance Service (LAS) on Monday and Tuesday 26 and 27 November 1992, was, like all major failures, blamed on a number of factors. These include inadequate training given to the operators, commercial pressures, no backup procedure, no consideration was given to system overload, poor user interface, not a proper fit between software and hardware and not enough system testing being carried out before hand. Claims were later made in the press that up to 20-30 people might have died as a result of ambulances arriving too late on the scene. According to Flowers, "The major objective of the London Ambulance Service Computer Aided Despatch (LASCAD) project was to automate many of the human-intensive processes of manual despatch systems associated with ambulance services in the UK. Such a manual system would typically consist of, among others, the following functions: Call taking. Emergency calls are received by ambulance control. Control assistants write down details of incidents on pre-printed forms." The LAS offered a contract for this system and wanted it to be up and running by 8th January 1992. All the contractors raised concerns about the short amount of time available but the LAS said that this was non-negotiable. A consortium consisting of Apricot, Systems Options and Datatrak won the contract. Questions were later asked about why there contract was significantly cheaper than their competitors. (They asked for £1.1 million to carry out the project while their competitors asked for somewhere in the region of £8 million.) The system was lightly loaded at start-up on 26 October 1992. Staff could manually correct any problems, caused particularly by the communications systems such as ambulance crews pressing the wrong buttons. However, as the number of calls increased, a build up of emergencies accumulated. This had a knock-on effect in that the system made incorrect allocations on the basis of the information it had. This led to more than one ambulance being sent to the same incident, or the closest vehicle was not chosen for the emergency. As a consequence, the system had fewer ambulance resources to use. With so many problems the LASCAD generated exception messages for those incidents for which it had received incorrect status information. The number of exception messages appears to have increased to such an extent the staff were not able to clear the queues. Operators later said this was because the messages scrolled of the screen and there was no way to scroll back through the list of calls to ensure that a vehicle had been dispatched. This all resulted in a viscous circle with the waiting times for ambulances increasing. The operators also became bogged down in calls from frustrated patients who started to fill the lines. This led to the operators becoming frustrated, which in turn led to an increased number of instances where crews failed to press the right buttons, or took a different vehicle to an incident than that suggested by the system. Crew frustration also seems to have contributed to a greater volume of voice radio traffic. This in turn contributed to the rising radio communications bottleneck, which caused a general slowing down in radio communications which, in turn, fed back into increasing crew frustration. The system therefore appears to have been in a vicious circle of cause and effect. One distraught ambulance driver was interviewed and recounted that the police are saying "Nice of you to turn up" and other things. At 23:00 on October 28 the LAS eventually instigated a backup procedure, after the death of at least 20 patients. An inquiry was carried out into this disaster at the LAS and a report was released in February 1993. Here is what the main summary of the report said: "What is clear from the Inquiry Team's investigations is that neither the Computer Aided Despatch (CAD) system itself, nor its users, were ready for full implementation on 26 October 1992. The CAD software was not complete, not properly tuned, and not fully tested. The resilience of the hardware under a full load had not been tested. The fall back option to the second file server had certainly not been tested. There were outstanding problems with data transmission to and from the mobile data terminals. … Staff, both within Central Ambulance Control (CAC) and ambulance crews, had no confidence in the system and was not all fully trained and there was no paper backup. There had been no attempt to foresee fully the effect of inaccurate or incomplete data available to the system (late status reporting/vehicle locations etc.). These imperfections led to an increase in the number of exception messages that would have to be dealt with and which in turn would lead to more call-backs and enquiries. In particular the decision on that day to use only the computer generated resource allocations (which were proven to be less than 100% reliable) was a high-risk move." In a report by Simpson (1994) she claimed that the software for the system was written in Visual Basic and was run in a Windows operating system. This decision itself was a fundamental flaw in the design. "The result was an interface that was so slow in operation that users attempted to speed up the system by opening every application they would need at the start of their shift, and then using the Windows multi-tasking environment to move between them as required. This highly memory-intensive method of working would have had the effect of reducing system performance still further." The system was never tested properly and nor was their any feedback gathered from the operators before hand. The report refers to the software as being incomplete and unstab...