用最简单的方法,做最复杂的测试。

Software Testing - SE By Ron Patton - Infamous Software Error Case Studies

上一篇 / 下一篇  2010-09-29 10:59:46 / 个人分类:Translation

Infamous Software Error Case Studies

It’s easy to take software for granted and not really appreciate how much it has infiltrated our daily lives. Back in 1947, the Mark II computer required legions of programmers to constantly maintain it. The average person never conceived of someday having his own computer in his home. Now there’s free software CD-ROMs attached to cereal boxes and more software in our kids’ video games than on the space shuttle. What once were techie gadgets, such as pagers and cell phones, have become commonplace. Most of us now can’t go a day without logging on to the Internet and checking our email. We rely on overnight packages, long-distance phone service, and cutting-edge medical treatment.

Software is everywhere. However, it’s written by people---so it’s not perfect, as the following examples show.

臭名昭著的软件错误案例研究

我们理所当然的接受软件存在但却未曾真正意识到软件已经多大程度上渗透进我们的日常生活。回到1947年,用Mark第二代技术建造的计算机需要大量程序员时常去维护它。一般人从来没有设想过某一天在自己家中能拥有一台属于自己的电脑。现在到处都是附带在麦片盒的免费光盘,而且孩子电动游戏机里的软件比太空飞船上的软件还要多。那些曾经只有技术员使用的小装置现在都已变得很普遍了,如寻呼机手机。大多数人现在每天都要登陆互联网或者查收邮件。我们依赖隔夜包裹、长途电话服务,以及最先进的医疗服务。

软件无处不在。但是,它是人手写出来的,所以不是完美的,如下面的例子。

 

Disney’sLion King, 1994-1995

In the fall of 1994, the Disney Company released its first multimedia CD-ROM game for children, TheLion King Animated Storybook. Although many other companies had been marketing children’s programs for years, this was Disney’s first venture into the market and it was highly promoted and advertised. Sales were huge. It was “the game to buy” for children that holiday season. What happened, however, was a huge debacle. On December 26, the day after Christmas, Disney’s customer support phones began to ring, and ring, and ring. Soon the phone support technicians were swamped with calls from angry parents with crying children who couldn’t get the software to work. Numerous stories appeared in newspapers and on TV news.

It turns out that Disney failed to test the software on a broad representation of the many different PC models available on the market. The software worked on a few systems---likely the ones that the Disney programmers used to create the game---but not on the most common systems that the general public had.

迪斯尼的狮子王,1994-1995

1994年秋天,迪斯尼公司发行了第一套儿童多媒体游戏光盘,《狮子王》动画故事书。虽然当时已有很多公司推销了多年的儿童程序,但这却是迪斯尼首次尝试进入这个市场,而且迪斯尼公司投入大量的广告和宣传。销售量非常的大。成了当时节假日为孩子“必买的游戏”。但是,却发生了严重的崩溃。同年1226日,也就是圣诞节的第二天,迪斯尼客户支持热线开始不断响起,很快电话支持技术人员被大量电话所淹没,电话那头都是生气的父母和那些因为无法运行软件而哭泣的孩子的声音。大量的报道出现在报纸和电视新闻。

事实证明,迪斯尼公司忽视了将软件放在当时市场上现存的很多代表性的计算机模型上进行广泛测试。这个软件只能在少数计算机系统上运行,比如迪斯尼程序员用来开发这个游戏的系统,而不是普通公众使用的大多数计算机系统。

 

Intel Pentium Floating-Point Division Bug, 1994

Enter the following equation into your PC’s calculator:

(4195835/3145727)*3145727-4195835

If the answer is zero, your computer is just fine. If you get anything else, you have an old Intel Pentium CPU with a floating-point division bug---a software bug burned into a computer chip and reproduced over and over in the manufacturing process.

On October 30, 1994, Dr. Thomas R. Nicely of Lynchburg (Virginia) College traced an unexpected result from one of his experiments to an incorrect answer by a division problem solved on his Pentium PC. He posted his find on the Internet and soon afterward a firestorm erupted as numerous other people duplicated his problem and found additional situations that resulted in wrong answers. Fortunately, these cases were rare and resulted in wrong answers only for extremely math-intensive, scientific, and engineering calculations. Most people would never encounter them doing their taxes or running their businesses.

What makes this story notable isn’t the bug, but the way Intel handled the situation:

Ÿ  Their software test engineers had found the problem while performing their own tests before the chip was released. Intel’s management decided that the problem wasn’t severe enough or likely enough to warrant fixing it or even publicizing it.

Ÿ  Once the bug was found, Intel attempted to diminish its perceived severity through press release and public statements.

Ÿ  When pressured, Intel offered to replace the faulty chips, but only if a user could prove that he was affected by the bug.

There was a public outcry. Internet newsgroups were jammed with irate customers demanding that Intel fix the problem. News stories painted the company as uncaring and incredulous. In the end, Intel apologized for the way it handled the bug and took a charge of more than $400 million to cover the costs of replacing bad chips. Intel now reports known problems on its website and carefully monitors customer feedback on Internet newsgroups.

 

NOTE

On August 28th, 2000, shortly before the first edition of this book went to press, Intel announced a recall of all the 1.13MHz Pentium III processors it had shipped after the chip had been in production for a month. A problem was discovered with the execution of certain instructions that could cause running applications to freeze. Computer manufacturers were creating plans for recalling the PCs already in customers’ hands and calculating the costs of replacing the defective chips. As the baseball legend Yogi Berra once said, “This is like déjà vu all over again.”

英特尔奔腾浮点数除法缺陷,1994

在电脑自带的计算器中输入下面的方程式:

(4195835/3145727)*3145727-4195835

如果得到的结果是零,说明你的计算机没问题。如果得到的结果是零以外的其他答案,说明你使用的是一款有浮点除法缺陷的旧版英特尔奔腾中央处理器---一个软件缺陷被烧进了计算机集成电路片,而且不断的在制造过程中被复制生产。

19941030日,维吉尼亚州林奇堡学院的托马斯.R.奈斯利博士在他的实验中发现了一个错误答案,他追溯到这个意外结果的根源是他当时使用的电脑的奔腾处理器的一个除法问题。他把发现的问题发布到英特网上,很快引起了大骚动,成千上万的人重现了他的问题,而且还发现了更多类似的问题。幸运的是,这些情形是极少数的,而且只在极其精度的数学运算、科学运算和工程运算中才会发生的错误结果。大多数人在税务计算或商务计算的时候都不会遇到这种情况。

这个故事值得引起注意的不是这个缺陷的本身,而是英特尔处理这一事情的方式:

Ÿ  芯片发行之前,他们的软件测试工程师就已通过执行测试发现了这个问题。英特尔管理层认为这个问题不足以严重到非要修复它,甚至没必要提出来。

Ÿ  当这个缺陷被发现后,英特尔通过新闻稿和公开声明试图将这个问题的严重度减小到他们所谓的严重程度。

Ÿ  迫于压力,英特尔主动更换了有缺陷的芯片,但是却只有一个用户可以证明自己受到了这个缺陷的影响。

公众强烈抗议。英特网新闻组都是愤怒的客户,要求英特尔公司修复这个问题。新闻报道称这是一家不负责和不可靠的公司。最后,英特尔对处理这个缺陷的方式表示道歉,还拿出了高达四百万美元来支付更换这些问题芯片的费用。现在英特尔在自己的网站上登记这些已经发现的问题,并且认真处理英特网上客户的反馈。

 

注意

2000828日,就在这本书第一版印刷之前,英特尔宣布召回已准备发售的一个月前生产的全部奔腾III1.13MHz处理器。原因是发现处理器在执行特定指令后,可能引起运行中程序停止。计算机制造商,制定计划召回已在客户手中的电脑,并计算替换这些有缺陷设备的成本。就像棒球传奇人物优吉.贝拉所说“这好像似曾相识的从头再来”。

 

NASA Mars Polar Lander, 1999

On December 3, 1999, NASA’s Mars Polar Lander disappeared during its landing attempt on the Martian surface. A Failure Review Board investigated the failure and determined that the most likely reason for the malfunction was the unexpected setting of a single data bit. Most alarming was why the problem wasn’t caught by internal tests.

In theory, the plan for landing was this: As the Lander fell to the surface, it was to deploy a parachute to slow its descent. A few seconds after the chute deployed, the probe’s three legs were to snap open and latch into position for landing. When the probe was about 1,800 meters from the surface, it was to release the parachute and ignite its landing thrusters to gently lower it the remaining distance to the ground.

To save money, NASA simplified the mechanism for determining when to shut off the thrusters. In lieu of costly radar used on other spacecraft, they put an inexpensive contact switch on the leg’s foot that set a bit in the computer commanding it to shut off the fuel. Simply, the engines would burn until the legs “touched down.”

Unfortunately, the Failure Review Board discovered in their tests that in most cases when the legs snapped open for landing, a mechanical vibration also tripped the touch-down switch, setting the fatal bit. It’s very probable that, thinking it had landed, the computer turned off the thrusters and the Mars Polar Lander smashed to pieces after falling 1,800 meters to the surface.

The result was catastrophic, but the reason behind it was simple. The Lander was tested by multiple teams. One team tested the leg fold-down procedure and another the landing process from that point on. The first team never looked to see if the touch-down bit was set---it wasn’t their area; the second team always reset the computer, clearing the bit, before it started its testing. Both pieces worked perfectly individually, but not when put together.

美国航天局火星极地登陆者号探测器,1999

1999123日,当美国航天局火星极地登陆者号探测器尝试准备登陆火星表面的时候突然消失了。故障检验委员会调查了这次失败事件并确定了故障最可能的原因是设置了一个意外的单数据位。最让人惊讶的是,为什么这个问题在内部测试的时候没有被发现。

理论上,登陆计划是这样的:当登陆者下降到接近火星表面的时候,需要调动打开一个降落伞来减慢登陆者的下降速度。降落伞打开几秒钟后,探测器的三条腿会马上打开并锁定登陆的正确位置。当探测器距离火星表面1,800米高的时候,释放降落伞,并点燃推进器,在剩下的高度里缓慢地降落。

为了节约资金,美国航天局简化决定何时关闭推进器的机械装置。为了替代使用在其他航天飞船上的雷达,他们在探测器的腿上安装了一个相对便宜的接触开关,设置了一个计算机位来控制关闭燃料。不过,引擎将一直燃烧直到探测器的腿“已着陆”。

不幸的是,故障检验委员会发现在他们测试的绝大多数用例中,当探测器的腿打开并准备着陆时,一个机械振动同时也绊到了着陆开关,设置了一个致命的数据位。所以,最大的可能是,电脑认为探测器已经着陆,所以关闭了推进器,而火星极地登陆者探测器从1,800米的高空直接摔到火星表面被砸得粉碎。

其结果是灾难性的,但是隐藏在背后的原因是简单的。登陆者号由多个测试小组测试。其中一个测试小组负责测试腿的折叠步骤,而另一个测试小组负责测试登陆者号在正确位置着陆的过程。第一个测试小组从来不察看着陆数据位是否被正确设定,因为那不是他们负责的工作范围;第二个测试小组总是在开始测试工作前重启电脑,清除数据位。两个测试小组都各自很好地完成了自己的测试工作,但是结合两部分的测试工作却没有做好。

 

Patriot Missile Defense System, 1991

The U.S. Patriot Missile Defense System is a scaled-back version of the Strategic Defense Initiative (“Star Wars”) Program Proposed by President Ronald Reagan. It was first put to use in the Gulf War as a defense for Iraqi Scud missiles. Although there were many news stories touting the success of the system, it did fail to defend against several missiles, including one that killed 28 U.S. soldiers in Dhahran, Saudi Arabia. Analysis found that a software bug was the problem. A small timing error in the system’s clock accumulated to the point that after 14 hours, the tracking system was no longer accurate. In the Dhahran attack, the system had been operating for more than 100 hours.

爱国者导弹防御系统,1991

美国爱国者导弹防御系统是罗纳德.里根总统提出的战略性主动防御(“星球大战”)程序的修正版。在海湾战争中首次用来防御伊拉克飞毛腿导弹。尽管有很多新闻报道吹捧这个系统的成功,但它确实失败防御了几枚导弹,包括那枚在沙特阿拉伯达兰导致28名美国士兵牺牲的导弹。分析发现这个问题是一个软件缺陷造成的。系统时钟有一个微小的定时错误堆积14小时后,跟踪系统就不再精确了。在达兰的进攻,这个系统已经被运行了100小时。

 

The Y2K (Year 2000) Bug, circa 1974

Sometime in the early 1970s a computer programmer---Let’s suppose his name was Dave---was working on a payroll system for his company. The computer he was using had a very little memory for storage, forcing him to conserve every last byte he could. Dave was proud that he could pack his programs more tightly than any of his peers. One method he used was to shorten dates from their 4-digit format, such as 1973, to a 2-digit format, such as 73. Because his payroll program relied heavily on date processing, Dave could save lots of expensive memory space. He briefly considered the problems that might occur when the current year hit 2000 and his program began doing computations on years such as 00 and 01. He knew there would be problems but decided that his program would surely be replaced or updated in 25 years and his immediate tasks were more important than planning for something thatfar out in time. After all, he had a deadline to meet. In 1995, Dave’s program was still being used, Dave was retired, and no one knew how to get into the program to check if it was Y2K compliant, let alone how to fix it.

It’s estimated that several hundred billion dollars were spent, worldwide, to replace or update computer programs such as Dave’s, to fix potential Year 2000 failures.

千年虫问题,大概1974

早在19世纪70年的某一天,有一个计算机程序员,我们叫他戴夫,正在为他所在的公司开发一个薪资系统。他当时使用的电脑存储内存很小,迫使他只能节约每一个字节。戴夫很得意能把程序压缩的比其他任何一个同事都小。其中的一个方法就是,用更短得2位日期格式来替代4位日期格式,比如197373替代。因为薪资系统大量依赖日期数据处理,所以戴夫节约了很多昂贵的内存空间。他简单的考虑过,程序可能在2000年的时候出现类似用00年和01年计算的问题。他知道这是一个问题,但他认为25年后他的程序肯定已经被替换或更新过了,而且显然他当前的任务要比计划一些将来的事情更重要。毕竟,他需要遵守最后期限。在1995年,戴夫开发的程序还在使用,戴夫却已经退休了,没有人知道如何进入程序来检查是否有千年虫问题,更不用说如何去修复这个问题了。

据估计,在全球范围内,大概花费了数千亿美元来替换或更新像戴夫那样的计算机程序,以此来修复潜在的千年虫问题。

 

Dangerous Viewing Ahead, 2004

On April 1, 1994, a message was posted to several Internet user groups and then quickly circulated as an email that a virus was discovered embedded in several JPEG format pictures available on the Internet. The warning stated that simply opening and viewing the infected pictures would install the virus on your PC. Variations of the warning stated that the virus could damage your monitor and that Sony Trinitron monitors were “particularly susceptible.”

Many heeded the warning, purging their system of JPEG files. Some system administrators even went so far as to block JPEG images from being received via email on the systems.

Eventually people realized that the original message was sent on “April Fools Day” and that the whole event was nothing but a joke taken too far. Experts chimed in that there was no possible way viewing a JPEG image could infect your PC with a virus. After all, a picture is just data, it’s not executable program code.

Ten years later, in the fall of 2004, a proof-of-concept virus was created, proving that a PEG picture could be loaded with a virus that would infect the system used to view it. Software patches were quick made and updates distributed to prevent such a virus from spreading.However, it may only be a matter of time until a means of transmission, such as an innocuous picture, succeeds in wrecking havoc over the Internet.

危险的预见,2004

199441日,一条消息被发布到英特网上的几个用户组,而且很快以邮件的形势散布开来,声称在英特网上发现了一个隐藏在JPEG图片格式中的病毒。警告声称,只要打开查看被感染的图片,病毒就会自动安装到你的电脑。警告演变声称,病毒可能会损坏你的显示器,尤其是索尼Trinitron显示器“更容易被感染”。

很多人听从警告,清除了系统内所有的JPEG文件。有些系统管理员竟然阻止通过邮件收到的JPEG图片。

最后人们意识到这条消息最初是在“愚人节”那天发布的,而且整个事件都不存在,只是一个开大了的玩笑罢了。而且专家指出,纯粹的查看JPEG图片是不可能感染电脑的。毕竟,图片只是数据而已,不是可执行的程序代码。

十年后,在2004年秋天,一个概念证明病毒出现了,它证明JPEG图片是可以携带病毒,而且可以感染查看该图片的系统的。很快推出并分布更新了软件补丁,以便防止这个病毒的蔓延。然而,这只是一个时间问题,直到找到能成功摧毁英特网上危害的传播方式,比如没有病毒的图片。

 

Study Notes:

The state of the art  固定搭配,可解释为“现状”或“最先进的”等

Through its pace  固定搭配,可解释为“测试它的性能”或“检验它的能力”等

Take … for granted  固定搭配,可理解为“认为是理所当然的”

We should take nothing for granted  我们不应该心存侥幸

Cutting-edge  最新的(研究/报告)

Let alone  更不用说,还不算


TAG: edition Edition PATTON Patton Ron RON second software Software Testing testing

 

评分:0

我来说两句

日历

« 2024-04-26  
 123456
78910111213
14151617181920
21222324252627
282930    

数据统计

  • 访问量: 18340
  • 日志数: 46
  • 建立时间: 2010-08-18
  • 更新时间: 2010-10-11

RSS订阅

Open Toolbar