The objective of this document is to present a proposal that Evaluation and Test become a Key Process Area (KPA) in the SEI Capability Maturity Model (CMM). The first section addresses the scope of what is meant by evaluation and test. The second section identifies the justifications for making this a separate KPA. The third section presents the proposed KPA definition including: definition, goals, commitment to perform, activities performed, measurements and analysis, and verifying implementation. The final section addresses integrating this KPA with the existing KPAs. This includes identifying which level to assign it to and some repackaging suggestions for existing KPAs.
关于“验证”和“确认”在 ISO9000 中有严格的定义。
2. DEFINING EVALUATION AND TEST（评价与测试的定义）
(Verification and Validation)
Evaluation is the activity of verifying the various system specifications and models produced during the software development process. Testing is the machine based activity of executing and validating tests against the code. Most software organizations define evaluation and test very narrowly. They use it to refer to just the activities of executing physical test cases against the code. In fact, many companies do not even assign testers to a project until coding is well under way. They further narrow the scope of this activity to just function testing and maybe performance testing.
This view is underscored （被强调） in the description of evaluation and test in the current CMM. It is part of the Software Product Engineering KPA. The activities in this KPA, activities 5, 6, and 7, only use code based testing for examples and only explicitly mention function testing. Other types of testing are euphemistically （婉转的 , 委婉说法的） referenced by the phrase “...ensure the software satisfies the software requirements”.
这种观点在目前的 CMM 有关 evaluation and test 的描述中被进一步强调，这就是 SPE ，软件产品工程 KPA 。在 SPE KPA 活动中，活动 5 、 6 、 7 ，仅仅用了基于代码的测试作为 examples ，只明确地提到了功能测试。其他类型的测试只是用一句非常含糊的话来指代： “….. 保证软件满足软件需求 ” 。
People who build skyscrapers, on the other hand, thoroughly integrate evaluation and test into the development process long before the first brick is laid. Evaluations are done via models to verify such things as stability, water pressure, lighting layouts, power requirements, etc. The software evaluation and test approach used by many organizations is equivalent to an architect waiting until a building is built before testing it and then only testing it to ensure that the plumbing and lighting work.
The CMM further compounds （混合） the limited view of evaluation and test by making a particular evaluation technique, peer reviews, its own KPA. This implies that prior to the delivery of code the only evaluation going on is via peer reviews and that this is sufficient. The steps in the evaluation and test of something are: define the completion/success criteria, design cases to cover this criteria, build the cases, perform/execute the cases, verify the results, and verify that everything has been covered. Peer reviews provide a means ofexecutinga paper based test. They do not inherently provide the success criteria nor do they provide any formal means for defining the cases, if any, to be used in the peer review. They are also fundamentally subjective. Therefore, the same misconceptions that lead a programmer to introduce a defect into the product may cause them to miss the defect in the peer review.
CMM 只是进一步将评价和测试的部分思想进行融合，用一个特殊的评价技术来代替，这个技术就是 CMM 中的一个 KPA ，同行评审。这 (CMM 设计者们的这种做法 ) 也意味着，在提交代码之前，唯一可干的评价就是同行评审，且已经足够了。事实上，对于一件事情的评价和测试的步骤包括： (1) 定义完成 / 成功准则； (2) 涉及覆盖这些准则的用例； (3) 执行用例； (4) 验证结果，验证所有的内容都已覆盖。同行评审只是提供了一个基于纸面的测试机制。它既不能从根本上提供成功准则，也不能提供任何正式的机制以支持用例定义以用于同行评审中。同行评审本质是主观的，因此，基于误解使程序员将缺陷引入产品，而到同行评审时，基于同样的误解，也使得人们无法发现这些 defect 。
A robust scope for evaluation and test must encompass every project deliverable at each phase in the development life cycle. It also address each desired characteristic of each deliverable. It must address each of the evaluation/testing steps. Let's look at two examples: evaluating requirements and evaluating a design.
评价和测试的一个相对坚固的内涵范围必须包括项目在开发周期每一个阶段的每一个交付产品。它也必须考虑每个交付产品的每一个预期特性。而且必须包括每一个评价 / 测试步骤。下面我们看两个例子：评价需求和对一个设计的评价。
A requirements document should be complete, consistent, correct, and unambiguous. One step is to validate the requirements against the project/product objectives (i.e., the statement of “why” the project is being done). This ensures that the right set of functions are being defined. Another evaluation is to walk use-case scenarios through the functional rules, preferably aided by screen prototypes if appropriate. A third evaluation is a peer review of the document by domain experts. A fourth is to do a formal ambiguity review by non-domain experts. (They cannot read into the document assumed functional knowledge. It helps ensure that the rules are defined explicitly, not implicitly.) A fifth evaluation is to translate the requirements into a Boolean graph. This identifies issues concerning the precedence relationships between the rules as well as missing cases. A sixth is a logical consistency check with the aid of CASE tools. A seventh is the review, by domain experts, of the test scripts derived from the requirements. This 揵 ite-size” review of the rules often uncovers functional defects missed in reviewing the requirements as a whole.
一个需求文档必须是完备的、一致的、正确的和清晰的。那么第一步就是基于项目 / 产品目标（即为什么要做这个项目的说明）对需求进行确认。这能够保证我们定义了正确的功能集。下一步评价就是遍历 use-case 脚本走查各功能规则，如果可能的话，最好用 screen prototype （可视原型、屏幕原型？）来作为辅助工具。第三步评价是有领域专家进行的对文档的同行评审。第四步是由非领域专家进行的正式的含糊性评审（他们无法读懂文档里的功能知识，这将帮助确保各种规则是明确定义的，而不是隐含定义）。第五步评价是将需求转换为布尔逻辑图。这可以鉴别规则之间的顺序问题，同时也能发现漏掉的用例 (cases) 。第五步评价是在 CASE 工具的辅助下进行的逻辑一致性检查。第七步评价是由领域专家进行的对测试脚本的评审，这些脚本是从需求导出来的。这种“ ?ite-size ”般的对规则的评审经常能够发现在需求评审中漏掉的功能缺陷。
Evaluating a design can also take a number of tacks. One is walking tests derived from the requirements through the design documents. Another is building a model to verify design integrity (e.g., a model built of the resource allocation scheme for an operating system to ensure that deadlock never occurs). A third is building a model to verify performance characteristics. A fourth is comparing the proposed design against existing systems at other companies to ensure that the expected transaction volumes and data volumes can be handled via the configuration proposed in the design.
对设计的评价一样可以进行一系列补救。一个是基于从需求到处的测试对设计文档进行走查。另一评价是构建一个模型来验证设计的完整性（例如构造一个操作系统的资源分配 schema( 模式 ) 来保证不会发生死锁）。第三种评价是建立模型来验证性能特征。第四种是将形成的设计与其他公司的现成系统进行对比，以确保所设计的配置能够处理预期的处理规模和数据规模。
Only some of the above evaluations were executed via peer reviews. None of the above were code based. Neither of the above examples of evaluation was exhaustive. There are other evaluations of requirements and designs that can be applied as necessary. The key point is that a deliverable has been produced (e.g., a requirements document); before we can say it is now complete and ready for use in the next development step we need to evaluate it for the desired/expected characteristics. Doing this requires more sophistication than just doing peer reviews.
上面的评价只有一部分可以用同行评审来完成，没有一个是基于代码的。而且上边的例子中没有一个评价是穷尽的，必要时我们可以进行的其他评价。核心关键是我们生产一个交付产品（如需求文档），在我们能够正式称它是完备并可被下一开发步骤使用之前，我们必须基于预期 / 期望的特征对之进行评价。而进行这些评价需要比进行同行评审更加复杂的技术。
That is the essence of evaluation and test. A pre-defined set of characteristics, defined as explicitly as possible, is validated against a deliverable. For example, when you were in school and took a math test the instructor compared your answers to the expected answers. The instructor did not just say they look reasonable or they're close enough. The answer was supposed to be 9.87652. Either it was or it was not. Also, the instructor did not wait until the end of the semester to review papers handed in early in the course. They were tested as they were produced. With the stakes so much higher in software development, can we be any less rigorous and timely?
这就是评价和测试的核心关键。一个特征的预定义集合，尽可能被明确定义，用来对一个交付产品来进行确认。例如，当你在学校，进行了数学测验，老师会拿你的回答与预期答案相对比。老师不会仅仅说他们看上去也是合理的，或者他们更加准确。答案是 9.87652 ，要么它对，要么不对。同时，老师也不会等到学期结束才将在课程早期交上来的进行判卷，在他们做出来之际就得到了测试。目前我们软件开发承担更加生死攸关的风险，难道我们还可以有任何的不严格和不及时吗？
Among the items which should be evaluated and tested are Requirements Specifications, Design Specifications, Data Conversion Specifications and Data Conversion code, Training Specifications and Training Materials, Hardware/Software Installation Specifications, Facilities Installation Specifications, Problem Management Support System Specifications, Product Distribution Support System Specifications, User Manuals, and the application code. Again this is not a complete list. The issue is that every deliverable called for in your project life cycle must be tested.
这些应当进行评价和测试的交付产品应当包括 SRS ，设计规格、数据转换规格和数据转换代码、培训规格和培训资料、硬件 / 软件安装规格、设备 / 工具安装规格、稳地管理支持系统规格、产品发布支持系统规格、用户手册和应用程序代码等等。当然这并不是一个完整的列表。问题是你的项目生命周期中的每一个交付产品都必须被测试。
The evaluation and test of a given deliverable may span multiple phases of the project life cycle. More and more software organizations are moving away from the waterfall model of the life cycle to an iterative approach. For example, a Design Specification might be produced via three iterations. The first iteration defines the architecture - is it manual or automated, is it centralized or distributed, is it on-line or batch, is it flat files or a relational data base, etc. The second iteration might push the design down to identifying all of the modules and the inter-module data path mechanisms. The third iteration might define the intra-module pseudo-code. Each of these iterations would be evaluated for the appropriate characteristics.
对于一个给定交付产品的评价和测试可能会延续项目生命周期的多个阶段。越来越多的软件组织开始从瀑布式模型向迭代式模型转变。例如，设计规格可能会经过三个迭代才能产生。第一个迭代定义体系结构 — 它是人工的还是自动的，是集中的还是分散的，是在线的还是批命令式的，是直接文件存储还是通过关系性数据库等等。第二个迭代则可能继续推动设计，来鉴别所有的模块和模块间的数据交换机制。第三个迭代则定义模块内部的伪代码。每个迭代都应当基于适当的特性来进行评价。
The types of evaluation and test must be robust. This includes, but is not limited to, verifying functionality, performance, reliability-availability-serviceability, usability, portability, maintainability, and extendibility.
评价和测试的类型必须是鲁棒的、坚固的。这包括对功能、性能、可靠性 - 可用性 / 实用性 - 可服务性、易用性、可移植性、可维护性和可扩展型的验证，但绝不仅限于此。
In summary, each deliverable at each phase in its development should be evaluated/tested for the appropriate characteristics via formal, disciplined techniques.
There are five significant reasons which justify having a separate Evaluation and Test KPA: evaluation and test's role in accelerating the cultural change towards a disciplined software engineering process, the role of evaluation and test in project tracking, the portion of the development and maintenance budget spent on evaluation and test, the impact of evaluation and test disciplines on the time and costs to deliver software, and the impact of residual defects in software.
由五个重要方面能说明必须有一个独立的 Evaluation and Test KPA ，即： (1) 评价和测试在促进向有纪律的软件工程过程过程的文化转变中的作用； (2) 评价和测试在项目跟踪中所起的作用； (3) 整个开发和维护在评价和测试部分的预算； (4) 评价和测试训练对软件交付时间和成本方面的影响； (5) 评价和测试对软件残余缺陷的影响。
3.1 Accelerating Cultural Change（促进文化改变）
Electrical engineers and construction engineers are far more disciplined than software engineers. Electrical engineers produce large scale integrated circuits at near zero defect even though they contain millions of transistors. What is often lost in the widely discussed defect in the Pentium processor is that it was one defect in 3,100,000 transistors. When was the last time you saw software which had only one defect in 3,100,000 lines of code? The hardware engineers do not achieve better results because they are smarter than the software engineers. They achieve quality levels orders of magnitude higher than software because they are more disciplined and rigorous in their development and testing approach. They are willing to invest the time and effort required to ensure the integrity of their products. They recognize the impact that defects have, economic and otherwise.
电子工程师和建筑工程师要远比软件工程师们训练有素。电子工程师们可以制造近乎 0 缺陷的包含上百万个晶体管的大规模集成电路。在有关 Pentium 处理器的热烈的缺陷声讨中，经常被忽略的是 310 万个晶体管中竟然只有一个缺陷。那么好了，再看看软件，你上次看到的在 310 万行软件代码中只有一个缺陷是什么时候？硬件工程师们没有继续达到更好结果是因为他们比软件工程师们更加 smart 。他们达到的质量水平幅度远远高于软件，因为他们更加训练有素，他们的开发和测试方法更加严格。他们愿意话更多的时间和精力来保证产品的完整性。他们真正认识到了缺陷所带来的影响：经济的或者其他的。
Construction engineers face similar challenges in constructing sky scrapers. In their world a “system crash” means the building collapsed. In regions of the world which have and enforce strict building codes that just does not happen. Again, this can be traced to the discipline of their development and testing approach.
建筑工程师在建造摩天大楼面临着同样的挑战。在他们的世界中，系统坍塌意味着建筑倒塌。 In regions of the world which have and enforce strict building codes that just does not happen. 同样，这可以追溯到他们开发和测试策略的纪律性。
Software, on the other hand, is a different matter. Gerald Weinberg's statement that “if builders built buildings the way software people build software, the first woodpecker( 啄木鸟 ) that came along would destroy civilization” is on the mark.
然而，另一方面，软件却是完全不同的方式。 Gerald Weinberg 的描述很著名：如果建筑师们也像软件工程使开发软件那样来建造大楼，来的第一个啄木鸟就将摧毁文明。
We have to recognize that the software industry is very young as compared to other engineering professions. You might say that it is fifty years old, if you start with Grace Hopper as the first programmer. (A bit older if you count Ada Lovelace as the first.) However, a more realistic starting date is about 1960. That is just over thirty five years. By contrast, the IEEE celebrated their 100th anniversary in 1984. That means that in 1884 there were enough electrical engineers around to form a professional society. In 1945, by contrast, Ms. Hopper would have been very lonely at a gathering of software engineers.
我们不得不承认软件工业相对于其他工程专业还十分年轻。如果你从 Grace Hopper 作为第一个编程人员的话，你可能会说它才仅仅 50 岁 ( 当然如果你将 Ada Lovelace 作为第一个的话，可能会所谓大一点 ) 。然而，更加切实的开始日期应当在 1960 年左右，也就是说我们软件工业也不过 30 多年。做个对比， IEEE 在 1984 年庆祝其成立 100 周年，这意味着到 1884 年，已经有大量的电子工程师，从而形成一个专业协会。而在 1945 年， Mr. Hopper 则在聚集软件工程师方面还十分孤独。
As a further contrast construction engineering goes back over 5,000 years. The initial motivation for creating nations was not self defense; it was the necessity to manage large irrigation construction projects. We even know the names of some of these engineers. For example, in 2650 BC Imhotep is the chief engineer for the step pyramid of Djoser (aka Zoser) in Egypt. In fact he did such a good job they made him a god.
The electrical engineers and construction engineers did not start out with inherently disciplined approaches to their jobs. The discipline evolved over many years. It evolved as they came to understand the need for discipline and the implications of defects in their work products. Unfortunately, we do not have thousands of years or even a hundred years to evolve the software profession. We are already building business critical and safety critical software systems. Failures in this software is causing major business disruptions and even deaths at an alarmingly increasing rate. (See “Risk To The Public” by Peter Neumann.)
Moving the software industry from a craftsman approach to a true engineering level of discipline is a major cultural shift. The objective of the CMM is, first and foremost, a mechanism for inducing this cultural change for software engineers. However, a culture does not change voluntarily unless it understands the necessity for change. It must fully understand the problems being solved by evolving to the new cultural paradigmThis, finally, brings us to the role of testing in accelerating the cultural change to a disciplined approach (I know you were beginning to wonder when I would tie this together).
将软件工业从一种手工 ( 艺 ) 匠方法向真正的训练有素的工程层次迈进实在是一种文化的转折、跃变。 CMM 的首要的而且也是最重要的目标是，建立一种机制来对软件工程是引进文化改变。但是一个文化不可能发生激烈的改变，除非你深刻理解改变的重要性。必须全面理解向新的文化改变所能给我们解决的问题。最后这一点，将使我们引导我们来讨论测试在这一加速向训练有素的文化改变中所起的作用。
In the late 1960's, IBM was one of the first major organizations to begin installing formal software engineering techniques. This began with the use of the techniques espoused （支持） by Edsger Dijkstra and others. Ironically （有讽刺意味的是） , it was not the software developers who initiated this effort. It was the software testers. The initial efforts were started in the Poughkeepsie labs under a project called “Design for Testability” headed by Philip Carol.
在 1960 年代后期， IBM 是第一批开始应用正式软件工程技术的组织之一。一开始使用的是 Dijkstra 支持的技术。具有讽刺意味的是，并不是由软件开发人员发起这项努力的，而是软件测试人员。这一创始性工作是在 Poughkeepsie 实验室进行的，属于 Philip Carol 领导的面向测试的设计项目。
Phil was a system tester in the Software Test Technology Group. This group was responsible for defining the software testing techniques and tools to be used across the entire corporation. Nearly thirty years ago they began to realize that you could not test quality into the code. You needed to address the analysis, design, and coding processes as well as the testing process. They achieved this insight because as testers they thoroughly understood the problem since testing touches all aspects of software development. Testers inherently look for what is wrong and try to understand why.
Phil 是软件测试技术工作组 (SW Test Technology Group) 的一个系统测试工程师。这个工作组主要负责定义软件测试技术和工具以用于整个公司。大概在 30 年以前，他们就开始意识到你不可能通过测试将质量注于代码中。你需要像考虑测试过程一样也得考虑分析、设计和编码过程。作为测试人员，由于测试需要接触软件开发的所有方面，他们对问题有更加彻底深入的理解，因而他们取得了这一深入洞察 (insight) 。
It was this understanding of the problem and the ability to articulate the problem to developers that allowed for a rapid change in the culture. As improved development and test techniques and tools were installed, the defect rate in IBM's OS operating system dropped by a factor of ten in just one release. This is a major cultural shift occurring in a very short time, especially given that it involved thousands of developers in multiple locations.
正是这一对问题的深入认识并将这一问题明确有力地向开发人员指出推动了软件开发文化的迅速改变。随着改进的开发和测试技术的应用， IBM 的 OS 操作系统的缺陷率在下一个发布降低了 1/10 。这确实是在短时间内产生的重要的文化变革，特别是这涉及到了分布在不同地域的近千名软件开发人员。
The rapidity of the change was aided by another factor related to testing in addition to the problem recognition. This was the focused feedback loop inherent in integrating the testing process with the development process. As the development process was refined, the evaluation and test process was concurrently refined to reflect the new success criteria. As developers tried new techniques they got immediate feedback from testers as to how well they did because the testers were specifically validating the deliverables against the new yardstick.
这种变化的加速除了对问题的重视的直接推动外，另一个推动因素是与测试有关的一些因素，即在测试过程和开发过程集成中的反馈环。随着开发过程的不断改进，评价和测试过程并行地改进以反映新的成功准则。随着开发不断使用新技术，他们直接从测试人员那里得到及时的反馈 --- 他们究竟做的怎么样 ----- 因为测试人员就是专门来基于新的尺度对交付产品进行确认的。
A specific example is the installation of improved techniques for writing requirements which are unambiguous 明确的 , deterministic 确定的 , logically consistent 逻辑上是一致的 , complete 完备的 , and correct 正确的 . Analysts are taught how to write better requirements in courses on Structured Analysis and courses in Object-Oriented Analysis. If ambiguity reviews are done immediately after they write up their first functional descriptions, the next function they write is much clearer out of the box. The tight feedback loop of write a function, evaluate the function, accelerates their learning curve. Fairly quickly the process moves from defect detection to defect prevention - they are writing clear, unambiguous specifications.
一个具体的例子是需求撰写改进技术的应用，需求必须是明确的、确定的、逻辑上是一致的、完备的、正确的。有关结构化分析方法和面向对象的方法的培训课教系统分析员如何来写一个好的需求。如果在他们刚刚写完第一个功能描述时就进行模糊性评审，那么他们写的下一个功能就会更加清楚 (out of box) 。这种紧凑的反馈环 — 写一个功能、评价一个功能 ---- 有效地加速了其学习曲线。这样的话，过程从缺陷检测到缺陷预防转移的相当快速 ---- 他们正在写着清晰、不模糊的规格。
Contrast this to the experience of the software industry as a whole. The structured techniques and the object oriented techniques have been available for over twenty-five years (yes, O-O is that old). Yet the state of the practice is far behind the state of the art. The issue is an organization does not fully accept nor understand a solution (e.g., the software engineering tools and techniques) unless it understands the problem being solved. Integrated evaluation and test is the key to problem comprehension. “Integrated evaluation and test” is defined here as integrating testing into every step in the software development process. It is thus the key to the necessary feedback loops required to master a technique. Any process without tight feedback loops is a fatally flawed process. Evaluation and test is then the key to accelerating the cultural change.
将这些经验与我们的整个软件工业做一个对比，结构化设计技术和面向对象的技术已经在 25 年前就可以应用了 ( 是的， OO 确实已经那么老了 ) ，然而我们的时间的情况却远远落后于这些方法的最新技术发展水平。问题是除非组织理解了正在解决的问题，否则它不会全面接受或者全面理解一个解决方案（如：软件工程方法和技术），而集成的评价和测试正是问题理解的杠杆和关键。这里“集成评价和测试”被定义为将测试集成到软件过程的每一步中，它也是为掌握一个技术所需的必要的反馈环的关键部分。任何没有紧密反馈环的过程是具有致命缺陷的过程，因此评价和测量是加速文化改变的关键。
A project plan consists of tasks, dependencies, resources, schedules, budgets, and assumptions. Each task should result in a well defined deliverable. That deliverable needs to be verified that it is truly complete. If you do not evaluate/test the task deliverables for completeness you cannot accurately track the true status of the project.
For example, Requirements Specifications always seem to be “done” on schedule. This is because many organizations do not formally evaluate the Requirements Specification. Later in the project they find themselves completing the definition of the requirements during design, coding, testing, and even production. What, therefore, did it really mean to say that the task of writing the requirements was completed?
Incomplete “completed” tasks can also have a ripple effect on the completion status of subsequent tasks. In the above scenario, what is the impact of finding a requirements deficiency during code based testing? The “completed” Requirements Specification must be revised. The “completed” Design Specification must be revised. The “completed” code must be revised. The “completed” User Manuals must be revised. The “completed” Training Materials must be revised. The “completed” test cases must be revised.
The objective of project tracking is to give management and the project team a clear understanding of where the project stands. Without evaluation/testing integrated into every step in the project you can never be sure of what is and is not really completed. Given that Software Project Tracking and Oversight is a KPA and it depends on evaluation and test to perform the tracking, then evaluation and test as a KPA is a necessary preceding activity.
A major pragmatic( 实际的；实用主义的，注重实效的 ) factor in determining what should and should not be a separate KPA is what portion of the software development budget and staff are involved in the activity. The more significant the activity is in these terms the more focus it should receive.
在决定一项实践应不应当是独立的 KPA 时，一个重要的实效因素是它在软件开发活动的预算和人员投入中所占的比重。如果在这方面越显著，那么它应当受到的关注应当越多。
There have been numerous studies documenting how project costs are allocated across the various activities. In these studies just the code based testing accounts for 35% to 50% of the project costs. This is true for both software development and for software maintenance. Factor in the effort to perform evaluations and this number is higher.
Organizations using any level of discipline in their testing have a tester to developer ratio of at least 1:3. More and more software vendors are moving to a 1:1 ratio. At times the NASA Space Shuttle project has had a ratio of 3:1 and even 5:1!
Simply put, any activity which consumes a third to a half of the budget and a fourth to a half of the resources should definitely be addressed by its own KPA.
Numerous studies show that the majority of defects have their root cause in problems with the requirements definition. In one study quoted by James Martin, over 50% of all software defects are caused by incomplete, incorrect, inaccurate, and/or ambiguous requirements. Even more telling( 说法 ) is that over 80% of the costs of defects have their roots in requirements based errors.
Other studies show that the earlier you find a defect the cheaper it is to fix. A defect found in production can cost 2,000 times more than the same defect found in an evaluation of the requirements.
The issue is scrap( 废弃的、零碎的 ) and rework. This is the primary cause of cost and schedule overruns on projects. The plan may have identified the initial set of tasks to be done. However, due to defects found later, “completed” tasks must now be redone. The “re-do” task was not in the original plan. As the number of tasks requiring rework grows, the cost and schedule overruns accumulate. Integrating evaluation and test throughout the project life cycle minimizes scrap and rework, bringing the costs and schedules back under control.
Integrated evaluation and test can further shorten schedules by allowing for more concurrent activities. When Requirements Specifications are not formally evaluated, the design and coding activities often result in numerous changes to the scope and definition of the functions being delivered. For this reason, work does not start on the User Manuals and Training Materials until code based testing is well underway. Until then no one is confident enough in the system definition.
Similarly, poorly defined requirements do not provide sufficient information from which to design test scripts. The design and building of test cases often does not start until coding is well underway.
These two scenarios force the development process to be linear: requirements, then design, then code, then test, then write manuals. If the Requirements Specification is written at a deterministic level of detail (i.e., given a set of inputs and an initial system state you should be able to determine the exact outputs and the new system state by following the rules in the specification), then test case design and the writing of the manuals can go on concurrently with the system design. This in turn shortens the elapsed time required to deliver the system. However, creating deterministic specifications requires formal evaluation of that specification.
根据这两种场景，可以得出目前开发过程是线性的：先需求、然后是设计、编码、接着是测试、编写用户手册。如果写的需求规格能够达到一个确定级的详程度 ( 即：给定一个输入集和一个系统初始状态，你应当能够按照规格中的规则准确地确定输出和新系统的状态 ) ，那么测试用例的设计以及用户手册的编写就可以和系统设计并行执行。这样同时也就缩短了系统交付时间。但是，建立决定性的规格需要对规格的正式评价。
In summary, integrated evaluation and test reduces schedules and project costs by minimizing scrap and rework and allowing more activities to be performed concurrently. These types of gains can not be accomplished without integrated evaluation and test. Since time to market and cost to market are key issues for any software organization and testing is the key to achieving improvements in this area, then evaluation and test should be a KPA.
The cost of defects is rising at an exponential rate. This has two causes. The first is that our dependence on software is greater than ever. When it fails its impact is proportionate to that dependence. The second cause is litigation( 诉讼 ). There is a significant increase in the number of lawsuits concerning software quality. These are usually multi-million dollar exercises.
The support costs for software vendors is a growing concern. Microsoft receives almost 25,000 calls per day at an average cost per call somewhere between $50 to $100. This number is pre-Windows 95 which was expected to increase the volume by 4X. Sending out incremental bug fix releases also costs millions of dollars for some vendors. You also have to factor in the costs for developers to fix the defects and the opportunity loss caused by efforts going into fixing defects instead of creating new functionality.
Quality and the lack thereof also moves market share. Ashton-Tate went from being the industry leader in PC based data base software to being out of business due to large numbers of defects in one release of their main product. Market share for dBase went from 90%+ to less than 45%. Their acquisition by Borland did not stop the slide. Furthermore, only one year after their acquisition only 2% of all the people who had worked for Ashton-Tate still had jobs at Borland.
The direct costs of defects can be staggering for the end users of the software. Both United Airlines and American Airlines estimate that they lose $20,000 a minute in unrecoverable income when their reservation system goes down. A large manufacturer estimates they lose $50,000 a minute when their assembly line goes down. A large credit card company estimates they lose over a $160,000 a minute when their credit authorization system goes down. Million dollar defects are now common place. For example, if GM has a defect in the firmware that requires reloading the control program in an EPROM it could effect 2.5 million automobiles at an average cost to GM of $100 per car. There has even been an instance of a BILLION dollar loss due to a single defect. It was caused by a round off error.
Some estimates place the average cost of a severity one defect in production in the tens of thousands and even the hundreds of thousands on some applications. You can do a lot of evaluation and test for a $100,000. You could add an additional senior tester to the organization and, counting their salary and overhead costs, the break even point occurs when they find one or two defects that would have slipped through to production.
When you are dealing with safety critical systems how do you cost out the value of a human life? There have been hundreds and hundreds of deaths due to software defects. With software playing a bigger role in transportation and in the medical profession, the risk of deaths is rapidly increasing.
The legal profession is beginning to take note of these costs. Many feel we should be held to the same standards as other engineering professions. This leads to the exposure of software product liability and professional malpractice. The financial exposure in such suits is enormous. To date the issue of setting legal precedents in this area is still in a state of flux. However, the trend is clear. Software professionals and their products will be held to the same standards of care and professionals as other engineers and their products.
Currently, most of the lawsuits related to software quality are being brought to court on the grounds of breach of contract. We (Bender & Associates) have been involved in a number of these as expert witnesses. We have never lost a case. This is because in each instance we have been on the side of the user of the software, not the producer.
Few software vendors can demonstrate that they have applied a reasonable level of due diligence in the evaluation and test of their software. The emphasis in most vendors is on dates and functionality, not quality. The result is that in half of the cases we have testified in the vendor has gone out of business as a direct result of the cost of litigation and the cost of the award to the customer.
If the CMM was addressing the medical profession, there is no doubt that the avoidance of malpractice suits would be a KPA. Well this issue is now on our doorsteps as software professionals. It requires a disciplined approach to evaluation and test to minimize this exposure.
如果 CMM 是在医科领域，那么无疑避免玩忽职守将必然是一个 KPA ，然而同样的问题也已经到了在我们软件从业人员的家门口。
The net （要点） is that the direct and indirect cost of defects is already huge and rising dramatically. Defect detection and defect avoidance require fully integrated evaluation and test. This alone is sufficient to justify an evaluation and test KPA.
要点是 defect 产生的直接和非直接的费用已经相当巨大，而且还在戏剧性地增长。缺陷监测和缺陷规避需要全面集成的评价和测试。仅仅这一点足可以使评价和测试成为一个独立的 KPA 。
Evaluation is the activity of ensuring the integrity of the various system specifications and models produced during the software development process. Testing is the machine based activity of executing tests against the code. The purpose of Software Evaluation and Test is to validate (i.e., is this what we want) and verify (i.e., is this correct) each of the software project deliverables, identifying any defects in those deliverable in a timely manner.
评价是对软件开发过程中所产生的各种系统规格和模型的集成性进行保证的活动。测试则是基于机器的对代码进行测试的活动。软件评价和测试的目的是确认 ( 即：判断这是我们所要的吗？ ) 和验证 ( 即：这是不是正确？ ) 每一个软件项目交付产品，及时发现这些产品中的任何缺陷。
Software Evaluation and Test involves identifying the deliverables to be evaluated/tested; determining the types of evaluations/tests to be performed; defining the success criteria for each evaluation/test; designing, building, and executing the necessary evaluations/tests; verifying the evaluation/test results; verifying that the set of tests fully cover the defined evaluation/test criteria; creating and executing regression libraries to re-verify deliverables that have been modified; and logging, reporting, and tracking defects identified.
The initial deliverable to be evaluated is the software requirements. Subsequently, the majority of the evaluation and test is based on the validated software requirements.
The software evaluation and test may be performed by the software engineering group and/or an independent test organization(s), plus the end user and/or their representatives.
软件评价和测试可以由软件工程组或者独立的测试组织加上最终用户和 / 或其代表。
Goal 1 Quantitative and qualitative evaluation/test criteria are established for each of the software project deliverables.
为每一个软件项目交付产品，建立定性和定量评价 / 测试准则
Goal 2 Evaluations/tests are executed in a timely manner to verify that the success criteria has been met.
及时执行评价 / 测试，以验证是否被满足成功准则。
Goal 3 Evaluation/testing is sufficiently effective to minimize the impact of defects such as scrap and rework during development and operational disruptions after implementation.
评价 / 测试充分有效，从而确保开发中产生的废品和返工等缺陷所产生的影响以及交付用户运作后的操作破坏达到最小。
Goal 4 Defects and other variances identified are logged and tracked
through to their successful closure.
Commitment 1 The project follows a written organizational policy for evaluating/testing the software project deliverables.
This policy typically specifies:
1. The organization identifies a standard set of software project deliverables to be evaluated/tested, the characteristics to be evaluated/tested, and the levels of verification criteria to be considered.
Examples of deliverables to be evaluated and tested include:
- requirements specifications,
- design specifications,
- user manuals,
- training materials,
- data conversion specifications and support systems, and
Examples of characteristics to evaluate/test for are:
- functional integrity,
- performance, and
Examples of levels of verification criteria are (using code based testing as the example):
- 100% of all statements and branch vectors;
- 100% of all predicate conditions;
- 100% of all first order simple set-use data flows; and
- 100% of all first order compound set-use data flows.
Examples of levels of verification criteria are (using requirements based testing as the example):
- 100% of all equivalence classes;
- 100% of all functional variations; and
- 100% of all functional variations, sensitized to guarantee the observability of defects.
2. The organization has a standard set of methods and tools for use in evaluation/testing and defect tracking.
3. Each project identifies the deliverables to be evaluated/tested, the phase(s) in which they will be evaluated/tested, and how they will be evaluated/tested in each phase.
4. Evaluations and tests are performed by trained testers.
5. Evaluations and testing focuses on the software project deliverables and not on the producer.
Commitment 2 Senior Management supports and enforces that projects must meet
their pre-defined success criteria before installation into production in the users/customers environment.
高级管理支持并保证：在用户 / 客户的环境中安装和生产之前，项目必须满足其预先定义的成功准则
1. Senior management reviews and approves the overall evaluation and testing objectives for the software system.
2. Senior management reviews and approves that the system has met that criteria prior to installation.
Author's note: One of the biggest enemies of quality is unreasonable schedules. If the team is going to measured solely on just meeting dates, then the test plan will be bypassed. Management must measure functionality, resources, schedules, and quality in determining a project's success, not just dates.
Ability 1 Adequate resources and funding are provided for planning and
executing the evaluation and testing tasks.
1. Sufficient numbers of skilled individuals are available for performing the evaluation and testing activities, including:
- overall evaluation/test planning,
- evaluation/test coordination,
- evaluation/test case design,
- evaluation/test case implementation,
- evaluation/test execution,
- evaluation/test results verification,
- evaluation/test coverage analysis, and
- defect logging and tracking.
2. Tools to support the testing effort are made available, including:
- test case design tools,
- test data generators,
- test drivers, and
- test coverage monitors.
3. A test environment configuration is made available, including:
- hardware and software, dedicated to the testers, which mirrors the intended production configuration.
Ability 2 Members of the software testing staff receive required training to
perform their technical assignments.
Examples of training for evaluation and test include:
- evaluation and test planning;
- criteria for evaluation/test readiness and completion;
- use of the evaluation/testing methods and tools; and
- performing peer reviews.
Ability 3 Members of the software engineering staff whose deliverables will
be evaluated and tested receive training on how to produce testable deliverables and orientation on the overall evaluation and testing disciplines to be applied to the project.
Refer to Ability 5 for an example of a testable deliverable.
Ability 4 The project manager and all of the software managers receive
orientation in the technical aspects of the evaluation/testing criteria and disciplines to be applied to the project.
Examples of orientation include:
- the evaluation/testing methods and tools to be used;
- the entry and exit criteria for the various levels of evaluation/testing; and
- the defect resolution process.
Ability 5 The software engineers produce testable deliverables.
An example of a testable deliverable would be a requirements specification that had the following characteristics:
- the functional rules are written at a deterministic level of detail (i.e., given a set of inputs and an initial system state you should be able to follows the rules in the specification and determine the outputs and the final system state);
- the specification is non-redundant;
- the specification is unambiguous; and
- the various requirements follow a consistent standard (e.g., standards for user interface definitions are followed which define function keys, intra-screen navigation, inter-screen navigation).
Activity 1 The overall evaluation and testing effort is planned and the plans are documented.
1. Identify the risks and exposures if defects propagate through the various project phases and into production. This information is used to determine how much evaluation and testing needs to be done.
Examples of risks to be evaluated are:
- the potential scrap and rework and resulting cost and schedule overruns which might be caused by defects in the requirements specifications;
- the potential cost per unit of time for system down time in production;
- the potential cost to customers and end users of inaccurate processing; and
- the potential risk to human lives in safety critical applications.
Note: The premise here is that testing is essentially an insurance policy. The overall evaluation and test strategy and its associated costs should be proportional to the potential bottom line risks which defects could cause.
2. Identify the software project deliverables to be evaluated/tested.
Examples of software project deliverables to be evaluated/tested are:
- requirements specifications;
- design specifications;
- user manuals and built in help facilities;
- training manuals, courseware, and training support systems;
- data conversion procedures and data conversion support systems;
- hardware/software installation procedures and support systems;
- production cutover procedures and support systems (e.g., code that creates a temporary bridge between an existing system and its replacement, allowing some sites to run on the old and some on the new until full cutover is complete).
- production problem management procedures and support systems (e.g., the production help desk).
- product distribution procedures and support systems (i.e., the mechanisms for distributing updates and new releases, especially to widely distributed end users).
- publications procedures and support systems (e.g., the mechanisms for physically publishing all of the copies of the manuals needed to support the system in production).
3. For each deliverable to be evaluated/tested determine the characteristics to be tested.
Examples of characteristics to be evaluated/tested are:
- functional integrity;
- reliability, availability, serviceability;
- portability (i.e., can this one code line be easily ported from one platform to another);
- maintainability (i.e., can fixes and minor incremental improvements be easily made); and
- extendibility (i.e., can major additions be made to the system without causing a major rewrite).
4. Determine the qualitative and quantitative success criteria for each deliverable and each characteristic evaluated and tested for the deliverable.
An example of the functional test criteria for code could be:
- the code is tested to verify that 100% of all functional variations derived from the requirements, fully sensitized for the observability of defects, have been run successfully; and
- 100% of the code's statements and branch vectors have been executed.
5. Determine the methods and tools required to evaluate/test each deliverable for each of its desired characteristics.
An example of evaluating a requirements specifications might involve:
- performing an ambiguity review;
- walking use-case scenarios through the requirements to validate completeness;
- building screen prototypes to validate the completeness;
- creating cause-effect graphs from the functional requirements to validate that the precedence rules are clear;
- doing a peer review with domain experts to validate completeness and accuracy;
- doing a logical consistency check of the rules via a CASE tool; and
- reviewing the test cases designed from the functional requirements with developers and end user / customers to validate the completeness and accuracy of the specifications from which they were derived.
Examples of testing tools include:
- test case design tools,
- test data generators,
- capture/playback tools,
- test drivers,
- test coverage monitors,
- test results compare utilities,
- memory leak detection tools,
- debuggers, and
- defect tracking tools.
6. Determine the stages (sometimes called levels) of testing and refine the quantitative and qualitative test criteria into entry and exit criteria for each phase of testing.
Examples of stages of code based testing include:
- unit testing with primary emphasis on white box structural testing, usually done by the coder;
- component testing with primary emphasis on black box functional testing and inter-unit interface testing, with some initial performance testing and initial usability testing;
- system testing with primary emphasis on inter-component interface testing, full thread functional testing, full performance testing, full usability testing, and full reliability/recoverability testing;
- inter-system integration testing with primary emphasis on inter-application interface testing and inter-application performance testing; and
- acceptance testing (a.k.a. beta testing) with emphasis on final validation of functional robustness, usability, and configuration testing.
An example of refining the success criteria by test stage is:
- the entry criteria into unit testing is a peer review of the code;
- the exit criteria from unit test is correct execution of 100% of the code statements and branch vectors;
- the entry criteria into component test is 100% execution of the 揼 o right” statements and branches,
- the exit criteria from component test is 100% execution of all functional variations derived from the requirements specification.
Note that the entry criteria into component test is less stringent than the exit criteria from unit test. This allows these activities to overlap in a controlled manner.
7. For each deliverable, decompose it into units for evaluation and test and determine the optimal sequence for evaluating/testing the units.
For example, the unit testing of the code might be done in a sequence which minimizes the need for building scaffolding code to emulate interfaces to code not yet tested.
8. Define the methods and procedures for defect reporting and tracking to be used by the project.
Activity 2 Reconcile( 协调，和谐 ) the evaluation/test plan with the overall development plan.
1. Verify the evaluation and test resources and schedules against the project schedules and constraints.
2. Reconcile the desired sequencing of units for evaluation and test against the availability of those units as defined in the development plan.
3. Get concurrence on the defect reporting and tracking mechanism from the developers.
Activity 3 Install the evaluation and testing infrastructure.
1. Acquire and install the testing tools needed for this project.
2. Acquire and install the test hardware and software configuration required to create and execute the tests.
3. Train management and staff on the evaluation and testing methods and tools to used.
Activity 4 Perform the evaluation/testing for each deliverable, for each characteristic, at the designated test stages.
1. Design the evaluation/test cases using the identified methods and tools.
2. Physically implement the cases in their final “executable” form.
3. Perform the evaluation / Execute the test cases.
4. Verify the e