System performance¶
Learning Objectives: The material presents various metrics for measuring the performance of computer systems and comparing them.
This material introduces well-known metrics, definitions, and standards used to calculate benchmarks for comparing different computer systems or processors.
From the user's perspective, the performance of a computer system/processor is often measured in terms of time, i.e., how long it takes to execute a program/task. This naturally involves examining the type of computational task being run on different systems to determine which processor/system handles such tasks the fastest. For designers and developers, the focus is on identifying which hardware components, techniques, implementations of operating systems and firmware functions, or software components turned out to be the bottlenecks in the system.
In general, a good metric for comparing computer systems/processors is to see how much performance has improved relative to previous solutions. Examples include how many instructions a processor executes per unit of time or the memory access time. The key is to select a metric appropriate for the parameters being compared, as there is significant variation in processor implementations, control mechanisms, etc. For example, when comparing a sequential processor with a pipelined version implementing the same instruction set, it was noted that the execution time for a single instruction slightly increased in the pipeline, but overall performance improved.
Amdahl's Law¶
As noted, the most visible performance parameter (to users) is the execution time of programs in human time (Metric 1). A common way to enhance a computer system's performance is to upgrade system components, such as replacing the processor with a faster one, adding cores for parallel computing, or replacing memory chips with faster ones. While performance improves with such upgrades, measuring the achieved improvement isn't straightforward.
As early as the 1960s, Amdahl's Law (see figure on the linked page) was introduced. It describes how increasing a system resource impacts program execution time, based on how much that resource is utilized relative to the overall time.
![](/media/images/tkj-amdahl.png)
Example:
T_old = 100s, execution time is 100s alpha = 0.6, i.e., the resource is used 60% of the execution time k = 3, the resource promises threefold speedup The improved execution time is: T_new = (1-0.6)*100s + (0.6*100s)/3 = 40s + 20s = 60s The relative speedup is T_old/T_new = 100s/60s = 1.67-fold.
Thus, Amdahl's Law reveals that speeding up an individual system resource does not proportionally speed up the entire system. This aligns with common sense, similar to upgrading parts of a mechanical system. If the new part doesn't address the bottleneck, performance improvements are marginal.
System Speed¶
Another common way to enhance performance is to increase system speed. When a computer system is controlled by a clock or multiple clocks, the clock cycle duration provides a unit for measuring instruction/program execution time and thus assessing speed (Metric 2).
Here,
clock cycle duration = 1 / clock frequency
(in seconds). The time taken for a program's execution is T = number of clock cycles for the program * clock cycle duration
(in seconds) or, alternatively, using the clock frequency: T = number of clock cycles for the program / clock frequency
.In processor microarchitectures, this could involve increasing the clock frequency (overclocking) or implementing CISC/RISC instructions or adding stages to the pipeline and/or splitting instruction execution.
Number of Instructions¶
A straightforward way to measure program execution time is to count the number of (machine language) instructions in the program (Metric 3).
This metric largely depends on the processor's instruction set architecture, including the number of registers, how memory addressing is performed, whether instructions are CISC or RISC, etc. Naturally, the efficiency with which developers and compilers optimize programs for the selected processor also plays a role.
Example: The microcode operations required by AMD's K7 processor for machine language instructions (ranging from 1 to 260) and the delay caused by instructions (ranging from 1 to 200 clock cycles).
Instruction Execution Time¶
The previous metric can be refined by calculating the number of clock cycles per instruction (CPI), providing the general average time per instruction for a processor (Metric 4). Naturally, the number of instructions and CPI can vary significantly for the same program on different processors due to differences in microarchitecture implementations. CPI is a suitable metric for comparing processors that implement the same instruction set, such as Intel and AMD processors that implement the x86 instruction set. However, to make CPI meaningful, the average time must be calculated separately for different types of instructions, such as integer operations, floating-point operations, memory accesses, conditional instructions, etc.
Using CPI, the program execution time T can be calculated as:
T = number of instructions * CPI * clock cycle duration
.Example: Comparing programs A and B.
The processor's instruction set defines two CPIs for different instruction types: Arithmetic operations CPI = 1 Memory accesses CPI = 8 The number of instructions in the programs: A: ALU 12 + memory 4 = 16 instructions B: ALU 6 + memory 6 = 12 instructions A's execution time Ta = 12*1 + 4*8 = 44 B's execution time Tb = 6*1 + 6*8 = 54
It's evident that program B had fewer instructions numerically, but A's execution time was still shorter. Thus, the comparison result depends on whether we prioritize the number of instructions or the program's execution time. For this reason, CPI alone is not a comprehensive metric for evaluating performance.
Comparing Computer Systems¶
In general, when evaluating computer systems, none of the above individual metrics/parameters can be used for performance evaluation or comparison. Therefore, all the above parameters are used to evaluate performance, providing a more comprehensive view, and enabling comparisons between systems:
1. | Execution time of the program |
2. | Clock cycle duration |
3. | Number of instructions in the program |
4. | CPI |
Optimizing Common Use Cases¶
Today, computer systems are designed as general-purpose workstations (PCs) with countless applications. Despite this, design considerations take into account user needs. For instance, gaming computers focus on fast graphics. Similarly, computational servers, often GPU-based, are optimized for specific computational operations required in tasks like neural network-based deep learning. The guiding principle in design is to optimize common use cases (engl. "make the common case fast").
But... what exactly constitutes common use cases? This is difficult, if not impossible, to define. Therefore, when calculating benchmarks (benchmarking), a set of different (standardized) programs is used to collectively measure processor efficiency comprehensively. For example, SPEC benchmarks include dozens of test programs, ranging from floating-point computations to C code compilation to chess games, and more.
Example: The SPEC programs used to measure the performance of the Intel Core i7-920 (2.66GHz) processor are shown in the image below.
![](/media/images/tkj-spec.png)
It is observed that for different tests, all four parameters mentioned above are reported: 1) number of instructions in the program, 2) CPI, 3) clock cycle duration, and 4) program execution time. The last parameter, SPECratio, is a computed (normalized) benchmark value derived by combining tests, allowing comparison of different processors within the specified tests.
Common Benchmarks¶
Other commonly used benchmarks, past and present, include Whetstone, Dhrystone, and Floating Point Operations Per Second FLOPS.
For supercomputers, benchmarks such as LINPACK for vector processors and LAPACK, which considers processor cache in vector calculations, are used. The HPCG benchmark also evaluates I/O performance in memory usage and distributed computation in "real-world" applications.
The June 2020 TOP-500 list of supercomputers can be found here. It shows that today's supercomputers already have millions of cores.
In Finland, the CSC (IT Center for Science) computing service made it to the top hundred fastest supercomputers in 2017. CSC has recently (2021) recently built Lumi supercomputer in Kajaani, which is competing with the world's fastest supercomputers.
Principles of Computer System Design¶
David Patterson, one of the authors of this course's second textbook, has outlined eight principles for designing computer systems:
- Consider Moore's Law. That is, designers should plan for the future, as it is expected that microchip resources will continue to grow.
- Design hardware and software systems in layers. In this approach, abstraction layers hide the details of lower levels. For example, the same instruction set architecture can be implemented with multiple different microarchitectures.
- Optimize for the common case. This should already be taken into account when designing the instruction set architecture.
- Parallelism in computation increases performance.
- Pipeline implementation increases performance.
- Prediction improves performance. On average, predicting conditional execution in a program is faster than waiting to be certain about the outcome. This principle also affects how programs should be implemented in machine code.
- Memory hierarchy speeds up access to slower resources.
- Redundancy increases reliability. Computer system components, such as hard drives or memory chips, will eventually fail. Designing systems with redundant, mirrored resources improves fault tolerance. For example, RAID hard drives.
Bibliography¶
Please refer to the course book Bryant & O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd edition. Chapter 1. and Pattern & Hennessy, Computer organization and design, 5th edition. Chapter 1.
Give feedback on this content
Comments about this material