WO2013103877A1

WO2013103877A1 - Method and system for testing a cache memory architecture

Info

Publication number: WO2013103877A1
Application number: PCT/US2013/020362
Authority: WO
Inventors: William Judge YOHN
Original assignee: Unisys Corporation
Priority date: 2012-01-05
Filing date: 2013-01-04
Publication date: 2013-07-11

Abstract

A method, system and computer device for testing a cache memory within a computer system. The method includes establishing a controlling CPU and at least one controlled CPU from the plurality of CPUs in the computer system, deactivating by the controlling CPU at least one controlled CPU by placing the controlled CPUs in a first idle state whereby the controlled CPU operates in an idle loop that is resident in a first memory level associated with the controlled CPUs, and activating by the controlling CPU at least one controlled CPU by placing the controlled CPU in a second activation state whereby the controlled CPU can access all memory levels.

Description

METHOD AND SYSTEM FOR TESTING

A CACHE MEMORY ARCHITECTURE

BACKGROUND

Field

[0001] The instant disclosure relates generally to computer system cache memory architectures, and more particularly, to methods and systems for testing computer system cache memory architectures.

Description of the Related Art

[0002] Within computer systems, cache memory is used to store the contents of a typically larger, slower memory component of the computer system. In the development of a large scale computer system cache/memory architecture that contains multiple units at each architectural level, there has been a persistent problem in how to provide comprehensive functional and stress tests of such a system. Desirably, the tests should meet several criteria, including providing the highest degree of functional, load, stress and volume coverage. The tests also should attain the highest degree of efficiency both in wall clock time and resource utilization, eliminate the need for users to parameterize and execute the test package, provide the ability to reliably reproduce a detected problem, provide diagnostics capable of allowing reliable fault analysis to take place, and provide the ability to direct the test package to perform specific operations in pursuit of fault isolation and diagnosis.

[0003] Conventional approaches to providing suitable system tests have not been successful. Conventional testing generally consists of one or two approaches in combination. The first approach is to execute a set of discrete, deterministic function tests either individually or in various combinations against cache memory configurations that could consist of multiple cache memory units existing at multiple architectural levels. The second approach is to execute a set of end user programs that are started simultaneously but otherwise are uncontrolled. Attorney Docket No.: 21004.0031 U1

Unisys Ref.: RA5910

[0004] These conventional approaches have a number of concerns. For example, with such conventional approaches, it is relatively difficult to provide a suitably high degree of functional and stress testing coverage, as well as a relatively high degree of path coverage. Also, with conventional approaches, it is relatively difficult to isolate and reproduce detected errors.

[0005] Deterministic tests typically concentrate on specific areas of functionality. A specific deterministic test may allocate and test (write/read) data functionality and integrity. However, to provide a suitable degree of functional and stress related test coverage, many deterministic test cases typically would be required. The need to conduct multiple deterministic test cases would involve considerable resource

requirements of both execution time and personnel involvement, often making a comprehensive deterministic test plan prohibitive. Multi-threaded test programs present additional problems in that it is difficult to predict, control and evaluate the degree of interaction between multiple threads of execution. This difficulty can significantly reduce the degree of determinism of a given set of test cases. A similar problem exists when executing a set of deterministic test cases simultaneously. Both predicting and reproducing the exact interaction between the test cases is relatively difficult if not impossible.

[0006] In computer systems with relatively complex cache memory architectures, there can be multiple paths from requestors to specific data. Deterministic test cases experience considerable difficulty in being able to test all paths to specific data from all requestors. This difficulty is exacerbated in systems that use random page allocation of memory, as it is not possible to programmatically determine in which cache memory unit data actually resides. In fact, programs typically are not aware of the actual

configuration details and are unable to determine the number of architectural cache memory levels and the number of units at each level.

[0007] A fundamental requirement of any test program is the ability to reproduce a detected failure for the purposes of diagnosis. The inability to reproduce such an error is a relatively severe limitation of conventional test techniques. A single deterministic test program consisting of a single thread of execution should be able to reproduce an error as long as the error can be attributed to a very basic and consistent functional failure. However, any error that is of a more complex nature or requires a combination of multiple circumstances to fail is relatively difficult to reproduce in an effective manner, especially if multiple deterministic test cases are executed simultaneously. The random interaction of these test cases cannot be effectively reproduced at the lowest cache memory levels.

[0008] The result of conventional test efforts is to produce a system load that either follows a specific set of characteristics (grooved activity) or is completely random in nature. The "grooved" activity produces a relatively limited set of test conditions while the random activity requires too great a time to detect even a limited number of combinatorial errors.

SUMMARY

[0009] Disclosed is a method, system and computer device for testing a cache memory within a computer system. The method includes establishing a controlling CPU and at least one controlled CPU from the plurality of CPUs in the computer system, deactivating by the controlling CPU at least one controlled CPU by placing the controlled CPUs in a first idle state whereby the controlled CPU operates in an idle loop that is resident in a first memory level associated with the controlled CPUs, and activating by the controlling CPU at least one controlled CPU by placing the controlled CPU in a second activation state whereby the controlled CPU can access all memory levels. The methods and systems described herein allow modern large scale computer system cache memory architectures to be tested with a greater degree of functional and load related coverage (compared to conventional testing methods) using both

deterministic and focused random (probabilistic) techniques, while being able to reproduce results in the event of an error having been detected. Such tested computer systems can contain multiple architectural levels of cache memory and multiple units at each level. By using the methods and systems described herein, all possible paths and combinations of paths to chosen data areas can be tested as desired. Also, the methods and systems described herein can be configured to generate all possible variations of tinning by which data in each of the available cache mennory units at each architectural level may be accessed.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] Fig. 1 is a schematic view of program absolute addressing within a computer system;

[0011] Fig. 2 is a schematic view of program page allocation within a computer system;

[0012] Fig. 3 is a schematic view of a very large scale mainframe computer cache memory hierarchy;

[0013] Fig. 4 is a schematic view of a portion of a general system memory hierarchy, showing the distinct levels of main memory with attendant difference in requestor to data timing;

[0014] Fig. 5 is a schematic view of a table built showing what data resides as what timing levels for each data requestor;

[0015] Fig. 6 is a schematic view of a portion of a general system memory hierarchy, showing the interaction between the controlling CPU and the controlled or tested CPUs, according to an embodiment;

[0016] Fig. 7A is a flow diagram of a portion of a method for testing a computer system cache memory unit, according to an embodiment;

[0017] Fig. 7B is a flow diagram of another portion of the method for testing a computer system cache memory unit, according to an embodiment;

[0018] Fig. 7C is a flow diagram of another portion of the method for testing a computer system cache memory unit, according to an embodiment;

[0019] Fig. 7D is a flow diagram of another portion of the method for testing a computer system cache memory unit, according to an embodiment;

[0020] Fig. 8 is a schematic view of a set of generated parameter tables, according to an embodiment;

[0021] Fig. 9 is a schematic view of generated function tables for the CPU, L1 , L2, L3 memories and main memory, according to an embodiment; [0022] Fig. 10 is a schematic view of a set of function tables, according to an embodiment;

[0023] Fig. 1 1 is a schematic view of memory allocation using a proprietary architecture;

[0024] Fig. 12 is a schematic view of a generated table relating the allocated memory area to the defined architectural levels, according to an embodiment;

[0025] Fig. 13 is a schematic view of a set of generated execution tables, according to an embodiment;

[0026] Fig. 14 is a schematic view of a set of generated execution history tables, according to an embodiment;

[0027] Fig. 15 is a schematic view of an apparatus configured to test a computer system cache memory unit, according to an embodiment.

DETAILED DESCRIPTION

[0028] In the following description, like reference numerals indicate like components to enhance the understanding of the disclosed methods and systems through the description of the drawings. Also, although specific features, configurations and arrangements are discussed hereinbelow, it should be understood that such is done for illustrative purposes only. A person skilled in the relevant art will recognize that other steps, configurations and arrangements are useful without departing from the spirit and scope of the disclosure.

[0029] As used in this description, the terms "component," "module," and "system," are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be a component. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components may execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes, such as in accordance with a signal having one or more data packets, e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network, such as the Internet, with other systems by way of the signal.

[0030] An individual computer program typically consists of a number of segments containing data, instructions and possibly library elements. Each instruction or data location is given an address that corresponds to a relative location within the segment in which it is contained. A program segment that consists of 100,000 instructions will have program relative addresses beginning at 0 and extending to 99,999, perhaps biased with some offset. While the sequence of instruction execution may not be contiguous, the address of each instruction executed is relative to the start of the program.

[0031] A typical computer program may contain several code or data segments, each with the same relative address range. For such a program to execute correctly, the segments must be combined in such a way that no two segment locations have the same address. The process of combining these entities into an executable program typically is called linking. After being linked, the entire program will have a set of unique runtime addresses often called absolute addresses.

[0032] Modern computer systems must be capable of allowing multiple independent programs to be active simultaneously, while allowing each program to appear to execute continuously. If the computer system has only one processing unit, only one program can actually be active on the system at a given instant. To maintain the appearance of simultaneity, each active program is allowed to execute on a requestor for a fixed period of time, have its state saved and then be deactivated. The next active program then is allowed to execute in a similar fashion. This process is termed time sharing or multi-programming. If the computer system has more than one processing unit, the number of programs that can execute simultaneously (one per requestor) is directly proportional to the number of available processing units. The process of simultaneously executing programs on multiple processing units is called multiprocessing. [0033] Fig. 1 is a schematic view of an example of program absolute addressing within a computer system 1 having a memory or physical memory 2 and a mass storage 4. In this example, four (4) independent programs (Program 1 , Program 2, Program 3, Program 4) are shown, each program with a different number of code and data segments. During program execution, each segment is loaded into memory 2 by the operating system, as needed. Because each segment may have the same or similar range of absolute addresses as another segment, each segment will be biased by its location in memory such that each segment will have a unique set of addresses, thus allowing each program to operate correctly. These sets of addresses are termed real addresses.

[0034] During execution, each program is unaware of the location of its segments in system memory. Each program is presented with an image that appears as if its order of execution is from one contiguous set of locations in memory.

[0035] When a program becomes active, its segments are loaded into memory by the operating system in the most efficient manner determined by the operating system. If the total program requirement for memory is greater than the available system memory, the operating system will have to decide how to allocate both memory space and execution time among the requesting programs. Because each program segment may be of a different size, it is relatively difficult for an operating system to efficiently manage memory such that all of the memory is being used and such that program segment loading and unloading is minimized. Computer systems typically employ a combination of hardware and software architecture to efficiently manage memory usage and program execution. A computer system that employs this approach is called a paging system.

[0036] To efficiently manage the aforementioned segmentation, and loading and unloading of individual programs, every segment is required to be the same size.

Without such requirement, memory utilization can become fragmented, resulting in increased loading and unloading of program banks, relatively poor memory utilization and an increase in actual program execution time. This requirement of every segment being the same size necessitates using both hardware and software architecture components to manage where a given page is actually loaded into memory. When a program first begins to execute, a certain number of pages are loaded into memory. For example, if four programs are requesting to be executed simultaneously, each of those four programs will have a number of its pages initially loaded for execution.

Subsequent program execution will require that additional pages of each program be loaded into physical memory as well. Fig. 2 is a schematic view of program page allocation within the computer system 1 .

[0037] Should the requested number of pages exceed the available physical memory, the system will unload one or more pages from physical memory 2 and place them on secondary disk mass storage 4 in a file called the page file. The newest requested program pages then will be loaded into physical memory 2 and program execution will continue. The details of how this action takes place are system

architecture and operating system dependent. Which pages are chosen to be unloaded is system architecture dependent, but generally relies on a type of least recently used algorithm. However, it should be understood that each program is divided into segments of a fixed size dependent on the system architecture and moved into and out of physical memory based on number of programs executing, the size of each program, the number of processing units and the frequency of use. At any one time, a given page of a specific program may be allocate anywhere in physical memory.

[0038] In a paging system, where a given page of a program is allocated in physical memory is chosen from a free page table built from randomly allocated addresses of all unassigned pages and may well be different at different points of execution. While this allocation mechanism allows for efficient utilization of memory, it makes testing a computer's cache memory architecture very difficult. This is especially true of relatively large computer systems that have multiple layers and multiple units of cache memory.

[0039] When using a conventional test program, a particular data range may exist in one unit of the cache memory at one point in time but may reside in another unit of the cache memory at another point in the test program's execution. If an individual requestor modifies a particular piece of data, that data will then reside in that

requestor's cache memory. Subsequently, if a second requestor modifies the same piece of data, that data will then be unloaded from its current cache memory module and loaded into a cache memory module used by the second requestor. [0040] The relative difficulty in testing this type of computer architecture to date is partially the result of a number of related conditions becoming relevant more or less simultaneously as computer systems developed have developed. As more people began using computer systems, the requirement to be able to process larger computer programs became more important. Also, the need to process a number of such programs simultaneously became a relevant consideration. To facilitate this need, larger amounts of system memory began to be employed.

[0041] However, at the same time, it was observed that not all parts of a computer program were used at the same rate and for the same amount of time. This gave rise to the concept of having a smaller but faster memory structure that would contain the parts of a program that were used more often, thus facilitating an increase in the speed at which a given program would execute. As computer systems evolved, it was observed that multiple hierarchical layers of memory would be necessary to improve program execution and reduce implementation costs. The intermediate layers of computer memory became known as cache memory units, levels or layers. The cache memory layer or level that resides closest to the requestors, such as central processing units (CPU), is both the smallest and the fastest of the cache memory levels, and is generally known as Level 1 cache (L1 ). Each succeeding layer or level is both larger and slower than the preceding layer or level. This cache memory structure ultimately culminates in a memory layer or level that generally known as the main memory or system memory.

[0042] Many modern computer architectures contain 3 levels of cache memory (e.g., Level 1 or L1 , Level 2 or L2, and Level 3 or L3), as well as a level of main memory (MEM), in addition to internal CPU registers. In most architectures, the Level 1 cache memory is integrated into the CPU ASIC (application specific integrated circuit). The Level 1 cache memory often is subdivided into 2 sections: one section that contains program instructions and one section that contains program data referred to as instruction operands. Typically, the Level 2 and Level 3 cache memory levels are integrated into the CPU ASIC as well. In some other architectures, the Level 2 and/or Level 3 cache memory levels are contained in separate ASICs located near the requestor ASICs on a system motherboard. Also, the main or system memory layer or level may be located near the requestor ASIC on a system motherboard. Finally, many of the latest computer architectures have been developed in such a way that multiple CPUs can be contained on the same physical ASIC and share an integrated Level 2 cache memory unit. The tabular listing below summarizes a general system memory hierarchy:

[0043] Computer system memory hierarchy (fastest to slowest access times):

[0044] 1 . Internal CPU storage registers (on CPU ASIC) - 1 CPU clock cycle

[0045] 2. Level 1 (L1 ) cache memory - 1 -3 clock cycles latency - size: 10 KB+

[0046] 3. Level 2 (L2) cache memory - latency higher than L1 - size: 500 KB+

[0047] 4. Level 3 (L3) cache memory - latency higher than L2 - size: 1 MB+

[0048] 5. Main memory - many clock cycles size: 64 GB+

[0049] 6. Disk mass storage - millisecond access - size: capacity limited by disk number (many terabytes)

[0050] Fig. 3 is a schematic view of a very large scale mainframe computer cache memory hierarchy 10. The entire cache memory hierarchy 10 includes four (4) processor control module (PCM) cells 12, although only two (2) PCM cells 12 are shown in Fig. 3. Each PCM cell 12 has two (2) I/O modules (lOMs) 14 and four (4) processor modules (PMMs) 16. Each PMM 16 includes 2 central processing units or CPUs (IPs) 18, each with an integrated Level 1 (FLC - first level cache) cache memory unit 22. Each PMM 16 also includes a shared Level 2 (SLC - second level cache) cache memory unit 24, a shared Level 3 (TLC - third level cache) cache memory unit 26, and a main memory (MEM) unit 28.

[0051] As can be seen from the cache memory hierarchy 10, the number of paths a piece of data can take when being accessed by a set of requestors is incredibly large. In the case of a computer system that contains sixteen (16) instruction processors, the number of combinations of requests to manipulate a specific piece of data by sixteen processors at a time is 2¹⁶ - 1 . For a similar configuration containing thirty two (32) instruction processors, the number of requests to manipulate a particular piece of data rises to 2³² - 1 . If requested data is resident in one of the cache memory units (e.g., SLC 24 or TLC 26), the data will be retrieved from the particular cache memory unit there. If the data is subsequently modified by the requestor, the initial copy retrieved from the cache memory unit no longer will be valid and therefore will be declared invalid and removed from the cache memory unit. The new modified data will be made resident in the cache memory unit(s) of the modifying requestor and eventually will be written back to main memory unit 28. The next time a requestor asks for that piece of data, the data will be retrieved from the modifying requestor's cache memory unit or from the memory unit, depending on the architectural implementation of the MESI (Modified Exclusive Shared Invalid) cache protocol.

[0052] This type of cache memory architecture contains four (4) levels of cache memory, each with different capacities and data transfer times. The data transfer times for the cache memory levels are directly proportional to the path length from the requestor. As previously mentioned, the first level cache (FLC) memory unit 22 has the shortest transfer time, followed by the second level cache (SLC) memory unit 24, the third level cache (TLC) memory unit 26, and the main memory unit 28. Additional transfer time exists if data is contained in a cache memory unit that is non-local to the requestor. For example, for a request for data by IP1 1 to MEMO, the memory unit MEMO is not resident in the same PMM as the CPU IP1 1 , and therefore the memory unit is considered to be non-local to the requestor. If the requestor and the memory unit containing the requested data reside in the same PMM, the memory unit is considered to be local to the requestor.

[0053] How to test such an architecture with relatively complete functional coverage, high efficiency and repeatability requires that a number of factors be determined.

Initially, knowing the number of requestors is desired, and such information is readily available from the computer's operating system. Also, knowing the number of cache memory levels, the number of units at each level and their capacities is desired, but not normally available. The number of memory units and their respective capacities is information that is only partially available. The total memory capacity of the system can be obtained relatively easily, but there normally is no means for a computer program to directly determine the number of individual memory units. Also, typically it is not possible for a computer program to directly determine how many cache memory units exist at what levels and with what capacities, because these units typically are embedded in the system architecture and are transparent to the end user. A further complication is that most modern computer operating systems use a randomized paging algorithm, which makes it impossible for a user program to determine exactly the memory unit into which a page of data is initially loaded. For example, if four (4) consecutive pages of data are requested by references from IPO, each of these data pages might be initially loaded into a different memory module.

[0054] According to an embodiment, the methods and systems described herein allow modern large scale computer system cache memory architectures to be tested with a greater degree of functional and load related coverage (compared to

conventional testing methods) using both deterministic and focused random

(probabilistic) techniques, while being able to reproduce results in the event of an error having been detected. Such tested computer systems can contain multiple architectural levels of cache memory and multiple units at each level. By using the methods and systems described herein, all possible paths and combinations of paths to chosen data areas can be tested as desired. Also, the methods and systems described herein can be configured to generate all possible variations of timing by which data in each of the available cache memory units at each architectural level may be accessed.

[0055] It should be noted that the inventive methods, devices and systems described herein assume that the cache memory system configuration is symmetric, i.e., each of the PMMs have the same cache levels and capacities. However, if the cache memory system configuration is not symmetric, i.e., at least some of the PMMs have different cache levels and capacities, the inventive methods, devices and systems described herein can be modified to account for such differences. Also, it should be noted that the inventive methods, devices and systems described herein assume that all CPU (IP) requestors have the same internal characteristics, e.g., clock speed. The cache memory implementation is system dependent, with some cache memory cache units being inclusive and some cache memory units being exclusive. Also, the inventive methods, devices and systems described herein assume that the cache memory levels are inclusive, although the inventive methods, devices and systems described herein can be adapted to include exclusive cache memory architectures. It is possible to have one cache level be inclusive and another cache level be exclusive. For example, a third level cache (TLC) memory unit can be exclusive and an associated second level cache (SLC) memory unit can be inclusive. In such a system configuration, the write loop timing can be used to differentiate the cache unit characteristics.

[0056] To determine the number of cache levels and the capacities of each cache level, a table can be built such that the time to access each cache level and its capacity can be recorded. The table can be built in any suitable manner, using any suitable method or system, such as the method and system described in co-pending U.S. Patent Application Serial No. 12/962,767, entitled "Method and System for Determining a Cache Memory Configuration for Testing," which is hereby incorporated by reference.

[0057] Once the table of requestors, cache levels and capacities has been

constructed, the memory configuration can be determined. Referring again to the cache memory hierarchy 10 in Fig. 3, in a maximum configuration it can be seen that an individual data requestor (CPU/IP) can access data located in any one of 16 memory units.

[0058] As discussed hereinabove, computer operating systems place data in memory based on a random page allocation algorithm. Typically, it is not possible for an end-user computer program to determine in which unit of a multi-memory

configuration a data page will reside. Hence, when a requestor allocates an area of memory to test, it cannot be directly determined programmatically in which physical memory unit the requested data resides. To accurately test the entire cache memory complex, it is desirable to know which allocated areas of program memory reside in which physical memory units.

[0059] In a cache memory hierarchy like the cache memory hierarchy 10 in Fig. 3, it can be seen that there are three (3) distinct levels of main memory with attendant differences in requestor-to-data timing. Fig. 4 is a schematic view of a portion of the general system memory hierarchy 10, showing the distinct levels of main memory with attendant differences in requestor-to-data timing.

[0060] If a data requestor, such as CPU0 18, has a requirement to write data to memory or read data from memory, the time it takes the data requestor to access the requested data depends on the number of hierarchical levels the data request must traverse. If CPU0 wants to retrieve data that is resident in MEMO 28, the requested data has to pass only through a single second level cache unit 24 (i.e., SLC0) and a single third level cache unit 26 (i.e., TLCO) to travel from memory to the data requestor. However, if CPUO has a similar requirement to access data in a memory unit 28 in another PMM 16 (e.g., MEM3), the requested data must pass through two (2) second level cache units 24 (i.e., SLC3 and SLCO) and two (2) third level cache units 26 (i.e., TLC3 and TLCO) before reaching the data requestor. Finally, if CPUO requests data that is resident in a memory unit 28 in another PCM 12 (e.g., MEM4), the requested data must pass through two (2) second level cache units 24 (i.e., SLC4 and SLCO), two (2) third level cache units 26 (i.e., TLC4 and TLCO) and two (2) IOSIM units 14 (i.e., IOSIM2 and IOSIM0) before reaching the data requestor. Each of these distinct data paths has a different data timing associated therewith. The data path having only a single second level cache unit 24 clearly will access data more rapidly than a data path having two second level cache units 24. The data path that contains both second level cache units 24 and IOSIM units 14 has the longest data transfer timing.

[0061] Although the physical memory unit in which requested data resides cannot be determined directly, the path length to that requested data can be determined, and subsequently the relative level at which that requested data resides can be determined. If a sufficient number of data areas are allocated as part of the detection process activities, it can be assumed that at least one data area will reside in each physical memory unit.

[0062] Using this data, a table can be built that shows which data resides at what timing levels for each requestor. Fig. 5 is a schematic view of a table or set of configuration tables 50 built showing what data resides at what timing levels for each data requestor. The table 50 identifies the data access timing from each CPU to each allocated data area. The timing measurements for each CPU to its respective first, second and third level caches should be the same, assuming a symmetric cache configuration. The associated data access timings to each data area in memory from each CPU will be 1 of 3 timing values due to the extended path lengths within the system. The table 50 is constructed such that the individual data areas can be accessed by CPUs at a given timing level or, conversely, a CPU can access all data areas at a given timing level. As a result, data can be accessed as either CPU relative, timing level relative or architectural component relative. This type of structure allows processes to be constructed that allow testing of the architectural components in any manner chosen. An example would be to formulate a process that tests all possible paths that exist at the same timing level for a given data range.

[0063] Once the data tables have been constructed, the process of testing the system architecture according to an embodiment can be performed. The principal objectives for a desirable test package to meet are maximum test coverage, efficiency and effectiveness, and reproducibility. Also, the objective of maximum test coverage itself has two principal objectives: to produce the highest degree of deterministic testing, and to produce random test conditions that cannot be adequately predicted or anticipated.

[0064] The test activities should be deterministic in nature. Deterministic test cases have a specific objective, operate consistently from one execution to the next, and possess efficient execution for test cases not involving combinatorial conditions. A disadvantage of deterministic testing can be the sheer number of test cases that might have to be executed to comprehensively test a relatively large computer system, assuming one could actually formulate the number of cases needed. Also, it is relatively difficult to formulate deterministic test cases that produce the desired output when more than one event is involved. Hence, a method of producing random test conditions under controlled conditions should also be employed.

[0065] To meet the requirements for a test package mentioned above, a method of controlling requestors and how they create and reference data had to be developed. When developing a test system, two principal factors must be considered. The first principal factor is that to produce a test package with an optimal degree of deterministic functional coverage as well as reproducibility, synchronous control should be used. The second principal factor is that to be able to produce the multiplicity of various test conditions that cannot be adequately predicted or anticipated, avoid a "grooved" execution and still have some degree of reproducibility, asynchronous control should be used.

[0066] The principal consideration for synchronous control is the precise control of both requestor operation and data creation/access. It typically is necessary to control which requestors create and access data, in which manner and by which of the possible paths to that data. If a total of sixteen requestors exist, there are 2¹⁶ -1 combinations of requestors that could be active at any given time. For example, if CPUs 0, 3 and 4 are to have access to a particular piece of data, the remaining CPUs would have to be made idle. A cache memory hierarchy like the cache memory hierarchy 10 in Fig. 3 contains sixteen (16) physical memory units. Given that a piece of data could exist in any of the 16 memory units, there are (2¹⁶)(2³)-1 possible combinations of data requests to access a particular piece of data. In addition, a particular piece of data might have been modified by one of the requestors and is therefore resident in one of the

requestor's cache units. Since each requestor has access to 3 levels of cache, the total number of possible combinations of requestors to a specific piece of data is

(2¹⁶)(2³)(2²)-1 . It is impossible to be completely deterministic within a system of this complexity due to changes in the queuing arrival rate of a data request depending on where the data resides. However, it is possible to exert considerable control on which requestors are simultaneously active and to which data the active requestors make requests.

[0067] However, to do so, an efficient method of controlling requestors according to an embodiment had to be determined. In examining a cache memory configuration in which a plurality of CPUs (e.g., sixteen CPUs) exist, it is important that only selected CPUs are allowed to become simultaneously active. To accomplish this, a mechanism had to be developed for placing a CPU in an idle state without affecting the data stored in the various cache and memory units. Normally, the system work load is distributed evenly between the CPU requestors. Consequently, the programs they execute vary with time and, as a result, so do their cache contents. An additional complication exists when a CPU's workload is switched, thus causing the current program state to be decelerated to either memory or the system paging file and a new program's state to be accelerated into storage. This further alters existing cache memory contents, possibly for more than the CPU in question. This behavior is undesirable for a test system as it is necessary to keep the cache and memory contents as deterministic as possible.

[0068] If a CPU is placed in an "idle loop" waiting for a program to execute, the idle loop typically is kept as small as possible to avoid constantly altering the L1 instruction and possibly other cache memory units. Once the CPU is in the "idle loop," the CPU determines whether or not it has been given a program to execute and where to find that program. According to an embodiment, use is made of the architectural MESI (modified, exclusive, shared, invalid) cache protocol or structure that is used by many computer system cache memory architectures.

[0069] At the beginning of a test session, a data area is allocated with sufficient size to assign an individual cacheline to each CPU in the system configuration. Initially, these cachelines will reside only in the main system memory. As the test session begins, each CPU will be assigned a unique cacheline that can only be accessed by the particular CPU and the controlling CPU. When a CPU accesses its assigned cacheline, the cacheline will be read into its L1 cache. When a CPU has completed its existing workload, the CPU enters or is placed in a relatively short "idle loop." Once in the "idle loop," the CPU executes out of its instruction cache so that the CPU only reads a value that is resident in its L1 operand cache, tests its state and if the CPU finds that no work is assigned, continues to repeat that sequence. Until the state of the cacheline is changed, the CPU will make no memory requests and, as a result, no other cache or memory locations are changed. If the controlling CPU decides that the controlled (worker) CPU in question should perform some work, the controlling CPU changes the state of the cacheline of the idle controlled (worker) CPU.

[0070] Once a cacheline changes state, the existing data in the requesting CPU's L1 operand cache is no longer valid, is invalidated and the updated cacheline is retrieved from memory automatically. Again, as the "idle loop" operates only out of the CPU's L1 cache, no other cache units are affected. Consequently, a CPU can be active within the system in its "idle loop" without any adverse effect on the rest of the system. Cache contents can be preserved at each cache level, thus allowing testing to be considerably more deterministic compared to conventional testing methods. This form of control according to an embodiment affords the test program relatively considerable integrity with much greater efficiency than conventional methods. This form of control according to an embodiment also provides for a relatively efficient method of controlling multiple requestors (CPUs) with a minimum of system overhead.

[0071] Fig. 6 is a schematic view of a portion of a general system memory hierarchy 10, showing the interaction between the controlling CPU and the controlled or tested CPUs, according to an embodiment. The controlling CPU sets a cacheline status value in the dedicated cachelines for each of the CPUs being tested to inform the CPUs being tested as to whether or not there is further work available for them to do. When a tested CPU has completed executing its current process, the tested CPU interrogates its dedicated cacheline to see if additional work is available. If the cacheline status value has been set to a first deactivation value (e.g., a value of "0"), the CPU enters a relatively short "idle loop" that is resident in its L1 instruction cache. That "idle loop" tests the status of the dedicated cacheline that is resident in the CPU's L1 operand cache. The value in that cacheline will be valid, so no L2, L3 or memory reference is necessary. If the cacheline status value is 0, the CPU continues operating in its local L1 instruction loop. However, when the controlling CPU wants to activate the controlled CPU, the controlling CPU changes the cacheline status value in the cacheline dedicated to the controlled CPU to a second activation value (e.g., a value of "1 "), which will cause the associated L1 cacheline data to be invalidated and the corresponding main memory location to be written. When the controlled CPU next tests its dedicated cacheline, the controlled CPU will go to main memory for the data because the controlled CPU's local copy has been invalidated. Consequently, when the controlled CPU is operating in the idle loop, all instruction and data references are contained within the controlled CPU's L1 cacheline. Only when the controlled CPU is to resume process execution does the controlled CPU make references outside its L1 caches.

[0072] When the test program begins, the operating system decides what testing processes to be used, which CPUs will be used for each testing process, and how the testing processes should be sequenced. If sixteen CPUs exist within the computer system, there are 2¹⁶ -1 combinations of active CPUs that can be used in a particular testing process. If it is desired to test access to each area of data located in a particular memory unit using all combinations of active CPUs, the testing process might start by using only CPU0, then continue to enable all CPUs according to a master bit binary sequence until all combinations have been used. Once a CPU has completed

executing its workload, that CPU examines its cacheline to see if more work is available. If there is none, that CPU will enter its "idle loop," e.g., as described previously hereinabove. The controlling CPU can thus activate or deactivate worker CPUs as desired to meet the requirements of each testing process in a relatively effective and efficient manner, while also preserving the integrity of the existing cache memory.

[0073] To prevent the system migrating the data from one memory unit to another via the paging mechanism, the data is allocated in a controlled manner. To facilitate the testing methods according to an embodiment, after the number of physical memory units for the system and their capacities have been determined, a sufficient number of data areas are allocated such that the random assignment mechanism of the memory allocation system places one or more data areas in each physical memory unit. If the configuration determination process has determined that sixteen physical memory units exist within the cache memory configuration, the number of assigned data areas should be a multiple of sixteen times the basic random allocation strategy. Also, the number of assigned data areas should be less that the number that would cause the system to invoke paging and subsequent data migration.

[0074] Once the data areas have been assigned, the location of each data area relative to each requestor is determined. For example, referring to the cache memory hierarchy 10 in Fig. 3 and using CPU 0 as a reference, it can be observed that there is a physical memory unit (MEM 0) in the same physical processor memory module, three (3) physical memory units on the same motherboard (MEM 1 - 3) and twelve (12) memory units that are located on different motherboards. Hence, CPU 0 has one data access timing if its requested data was in MEM 0, a slightly longer data access timing if the requested data was in MEM 1 - 3, and a third but even longer data access timing if the requested data was located in MEM 4 - 15. This final detection stage is made for each CPU to each data area. The table shown in Fig. 5 then can be completed.

[0075] It should be noted that given the random allocation process used to locate data within each memory unit, it may not be possible for the testing program to determine exactly which of the memory units at a given timing level houses a particular data area. A data area at timing level 3 relative to CPU 0 could be resident in any memory in the range of MEM 4 - 15. However, this is not a significant issue, as a testing process tests each of the data areas at timing level 3 from CPU 0 and thus tests each of the level 3 paths from the CPU. [0076] If a failure occurs, there are several options to determine which level 2 or 3 path caused the error. One option that often is used is to invoke a system stop and perform a subsequent system dump analysis to determine the precise location of the failing data. Many test programs allocate a relatively large number of data areas without having an idea as to where the data resides. If a failure occurs, it is relatively difficult to replicate the data areas, because the same number of data areas would likely be assigned differently if the test was rerun. According to an embodiment, the test program creates an entry in a history file that identifies the process being used at the time of the error. The test program also creates the data configuration and timing table and which timing level(s) to which CPUs was being used at the time of the error.

Although data will be allocated differently if the test is rerun, it will make no difference because there still will be level 1 , 2 and 3 allocations relative to each CPU and the test actually can restart at the beginning of the process and failing process subset. For example, if CPUs 0 - 3 were testing level 2 timing paths when a failure occurred, the test would restart at the beginning of the level 2 timing process for the four CPUs.

Therefore, not only do the inventive methods described herein facilitate a greater degree of test coverage, they also facilitates a greater degree of reproducibility. Also, if it is determined that a particular failure occurred from a given CPU to a given timing level, the system is restarted with one or more architectural units in a "down" or unavailable state. The test then is restarted using the failing process with the smaller cache memory configuration to further isolate the failure. The significant aspect of this approach is the greater degree of reproducibility accomplished with a more

comprehensive test coverage.

[0077] The method of controlling the execution of the selected processes by the desired requestors requires both synchronous and asynchronous control to be implemented. To facilitate a relatively high degree of test coverage with a relatively high degree of determinism and reproducibility, synchronous control typically is implemented.

[0078] When executing under synchronous control mode, the activities created to run on each requestor are created prior to execution, made to go idle and then made to start executing the selected functions and processes under control of the main scheduling activity. Each activity executes the current function and related process until it either completes the process or the prescribed amount of time has elapsed. If the prescribed amount of time has elapsed, the executing activity goes idle and waits for a new process to be scheduled. If the activity has actually completed its selected process, the activity goes idle, waits until another execution process has been selected and all activities have been activated. In a synchronous control mode, all activities need to become idle prior to another process being selected and subsequently activating the selected requestors. This sequence of activity activation, process execution, activity deactivation, new process selection and activity activation continue until the end-of- session flag has been set or all activities have completed their assigned workload. For example, if six separate activities are scheduled to execute on CPUs 0 - 5, the activities are synchronized such that each activity executes a first process for a period of time, goes idle and then begins execution of a second process at the same time. As each activity begins a new process, the activity logs all relevant data so that in the event of an error, the test can be restarted with the set of failing parameters that were active at the time of the error.

[0079] The advantage of this type of synchronous control is that relatively complex sets of operating conditions can be created, monitored and largely reproduced, if necessary. Using the inventive methods described hereinabove, this synchronous control is achieved without corrupting the contents of cache and memory units, making the test package more comprehensive, efficient and reliable compared to conventional test packages. Also comprehensive sets of statistics can be built from each test session execution.

[0080] However, the synchronous control approach typically is not capable of generating random test conditions in a probabilistic manner while still possessing a degree of reproducibility. Also, the synchronous control approach is somewhat

"grooved," which is wholly desirable for a certain set of requirements but not for some other sets of requirements.

[0081] To facilitate the requirement to generate random sets of conditions with a degree of reproducibility, an asynchronous mode of operation can be employed. In an asynchronous mode, each activity is given a specific set of test processes to execute in a specific sequence over a designated amount of time. However, within those parameters, each activity is allowed to execute at its own pace. For example, if a first activity is given the set of processes 1 , 4, 6, 8 and a second activity is given the set of processes 1 , 2, 6, 9, when activated, the first activity runs through its set of processes under its own control. The first activity essentially is free-running until it either completes the execution of its set of processes or until the global timer for the sequence expires. All other scheduled activities follow the same execution pattern until the activities all complete execution of their assigned processes or until the global timer expires. In either case, each activity provides a comprehensive log of its execution so similar conditions are recreated in the event of an error. Given this type of logging, even largely asynchronous execution is recreated with a relatively high degree of accuracy and reproducibility.

[0082] Fig. 7A is a flow diagram of a portion of a method 300 for testing a computer system cache memory unit, according to an embodiment, and Figs. 7B, 7C and 7D are flow diagrams of other portions of the method 300 for testing a computer system cache memory unit, according to an embodiment. Reference to the method steps shown in Figs. 7A-7D will continue throughout the discussion hereinbelow.

[0083] The method 300 includes a step 302 of determining whether to start the testing from a history file. If the testing is to be started from a history file (Y), the method 300 proceeds to a series of steps for replaying a prior session, as will be discussed in greater detail hereinbelow. If the testing is not to be started from a history file (N), the method 300 proceeds to a step 304 of forming a selected parameter table.

[0084] Fig. 8 is a schematic view of a set 50 of generated parameter tables, according to an embodiment. To facilitate the requirements of both synchronous and asynchronous operation, a set of global device tables are constructed. At the start of a test session, the end user selects a set of parameters from a nested menu structure. These parameters are then merged with the parameters from a default parameter table 52 to form a selected parameter table 54. The selected parameter table 54 forms the basis from which a number of other tables are built, e.g., a global activity execution table 55, a current file table 56, an activity table 57, and an I/O device table 58. The formation of a default function table 74 and a global function table 72 will be discussed in greater detail hereinbelow with respect to a discussion of Fig. 10. [0085] Following the construction of the selected parameter table 54, the number of CPUs and the number of I/O devices are detected. Each of these detected numbers are entered in their respective tables to build the CPU table (step 306) and the I/O device table 58 (step 308).

[0086] Fig. 9 is a schematic view 60 of cache memory tables for the CPU, L1 , L2, L3 memories and main memory, according to an embodiment. The number and levels of the cache and memory units are detected in a suitable manner, e.g., according to the methods described in co-pending U.S. Patent Application Serial No. 12/962,767, entitled "Method and System for Determining a Cache Memory Configuration for Testing." This is shown as a step 310 of determining the system memory configuration.

[0087] Fig. 10 is a schematic view of a set 70 of function tables, according to an embodiment. At the completion of the construction of the global device tables, a global function table 72 is constructed (shown as a step 312) by merging a default function table 74, containing all default settings, with the user selected parameters. The global function table 72 contains all user selected functions, processes, memory allocation parameters and other parameters needed to start execution. The global function table 72 also is used to construct the execution tables. The step 312 also constructs a global execution table, which is discussed in greater detail hereinbelow with respect to Fig. 13.

[0088] Once all user parameters have been chosen and the device and global function tables have been constructed, the memory areas for all functions and processes are determined. This is shown as a step 314 of allocating memory according to global function table. This determination can be done in two distinct ways. The first determination approach is to calculate the maximum memory allocation needed by all processes prior to execution and then allocate subsets of that allocation for use by the individual processes. This approach has the advantage that all processes use a subset of a global allocation. However, because all processes do use a subset of the global allocation, limited allocation diversity is achieved. The second determination approach is to allocate memory at the start of each free-running process set. The advantages and disadvantages of the second determination approach are the opposite of the allocation strategy of the first determination approach. According to an embodiment, the first determination approach can be the default allocation strategy for the synchronous mode operation and the localized second determination approach can be the default allocation strategy for the asynchronous mode of operation. However, by careful selection of parameters, combinations of both determination approaches can be achieved in each mode of operation. Exactly how memory is allocated is also partially dependent on the system architecture in question.

[0089] Fig. 1 1 is a schematic view 80 of memory allocation using a proprietary architecture.

[0090] Fig. 12 is a schematic view of a generated table 90 relating the allocated memory area to the defined architectural levels, according to an embodiment. This is shown as a step 316. During the process of memory allocation, the table 90 is constructed that relates the memory area being allocated to the defined architectural levels, e.g., as shown in Fig. 5. Since the timings from a given CPU to its respective architectural cache memory levels have been determined using suitable techniques, e.g., as described in , co-pending U.S. Patent Application Serial No. 12/962,767, entitled "Method and System for Determining a Cache Memory Configuration for Testing," a subset of those timings are employed to determine the path lengths to a particular piece of data from each requestor under a number of different test conditions. This ability is useful because of the flexibility and determinism offered. For example, one could choose a process that exercises all requestor to timing level 4 conditions without having to spend inordinate quantities of time performing non-related executions such as timing level 1 , 2 or 3 accesses.

[0091] After memory allocation has taken place, the mass storage file allocation is made according to the user selected parameters. This is shown as a step 318. As the files are allocated, their data is placed in the file allocation table 56. While not shown in the diagram, additional data relating to the hardware and logical paths to a particular file can be inserted in the file allocation table 56 to facilitate a further refinement in the capabilities of the test program.

[0092] After the completion of the aforementioned activity, an execution history table is initialized. This is shown as a step 320. This table is used to analyze execution history and for fault isolation and diagnosis. [0093] Fig. 14 is a schematic view of a set 1 10 of generated execution history tables, according to an embodiment. In the synchronous mode of operation, the execution details are logged in an execution history table 1 12 as a global entry and are time- stamped accordingly. In the asynchronous mode of operation, each activity logs its execution details and status separately. At the end of a test session, if history logging has been enabled, the entries in the execution history table 1 12 are converted into an entry in a replay history table 1 14. This is shown as a step 322. These entries then can be used to establish the environment that was active at the time a particular error occurred. The test then can be restarted using this environment with a view to recreating the original error. Thus, any previous error that was logged has the potential to be recreated.

[0094] Referring again to Figs. 7A and 7B, the method 300 includes a step 324 of selecting synchronous control or an asynchronous mode of operation. If a user selects synchronous control (Y), the method 300 proceeds to a step 326 of creating activities, entering those activities into an activity table and setting the activity state to idle. If the user selects asynchronous control (N), the method 300 proceeds to a series of steps leading up to and including asynchronous activity, as will be discussed in greater detail hereinbelow.

[0095] Once the step 326 has been completed, the method 300 proceeds to a step 328 of setting the activity table pointer in the global execution table.

[0096] Once the activity table pointer in the global execution table has been set, the method 300 proceeds to a step 330 of setting the initial function and process pointers in the global execution table. It should be noted that if the synchronous mode of operation is being started in replay history mode, the function and process pointers are set to those from the history file.

[0097] Once the function and process pointers in the global execution table have been set, the method 300 proceeds to a step 332 of setting the timer for process execution per the current CPU combination. It should be noted that the timer can be made to be by process, function or total test time. Also, there could be a multi-level timing facility such that timers are nested with the highest level being total test time and the lowest level being CPU combination. [0098] Once the timer for process execution has been set, the method 300 proceeds to a step 334 of starting the synchronous control activities for the selected CPU combination. The steps of the synchronous control activities are shown in Fig. 7C and will be discussed in greater detail hereinbelow.

[0099] The method 300 also includes a step 336 of the selected CPU combination running the selected function/process pair.

[00100] The method 300 also includes a step 338 of determining whether an error termination has been set. If an error termination has been set (Y), the method 300 proceeds to a series of steps that will be described in greater detail hereinbelow.

[00101] If an error termination has not been set (N), the method 300 proceeds to a step 340 of setting all activities to idle when the timer expires.

[00102] Once all of the activities have been set to idle, the method 300 proceeds to a step 342 of determining whether there are more CPU combinations for the current process. If there are more CPU combinations for the current process (Y), the method 300 proceeds to a step 344 of updating the global activity table to point to the next CPU combination. The method 300 then returns to the step 332 of setting the timer for process execution per the next (now current) CPU combination.

[00103] If there are no more CPU combinations for the current process (N), the method 300 proceeds to a step 346 of determining whether there are more processes for the current function. If there are more processes for the current function (Y), the method 300 proceeds to a step 348 of updating the global execution table to point to t he next process. The method 300 then proceeds to a step 352 of resetting the CPU pointer to the start of the CPU selection. The method 300 then returns to the step 332 of setting the timer for process execution per the current CPU combination and the next process.

[00104] If there are no more processes for the current function (N), the method 300 proceeds to a step 354 of determining whether there are more functions. If there are more functions (Y), the method proceeds to a step 356 of updating the global execution table to point to the next function and process. The method 300 then proceeds to a step 358 of resetting the CPU pointer to the start of the CPU selection. The method 300 then returns to the step 332 of setting the timer for process execution per the current CPU combination and the next function.

[00105] If there are no more functions (N), the method 300 proceeds to a step 360 of updating the execution history and history replay table. The method 300 then proceeds to a step 362 of printing the final status and exiting the test.

[00106] Returning to the synchronous/asynchronous selecting step 324, if the user selects asynchronous control (N), the method 300 proceeds to a step 364 of creating a local activity execution table for each activity.

[00107] Once the local activity execution table for each activity has been created, the method 300 proceeds to a step 366 of creating activities, entering each activity into an activity table, setting its state to idle and establishing a pointer to the appropriate execution table. These tables are constructed by merging data from the global function table 72 and other tables to form the basis for actual process execution. In the synchronous mode of operation, all activities reference the global execution table because the activities are synchronized to execute in an identical fashion. However, in the asynchronous mode of operation, each activity will operate in a "free-running" mode with its uniquely defined set of processes. Therefore, a unique activity table is needed. This activity table is defined as a local execution table. Fig. 13 is a schematic view of a set 100 of activity tables, according to an embodiment.

[00108] The method 300 then proceeds to a step 368 of setting the initial function and process pointer in the local execution table.

[00109] Once the initial function and process pointer in the local execution table has been set, the method 300 proceeds to a step 370 of setting the initial function and process pointers in the local execution table. It should be noted that if the synchronous mode of operation is being started in replay history mode, the function and process pointers are set to those from the history file.

[00110] Once the function and process pointers in the local execution table have been set, the method 300 proceeds to a step 372 of setting the timer for all functions, processes and CPU combinations. It should be noted that the timer can be made to be by process, function or total test time. Also, there could be a multi-level timing facility such that tinners are nested with the highest level being total test time and the lowest level being CPU combination.

[00111] Once the timer for all functions, processes and CPU combinations has been set, the method 300 proceeds to a step 374 of starting all activities (asynchronous mode of operation). The steps of the asynchronous mode of operation are shown in Fig. 7D and will be discussed in greater detail hereinbelow.

[00112] The method 300 also includes a step 376 of determining whether an error termination has been set. If an error termination has been set (Y), the method 300 proceeds to a series of steps that will be described in greater detail hereinbelow.

[00113] If an error termination has not been set (N), the method 300 proceeds to a step 378 of determining whether the timer has expired. If the timer has not expired (N), the method 300 returns to the step 376 of determining whether an error termination has been set. If the timer has expired (Y), the method 300 proceeds to a step 380 of determining whether there are more CPU combinations.

[00114] If there are more CPU combinations (Y), the method 300 proceeds to a step 382 of updating the local execution table function and process. The method 300 then proceeds to a step 384 of updating the CPU pointers. The method 300 then returns to the step 374 of starting all activities (asynchronous mode of operation).

[00115] If there are no more CPU combinations (N), the method 300 proceeds to a step 386 of updating the execution history and history replay table. The method 300 then proceeds to a step 388 of printing the final status and exiting the test.

[00116] Returning to the step 302 of determining whether to start the testing from a history file, if the testing is to be started from a history file (Y), the method 300 proceeds to a step 390 of re-establishing the global function table. The method 300 then proceeds to a step 392 of verifying that the CPU, I/O and memory configuration has not changed. The method 300 then proceeds to a step 394 of determining the CPU to memory range timings and building the address range table.

[00117] The method 300 then proceeds to a step 396 of allocating the user mass storage files according to the user parameters. The method 300 then proceeds to a step 398 of initializing the global execution history table. The method 300 then proceeds to a step 400 of creating a new replay history entry. The method 300 then returns to the step 324 of selecting synchronous control or an asynchronous mode of operation.

[00118] Referring now to Fig. 7C, synchronous control now will be described. As discussed hereinabove, the method 300 includes a step 334 of starting the synchronous control activities for the selected CPU combination. The method 300 proceeds to a step 402 of informing the control activity of the current status.

[00119] The method 300 then proceeds to a step 404 of getting the current function and process pointers from the global execution table.

[00120] The method 300 then proceeds to a step 406 of executing the current process and a step 408 of continuing to execute the current process until the current process is done.

[00121] Once the current process is done, the method 300 proceeds to a step 410 of determining whether there are any errors. If there are no errors (N), the method 300 proceeds to a step 412 of determining whether the timer has expired or a terminate flag has been set. If the timer has not expired and no terminate flag has been set (N), the method 300 returns to the step 408 of executing the current process until the current process is done. If the timer has expired or if a terminate flag has been set (Y), the method 300 proceeds to a step 414 of updating the execution table.

[00122] The method 300 then proceeds to a step 416 in which the control activity is informed that it is going to become inactive and await further instructions.

[00123] The method 300 then proceeds to a step 418 in which the activity deactivates itself and waits to be reactivated.

[00124] Returning to the step 410 of determining whether there are any errors, if there are any errors, the method 300 proceeds to a step 420 of logging the error(s). The method 300 then proceeds to a step 422 of determining whether a stop-on-error setting has been set. If a stop-on-error setting has not been set (N), the method 300 returns to the step 412 of determining whether the timer has expired or a terminate flag has been set. If a stop-on-error setting has been set (Y), the method 300 proceeds to a step 424 of setting an error flag. The method 300 then proceeds to the deactivation step 418. [00125] The asynchronous mode of operation now will be described. As discussed hereinabove, the method 300 includes a step 374 of starting all activities (in the asynchronous mode of operation).

[00126] Referring now to Fig. 7D, asynchronous control now will be described. The method 300 proceeds to a step 426 of informing the control activity of the current status. The method 300 then proceeds to a step 428 of getting the current function and process pointers from the local execution table. The method 300 then proceeds to a step 430 of executing the current process and a step 432 of continuing to execute the current process until the current process is done.

[00127] Once the current process is done, the method 300 proceeds to a step 434 of determining whether there are any errors. If there are errors (Y), the method proceeds to a step 436 of logging the error information. The method then proceeds to a step 438 of updating the execution history file. The method then proceeds to a step 440 of updating the replay history file. The method then proceeds to a step 442 of terminating all activities. The method then proceeds to a step 444 of printing the final error status and exiting the test. Steps 436 through 444 also are performed if an error termination has been set (Y) from the step 338 (Fig. 7A).

[00128] Returning to the step 434 of determining whether there are any errors, if there are no errors (N), the method proceeds to a step 446 of determining whether the timer has expired or a terminate flag has been set. If the timer has not expired and no terminate flag has been set (N), the method 300 proceeds to a step 448 of updating the execution table.

[00129] The method then proceeds to a step 450 of determining whether there are more processes. If there are more processes (Y), the method proceeds to a step 452 of updating the local execution table to point to the next process. The method then returns to the step 432 of executing the next (now current) process until the current process is done.

[00130] If there are no more processes (N), the method 300 proceeds to a step 454 of determining whether there are more functions. If there are more functions (Y), the method proceeds to a step 456 of updating the global execution table to point to the next function and process. The method then returns to the step 432 of executing the next (now current) process until the current process is done.

[00131] If there are no more functions (N), the method proceeds to a step 458 of updating the execution history and history replay table. The method then proceeds to a step 462 of resetting the local execution table function and process pointers. The method then returns to the step 432 of executing the next (now current) process until the current process is done.

[00132] Returning to the step 446 of determining whether the timer has expired or a terminate flag has been set, if the timer has expired of if a terminate flag has been set (Y), the method 300 proceeds to a step 464 of updating the execution table. The method 300 then proceeds to a step 466 of updating the execution history file. The method 300 then proceeds to a step 468 of updating the replay history file.

[00133] The method 300 then proceeds to a step 470 in which the activity deactivates itself and begins waiting to be reactivated.

[00134] Fig. 15 is a schematic view of an apparatus 200 configured to test a computer system cache memory unit according to an embodiment. The apparatus 200 can be any apparatus, device or computing environment suitable for testing a computer system cache memory unit according to an embodiment. For example, the apparatus 200 can be or be contained within any suitable computer system, including a mainframe computer and/or a general or special purpose computer.

[00135] The apparatus 200 includes one or more general purpose (host) controllers or processors 202 that, in general, processes instructions, data and other information received by the apparatus 200. The processor 202 also manages the movement of various instructional or informational flows between various components within the apparatus 200. The processor 202 can include a cache memory configuration interrogation module (configuration module) 204 that is configured to execute and perform the cache memory unit configuration determining processes, e.g., as described in co-pending U.S. Patent Application Serial No. 12/962,767, entitled "Method and System for Determining a Cache Memory Configuration for Testing." Alternatively, the apparatus 200 can include a standalone cache memory configuration interrogation module 205 coupled to the processor 202. Also, the processor 202 includes a testing module 206 that is configured to execute and perform the cache memory unit testing processes described herein. Alternatively, the apparatus 200 can include a standalone testing module 207 coupled to the processor 202.

[00136] The apparatus 200 also can include a memory element or content storage element 208, coupled to the processor 202, for storing instructions, data and other information received and/or created by the apparatus 200. In addition to the memory element 208, the apparatus 200 can include at least one type of memory or memory unit (not shown) within the processor 202 for storing processing instructions and/or information received and/or created by the apparatus 200.

[00137] The apparatus 200 also can include one or more interfaces 212 for receiving instructions, data and other information. It should be understood that the interface 212 can be a single input/output interface, or the apparatus 200 can include separate input and output interfaces.

[00138] One or more of the processor 202, the configuration module 204, the configuration module 205, the testing module 206, the testing module 207, the memory element 208 and the interface 212 can be comprised partially or completely of any suitable structure or arrangement, e.g., one or more integrated circuits. Also, it should be understood that the apparatus 200 includes other components, hardware and software (not shown) that are used for the operation of other features and functions of the apparatus 200 not specifically described herein.

[00139] The apparatus 200 can be partially or completely configured in the form of hardware circuitry and/or other hardware components within a larger device or group of components. Alternatively, the apparatus 200 can be partially or completely configured in the form of software, e.g., as processing instructions and/or one or more sets of logic or computer code. In such configuration, the logic or processing instructions typically are stored in a data storage device, e.g., the memory element 208 or other suitable data storage device (not shown). The data storage device typically is coupled to a processor or controller, e.g., the processor 202. The processor accesses the necessary instructions from the data storage element and executes the instructions or transfers the instructions to the appropriate location within the apparatus 200. [00140] One or more of the configuration module 204, the configuration module 205, the testing module 206 and the testing module 207 can be implemented in software, hardware, firmware, or any combination thereof. In certain embodiments, the module(s) may be implemented in software or firmware that is stored in a memory and/or associated components and that are executed by the processor 202, or any other processor(s) or suitable instruction execution system. In software or firmware

embodiments, the logic may be written in any suitable computer language. One of ordinary skill in the art will appreciate that any process or method descriptions associated with the operation of the configuration module 204, the configuration module 205, the testing module 206 and/or the testing module 207 may represent modules, segments, logic or portions of code which include one or more executable instructions for implementing logical functions or steps in the process. It should be further appreciated that any logical functions may be executed out of order from that described, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art. Furthermore, the modules may be embodied in any non-transitory computer readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

[00141] It should be understood that the methods and systems described herein are not limited to testing a computer system cache memory unit. For example, many computer applications programs often consist of multiple threads of execution, each of which may access specifically allocated data areas. Each thread may execute on one or a set of requestors at the direction of the operating system. These systems often consist of multiple requestors (CPUs) that access data contained within multiple memory units at different hierarchical levels. In such systems, the path length from individual requestors to retrieve a specific piece of data is dependent on the memory level in which the data resides and its connection path to the specific requestor.

Referring again to Fig. 6, it can be seen that data residing in MEM 2 has a shorter path length to IP 4 and IP 5 than it does to all remaining IPs (CPUs). In systems that have only a single requestor, for a single memory unit or multiple requestors connected to multiple memory units at the same architectural level, the timing from any requestor to any piece of data is uniform.

[00142] In systems such as the one shown in Fig. 6, there are potential opportunities to optimize an application's performance by allocating execution threads to CPUs whose path length to the requested memory data is the shortest. For example, it would be desirable to allocate an execution thread that requests data residing MEM 2 to either or both IP 4 and IP 5. The viability of this approach is dependent on the modularity of such data and the operational characteristics of the system, particularly its memory management.

[00143] Many computer systems allocate real memory space based on dividing application data into pages and allocating those pages to randomly assigned areas in the different memory units. The size of an individual page is system dependent and may have more than one setting on a particular system, for example, the system shown in Fig. 6 typically uses 4096 word (36 bit) pages. If an application program has a data area that is 4096 words or less, determining in which memory unit the data is allocated allows a requesting execution thread to be allocated to the requester whose path length to the data is minimal. For example, if a page of data for an application program has been allocated in MEM 2, the requestors whose path length to the data is the shortest are IP 4 and IP 5. Hence, allocating the application execution thread to either IP 4 or IP 5 provides the fastest possible execution.

[00144] For example, assume an application execution thread allocates a data area that spans three pages and each page is allocated to a different memory unit, if the path length to each data area is determined and it is seen that two data pages have a shorter path length to specific requestors than the remaining page, it is possible to allocate the execution thread to the set of requestors that is closest to the first two data pages. This improves the execution time for that thread. For example, in the system shown in Fig. 6, assume that two data pages have been allocated to MEM 1 and MEM 2 and a third data page has been allocated to MEM 6. A referencing execution thread should ideally be allocated to run on either IP 4 and/or IP 5 for optimal performance. This approach can be extended to include multiple execution threads, each containing multiple data areas. The method of control can be tailored to meet the individual needs of each application program.

[00145] The methods and systems described herein provide a means by which this type of allocation and application execution control are achieved in conjunction with suitable techniques for determining the number and levels of the cache and memory units, such as described in co-pending U.S. Patent Application Serial No. 12/962,767, entitled "Method and System for Determining a Cache Memory Configuration for Testing." The control mechanisms described herein perform the allocation and execution control for an application in which a suitable technique for determining the number and levels of the cache and memory units has been integrated. A variation of those techniques will build memory timing tables based on the need of the application program and will be provided as input to the control techniques described in this submission. Hence, the methods and systems described herein can be extended past the needs of testing new computer system architecture into the realm of running commercial applications.

[00146] The functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted as one or more instructions or code on a non-transitory computer-readable medium. The methods illustrated in FIGs. 7A and 7B may be implemented in a general, multi-purpose or single purpose processor. Such a processor will execute instructions, either at the assembly, compiled or machine-level, to perform that process. Those instructions can be written by one of ordinary skill in the art following the description of FIGs. 7A and 7B and stored or transmitted on a non- transitory computer readable medium. The instructions may also be created using source code or any other known computer-aided design tool. A non-transitory computer readable medium may be any medium capable of carrying those instructions and includes random access memory (RAM), dynamic RAM (DRAM), flash memory, readonly memory (ROM), compact disk ROM (CD-ROM), digital video disks (DVDs), magnetic disks or tapes, optical disks or other disks, silicon memory (e.g., removable, non-removable, volatile or non-volatile), and the like. [00147] It will be apparent to those skilled in the art that many changes and substitutions can be made to the embodiments described herein without departing from the spirit and scope of the disclosure as defined by the appended claims and their full scope of equivalents.

Claims

1 . A method for operating a computer system having a plurality of central processing units (CPUs) and a cache memory architecture with a plurality of cache memory levels, the method comprising:

establishing a controlling CPU and at least one controlled CPU from the plurality of CPUs;

deactivating by the controlling CPU at least one controlled CPU by placing the controlled CPUs in a first idle state whereby the controlled CPU operates in an idle loop that is resident in a first memory level associated with the controlled CPUs; and

activating by the controlling CPU at least one controlled CPU by placing the controlled CPU in a second activation state whereby the controlled CPU can access all memory levels.

2. The method as recited in claim 1 , wherein each of the plurality of CPUs includes a unique cacheline in a memory location associated therewith, and wherein the controlled CPU operating in an idle loop includes the controlled CPU testing the unique cacheline in the memory location associated therewith.

3. The method as recited in claim 1 , wherein each of the plurality of CPUs includes a unique cacheline in a memory location associated therewith, wherein deactivating at least one controlled CPU includes setting a cacheline status value in the unique cacheline associated with the controlled CPU to a first deactivation value, and wherein activating at least one controlled CPU includes setting the cacheline status value in the unique cacheline associated with the controlled CPU to a second activation value.

4. The method as recited in claim 1 , wherein, when in the first idle state, the controlled CPU makes memory requests only within the first memory level associated with the controlled CPU.

5. The method as recited in claim 1 , further comprising the controlling CPU testing the cache memory of at least a portion of the cache memory architecture while the at least one controlled CPU is deactivated.

6. A method for testing a cache memory within a computer system having a plurality of central processing units (CPUs) and a cache memory architecture with a plurality of cache memory levels, the method comprising:

determining the cache memory configuration of the computer system;

deactivating by the controlling CPU at least one controlled CPU by placing the controlled CPUs in a first idle state whereby the controlled CPU operates in an idle loop that is resident in a first memory level associated with the controlled CPU;

allocating data areas to be tested based on the deactivated controlled CPUs and the activated controlled CPU;

testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under synchronous control; and

testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under an asynchronous mode of operation.

7. The method as recited in claim 6, wherein testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under synchronous control includes:

creating at least one testing activity to run on each activated controlled CPU, executing the testing activity until the testing activity is completed or an amount of time has elapsed, and executing the next testing activity until all testing activities have either been completed or have expired.

8. The method as recited in claim 6, wherein testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under an

asynchronous mode of operation includes:

creating a testing activity having a set of test processes and a sequence in which to execute the set of test processes, and

executing the set of processes in the testing activity until all processes have either been completed or an amount of time has elapsed.

9. The method as recited in claim 6, further comprising creating an entry in a history file if an error occurs during testing, wherein the entry created in the history file identifies the process, data configuration and data timings being used at the time of the error.

10. The method as recited in claim 6, wherein accessing data during testing is one of CPU relative, timing level relative and architectural component relative.

1 1 . The method as recited in claim 6, further comprising constructing at least one table based on a set of testing parameters, wherein at least one of the testing under synchronous control and testing under an asynchronous mode of operation is based on the at least one constructed table.

12. The method as recited in claim 1 1 , wherein testing functions and processes are determined based on the at least one constructed table.

13. The method as recited in claim 1 1 , wherein data areas to be tested are determined based on the at least one constructed table.

14. A computing device, comprising: a processor;

a memory element coupled to the processor, wherein the memory element includes cache memory having a plurality of cache memory levels;

a cache memory configuration interrogation module coupled to the processor and to the memory element, wherein the cache memory configuration interrogation module is configured to determine the cache memory configuration of the computer system; and a cache memory testing module coupled to the processor and to the memory element, wherein the cache memory testing module is configured to

determine the cache memory configuration of the computer system;

establish a controlling CPU and at least one controlled CPU from the plurality of

CPUs;

deactivate by the controlling CPU at least one controlled CPU by placing the controlled CPUs in a first idle state whereby the controlled CPU operates in an idle loop that is resident in a first memory level associated with the controlled CPU;

activate by the controlling CPU at least one controlled CPU by placing the controlled CPU in a second activation state whereby the controlled CPU can access all memory levels.

allocate data areas to be tested based on the deactivated controlled CPUs and the activated controlled CPU;

test at least a portion of the cache memory levels of at least one deactivated controlled CPU under synchronous control; and

test at least a portion of the cache memory levels of at least one deactivated controlled CPU under an asynchronous mode of operation.

15. The computing device as recited in claim 14, wherein testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under synchronous control includes:

16. The computing device as recited in claim 14, wherein testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under an asynchronous mode of operation includes:

17. The computing device as recited in claim 14, wherein the cache memory testing module is configured to create an entry in a history file if an error occurs during testing, wherein the entry created in the history file identifies the process, data configuration and data timings being used at the time of the error.

18. The computing device as recited in claim 14, wherein the cache memory testing module is configured to construct at least one table based on a set of testing parameters, wherein at least one of the testing under synchronous control and testing under an asynchronous mode of operation is based on the at least one constructed table.

19. The computing device as recited in claim 18, wherein the cache memory testing module is configured to determine testing functions and processes based on the at least one constructed table.

20. The computing device as recited in claim 18, wherein the cache memory testing module is configured to determine data areas to be tested based on the at least one constructed table.

21 . A non-transitory computer readable medium having instructions stored thereon which, when executed by a processor, carry out a method for testing a cache memory within a computer system having a plurality of central processing units (CPUs) and a cache memory architecture with a plurality of cache memory levels, the instructions comprising:

instructions for determining the cache memory configuration of the computer system;

instructions for establishing a controlling CPU and at least one controlled CPU from the plurality of CPUs;

instructions for deactivating by the controlling CPU at least one controlled CPU by placing the controlled CPUs in a first idle state whereby the controlled CPU operates in an idle loop that is resident in a first memory level associated with the controlled CPU;

instructions for activating by the controlling CPU at least one controlled CPU by placing the controlled CPU in a second activation state whereby the controlled CPU can access all memory levels.

instructions for allocating data areas to be tested based on the deactivated controlled CPUs and the activated controlled CPU;

instructions for testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under synchronous control; and

instructions for testing at least a portion of the cache memory levels of at least one deactivated controlled CPU under an asynchronous mode of operation.