CN113868109B

CN113868109B - Method, apparatus, device and readable medium for evaluating performance of multiprocessor interconnection

Info

Publication number: CN113868109B
Application number: CN202111158110.7A
Authority: CN
Inventors: 邹晓峰
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2024-04-19
Anticipated expiration: 2041-09-30
Also published as: CN113868109A

Abstract

The invention provides a method, a device, equipment and a readable medium for evaluating the performance of multiprocessor interconnection access, wherein the method comprises the following steps: modeling multiprocessor interconnection access sharing a Cache consistency protocol, acquiring transmission path information of each access request, and quantifying the length of a path based on the transmission path information; acquiring delay data of each transmission stage in the transmission path according to process delay model data of each processing component in the transmission path; and evaluating the system performance according to the acquired delay data of each transmission stage in the transmission path. By using the scheme of the invention, the delay and the bandwidth of local and remote access of the multiprocessor under a specific consistency protocol can be measured and calculated, and the access delay and the bandwidth between nodes of the system can be quantitatively evaluated at the beginning of the collaborative chip design.

Description

Method, apparatus, device and readable medium for evaluating performance of multiprocessor interconnection

Technical Field

The present invention relates to the field of computers, and more particularly, to a method, apparatus, device, and readable medium for performance evaluation of multiprocessor interconnect access.

Background

The shared memory multiprocessor is an important structure in a computer system structure, and is interconnected through a special processor point-to-point consistency protocol interface, a complex topological structure and an interconnection network, so that the consistency interconnection of the multiprocessor and the sharing of the global memory are realized. For the shared memory multiprocessor, the multiprocessor system can be realized by directly connecting the multiprocessor or through the collaborative chip agent according to the number of the interfaces of the processor and the condition that the interfaces support the consistency protocol forwarding function. For example, in IBM's Power mini-machine, the Power processor has enough direct connection ports for the processor, so that the 16-way system can be realized by direct connection of the processor. The Intel Xeon processor usually has only 3 processor interconnection interfaces (QPI or UPI), the interfaces support multi-hop consistency protocol forwarding, and 8-way systems can be realized maximally based on the Xeon processor, so that when a 16-way system is realized, interface expansion and message forwarding are realized by taking a processor cooperation chip as an intermediate agent, and a large-scale multi-processor system with more than 16 ways can be constructed based on the cooperation chip. The processor cooperation chip is mainly used for realizing the processing and forwarding of Cache consistency protocol messages among multiple processors according to the consistency protocol of the processors, and a processor interface needs to be designed on the side facing the processors, and an interconnection network interface needs to be designed on the side facing other processors. In the initial stage of the design of the processor and the cooperative chip, the chip is not engineering, so that the delay and the bandwidth between nodes of the multiprocessor system cannot be tested.

Disclosure of Invention

In view of the above, an object of the embodiments of the present invention is to provide a method, apparatus, device and readable medium for evaluating performance of multiprocessor interconnection access, by using the technical solution of the present invention, it is able to implement measurement and calculation of delay and bandwidth of local and remote access to a multiprocessor under a specific coherence protocol, and it is able to implement quantitative evaluation of access delay and bandwidth between nodes of a system at the beginning of collaborative chip design.

In view of the foregoing, an aspect of an embodiment of the present invention provides a method for performance evaluation of multiprocessor interconnect accesses, comprising the steps of:

Modeling multiprocessor interconnection access sharing a Cache consistency protocol, acquiring transmission path information of each access request, and quantifying the length of a path based on the transmission path information;

Acquiring delay data of each transmission stage in the transmission path according to process delay model data of each processing component in the transmission path;

And evaluating the system performance according to the acquired delay data of each transmission stage in the transmission path.

According to one embodiment of the invention, system performance includes remote access to total delay data and bandwidth data.

According to one embodiment of the present invention, evaluating system performance based on acquired delay data for each transmission stage in a transmission path includes:

delay data of each stage in the transmission path information of each access request is added to obtain total delay data of each access request.

obtaining the maximum request book and the data packet size accessed between processors;

Bandwidth data is calculated using the following formula:

bandwidth = maximum number of requests sent × packet size/(total delay + maximum number of requests minus 1).

In another aspect of an embodiment of the present invention, there is provided an apparatus for evaluating performance of multiprocessor interconnect access, the apparatus including:

the modeling module is configured to model the multiprocessor interconnection accesses sharing the Cache consistency protocol, acquire the transmission path information of each access request and quantify the length of the path based on the transmission path information;

The acquisition module is configured to acquire delay data of each transmission stage in the transmission path according to the process delay model data of each processing component in the transmission path;

and the evaluation module is configured to evaluate the system performance according to the acquired delay data of each transmission stage in the transmission path.

According to one embodiment of the invention, the evaluation module is further configured to:

Bandwidth data is calculated using the following formula:

In another aspect of the embodiments of the present invention, there is also provided a computer apparatus including:

At least one processor; and

And a memory storing computer instructions executable on the processor, the instructions when executed by the processor performing the steps of any of the methods described above.

In another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the methods described above.

The invention has the following beneficial technical effects: according to the method for evaluating the performance of the multiprocessor interconnection access, which is provided by the embodiment of the invention, the multiprocessor interconnection access sharing the Cache consistency protocol is modeled, the transmission path information of each access request is obtained, and the length of a path is quantized based on the transmission path information; acquiring delay data of each transmission stage in the transmission path according to process delay model data of each processing component in the transmission path; according to the technical scheme for evaluating the system performance according to the acquired delay data of each transmission stage in the transmission path, the delay and the bandwidth of local and remote access of the multiprocessor under a specific consistency protocol can be measured and calculated, and the quantitative evaluation of the access delay and the bandwidth between nodes of the system can be realized at the beginning of collaborative chip design.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart diagram of a method of performance evaluation of multiprocessor interconnect accesses, according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an apparatus for performance evaluation of multiprocessor interconnect accesses, according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a computer device according to one embodiment of the invention;

Fig. 4 is a schematic diagram of a computer-readable storage medium according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

With the above object in view, in a first aspect, an embodiment of a method for performance evaluation of multiprocessor interconnect accesses is presented. Fig. 1 shows a schematic flow chart of the method.

As shown in fig. 1, the method may include the steps of:

s1, modeling multiprocessor interconnection access sharing a Cache consistency protocol, acquiring transmission path information of each access request, and quantifying the length of a path based on the transmission path information.

By modeling and analyzing the multiprocessor shared Cache consistency protocol, path information of various access transmission processes can be obtained. For example, for a catalogue remote access request under a certain consistency protocol, the path accessed by the pen can be obtained according to the protocol jump process, and the lengths of various access transmission paths are quantized by taking a single Flit transmission delay as a unit through analyzing the protocol jump and transmission interface paths, wherein the Flit unit refers to the time for transmitting and processing one Flit message data packet.

S2, delay data of each transmission stage in the transmission path are obtained according to process delay model data of each processing component in the transmission path.

And assigning values to each stage of the memory access process based on the actual delay data of the specific processor interface, the realization process delay model data adopted by the cooperative chip, the cooperative chip frequency and the like. For example, in a specific implementation of a cooperative chip, according to the module division of the chip, by combining an operation frequency and a process delay model, delay is assumed, and assumed data is obtained through comprehensive calculation of a design and a reference model, as shown in the following table:

TABLE 1 delay time of each processing element

Processing component	By time delay
		Physical layer	15ns
Link layer	15ns
		Distributing device	5ns
Remote protocol processing	15ns
		On-chip switching	9ns
Network interface	15ns
		Network interface link layer	25ns
Native protocol processing	15ns
		Accessing CPU memory	60ns
Collaborative chip directory Cache hit latency	5ns
		Monitoring a node delay	50ns

And S3, evaluating the system performance according to the acquired delay data of each transmission stage in the transmission path.

And calculating the total path delay and the local and remote average delay of various memory access transactions through path delay accumulation, wherein the data volume obtained in unit time divided by time is the bandwidth, namely the bandwidth=maximum transmission request number is the data packet size/(the total delay+the maximum request number minus 1). For example, under the assumption of delay in a co-chip, the maximum number of requests is 1024, the packet size is 64 bytes, and the total delay is 725ns, then the remote bandwidth of a single CPU through a single co-chip is approximately: 1024 x 64 byte/(725+1023) =37 GB/s.

The processor cooperation chip is an interconnection chip based on a multi-layer structure design of a Cache consistency protocol among multiple processors. The complete Cache coherence protocol generally comprises a plurality of layers of sub-protocols, and must comprise a protocol layer, a link layer, a physical layer, a transmission layer, a routing layer and the like for transmitting and forwarding protocol messages. The architecture of a processor co-chip generally includes a physical layer (responsible for interconnection with a processor), a link layer (responsible for packaging, streaming and forwarding messages to the physical layer, or receiving from the physical layer), a dispatch module (responsible for dispatching interface messages to or from protocol processing modules), a protocol processing engine (typically implemented by dividing the protocol processing engine into multiple protocol processing pipelines according to remote and local agents), on-chip switching (routing, switching of messages to the protocol processing modules and network interfaces), a network interface (an interconnection interface between system nodes through which the co-chip interconnects with other co-chips).

The invention mainly aims to solve the problem that the performance evaluation of the multiprocessor has no effective method in the initial stage of the collaborative chip design of the processor. The multi-processor system, especially the multi-processor system with the collaborative chip, can carry out quantitative evaluation on the access delay and the bandwidth among the nodes of the system at the beginning of the collaborative chip design, and the evaluation result can be used as an important evaluation basis for the feasibility of the design scheme.

According to the technical scheme, the delay and the bandwidth of local and remote access of the multiprocessor under the specific consistency protocol can be measured and calculated, and the access delay and the bandwidth between the nodes of the system can be quantitatively evaluated at the beginning of the collaborative chip design.

In a preferred embodiment of the invention, system performance includes remote access to total delay data and bandwidth data.

In a preferred embodiment of the present invention, evaluating the system performance based on the acquired delay data for each transmission stage in the transmission path comprises:

Delay data of each stage in the transmission path information of each access request is added to obtain total delay data of each access request. And calculating the total path delay and the local and remote average delays of various access transactions through path delay accumulation. For example, under the assumption of delay in a co-chip, the total delay for remote access is: 725ns. Including the following delay sums: READCLEAN TO RDEX delay: 135ns; RDEX to readclean delay: 200ns; compdata _ uc to PGER delay: 170ns; PGER to compdata _uc delay: 135ns; compack delay: 85ns.

Bandwidth data is calculated using the following formula:

Bandwidth = maximum number of requests sent × packet size/(total delay + maximum number of requests minus 1). The data amount obtained in unit time divided by time is the bandwidth, that is, the bandwidth=maximum number of transmission requests. For example, under the assumption of delay in a co-chip, the maximum number of requests is 1024, the packet size is 64 bytes, and the total delay is 725ns, then the remote bandwidth of a single CPU through a single co-chip is approximately: 1024 x 64 byte/(725+1023) =37 GB/s.

It should be noted that, it will be understood by those skilled in the art that all or part of the procedures in implementing the methods of the above embodiments may be implemented by a computer program to instruct related hardware, and the above program may be stored in a computer readable storage medium, and the program may include the procedures of the embodiments of the above methods when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random access memory (Random Access Memory, RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.

Furthermore, the method disclosed according to the embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. When executed by a CPU, performs the functions defined above in the methods disclosed in the embodiments of the present invention.

With the above object in view, in a second aspect, an apparatus for evaluating performance of multiprocessor interconnection accesses is provided, as shown in fig. 2, where an apparatus 200 includes:

The modeling module 201, the modeling module 201 is configured to model the multiprocessor interconnection access sharing the Cache consistency protocol, obtain the transmission path information of each access request, and quantify the length of the path based on the transmission path information;

an acquisition module 202, the acquisition module 202 being configured to acquire delay data for each transmission stage in the transmission path according to the process delay model data for each processing component in the transmission path;

And an evaluation module 203, wherein the evaluation module 203 is configured to evaluate the system performance according to the acquired delay data of each transmission stage in the transmission path.

In a preferred embodiment of the invention, the evaluation module is further configured to:

Bandwidth data is calculated using the following formula:

Based on the above object, a third aspect of the embodiments of the present invention proposes a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, an embodiment of the present invention includes the following means: at least one processor 21; and a memory 22, the memory 22 storing computer instructions 23 executable on the processor, the instructions when executed by the processor performing the method of:

Bandwidth data is calculated using the following formula:

Based on the above object, a fourth aspect of the embodiments of the present invention proposes a computer-readable storage medium. Fig. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer-readable storage medium S31 stores a computer program S32 that, when executed by a processor, performs the following method:

Bandwidth data is calculated using the following formula:

Furthermore, the method disclosed according to the embodiment of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. The above-described functions defined in the methods disclosed in the embodiments of the present invention are performed when the computer program is executed by a processor.

Furthermore, the above-described method steps and system units may also be implemented using a controller and a computer-readable storage medium storing a computer program for causing the controller to implement the above-described steps or unit functions.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one location to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer or general purpose or special purpose processor. Further, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. A method of performance evaluation for multiprocessor interconnect access, comprising the steps of:

Modeling multiprocessor interconnection access sharing a Cache consistency protocol, obtaining transmission path information of each access request and quantifying the length of a path based on the transmission path information, wherein obtaining the transmission path information of each access request and quantifying the length of the path based on the transmission path information comprises: obtaining path information of various access transmission processes by modeling and analyzing a multiprocessor shared Cache consistency protocol, obtaining paths of access requests according to protocol jump processes aiming at directory remote access requests under the Cache consistency protocol, and quantifying the lengths of various access transmission paths by taking single Flit transmission delay as a unit, wherein Flit refers to the time taken for transmitting and processing a Flit message data packet;

acquiring delay data of each transmission stage in the transmission path according to the process delay model data of each processing component in the transmission path, wherein acquiring the delay data of each transmission stage in the transmission path according to the process delay model data of each processing component in the transmission path comprises: assigning values to each stage of the memory access process based on actual delay data of a specific processor interface, process delay model data realized by a cooperative chip and cooperative chip frequency, dividing the memory access process according to a module of the chip, and carrying out delay assumption by combining an operation frequency and a process delay model, wherein the assumption data is comprehensively calculated through a design and reference model;

2. The method of claim 1, wherein the system performance comprises remote access to total delay data and bandwidth data.

3. The method of claim 2, wherein evaluating system performance based on the acquired delay data for each transmission stage in the transmission path comprises:

4. The method of claim 2, wherein evaluating system performance based on the acquired delay data for each transmission stage in the transmission path comprises:

Obtaining the maximum request data and the data packet size accessed between processors;

Bandwidth data is calculated using the following formula:

bandwidth = maximum number of requests × packet size/(total delay + maximum number of requests minus 1).

5. An apparatus for performance evaluation of multiprocessor interconnect accesses, the apparatus comprising:

The system comprises a modeling module, a processing module and a processing module, wherein the modeling module is configured to model multi-processor interconnection accesses sharing a Cache consistency protocol, acquire transmission path information of each access request and quantify the length of a path based on the transmission path information, the modeling module is further configured to acquire the path information of each type of memory transmission process by modeling and analyzing the multi-processor sharing the Cache consistency protocol, acquire the path of the access request according to a protocol jump process aiming at a directory remote access request under the Cache consistency protocol, and quantify the length of each memory transmission path by taking single Flit transmission delay as a unit, wherein Flit refers to the time for transmitting and processing one Flit message data packet;

The acquisition module is configured to acquire delay data of each transmission stage in the transmission path according to process delay model data of each processing component in the transmission path, and is further configured to assign values to each stage of the memory access process based on actual delay data of a specific processor interface, process delay model data realized by a cooperative chip and cooperative chip frequency, and perform delay assumption according to module division of the chip and combining the operation frequency and the process delay model, wherein the assumption data is obtained through comprehensive calculation of a design and reference model;

6. The apparatus of claim 5, wherein the system performance comprises remote access to total delay data and bandwidth data.

7. The apparatus of claim 6, wherein the evaluation module is further configured to:

8. The apparatus of claim 6, wherein the evaluation module is further configured to:

Bandwidth data is calculated using the following formula:

9. A computer device, comprising:

At least one processor; and

A memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method of any one of claims 1-4.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of claims 1-4.