CN113128143B

CN113128143B - AI processor simulation method, AI processor simulation device, computer equipment and storage medium

Info

Publication number: CN113128143B
Application number: CN202110669107.5A
Authority: CN
Inventors: 魏斌
Original assignee: Beijing Suiyuan Intelligent Technology Co ltd
Current assignee: Beijing Suiyuan Intelligent Technology Co ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-09-28
Anticipated expiration: 2041-06-17
Also published as: CN113128143A

Abstract

The invention discloses a simulation method, a simulation device, computer equipment and a storage medium of an AI processor, comprising the following steps: acquiring a benchmark test operation flow matched with the AI processor to be tested; dividing the benchmark test operation flow into a plurality of target operation slices according to the operation execution sequence among the operations, wherein each target operation slice can independently run in a processor core; deploying each target operation slice in each processor core respectively, and obtaining local performance overhead corresponding to each target operation slice through parallel simulation of each processor core; and backtracking to obtain the total performance cost of the AI processor to be tested for the benchmark test operation flow according to the local performance cost respectively corresponding to each target operation slice. The technical scheme of the embodiment of the invention can reduce the time consumption for simulating the AI processor and improve the simulation efficiency.

Description

AI processor simulation method, AI processor simulation device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of simulation processing, in particular to a simulation method and device of an AI (Artificial Intelligence) processor, computer equipment and a storage medium.

Background

With the continuous upgrading of chip manufacturing processes, the computational power density of an Artificial Intelligence (AI) processor deployed on a unit wafer area is higher and higher, and a benchmark test which can be carried by the AI processor and is used for evaluating the performance is more and more complex. Meanwhile, with the development of AI applications, the logic design density per unit area of the AI processor is increasing, and the time for the AI processor to perform benchmark testing is also increasing.

The current methods for simulating the performance of the AI processor mainly include the following three methods: performance Simulation was performed using an X86 multi-core server (Simulation), Simulation was performed using a hardware Simulation accelerator (Simulation), and Simulation was performed using a real AI processor (Silicon).

When Simulation is carried out on the performance of the AI processor by using Simulation, the Simulation time can reach dozens of days or even dozens of days, and the Simulation efficiency is low; when Emulation is used for Emulation, the Emulation speed is high, but the hardware cost is high, and the implementation performance of the method is poor for developers of AI processors; the real AI processor is generally suitable for simulating the post-tape AI processor, and although the simulation speed is high and the hardware cost is low, the method has no opportunity to modify problems existing in the AI processor, and the practicability is poor.

Disclosure of Invention

Embodiments of the present invention provide a simulation method and apparatus for an AI processor, a computer device, and a storage medium, which can reduce simulation time consumption for the AI processor and improve simulation efficiency.

In a first aspect, an embodiment of the present invention provides a simulation method for an AI processor, where the method includes:

acquiring a benchmark test operation flow matched with the AI processor to be tested; the benchmark test operation flow comprises a plurality of operations, and a set operation execution sequence is arranged among the operations;

dividing the benchmark test operation flow into a plurality of target operation slices according to the operation execution sequence among the operations, wherein each target operation slice can independently run in a processor core;

deploying each target operation slice in each processor core respectively, and obtaining local performance overhead corresponding to each target operation slice through parallel simulation of each processor core;

and backtracking to obtain the total performance cost of the AI processor to be tested for the benchmark test operation flow according to the local performance cost respectively corresponding to each target operation slice.

In a second aspect, an embodiment of the present invention further provides an apparatus for simulating an AI processor, where the apparatus includes:

the benchmark test operation flow acquisition module is used for acquiring a benchmark test operation flow matched with the AI processor to be tested; the benchmark test operation flow comprises a plurality of operations, and a set operation execution sequence is arranged among the operations;

the target operation slice segmentation module is used for segmenting the benchmark test operation flow into a plurality of target operation slices according to the operation execution sequence among the operations, and each target operation slice can independently run in one processor core;

the local operation simulation module is used for respectively deploying each target operation slice in each processor core in the multi-core processor and obtaining local performance expenditure respectively corresponding to each target operation slice through parallel simulation of each processor core;

and the total performance overhead backtracking module is used for backtracking to obtain the total performance overhead of the AI processor to be tested for the benchmark test operation flow according to the local performance overhead respectively corresponding to each target operation slice.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement a simulation method of an AI processor according to any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program, when executed by a processor, implements the simulation method for the AI processor according to any embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, the reference test operation flow matched with the AI processor to be tested is obtained, the reference test operation flow is divided into a plurality of target operation slices according to the operation execution sequence among the operations, then the target operation slices are respectively deployed in the processor cores, the local performance overhead respectively corresponding to each target operation slice is obtained through parallel simulation of the processor cores, and finally the total performance overhead of the AI processor to be tested for the reference test operation flow is obtained through backtracking according to the local performance overhead respectively corresponding to each target operation slice, so that the simulation time consumption of the AI processor can be reduced, and the simulation efficiency is improved.

Drawings

Fig. 1 is a flowchart of a simulation method of an AI processor according to a first embodiment of the present invention;

FIG. 2a is a flowchart illustrating a simulation method of an AI processor according to a second embodiment of the invention;

FIG. 2b is a diagram illustrating a benchmark test operation flow in the second embodiment of the present invention;

FIG. 2c is a diagram of a parent operation slice according to a second embodiment of the present invention;

FIG. 2d is a diagram of another parent operation slice in accordance with a second embodiment of the present invention;

FIG. 3 is a flowchart of a simulation method of an AI processor according to a third embodiment of the invention;

fig. 4 is a structural diagram of an AI processor simulation apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device in the fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a simulation method for an AI processor according to an embodiment of the present invention, where the present embodiment is applicable to a case of simulating performance of the AI processor, and the method may be executed by a simulation apparatus for the AI processor, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device, and specifically includes the following steps:

and step 110, acquiring a benchmark test operation flow matched with the AI processor to be tested.

In this embodiment, the AI processor to be tested is an AI processor waiting for performance simulation, and the benchmark test operation flow is an operation flow for evaluating the processing performance of the AI processor to be tested, and the benchmark test operation flow includes a plurality of operations, and a set operation execution sequence is provided among the operations.

The benchmark test operation flow may be an operation flow that needs to be executed when the AI processor to be tested realizes a specific function. The function may be a data handling function, a data calculation function, a synchronization message function, or a register configuration function, etc.

In this embodiment, the developer may input the configuration information of the AI processor under test and the benchmark test operation flow matched with the AI processor under test to the computer device together.

And step 120, segmenting the benchmark test operation flow into a plurality of target operation slices according to the operation execution sequence among the operations.

In this embodiment, optionally, if the execution sequence of the operations among the operations is a serial or parallel execution sequence, the benchmark test operation flow may be divided evenly according to a preset value, so as to obtain a plurality of target operation slices. The number of operations in each target operation slice is equal to the preset value, and each target operation slice can independently run in one processor core. Wherein, the processor core can be an X86 server core on a Simulation platform of Simulation.

In a specific embodiment, assuming that the benchmark test operation flow includes 100 operations, and the execution order of the operations among each operation is a serial execution order, the benchmark test operation flow may be equally divided into 10 target operation slices, each target operation slice includes 10 operations, and each target operation slice can independently run in one processor core.

And step 130, respectively deploying the target operation slices in the processor cores, and obtaining local performance overhead respectively corresponding to each target operation slice through parallel simulation of the processor cores.

In this embodiment, after the benchmark test operation stream is divided into a plurality of target operation slices, each target operation slice may be deployed in a different processor core. Each processor core can run the deployed target operation slices in parallel according to the configuration information of the to-be-tested AI processor to obtain a performance simulation result, namely local performance overhead, of the to-be-tested AI processor for the target operation slices. The local performance overhead may be a simulation result of time consumption, resource consumption, and the like after the AI processor to be tested runs the target operation slice.

In a specific embodiment, assuming a total of 10 target operation slices, the 10 target operation slices can be deployed in 10 different processor cores respectively.

And 140, backtracking to obtain the total performance cost of the to-be-tested AI processor for the benchmark test operation flow according to the local performance cost respectively corresponding to each target operation slice.

In this embodiment, the total performance overhead of the AI processor to be tested for the benchmark test operation flow may be obtained by backtracking according to the local performance overhead corresponding to each target operation slice and the execution relationship between the operations in the benchmark test operation flow.

In a specific embodiment, if the execution relation among the operations of the benchmark test operation flow is serial execution, the local performance overheads can be added to obtain the total performance overheads of the AI processor to be tested for the benchmark test operation flow; if the execution relation among the operations of the benchmark test operation flow is parallel execution, the local performance overhead with the largest value can be used as the total performance overhead of the AI processor to be tested for the benchmark test operation flow.

In the prior art, when the performance of the AI processor is simulated, a benchmark test operation flow is usually deployed in one processor core, and the benchmark test operation flow is run by using one processor core to obtain a performance simulation result of the AI processor for the benchmark test operation flow. In the embodiment, the benchmark test operation flow is segmented into the plurality of target operation slices, and then the plurality of processor cores are utilized to run the target operation slices in parallel to obtain the performance simulation result of the AI processor for the benchmark test operation flow.

On the basis of the above embodiment, acquiring a benchmark test operation flow matched with the AI processor to be tested includes: and selecting a benchmark test in the benchmark test set, and generating a benchmark test operation flow corresponding to the selected benchmark test according to the processor architecture of the AI processor to be tested.

In this embodiment, the benchmark set includes a plurality of benchmarks (e.g., data handling test, data calculation test, synchronous message test, register configuration test, etc.), and each benchmark includes a plurality of operations corresponding to the test. The research and development personnel can input the name or other identification information of the benchmark test to be selected into the computer equipment, and the computer equipment selects the corresponding benchmark test in the benchmark test set according to the information.

After the benchmark test is selected in the benchmark test set, optionally, the execution sequence of each operation may be arranged according to the relationship between each operation included in the benchmark test and the deployment conditions of the isomorphic processing unit and the isomorphic processing unit in the AI processor to be tested, and the arranged operations together form the benchmark test operation flow.

In a specific embodiment, it is assumed that the benchmark test includes three operations, which are an operation a, an operation B, and an operation C, respectively, a data inheritance relationship does not exist between the operation a and the operation B, and a data inheritance relationship exists between the operation C and both the operation a and the operation B. Assuming that the AI processor to be tested includes two isomorphic processing units and one heterogeneous processing unit, it may be considered that the AI processor to be tested may perform parallel processing on operation a and operation B by using the two isomorphic processing units, and then perform serial processing on operation C by using the heterogeneous processing units. That is, the execution sequence between the operation a and the operation B is parallel execution, the execution sequence between the operation C and the operations a and B is serial execution, and after the operations a, B and C are arranged according to the execution sequence, a benchmark test operation flow corresponding to a benchmark test is obtained.

In this embodiment, the benchmark test operation flow corresponding to the selected benchmark test is generated according to the processor architecture of the AI processor to be tested, so that the performance simulation result of the obtained AI processor to be tested for the benchmark test operation flow can be ensured to be closer to the actual performance result of the AI processor for the benchmark test operation flow, and thus the accuracy of the simulation result of the AI processor can be improved.

In one implementation manner of the embodiment of the present invention, segmenting a benchmark test operation stream into a plurality of target operation slices according to an operation execution sequence among operations includes:

and step 121, pre-dividing the benchmark test operation stream into at least two parent operation slices with serial execution relation according to the operation execution sequence among the operations.

In this step, optionally, operations having no parallel execution sequence with any operation may be screened out in all the operations, and then the benchmark test operation stream is segmented according to the position of the target operation in the benchmark test operation stream, so as to obtain a plurality of parent operation slices having a serial execution relationship.

And step 122, determining whether each parent operation slice can be divided into a plurality of child operation slices with parallel or serial execution relations according to the operation execution sequence among the operations, if so, executing step 123, and if not, executing step 124.

In this step, it may be determined whether each parent operation slice includes a plurality of child operation slices according to the operation execution sequence among the operations, and if so, it is determined whether each parent operation slice can be split into a plurality of child operation slices having a parallel or serial execution relationship.

And step 123, dividing each parent operation slice into a plurality of matched child operation slices, and determining each child operation slice as a new parent operation slice.

In a specific embodiment, if a parent operation slice includes a plurality of serially executed child operation slices, the parent operation slice may be divided into a plurality of child operation slices having a serial execution relationship; if a parent operation slice includes multiple child operation slices that execute in parallel, the parent operation slice may be sliced into multiple child operation slices that have a parallel execution relationship.

In this step, after each parent operation slice is divided into a plurality of matching child operation slices and each child operation slice is determined as a new parent operation slice, the process returns to step 122 until all the parent operation slices are processed.

And step 124, determining the parent operation slice as a target operation slice.

Example two

This embodiment is a further refinement of the above embodiment, and the same or corresponding terms as those of the above embodiment are explained, and this embodiment is not described again. Fig. 2a is a flowchart of a simulation method of an AI processor according to a second embodiment, in this embodiment, the technical solution of this embodiment may be combined with one or more methods in the solutions of the foregoing embodiments, as shown in fig. 2a, the method provided in this embodiment may further include:

step 210, selecting a benchmark test from the benchmark test set, and generating a benchmark test operation flow corresponding to the selected benchmark test according to the processor architecture of the AI processor to be tested.

The benchmark test operation flow comprises a plurality of operations, and a set operation execution sequence is arranged among the operations.

Step 220, establishing a dependency relationship table according to the operation execution sequence among the operations, and recording the dependency relationship among the operations in a grading manner in the dependency relationship table; the execution of each operation in the subsequent stage depends on the completion of the execution of at least one operation in the previous stage.

In this embodiment, a dependency relationship table may be established according to the execution order of the operations and the execution relationship (for example, a serial execution relationship or a parallel execution relationship) between the operations, and the dependency relationship table includes the dependency relationship between the operations. If there is an adjacent operation having a serial execution relationship with the current operation after the execution sequence corresponding to the current operation, that is, the adjacent operation is executed after the current operation is executed, it is determined that the adjacent operation has a dependency relationship with the current operation.

In an implementation manner of the embodiment of the present invention, establishing a dependency relationship table according to an operation execution sequence among operations includes: determining a waiting list corresponding to each operation according to the operation execution sequence among the operations, wherein all operations needing to be executed before the operation is executed are recorded in the waiting list; and establishing the dependency relationship table according to the waiting list of each operation.

The waiting list corresponding to each operation can be determined according to the sequence of execution among the operations. After the waiting list corresponding to each operation is determined, screening the first operation with the highest execution sequence from all the waiting lists as the current operation, and determining the grade corresponding to the current operation as a first grade; confirming subsequent operations having dependency relationship with the current operation in all waiting lists, and determining the grade corresponding to the subsequent operations as a second grade; and then, taking the subsequent operation as the current operation, and repeatedly executing the process until the processing of all the operations is finished.

If a plurality of operations with the highest execution sequence are screened out from all the waiting lists, the operations are confirmed to have a parallel relation, and the grades corresponding to the operations are the first grades.

And step 230, inquiring the dependency relationship table, and pre-dividing the benchmark test operation stream into a plurality of parent operation slices with serial execution relationships.

In this embodiment, the benchmark test operation stream may be pre-divided into a plurality of parent operation slices having serial execution relations according to the operation execution order among the operations recorded in the dependency relation table and the dependency relation among the operations.

In a specific embodiment, fig. 2b is a schematic diagram of a benchmark test operation flow in the embodiment, and as shown in fig. 2b, it is assumed that the benchmark test operation flow includes six operations, which are respectively: operation A, operation B, operation C, operation D, operation E, and operation F. The operation C has a dependency relationship with the operation A and the operation B, and the operation D, the operation E, the operation F and the operation C have a dependency relationship. In the dependency relationship table, the hierarchy corresponding to the operation a and the operation B is a first hierarchy, the hierarchy corresponding to the operation C is a second hierarchy, and the hierarchy corresponding to the operation D, the operation E and the operation F is a third hierarchy.

In an implementation manner of the embodiment of the present invention, querying the dependency relationship table to pre-divide the benchmark test operation stream into a plurality of parent operation slices having serial execution relationships includes: querying at least one hierarchy comprising only a single operation in the dependency table; and segmenting the benchmark test operation flow according to the position of each single operation in the benchmark test operation flow to obtain a plurality of parent operation slices with serial execution relation.

In a particular embodiment, the flow of benchmarking operations may be sliced according to the location of a single operation in the flow of benchmarking operations, where the single operation and operations preceding the single operation may be sliced into one parent operation slice and operations following the single operation may be sliced into another parent operation slice.

In a specific embodiment, taking the benchmark test operation flow in fig. 2b as an example, if a single operation (i.e., operation C) is included in the second hierarchy, the benchmark test operation flow may be sliced according to the position of operation C in the benchmark test operation flow. Operation C, operation a, and operation B are split into one parent operation slice, and operation D, operation E, and operation F are split into another parent operation slice.

Step 240, querying the dependency relationship table, and determining whether each parent operation slice can be split into a plurality of child operation slices having a parallel or serial execution relationship, if yes, performing step 250, and if not, performing step 260.

In the embodiment of the present invention, querying the dependency relationship table to determine whether each parent operation slice can be split into a plurality of child operation slices having parallel or serial execution relationships includes: acquiring each target operation included in a currently processed target parent operation slice, and acquiring a target grade including each target operation in a dependency relation table; counting the grading number of each target grade and/or the operation number of target operation included in each target grade; and determining whether the target parent operation slice can be divided into a plurality of sub operation slices with parallel or serial execution relations according to the grading number and/or the operation number.

If the hierarchical number of each target hierarchy is multiple, or the operation number of the target operation included in at least one target hierarchy is multiple, whether the target parent operation slice can be divided into multiple sub operation slices with parallel or serial execution relation can be determined.

In one implementation of the embodiment of the present invention, determining whether a target parent operation slice can be split into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number and/or the operation number includes: if the number of the hierarchies is determined to be unique, and the unique target hierarchy comprises a plurality of target operations, the target parent operation slice is determined to be capable of being segmented into a plurality of child operation slices with parallel execution relations.

Accordingly, slicing the target parent operation slice into a plurality of matching child operation slices includes: and respectively dividing each target operation in the target parent operation slice into a sub operation slice corresponding to the target parent operation slice.

Taking the benchmark test operation flow in fig. 2b as an example, assuming that only one target hierarchy is included in the currently processed target parent operation slice, where the target hierarchy includes three target operations, namely operation D, operation E, and operation F, it is determined that the target parent operation slice can be split into three child operation slices having a parallel execution relationship.

In another implementation of the embodiment of the present invention, determining whether a target parent operation slice can be sliced into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number, and/or the operation number, includes: if the number of the grades is determined to be multiple and only one target operation is included in at least one target grade, the target parent operation slice is determined to be capable of being segmented into a plurality of child operation slices with serial execution relations.

Accordingly, slicing the target parent operation slice into a plurality of matching child operation slices includes: and segmenting the target parent operation slice to obtain a plurality of child operation slices according to the position of each unique target operation in the target parent operation slice.

In a specific embodiment, fig. 2c is a schematic diagram of a parent operation slice in this embodiment, and as shown in fig. 2c, it is assumed that three target hierarchies are included in the target parent operation slice, a target operation included in a first target hierarchy is operation D, operation E and operation F, a second target hierarchy includes a unique target operation (operation G), and a target operation included in a third target hierarchy is operation H and operation I. In this case, the target parent operation slice may be sliced according to the position of operation G in the target parent operation slice, resulting in two child operation slices having serial execution relationship. The operation G, the operation D, the operation E, and the operation F are split into one sub-operation slice, and the operation H and the operation I are split into another sub-operation slice.

In another implementation of the embodiment of the present invention, determining whether a target parent operation slice can be sliced into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number, and/or the operation number, includes: if the number of the grades is determined to be multiple and each target grade comprises a plurality of target operations, the target parent operation slice is determined to be capable of being segmented into a plurality of child operation slices with parallel execution relations.

Accordingly, slicing the target parent operation slice into a plurality of matching child operation slices includes: acquiring each target operation in the target hierarchy of the highest hierarchy level as a starting point operation; and according to the waiting list of the target operations, respectively acquiring the target operations with direct or indirect dependency relation with each starting point operation in the target operations in the lower-level target hierarchy to form a plurality of sub-operation slices.

In a specific embodiment, fig. 2D is a schematic diagram of a parent operation slice in this embodiment, and as shown in fig. 2D, it is assumed that the target parent operation slice includes two target hierarchies, a target operation included in a first target hierarchy is operation D and operation E, and a target operation included in a second target hierarchy is operation F, operation G and operation H. In this case, operation D may be used as a starting point, and operation F having a dependency relationship with operation D is obtained to form a first sub-operation slice; then, operation E is used as a starting point to obtain operation G and operation H which have a dependency relationship with operation E, and a second sub-operation slice is formed.

The first sub-operation slice includes operation D and operation F, and the second sub-operation slice includes operation E, operation G, and operation H. The first sub-operation slice has a parallel execution relationship with the second sub-operation slice.

In another implementation of the embodiment of the present invention, determining whether a target parent operation slice can be sliced into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number, and/or the operation number, includes: if the number of levels is determined to be unique and only a unique target operation is included in the unique target level, then the target parent operation slice is determined not to be capable of being sliced into a plurality of child operation slices having parallel or serial execution relationships.

And if the target parent operation slice only comprises one target hierarchy and the target hierarchy only comprises one target operation, determining that the target parent operation slice cannot be cut into the child operation slices.

And step 250, dividing each father operation slice into a plurality of matched son operation slices, and determining each son operation slice as a new father operation slice.

After this step, execution returns to step 240 until processing of all parent operation slices is completed.

And step 260, determining the father operation slice as a target operation slice.

And 270, respectively deploying the target operation slices in the processor cores, and obtaining local performance overhead respectively corresponding to each target operation slice through parallel simulation of the processor cores.

And step 280, backtracking to obtain the total performance cost of the AI processor to be tested for the benchmark test operation flow according to the local performance cost respectively corresponding to each target operation slice.

The technical scheme of the embodiment of the invention selects a benchmark test in a benchmark test set, generates a benchmark test operation flow corresponding to the selected benchmark test according to the processor architecture of an AI processor to be tested, establishes a dependency relationship table according to the operation execution sequence among the operations, inquires the dependency relationship table, then pre-divides the benchmark test operation flow into a plurality of father operation slices with serial execution relationship, inquires the dependency relationship table, determines whether each father operation slice can be divided into a plurality of child operation slices with parallel or serial execution relationship, if so, divides each father operation slice into a plurality of matched child operation slices, determines each child operation slice as a new father operation slice, if not, determines the father operation slice as a target operation slice, and then deploys each target operation slice in each core respectively, and obtaining the local performance overhead respectively corresponding to each target operation slice through parallel simulation of each processor core, and finally backtracking the technical means of obtaining the total performance overhead of the AI processor to be tested aiming at the benchmark test operation flow according to the local performance overhead respectively corresponding to each target operation slice, so that the simulation time consumption of the AI processor can be reduced, and the simulation efficiency can be improved.

EXAMPLE III

This embodiment is a further refinement of the above embodiment, and the same or corresponding terms as those of the above embodiment are explained, and this embodiment is not described again. Fig. 3 is a flowchart of a simulation method of an AI processor provided in a third embodiment, in the third embodiment, the technical solution of the present embodiment may be combined with one or more methods in the solutions of the foregoing embodiments, as shown in fig. 3, the method provided in the present embodiment may further include:

step 310, selecting a benchmark test from the benchmark test set, and generating a benchmark test operation flow corresponding to the selected benchmark test according to the processor architecture of the AI processor to be tested.

And step 320, segmenting the benchmark test operation flow into a plurality of target operation slices according to the operation execution sequence among the operations.

Wherein each target operation slice can independently run in one processor core.

In this embodiment, after splitting the benchmark test operation flow into a plurality of target operation slices according to the operation execution sequence among the operations, the method further includes: and recording the parent-child slice relationship and the serial-parallel execution relationship among the slices obtained by segmentation in the process of slicing the benchmark test operation stream.

In the process of slicing the benchmark test operation flow, the parent-child slice relation and the serial-parallel execution relation among all the target operation slices obtained through slicing are recorded.

And 330, respectively deploying the target operation slices in the processor cores, and obtaining local performance overhead respectively corresponding to each target operation slice through parallel simulation of the processor cores.

And step 340, backtracking to obtain the total performance cost of the AI processor to be tested for the benchmark test operation flow according to the local performance cost respectively corresponding to each target operation slice, and the pre-recorded parent-child slice relationship and the serial-parallel execution relationship among the slices.

In this embodiment, the total performance overhead of the AI processor to be tested for the benchmark test operation flow may be obtained by backtracking according to the local performance overhead corresponding to each target operation slice, and the parent-child slice relationship and the serial-parallel execution relationship between the target operation slices.

In an implementation manner of the embodiment of the present invention, the contribution of each sub-operation slice having a parallel execution relationship, which is obtained by splitting the same parent operation slice, to the total performance overhead is the maximum value among the local performance overheads of each sub-operation slice; and the contribution of each sub operation slice with the serial execution relation, which is obtained by cutting the same father operation slice, to the total performance overhead is the sum of the local performance overheads of the sub operation slices.

In a specific embodiment, assume that the benchmark stream of operations is sliced into four target operation slices, slice a, slice B, slice C, and slice D, respectively. The slice A and the slice B are obtained by slicing the same father operation slice, and the slice A and the slice B have a parallel execution relation; the slice C and the slice D are obtained by cutting the same father operation slice, and the slice C and the slice D have a serial execution relation; slice a and slice B have serial execution relationships with slice C, respectively. In this case, the local performance cost with a large value may be selected from the local performance costs corresponding to the slice a and the slice B as the target local performance cost, and then the target local performance cost is added to the local performance cost corresponding to the slice C and the local performance cost corresponding to the slice D to obtain the total performance cost.

And 350, performing performance evaluation on the processor architecture of the AI processor to be tested according to the total performance overhead and the reference performance overhead matched with the benchmark test.

In this embodiment, each benchmark test corresponds to a reference performance cost in advance, wherein a difference value between the total performance cost and the reference performance cost may be calculated, and then the performance of the processor architecture of the AI processor to be tested is evaluated according to the difference value.

In a specific embodiment, if the difference value is greater than a preset threshold, the processor architecture performance of the AI processor to be tested may be considered to be poor, and if the difference value is less than or equal to the preset threshold, the processor architecture performance of the AI processor to be tested may be considered to be good.

In this embodiment, after the benchmark test operation stream is divided into a plurality of target operation slices, since each target operation slice is the smallest and cannot be divided any more, when the deployed target operation slices are simulated in parallel by each processor core, the simulation time of the AI processor can be greatly shortened.

In a specific embodiment, 16923 operations are included in the benchmark test operation stream corresponding to a micro benchmark test, and the benchmark test operation stream can be segmented into 2103 target operation slices in the above manner, wherein the number of operations in each target operation slice is within the [1, 30] interval. The Simulation time of the Simulation platform for one target operation slice can be controlled within 3 days, and under the condition that enough server resources exist, when the deployed target operation slices are simulated in parallel through each processor core, the performance Simulation result of the AI processor can be obtained within 3-4 days.

The technical scheme of the embodiment of the invention comprises the steps of selecting a benchmark test in a benchmark test set, generating a benchmark test operation flow corresponding to the selected benchmark test according to a processor framework of an AI processor to be tested, segmenting the benchmark test operation flow into a plurality of target operation slices according to an operation execution sequence among operations, respectively deploying the target operation slices in each processor core, obtaining local performance expenditure respectively corresponding to each target operation slice through parallel simulation of each processor core, obtaining the total performance expenditure of the AI processor to be tested aiming at the benchmark test operation flow according to the local performance expenditure respectively corresponding to each target operation slice and the pre-recorded father-son slice relationship and serial-parallel execution relationship among the slices, and finally obtaining the reference performance expenditure matched with the benchmark test according to the total performance expenditure, the technical means for evaluating the performance of the processor architecture of the AI processor to be tested can reduce the simulation time consumption of the AI processor and improve the simulation efficiency.

The technical scheme of the embodiment of the invention can also realize the following technical effects: according to the processor architecture of the AI processor, a benchmark test operation flow running on the processor is divided into independent operation slices which are executed in parallel or in series, and the scenes with high probability of occurrence in the whole benchmark test operation flow are counted by hardware resources (such as a memory, a data carrying path, an actuator and the like) occupied by each independent operation slice and working modes (such as processor configuration information, a series-parallel execution relation and the like), so that the improvement emphasis of the AI processor can be targeted, and the performance of the operation slices which appear in the AI processor operation probability is optimized.

In addition, the data storage capacity, the computing capacity, the interconnection capacity and the like of the processor architecture of the AI processor can be balanced in an auxiliary manner according to the classification information (such as data transportation, calculation, synchronization and the like) corresponding to different operation slices.

The performance overhead of each parallel operation slice and the execution collision among the parallel operation slices can be analyzed, and the processor architecture is optimized according to the analysis result, so that the performance overhead of each parallel operation slice is ensured to be balanced as much as possible, and the phenomenon of local no load is prevented when the processors perform parallel processing; secondly, by optimizing the processor architecture, the serial operation flow among all the sub-operation slices in the parallel operation slice can be reduced as much as possible, so that the performance overhead is more stable and controllable when the processors execute in parallel. In addition, if memory or data channels are inevitably shared among the operation slices, hardware resources can be increased or architecture deployment can be adjusted as appropriate, and collisions are decoupled.

Example four

Fig. 4 is a structural diagram of a simulation apparatus of an AI processor according to a fourth embodiment of the present invention, including: a benchmark test operation flow acquisition module 410, a target operation slice slicing module 420, a local operation simulation module 430, and an overall performance overhead backtracking module 440.

The benchmark test operation flow acquiring module 410 is configured to acquire a benchmark test operation flow matched with the AI processor to be tested; the benchmark test operation flow comprises a plurality of operations, and a set operation execution sequence is arranged among the operations;

a target operation slice splitting module 420, configured to split the benchmark test operation stream into a plurality of target operation slices according to an operation execution sequence among the operations, where each target operation slice can independently run in one processor core;

the local operation simulation module 430 is configured to deploy each target operation slice in each processor core of the multi-core processor, and obtain a local performance overhead corresponding to each target operation slice through parallel simulation of each processor core;

the total performance overhead backtracking module 440 is configured to backtrack the total performance overhead of the AI processor to be tested for the benchmark test operation flow according to the local performance overhead respectively corresponding to each target operation slice.

On the basis of the foregoing embodiments, the benchmark operation flow obtaining module 410 may include:

and the operation flow generation unit is used for selecting the benchmark test in the benchmark test set and generating the benchmark test operation flow corresponding to the selected benchmark test according to the processor architecture of the AI processor to be tested.

The target operational slice segmentation module 420 may include:

determining the segmentation of the parent operation slices, which is used for pre-segmenting the benchmark test operation stream into at least two parent operation slices with serial execution relation according to the operation execution sequence among the operations;

the segmentation capability determining unit is used for determining whether each father operation slice can be segmented into a plurality of child operation slices with parallel or serial execution relations according to the operation execution sequence among the operations;

and the parent operation slice segmentation unit is used for segmenting each parent operation slice into a plurality of matched child operation slices, and determining each child operation slice as a new parent operation slice until all the parent operation slices are processed.

A target operation slice determination unit for determining the parent operation slice as a target operation slice.

The relation table establishing unit is used for establishing a dependency relation table according to the operation execution sequence among the operations, and the dependency relation among the operations is recorded in the dependency relation table in a grading manner; the execution of each operation in the next level depends on the execution completion of at least one operation in the previous level;

the first relation table query unit is used for querying the dependency relation table and pre-dividing the benchmark test operation stream into a plurality of parent operation slices with serial execution relations;

the second relation table query unit is used for querying the dependency relation table and determining whether each father operation slice can be divided into a plurality of child operation slices with parallel or serial execution relations;

the waiting list determining unit is used for determining a waiting list corresponding to each operation according to the operation execution sequence among the operations, and all the operations needing to be executed before the operation is executed are recorded in the waiting list;

the dependency relationship table establishing unit is used for establishing the dependency relationship table according to the waiting list of each operation;

the single operation query unit is used for querying at least one hierarchy which only comprises a single operation in the dependency relationship table;

the single operation processing unit is used for segmenting the benchmark test operation flow according to the position of each single operation in the benchmark test operation flow to obtain a plurality of parent operation slices with serial execution relation;

the target grading acquisition unit is used for acquiring each target operation included in the currently processed target parent operation slice and acquiring the target grading including each target operation in the dependency relationship table;

a classification number counting unit for counting the classification number of each target classification and/or the operation number of target operations included in each target classification;

the quantity processing unit is used for determining whether the target parent operation slice can be divided into a plurality of sub operation slices with parallel or serial execution relations according to the grading quantity and/or the operation quantity;

a first parallel sub-operation slice determination unit configured to determine that a target parent operation slice can be split into a plurality of sub-operation slices having parallel execution relationships if it is determined that the number of hierarchies is unique and the unique target hierarchy includes a plurality of target operations;

the first target parent operation slice segmentation unit is used for segmenting each target operation in the target parent operation slice into a sub operation slice corresponding to the target parent operation slice;

a serial sub-operation slice determining unit, configured to determine that a target parent operation slice can be sliced into a plurality of sub-operation slices having serial execution relationships if it is determined that the number of hierarchies is multiple and only a unique target operation is included in at least one target hierarchy;

the second target parent operation slice segmentation unit is used for segmenting the target parent operation slice to obtain a plurality of child operation slices according to the position of each unique target operation in the target parent operation slice;

a second parallel suboperation slice determining unit configured to determine that the target parent operation slice can be split into a plurality of suboperation slices having parallel execution relations if the number of hierarchies is determined to be a plurality and each target hierarchy includes a plurality of target operations therein;

a starting point operation acquisition unit configured to acquire each target operation in a target hierarchy at a highest hierarchy level as a starting point operation;

the target operation acquisition unit is used for respectively acquiring target operations with direct or indirect dependency relation with each starting point operation in the target operations in the lower-level target hierarchy according to the waiting list of the target operations to form a plurality of sub-operation slices;

a target parent operation slice segmentation capability determination unit, configured to determine that a target parent operation slice cannot be segmented into a plurality of child operation slices having a parallel or serial execution relationship if it is determined that the number of hierarchies is unique and the unique target hierarchy includes only a unique target operation;

and the relation recording unit is used for recording the parent-child slice relation and the serial-parallel execution relation among the slices obtained by segmentation in the process of slicing the benchmark test operation stream.

The overall performance overhead backtracking module 440 may include:

the overall performance overhead calculation unit is used for backtracking to obtain the overall performance overhead of the AI processor to be tested for the benchmark test operation flow according to the local performance overhead corresponding to each target operation slice, and the pre-recorded parent-child slice relationship and the serial-parallel execution relationship among the slices;

the method comprises the following steps that a parent operation slice is divided into sub operation slices, wherein the contribution of each sub operation slice with a parallel execution relation to the total performance overhead is the maximum value in the local performance overhead of each sub operation slice;

the contribution of each sub-operation slice with serial execution relation to the total performance overhead, which is obtained by cutting the same father operation slice, is the sum of the local performance overhead of each sub-operation slice;

and the performance evaluation unit is used for evaluating the performance of the processor framework of the AI processor to be tested according to the total performance overhead and the reference performance overhead matched with the benchmark test.

The simulation device for an AI processor provided in the embodiments of the present invention can execute the simulation method for an AI processor provided in any embodiment of the present invention, and has functional modules and advantageous effects corresponding to the execution method.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a computer apparatus according to a fifth embodiment of the present invention, as shown in fig. 5, the computer apparatus includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of the processors 510 in the computer device may be one or more, and one processor 510 is taken as an example in fig. 5; the processor 510, the memory 520, the input device 530 and the output device 540 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 5. The memory 520 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a simulation method of an AI processor in the embodiments of the present invention (for example, the benchmark operation flow acquisition module 410, the target operation slice segmentation module 420, the local operation simulation module 430, and the overall performance cost backtracking module 440 in a simulation apparatus of an AI processor). The processor 510 executes various functional applications of the computer device and data processing by executing software programs, instructions, and modules stored in the memory 520, that is, implements one of the simulation methods of the AI processor described above. That is, the program when executed by the processor implements:

The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 520 may further include memory located remotely from processor 510, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, and may include a keyboard and a mouse, etc. The output device 540 may include a display device such as a display screen.

EXAMPLE six

The sixth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method according to any embodiment of the present invention. Of course, the embodiment of the present invention provides a computer-readable storage medium, which can perform related operations in the simulation method of the AI processor according to any embodiment of the present invention. That is, the program when executed by the processor implements:

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the simulation apparatus for an AI processor, the units and modules included in the simulation apparatus are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A simulation method of an AI processor, comprising:

establishing a dependency relationship table according to the operation execution sequence among the operations, and recording the dependency relationship among the operations in a grading manner in the dependency relationship table; the execution of each operation in the next level depends on the execution completion of at least one operation in the previous level; inquiring the dependency relationship table, and pre-dividing the benchmark test operation flow into a plurality of father operation slices with serial execution relationships;

acquiring each target operation included in a currently processed target parent operation slice, and acquiring a target grade including each target operation in a dependency relation table; counting the grading number of each target grade and/or the operation number of target operation included in each target grade; determining whether the target parent operation slice can be divided into a plurality of sub operation slices with parallel or serial execution relations according to the grading number and/or the operation number;

if so, dividing each father operation slice into a plurality of matched son operation slices, and determining each son operation slice as a new father operation slice, otherwise, determining the father operation slice as a target operation slice until all father operation slices are processed; each target operation slice can independently run in one processor core;

2. The method of claim 1, wherein obtaining a benchmark operational flow that matches the AI processor under test comprises:

and selecting a benchmark test in the benchmark test set, and generating a benchmark test operation flow corresponding to the selected benchmark test according to the processor architecture of the AI processor to be tested.

3. The method of claim 1, wherein building the dependency table according to the operation execution sequence among the operations comprises:

determining a waiting list corresponding to each operation according to the operation execution sequence among the operations, wherein all operations needing to be executed before the operation is executed are recorded in the waiting list;

and establishing the dependency relationship table according to the waiting list of each operation.

4. The method of claim 3, wherein querying the dependency table to pre-partition the stream of benchmarking operations into a plurality of parent operation slices having a serial execution relationship comprises:

querying at least one hierarchy comprising only a single operation in the dependency table;

and segmenting the benchmark test operation flow according to the position of each single operation in the benchmark test operation flow to obtain a plurality of parent operation slices with serial execution relation.

5. The method of claim 3, wherein determining whether the target parent operation slice can be sliced into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number, and/or the operation number comprises:

if the number of the grades is determined to be unique, and the unique target grade comprises a plurality of target operations, determining that a target parent operation slice can be segmented into a plurality of sub operation slices with parallel execution relations;

segmenting the target parent operation slice into a plurality of matched child operation slices, including:

and respectively dividing each target operation in the target parent operation slice into a sub operation slice corresponding to the target parent operation slice.

6. The method of claim 3, wherein determining whether the target parent operation slice can be sliced into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number, and/or the operation number comprises:

if the number of the grades is determined to be multiple and only one target operation is included in at least one target grade, determining that a target parent operation slice can be segmented into a plurality of child operation slices with serial execution relations;

and segmenting the target parent operation slice to obtain a plurality of child operation slices according to the position of each unique target operation in the target parent operation slice.

7. The method of claim 3, wherein determining whether the target parent operation slice can be sliced into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number, and/or the operation number comprises:

if the number of the grades is determined to be multiple and each target grade comprises a plurality of target operations, determining that a target parent operation slice can be segmented into a plurality of child operation slices with parallel execution relations;

acquiring each target operation in the target hierarchy of the highest hierarchy level as a starting point operation;

and according to the waiting list of the target operations, respectively acquiring the target operations with direct or indirect dependency relation with each starting point operation in the target operations in the lower-level target hierarchy to form a plurality of sub-operation slices.

8. The method of claim 3, wherein determining whether the target parent operation slice can be sliced into a plurality of child operation slices having parallel or serial execution relationships according to the hierarchical number, and/or the operation number comprises:

if the number of levels is determined to be unique and only a unique target operation is included in the unique target level, then the target parent operation slice is determined not to be capable of being sliced into a plurality of child operation slices having parallel or serial execution relationships.

9. The method of claim 1, further comprising, after completing processing of all parent operation slices:

recording a parent-child slicing relation and a serial-parallel execution relation among all slices obtained by splitting in the process of slicing the benchmark test operation stream;

backtracking to obtain the total performance cost of the AI processor to be tested for the benchmark test operation flow according to the local performance cost respectively corresponding to each target operation slice, wherein the backtracking comprises the following steps:

and backtracking to obtain the total performance overhead of the AI processor to be tested for the benchmark test operation flow according to the local performance overhead corresponding to each target operation slice, and the pre-recorded parent-child slice relationship and the serial-parallel execution relationship among the slices.

10. The method of claim 9, wherein:

the contribution of each sub-operation slice with the parallel execution relation obtained by the segmentation of the same father operation slice to the total performance cost is the maximum value in the local performance cost of each sub-operation slice;

and the contribution of each sub operation slice with the serial execution relation, which is obtained by cutting the same father operation slice, to the total performance overhead is the sum of the local performance overheads of the sub operation slices.

11. The method according to claim 2, wherein after obtaining the total performance cost of the AI processor under test for the benchmark test operation flow by backtracking according to the local performance cost respectively corresponding to each target operation slice, the method further comprises:

and performing performance evaluation on the processor framework of the AI processor to be tested according to the total performance overhead and the reference performance overhead matched with the benchmark test.

12. An AI processor simulation apparatus, comprising:

the target operation slicing module is used for establishing a dependency relationship table according to the operation execution sequence among the operations, and the dependency relationship among the operations is recorded in a grading manner in the dependency relationship table; the execution of each operation in the next level depends on the execution completion of at least one operation in the previous level; inquiring the dependency relationship table, and pre-dividing the benchmark test operation flow into a plurality of father operation slices with serial execution relationships;

13. A computer device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-11.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-11.