CN117271249A

CN117271249A - Method and equipment for analyzing pipeline performance of artificial intelligent accelerator

Info

Publication number: CN117271249A
Application number: CN202210667147.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2023-12-22
Also published as: WO2023241478A1

Abstract

The present invention relates to an artificial intelligence accelerator pipeline performance analysis method and apparatus wherein the computing device of the present invention is included in an integrated circuit device that includes a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to collectively complete a user-specified computing operation. The integrated circuit device may further comprise a storage device coupled to the computing device and the other processing device, respectively, for data storage by the computing device and the other processing device.

Description

Method and equipment for analyzing pipeline performance of artificial intelligent accelerator

Technical Field

The present invention relates generally to the field of computers. More particularly, the present invention relates to an artificial intelligence accelerator pipeline performance analysis method and apparatus.

Background

As computer systems become increasingly complex, increasing numbers of performance events occur in processors, which are responsible for performance bottlenecks. In order to find performance bottlenecks, developers need to collect as much of the various operational data as possible to optimize the runtime performance of the computer system.

Patent number CN106557400a provides a method of dynamic data collection in a device, comprising: collecting data of high-level performance events by the device hardware, wherein the high-level performance events indicate alternative reasons for causing performance bottlenecks; determining a first performance event between high-level performance events that cause performance bottlenecks based on real-time analysis of the data; reconfiguring the device hardware to collect additional data for lower level performance events below the first performance event, and the lower level performance event indicating additional alternative reasons more specific than the alternative reasons; the device performs the above-described steps of collecting, determining and reconfiguring while continuously executing the application.

The prior art described above requires continuous execution of applications while inferring performance bottlenecks, and high-level and low-level performance events are empirically inferred, and is designed only for processors such as CPU, GPU, DSP, without providing an efficient way of performance detection for structures specific to artificial intelligence accelerators.

Disclosure of Invention

In order to at least partially solve the technical problems mentioned in the background art, the scheme of the invention provides an artificial intelligent accelerator pipeline performance analysis method and equipment.

In one aspect, the present invention discloses a processor core for analyzing pipeline performance, including a control module, an operation module and a storage module, where the control module, the operation module and the storage module correspond to a plurality of events of a pipeline, and the events are combined according to different conditions, and the plurality of combinations are mutually exclusive and are divided into a front-end bottleneck and a back-end bottleneck. The processor core further includes a plurality of performance counters respectively associated with the plurality of combinations, the respective performance counters counting when the plurality of combinations satisfy respective conditions.

In another aspect, the present invention discloses an artificial intelligence accelerator comprising a processor core as described above.

In another aspect, the present invention discloses an integrated circuit device comprising the artificial intelligence accelerator described above.

In another aspect, the present invention discloses a board including the above-mentioned integrated circuit device.

In another aspect, the present invention discloses a method of performance analysis of an artificial intelligence accelerator, the artificial intelligence accelerator comprising a processor core. The method comprises the following steps: dividing a pipeline of processor cores into a plurality of events; forming a plurality of combinations based on a plurality of events, wherein the plurality of combinations are mutually exclusive and are divided into a front-end bottleneck and a back-end bottleneck; configuring performance counters associated with a plurality of combinations; judging whether the plurality of combinations meet corresponding conditions; if so, driving the corresponding performance counter to count; and evaluating the front-end bottleneck and the back-end bottleneck according to the count value of the performance counter.

In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer program code for an artificial intelligence accelerator performance analysis method, which when executed by a processing device, performs the method described above.

The invention has the following technical effects:

1. the invention configures the performance counter for the event combination with the possible performance bottleneck, when the performance analysis is carried out each time, the combination to be concerned can be selected, the hardware is not required to be reconfigured for each analysis, only the specific performance counter is selected to be read, and the performance analysis scheme of the invention is quite light and flexible.

2. The performance analysis scheme of the invention is developed from top to bottom in a multi-level manner, and the events of each layer are mutually exclusive and divided into a front-end bottleneck and a rear-end bottleneck, so that the performance bottleneck can be logically and rapidly positioned.

3. The invention defines the event combination aiming at the special structure of the artificial intelligent accelerator, and can directly find the performance bottleneck of the artificial intelligent accelerator.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. In the drawings, several embodiments of the invention are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals refer to like or corresponding parts. Wherein:

fig. 1 is a block diagram showing a board of an embodiment of the present invention;

fig. 2 is a block diagram showing an integrated circuit device of an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the internal architecture of a computing device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the internal architecture of a processor core according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating layering of pipelines from top to bottom in accordance with an embodiment of the present invention;

FIG. 6 is a diagram showing performance statistics time for layer 0 of an embodiment of the present invention;

FIG. 7 is a graph showing performance statistics time for a layer 1 backend bottleneck in an embodiment of the present invention;

fig. 8 is a flow chart illustrating another embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification and drawings of the present invention are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present invention are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Specific embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Most of the existing artificial intelligent accelerators adopt a parallel operation design, are configured with tens or even hundreds of processor cores, and each processor core can perform independent operation and has an independent pipeline. As artificial intelligence accelerators become increasingly complex, the efficiency may be low when performing operations of various algorithms due to software and hardware unadapted. In many cases, the pipeline will form idle bubbles (bubbles of idleness) that will create performance bottlenecks that may not fully exploit hardware performance.

The invention aims at analyzing the special structure of an artificial intelligent accelerator, in particular to a proposal for analyzing the performance of a pipeline by a processor core of the artificial intelligent accelerator, which generally comprises a control module, an operation module and a storage module. The present invention configures a plurality of performance counters, each associated with a combination thereof. When the combination meets the corresponding condition, the corresponding performance counter counts, and a developer can choose to read the information of the specific performance counter and evaluate the performance bottleneck from top to bottom in a layered manner.

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the invention. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and an off-chip memory 204.

The computing device 201 is an artificial intelligence accelerator configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor, so as to perform deep learning or machine learning computation, and may interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, such as a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), or other general purpose and/or special purpose processors.

The off-chip memory 204 is used to store data to be processed, and is typically a DDR memory, and is typically 16G or larger in size, and is used to store data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 201 is configured to process input data such as computer vision, voice, natural language, and data mining, where the computing device 201 is configured as a multi-core hierarchical structure, and the computing device 201 is a system-on-chip (soc) including a plurality of clusters (clusters), each of which includes a plurality of processor cores, in other words, the computing device 201 is configured in a system-on-chip (soc) -cluster-processor core hierarchy.

At the system-on-chip level, as shown in FIG. 3, computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.

There may be a plurality of external memory controllers 301, 2 being shown by way of example, for accessing external memory devices, such as off-chip memory 204 in FIG. 2, to read data from or write data to the off-chip in response to an access request issued by a processor core. The peripheral communication module 302 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302, and the plurality of clusters 305 for transferring data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 being shown by way of example in the figure, and the computing device 201 of the present invention may also include 8, 16, 64, or even more clusters 305 as hardware progresses. The cluster 305 is used to efficiently execute the deep learning algorithm.

At the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU cores) 306 and a memory core (MEM core) 307.

The number of processor cores 306 is illustratively shown as 4, and the present invention is not limited to the number of processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an operation module 42 and a storage module 43.

The control module 41 is used for coordinating and controlling the operation of the operation module 42 and the storage module 43 to complete the task of deep learning, and includes a fetch unit (instruction fetch unit, IFU) 411, an instruction decode unit (instruction decode unit, IDU) 412, and a cache 413. The instruction fetch unit 411 is configured to fetch instructions from the processing device 203, and the instruction decode unit 412 decodes the fetched instructions and temporarily stores the decoded results (micro instructions) in the buffer 413. The microinstructions in the cache 413 at the appropriate timing are sent as control information to the arithmetic module 42 and the storage module 43.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 422 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 43 is used for storing or handling related data, including a neuron storage unit (NRAM) 431, a weight storage unit (WRAM) 432, an input/output direct memory access module (input/output direct memory access, IODMA) 433, and a handling direct memory access module (move direct memory access, MVDMA) 434.NRAM 431 is used to store the feature map for the processor core 306 to calculate and the intermediate result after calculation; WRAM 432 is configured to store weights for the deep learning network; IODMA 433 controls access to NRAM 431/WRAM 432 and off-chip memory 204 via broadcast bus 309; MVDMA 434 is used to control access to NRAM 431/WRAM 432 and SRAM 308.

Returning to FIG. 3, the storage cores 307 are mainly used for storing and communicating, i.e., storing shared data or intermediate results between the processor cores 306, and performing communication between the clusters 305 and the off-chip memory 204, communication between the clusters 305, communication between the processor cores 306, etc. In other embodiments, the memory core 307 has scalar operation capabilities to perform scalar operations.

The memory core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a clustered direct memory access module (cluster direct memory access, CDMA) 310, and a global direct memory access module (global direct memory access, GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 306 in the same cluster 305 is not required to be obtained from each processor core 306 to the off-chip memory 204, but transferred between the processor cores 306 through the SRAM 308, and the memory core 307 only needs to rapidly distribute the multiplexed data from the SRAM 308 to a plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip input/output access is also greatly reduced.

Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform inter-processor core 306 communications, inter-cluster 305 communications, and data transfers between cluster 305 and off-chip memory 204, respectively. As will be described below, respectively.

The broadcast bus 309 is used to perform high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 308 to a specific number of processor cores 306, and broadcast is a communication scheme that transfers a piece of data from SRAM 308 to all processor cores 306, a special case of multicast.

CDMA 310 is used to control access to SRAM 308 between different clusters 305 within the same computing device 201.

The GDMA 311 cooperates with the external memory controller 301 to control access of the SRAM 308 of the cluster 305 to the off-chip memory 204 or to read data from the off-chip memory 204 to the SRAM 308. From the foregoing, it is appreciated that communication between off-chip memory 204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly connect the off-chip memory 204 with the NRAM 431 or WRAM 432 through the IODAM 433; the second channel is to transfer data between the off-chip memory 204 and the SRAM 308 via the GDMA 311, and then transfer data between the SRAM 308 and the NRAM 431 or WRAM 432 via the MVDMA 434. While the second channel seemingly requires more elements to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel and thus communication between the off-chip memory 204 and the NRAM 431 or WRAM 432 may be more efficient through the second channel. Embodiments of the present invention may select a data transmission channel based on its hardware conditions.

In other embodiments, the functionality of the GDMA 311 and the functionality of the IODMA 433 may be integrated in the same component. For convenience of description, the GDMA 311 and the IODMA 433 are regarded as different components, so long as the functions and technical effects achieved by the present invention are similar to those of the present invention, i.e., they belong to the protection scope of the present invention. Further, the functions of the GDMA 311, the IODMA 433, the CDMA 310, and the MVDMA 434 may be implemented by the same component.

This embodiment analyzes the pipeline performance of the processor cores 306 in the computing device 201, i.e., focuses on the individual and cooperative performance of the control module 41, the operation module 42, and the storage module 43. In this embodiment, the multiple events of the multiple layers obtained by layering the pipeline of the processor core 306 may be hierarchically combined according to different conditions by a developer based on the events of interest, where the combinations are mutually exclusive and divided into front-end bottlenecks and back-end bottlenecks, and correspond to individual or cooperative performances in the control module 41, the operation module 42 and the storage module 43, for example, the front-end bottlenecks may include that the instruction of the control module 41 cannot be issued, and the back-end bottlenecks include that the operation module 42 cannot operate or the storage module 43 cannot access.

Fig. 5 shows a schematic diagram of this embodiment layering the pipeline from top to bottom. This embodiment divides the pipeline of the processor core 306 from top to bottom into multiple layers, each of which is divided into front and back ends, as shown, the uppermost layer of the pipeline of the processor core 306 is layer 0, and divides the entire pipeline into front F0 and back B0, layer 1 is the next layer of layer 0, front F0 of layer 0 is divided into front F1 and F2 of layer 1, and back B0 of layer 0 is divided into back B1 and B2 of layer 1. Layer 2 is the next layer to layer 1, with layer 1 having a front end F1 illustratively split into layer 2 front ends F3 and F4. By analogy, this embodiment can be divided down into N layers, detailing the front and back ends. The bottlenecks occurring at the front end and the rear end of each layer are the front end bottleneck and the rear end bottleneck. The developer can define the front end and the back end at will and plan the combination of events based on the definition of the front end and the back end, and the combination is mutually exclusive, so that the cause of the bottleneck can be clearly found out.

There may be tens or even hundreds of events associated with the processor core 306. These events, while related to performance bottlenecks, are sometimes not obvious, this embodiment generates combinations from the logical relationships of the multiple events, each reflecting an actual performance bottleneck, and the combinations are mutually exclusive.

The processor core 306 further includes a plurality of performance counters (not shown) disposed directly within the control module 41, the operation module 42 and the storage module 43, and respectively associated with a plurality of combinations of pipelines in the control module 41, the operation module 42 and the storage module 43. In more detail, these performance counters are associated with a combination of pipeline hierarchies of processor cores 306, with performance counters of the same layer generalized to performance counters of a previous layer and expanding performance counters of a next layer according to the front and back ends of the pipeline. Taking fig. 5 as an example, the performance counter of the front end F1 and the performance counter of the front end F2 of the layer 1 are summarized to the performance counter of the front end F0 of the layer 0, and the performance counter of the front end F1 of the layer 1 is expanded to include the performance counter of the front end F3 of the layer 2 and the performance counter of the front end F4.

Events for this embodiment include run events and idle events due to specific factors, including, for example, control pipeline events occurring at control module 41 subdivided by control pipeline time, compute pipeline events occurring at compute module 42 subdivided by compute type, and memory pipeline events occurring at memory module 43 subdivided by memory type or memory location.

This embodiment illustratively includes at least layer 0 to layer 2 performance counters, wherein the layer 0 performance counter is associated with a combination of an instruction issue event of the control module 41, an operation access event of the operation module 42 and the storage module 43, the layer 1 performance counter is a lower layer extended performance counter of the layer 0 performance counter, an instruction fetch blocking event and a downstream blocking event associated with the instruction issue event, and a combination of an operation event of the operation module 42 and an access event of the storage module 43, and the layer 2 performance counter is a lower layer extended performance counter of the layer 1 performance counter, and is associated with a combination of at least one of the following events: at least one of a convolution operation event, a scalar operation event, and a vector operation event of the convolution operation in the operation module 42; memory access events stored on-chip or off-chip in the memory module 43; data input output events or data handling events in the storage module 43. The layer 2 performance counter is also associated with an instruction fetch blocking event and an instruction downstream blocking event in the operational event, an instruction fetch blocking event and an instruction downstream blocking event in the memory event, an operational proportion event of the operational time to the overall time associated with the operational module 42, or a combination of memory proportion events of the memory time to the overall time of the memory module 43. The relationship of each layer event to the performance counter will be described in detail below.

Fig. 6 shows a performance statistics time diagram for layer 0. Since layer 0 is the uppermost layer, layer 0 also reflects the combination of performance events of the overall pipeline 60 of processor cores 306. In this embodiment, layer 0 is defined as the instruction issue event 601 of the control module 41, and the operation access event 602 of the operation module 42 or the storage module 43. As shown in fig. 6, the instruction issue event 601 and the operation access event 602 divide the whole pipeline 60 into 4 event combinations, the first combination refers to that the instruction issue event 601 is running and the operation access event 602 is idle, the second combination refers to that the instruction issue event 601 is idle and the operation access event 602 is idle, the third combination refers to that the instruction issue event 601 is idle and the operation access event 602 is running, and the fourth combination refers to that the instruction issue event 601 is running and the operation access event 602 is running, that is, the union of the instruction issue event 601 and the operation access event 602. After such grouping, the layer 0 combinations are mutually exclusive, i.e., the first, second, third, and fourth combinations do not overlap each other.

Based on the above-mentioned grouping, this embodiment defines the first combination and the second combination as the front-end bottleneck of the layer 0, and the third combination and the fourth combination as the back-end bottleneck of the layer 0. Each combination of the 0 th layer is configured with a performance counter for calculating the average occurrence rate of the combination, the performance counter of the 0 th layer comprises first to fourth performance counters, wherein the first performance counter is used for counting the front bottleneck of the 0 th layer, and when the condition of the first combination (the instruction issuing event 601 is running and the operation access event 602 is idle) is met, the first performance counter counts; the second performance counter is used for counting the front end bottleneck of the layer 0, and counts when the condition of the second combination (the instruction issue event 601 is idle and the operation access event 602 is idle) is met; the third performance counter is used for counting the back end bottleneck of the layer 0, and counts when the condition of the third combination (the instruction issue event 601 is idle and the operation access event 602 is running) is met; the fourth performance counter is configured to count the back-end bottleneck of the layer 0, and counts when the condition of the fourth combination (the instruction issue event 601 is running and the operation access event 602 is running) is satisfied.

In summary, layer 0 of this embodiment utilizes 4 performance counters to count 4 mutually exclusive event combinations, respectively.

Layer 1 is the next layer of layer 0 and is also divided into a front bottleneck and a back bottleneck. Specifically, the front-end bottleneck of layer 1 is a further subdivision of the front-end bottleneck of layer 0, and the front-end bottleneck of layer 1 is an instruction fetch blocking event and a downstream blocking event, for example, a cache miss occurs when an instruction is fetched, which results in an excessively long pipeline interrupt time and a branch prediction error, release of a register dependency, excessively long transmission delay of a synchronous instruction in a core or between cores, unreasonable synchronous instruction configuration, and the like. The back-end bottleneck of the layer 1 is a further subdivision of the back-end bottleneck of the layer 0, and the back-end bottleneck of the layer 1 is a combination of the operation event of the operation module 42 and the access event of the storage module 43, such as the running and idle states of the operation module 42 and/or the storage module 43. Table 1 shows the combinations of layer 1.

TABLE 1

This embodiment configures 7 performance counters for layer 1 combinations: the fifth to eleventh performance counters count fifth to eleventh combinations, respectively.

The fifth performance counter is used for counting the front-end bottleneck of the layer 1, and is associated with the fifth combination on the premise that the first performance counter counts, that is, the operation module 42 and the storage module 43 cannot receive the instruction due to the instruction queue being full, specifically, the fifth performance counter counts when the following conditions are met: the micro instruction is being issued in the cache 413 and the instruction queue of either the operation module 42 or the storage module 43 is full.

The sixth performance counter is configured to count the front-end bottleneck of the layer 1, and is associated with the sixth combination on the premise that the first performance counter counts, that is, the operation module 42 and the storage module 43 cannot receive the instruction due to synchronization, specifically, the sixth performance counter counts when the following conditions are met: the micro instruction in the cache 413 is being issued and either the operation module 42 or the storage module 43 is in a synchronized state.

The seventh performance counter is used for counting the front-end bottleneck of the layer 1, and is associated to the seventh combination under the premise of counting by the second performance counter, namely the pipeline interrupt time is overlong due to the occurrence of cache loss when an instruction is acquired, specifically, the seventh performance counter counts when the condition that the pipeline interrupt time exceeds a threshold value due to the occurrence of cache loss or branch prediction error is met.

The eighth performance counter is configured to count a front-end bottleneck of the layer 1, and is associated with an eighth combination, i.e. the instruction waiting register is de-dependent, on the premise that the second performance counter counts, and specifically, the eighth performance counter counts when a condition that the instruction waiting register is de-dependent and causes the pipeline interrupt time to exceed a threshold is met.

FIG. 7 is a diagram showing the performance statistics time of the back-end bottleneck of layer 1, wherein the back-end of layer 1 is related to the operation event 701 and the memory event 702, and the back-end bottleneck of layer 1 is mutually exclusive divided into a ninth combination, a tenth combination and an eleventh combination as shown.

The ninth performance counter is used for counting the back end bottleneck of the layer 1, and is associated with the ninth combination on the premise that the third performance counter or the fourth performance counter counts, that is, the operation event and the memory access event are simultaneously operated, specifically, when the operation module 42 is operated and the conditions of the operation of the storage module 43 are all satisfied, the ninth performance counter counts.

The tenth performance counter is used for counting the back end bottleneck of the layer 1, and is associated to the tenth combination on the premise that the third performance counter or the fourth performance counter counts, that is, the operation event is running and the memory access event is idle, specifically, when the conditions that the operation module 42 is running and the storage module 43 is idle are all satisfied, the tenth performance counter counts.

The eleventh performance counter is used for counting the back-end bottleneck of the layer 1, and is associated with the eleventh combination on the premise that the third performance counter or the fourth performance counter counts, that is, the operation event is idle and the memory access event is running, specifically, when the condition that the operation module 42 is idle and the storage module 43 is running is satisfied, the eleventh performance counter counts.

To sum up, layer 1 of this embodiment utilizes 7 performance counters to count 7 mutually exclusive event combinations, respectively.

Layer 2 is the next layer of layer 1 and is also divided into front-end bottlenecks and back-end bottlenecks, i.e., the front-end bottlenecks of layer 2 are further subdivisions of the front-end bottlenecks of layer 1, and the back-end bottlenecks of layer 2 are further subdivisions of the back-end bottlenecks of layer 1. For example, the back-end of the tier 1 operational event bottleneck can be subdivided into 2 mutually exclusive back-end bottlenecks of tier 2, i.e., whether the memory event is blocked and whether the operational event itself is efficient. Possible reasons for blocking access events are instruction issue bottlenecks (e.g., insufficient instruction queue depth of the memory module 43) and the inability of instructions to be issued due to synchronization. The memory access event bottleneck of the back end of the layer 1 can be subdivided into 2 mutually exclusive back end bottlenecks of the layer 2, namely whether the operation event is blocked and whether the efficiency of the memory access event is high enough. Since the back-end bottleneck of layer 2 focuses on details of the operation event and the access event, the performance counter count with the back-end bottleneck of layer 2 is based on any one of the ninth, tenth, and eleventh performance counters. Table 2 shows a partial combination of the rear end of the layer 2 of this embodiment.

Watch II

This embodiment configures 10 performance counters for the backend of layer 2 of table 2: twelfth to twenty-first performance counters count twelfth to twenty-first combinations, respectively.

The twelfth performance counter is used for being related to the twelfth combination, namely, the pipeline interrupt time is too long due to the cache miss when the instruction is fetched, and the twelfth performance counter counts when the idle time of the operation module 42 or the storage module 43 exceeds the threshold value and the condition that the cache 413 is lost is satisfied on the premise of counting the tenth performance counter.

The thirteenth performance counter is used to correlate to the thirteenth combination, i.e. the instruction waiting register is de-dependent, and counts when the condition that the time of the de-dependent event of the storage module 43 is idle exceeds the threshold value is satisfied under the condition that the tenth performance counter counts.

The fourteenth performance counter is used to be associated with the fourteenth combination, that is, the memory module 43 cannot receive instructions due to the full instruction queue, and the fourteenth performance counter counts when the condition that the queue of the memory module 43 is full is satisfied in the sending of the access micro instruction of the cache 413 under the premise of counting by the tenth performance counter.

The fifteenth performance counter is used to be associated with the fifteenth combination, that is, the memory module 43 cannot receive the instruction due to synchronization, and the fifteenth performance counter counts when the following conditions are satisfied on the premise that the tenth performance counter counts: in the memory micro instruction issue of cache 413; the storage module 43 cannot accept the microinstruction; the memory module 43 is in a synchronized state.

The sixteenth performance counter is used to correlate the sixteenth combination, i.e. the number of beats/total beats of the instruction executed by the operation module 42, and counts when the condition that the operation module 42 is running the instruction for at least a specific proportion of all times is satisfied under the condition that the first, second and tenth performance counters are simultaneously counted.

The seventeenth performance counter is used for being related to the seventeenth combination, namely, the pipeline is interrupted for too long due to cache miss when the instruction is fetched, and the seventeenth performance counter counts when the idle time of the operation module 42 exceeds a threshold value and the condition of cache miss is met under the premise of counting by the eleventh performance counter.

The eighteenth performance counter is used to correlate to the eighteenth combination, i.e. the instruction waiting register of the operation module 42 is de-dependent, and counts when the condition that the time of the de-dependent event of the operation module 42 is idle exceeds the threshold value is met under the premise of counting by the eleventh performance counter.

The nineteenth performance counter is used to associate with the nineteenth combination, that is, the operation module 42 cannot receive the instruction due to the full instruction queue, and the nineteenth performance counter counts when the condition that the queue of the operation module 42 is full is satisfied in the micro instruction issue of the cache 413 under the premise of the count of the eleventh performance counter.

The twentieth performance counter is used in association with the twentieth combination, i.e., the operation module 42 cannot receive instructions due to synchronization, and counts when the following conditions are satisfied on the premise that the eleventh performance counter counts: in the micro instruction issue of cache 413; the operation module 42 cannot accept the operation microinstruction; the operation module 42 is in a synchronous state.

The twenty-first performance counter is used to correlate to the twenty-first combination, i.e. the number of beats/total beats of the instructions executed by the memory module 43, and counts when the condition that the memory module 43 is running the instructions for at least a specific proportion of all times is met, under the condition that the first, second and eleventh performance counters count simultaneously.

In summary, the back-end of layer 2 of this embodiment illustratively utilizes 10 performance counters to count 10 mutually exclusive event combinations, respectively.

The count of the performance counter of this embodiment is used to reflect the average occurrence of the pipeline of processor cores 306 at run-time. For example, if the average occurrence rate of the front end F0 of the layer 0 is too low, the hardware usage of the next layer of the front end F0 of the layer 0 can be further checked, that is, the average occurrence rate of the front ends F1 and F2 of the layer 1 is checked, if the average occurrence rate of the front end F2 of the layer 1 is found to be too low, the average occurrence rate of the front ends F5 and F6 of the layer 2 is further checked, so that the key cause of the performance bottleneck can be clearly locked by tracking down.

In more detail, during a period (e.g., 100 beats), when the combination satisfies the respective condition, the respective performance counter counts, the count value representing the average occurrence of the combination of events during the period. The developer can decide the number of the performance counters according to the hardware resources, and properly select and read the information of the specific performance counter, and evaluate the performance bottleneck from top to bottom in a layered manner. Since this embodiment hierarchies the pipeline from top to bottom, the performance counters are also distributed from top to bottom to hierarchically analyze performance bottlenecks.

For example, the developer may expect the higher the average occurrence of the ninth combination of FIG. 7, because the operation module 42 and the storage module 43 are running simultaneously to indicate that the operation event is parallel to the memory event, the higher the degree of parallelism the higher the benefit of the processor core 306. Accordingly, the developer may choose to read the count value of the ninth performance counter to determine whether its average occurrence meets expectations.

The above combinations are merely examples, and those skilled in the art may derive other mutually exclusive event combinations at the same layer by themselves, such as the event combinations at the front end of layer 2, or may further layer by themselves, such as the front end bottleneck and the back end bottleneck of layer 3. The embodiment is not limited to combinations and layers.

The embodiment aims at the artificial intelligent accelerator, a plurality of performance counters are arranged on hardware, event combinations to be concerned are selected, the hardware is not required to be reconfigured, the pipeline is expanded from top to bottom in a multi-level mode, the combinations of each layer are mutually exclusive and divided into a front-end bottleneck and a rear-end bottleneck, and the performance bottlenecks can be logically and rapidly located.

Another embodiment of the present invention is a method for analyzing performance of an artificial intelligence accelerator, specifically, analyzing pipeline performance of the processor core 306 in the computing device 201 in fig. 1-4, that is, focusing on individual and cooperative performance of the control module 41, the operation module 42 and the storage module 43. This embodiment also hierarchically separates the pipeline of the processor core 306 into multiple layers of events, and the developer can hierarchically compose combinations according to different conditions based on the events of interest, where the combinations are mutually exclusive and divided into front-end bottlenecks and back-end bottlenecks, and correspond to individual or cooperative performance in the control module 41, the operation module 42, and the storage module 43. Fig. 8 shows a flowchart of this embodiment.

In step 801, the pipeline of processor cores 306 is partitioned into a plurality of events. Events for this embodiment include run events and idle events due to specific factors, including, for example, control pipeline events occurring at control module 41 subdivided by control pipeline time, compute pipeline events occurring at compute module 42 subdivided by compute type, and memory pipeline events occurring at memory module 43 subdivided by memory type or memory location.

This embodiment divides the pipeline into at least 3 layers: layer 0, layer 1 and layer 2. Layer 0 is the uppermost layer, reflecting the combination of performance events for the overall pipeline of processor cores 306. In this embodiment, layer 0 is defined as the front bottleneck of the instruction issue event of the control module 41, and the back bottleneck of the operation access event of the operation module 42 or the storage module 43.

Layer 1 is the next layer of layer 0 and is also divided into a front bottleneck and a back bottleneck. Specifically, the front-end bottleneck of layer 1 is a further subdivision of the front-end bottleneck of layer 0, and the front-end bottleneck of layer 1 is an instruction fetch blocking event and a downstream blocking event, for example, pipeline interrupt time is too long and branch prediction is wrong due to cache loss when an instruction is fetched, register dependency is relieved, transmission delay of a synchronous instruction in a core or between cores is too long, and synchronous instruction configuration is unreasonable.

Layer 2 is the next layer of layer 1, and includes events including convolution operation events of convolution operation in operation module 42, scalar operation events, and vector operation events; memory access events stored on-chip or off-chip in the memory module 43; data input output events or data handling events in the storage module 43; an instruction in the operation event acquires a blocking event and an instruction downstream blocking event; the instruction associated with the access event acquires a blocking event and instructs a downstream blocking event; an operation proportion event of the operation time of the operation module 42 accounting for the total time; memory proportion event of memory module 43 memory time to total time, etc.

In step 802, based on a plurality of combinations of events, the combinations are mutually exclusive and divided into front-end bottlenecks and back-end bottlenecks, such as the combination of the command issuing event of the control module 41, the operation module 42 and the operation access event of the storage module 43 of layer 0. The instruction in the layer 1 instruction issue event obtains a combination of blocking event and downstream blocking event, and a combination of operation event of the operation module 42 and memory access event of the storage module 43. At least one of a convolution operation event, a scalar operation event, and a vector operation event of the convolution operation in the operation module 42 of the layer 2; memory access events stored on-chip or off-chip in the memory module 43; data input output events or data handling events in the storage module 43; the instruction fetch blocking event and the instruction downstream blocking event in the operation event, the instruction fetch blocking event and the instruction downstream blocking event in the memory event, the operation proportion event of the operation time of the operation module 42 to the total time, or the memory proportion event of the memory time of the memory module 43 to the total time.

Illustratively, this embodiment specifically divides the pipeline into the following combinations.

The first combination refers to the instruction issue event running and the operation access event idling, the second combination refers to the instruction issue event idling and the operation access event idling, the third combination refers to the instruction issue event idling and the operation access event running, and the fourth combination refers to the instruction issue event running and the operation access event running. After such grouping, the layer 0 combinations are mutually exclusive, i.e., the first, second, third, and fourth combinations do not overlap each other. Based on the above-mentioned grouping, this embodiment defines the first combination and the second combination as the front-end bottleneck of the layer 0, and the third combination and the fourth combination as the back-end bottleneck of the layer 0.

The front-end bottleneck of layer 1 includes a fifth combination, i.e., the case where the operation module 42 and the storage module 43 cannot receive instructions due to the full instruction queue; a sixth combination, namely the operation module 42 and the storage module 43 cannot receive instructions for synchronization reasons; a seventh combination, namely, the pipeline interruption time is too long due to cache loss when an instruction is acquired; the eighth combination, instruction waiting register de-dependencies.

The back-end bottleneck of the layer 1 is a further subdivision of the back-end bottleneck of the layer 0, and the back-end bottleneck of the layer 1 is a combination of the operation event of the operation module 42 and the access event of the storage module 43, such as the running and idle states of the operation module 42 and/or the storage module 43. The back-end bottleneck of layer 1 includes a ninth combination, i.e., the operation event and the memory access event run simultaneously; a tenth combination, i.e., the operation event is running and the memory access event is idle; the eleventh combination is that the operation event is idle and the memory access event is running.

Layer 2 is the next layer of layer 1 and is also divided into front-end bottlenecks and back-end bottlenecks, i.e., the front-end bottlenecks of layer 2 are further subdivisions of the front-end bottlenecks of layer 1, and the back-end bottlenecks of layer 2 are further subdivisions of the back-end bottlenecks of layer 1. For example, the back-end of the tier 1 operational event bottleneck can be subdivided into 2 mutually exclusive back-end bottlenecks of tier 2, i.e., whether the memory event is blocked and whether the operational event itself is efficient. Possible reasons for blocking access events are instruction issue bottlenecks (e.g., insufficient instruction queue depth of the memory module 43) and the inability of instructions to be issued due to synchronization. The memory access event bottleneck of the back end of the layer 1 can be subdivided into 2 mutually exclusive back end bottlenecks of the layer 2, namely whether the operation event is blocked and whether the efficiency of the memory access event is high enough.

The back end of layer 2 includes: twelfth combination, namely, the cache loss occurs when an instruction is acquired, so that the interrupt time of a pipeline is too long; a thirteenth combination, instruction waiting register de-dependencies; a fourteenth combination, i.e., memory module 43 cannot receive instructions because the instruction queue is full; a fifteenth combination, i.e., the memory module 43 cannot receive instructions for synchronization reasons; a sixteenth combination, i.e., the number of beats/total beats of the execution instruction by the operation module 42; seventeenth combination, namely, the cache loss occurs when the instruction is acquired, so that the pipeline is interrupted for too long; an eighteenth combination, instruction waiting register de-dependencies of the operation module 42; a nineteenth combination, namely, the operation module 42 cannot receive instructions because the instruction queue is full; a twentieth combination, i.e., the operation module 42 cannot receive instructions for synchronization reasons; the twenty-first combination, memory module 43, executes the beats of instructions/total beats.

The above combinations are merely examples, and other mutually exclusive combinations of events at the same level can be derived by those skilled in the art.

In step 803, a configuration performance counter is associated with the plurality of combinations. The performance counter is directly arranged in the control module 41, the operation module 42 and the storage module 43, and is respectively associated with the combination.

In more detail, the first performance counter is configured to be associated with a first combination; configuring a second performance counter to be associated with a second combination; configuring a third performance counter to be associated with a third combination; configuring a fourth performance counter to correlate to a fourth combination; configuring a fifth performance counter to be associated with a fifth combination; configuring a sixth performance counter to be associated with a sixth combination; configuring a seventh performance counter to be associated with a seventh combination; configuring an eighth performance counter to be associated with an eighth combination; configuring a ninth performance counter to be associated with the ninth combination; configuring a tenth performance counter to be associated with a tenth combination; configuring an eleventh performance counter to be associated with the eleventh combination; configuring a twelfth performance counter to be associated with the twelfth combination; configuring a thirteenth performance counter to be associated with a thirteenth combination; configuring a fourteenth performance counter to be associated with the fourteenth combination; configuring a fifteenth performance counter to be associated with a fifteenth combination; configuring a sixteenth performance counter to be associated with a sixteenth combination; configuring a seventeenth performance counter to be associated with the seventeenth combination; configuring an eighteenth performance counter to be associated with an eighteenth combination; configuring a nineteenth performance counter to be associated with the nineteenth combination; configuring a twentieth performance counter to be associated with the twentieth combination; the twenty-first performance counter is configured to be associated with the twenty-first combination.

The developer can decide the number of performance counters according to the hardware resources and choose to read the information of the specific performance counter appropriately.

In step 804, it is determined whether the plurality of combinations satisfy the respective conditions. Specifically, the layer 0 judges whether the condition that the first combined instruction issues the event operation and the operation access event is idle is met, judges whether the condition that the second combined instruction issues the event idle and the operation access event is idle is met, judges whether the condition that the third combined instruction issues the event idle and the operation access event is met, and judges whether the condition that the fourth combined instruction issues the event operation and the operation access event is met.

In the layer 1 judgment of the micro instruction issue in the fifth combination cache 413, and whether the condition that the instruction queue of the operation module 42 or the storage module 43 is full is satisfied, the judgment of the micro instruction issue in the sixth combination cache 413, and whether the condition that the operation module 42 or the storage module 43 is in a synchronous state is satisfied, the judgment of the condition that the pipeline interrupt time exceeds the threshold value due to the occurrence of the cache miss or the branch prediction error of the seventh combination, the judgment of the condition that the pipeline interrupt time exceeds the threshold value due to the instruction wait register solution of the eighth combination is satisfied, the judgment of the condition that the operation module 42 of the ninth combination is operated and the condition that the storage module 43 is operated is satisfied, the judgment of the condition that the operation module 42 of the tenth combination is operated and the storage module 43 is idle is satisfied, and the judgment of the condition that the operation module 42 of the eleventh combination is idle and the operation module 43 is operated is satisfied.

In the layer 2, it is determined whether the time for which the twelfth combined operation module 42 or the storage module 43 is idle exceeds a threshold value, and whether the condition for which the buffer 413 is lost is satisfied, it is determined whether the condition for which the solution dependent event of the thirteenth combined operation module 43 is idle exceeds a threshold value is satisfied, it is determined whether the condition for which the solution dependent event of the thirteenth combined operation module 413 is idle exceeds a threshold value is satisfied, it is determined whether the condition for which the queue of the storage module 43 is full is satisfied during transmission of the access micro instruction of the fourteenth combined operation module 413, it is determined whether the condition for which the queue of the storage module 43 is full is satisfied during transmission of the access micro instruction of the fifteenth combined operation module 413, it is determined whether the condition for which the storage module 43 cannot accept the micro instruction and the storage module 43 is in a synchronous state is satisfied, it is determined whether the condition for which the time for which the sixteenth combined operation module 42 is idle exceeds a threshold value is satisfied, and it is determined whether the condition for which the buffer loss condition for the seventeenth combined operation module 42 is idle exceeds a threshold value is satisfied during transmission of the access micro instruction of the nineteenth combined operation module 413, and whether the condition for which the buffer 42 cannot accept the micro instruction is satisfied.

For any combination, if the corresponding condition is not met, step 805 is performed to stop driving the corresponding performance counter count, i.e., the corresponding performance counter does not count. If the corresponding condition is met, step 806 is performed to drive the corresponding performance counter count.

In step 807, the front-end bottlenecks and the back-end bottlenecks are evaluated based on the count values of the performance counters. The count value of the performance counter of this embodiment reflects the average occurrence of the pipeline of processor cores 306 at run-time. In more detail, this embodiment may repeatedly perform steps 804-806 over a period (e.g., 100 beats), and when the combination satisfies the corresponding condition, the corresponding performance counter counts, the count value representing the average occurrence of the event combination over the period. As the pipeline is layered from top to bottom in the embodiment, the performance counters are distributed from top to bottom, and the front-end bottleneck and the back-end bottleneck can be analyzed by layering by a person skilled in the art conveniently.

The embodiment selects the event combination to be focused by utilizing the performance counter aiming at the artificial intelligent accelerator, hardware is not required to be reconfigured for each analysis, the pipeline is expanded from top to bottom in a multi-level manner, and the combination of each layer is mutually exclusive and divided into a front-end bottleneck and a back-end bottleneck, so that the performance bottleneck can be logically and rapidly positioned.

Another embodiment of the invention is a computer readable storage medium having stored thereon computer program code for an artificial intelligence accelerator performance analysis method, which when executed by a processor, performs the method of the previous embodiments. In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the inventive aspects are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of a method described by an embodiment of the invention. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media capable of storing program codes.

The invention has the following technical effects: the invention configures the performance counter for the event combination with the possibility of performance bottleneck, when each performance analysis is carried out, the combination to be concerned can be selected, each analysis does not need to reconfigure hardware, only the specific performance counter is selected to be read, and the performance analysis scheme of the invention is quite light and flexible; the performance analysis scheme of the invention is developed from top to bottom in a multi-level manner, and the events of each layer are mutually exclusive and divided into a front-end bottleneck and a rear-end bottleneck, so that the performance bottleneck can be logically and rapidly positioned; the invention defines the event combination aiming at the special structure of the artificial intelligent accelerator, and can directly find the performance bottleneck of the artificial intelligent accelerator.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present invention may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, etc. Furthermore, the electronic equipment or the electronic device can be used in cloud end, edge end, terminal and other application scenes related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the high-power electronic device or apparatus according to the present invention may be applied to a cloud device (e.g., a cloud server), and the low-power electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of simplicity, the present invention represents some methods and embodiments thereof as a series of acts and combinations thereof, but it will be understood by those skilled in the art that the aspects of the present invention are not limited by the order of acts described. Thus, those skilled in the art will appreciate, in light of the present disclosure or teachings, that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described herein may be considered as alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or all aspects of the present invention. In addition, the description of some embodiments of the present invention is also focused on according to the different schemes. In view of this, those skilled in the art will appreciate that portions of one embodiment of the invention that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, those skilled in the art will appreciate that several embodiments of the present disclosure may be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separated, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purposes of the solution according to the embodiments of the present invention. In addition, in some scenarios, multiple units in embodiments of the invention may be integrated into one unit or each unit may physically reside separately.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing may be better understood in light of the following clauses:

the processor core for analyzing the performance of the pipeline comprises a control module, an operation module and a storage module, wherein the control module, the operation module and the storage module correspond to a plurality of events of the pipeline, the events are combined according to different conditions, the combinations are mutually exclusive and divided into a front-end bottleneck and a back-end bottleneck, and the processor core is characterized by further comprising a plurality of performance counters which are respectively associated with the combinations, and when the combinations meet corresponding conditions, the corresponding performance counters count.

Clause a2. The processor core of clause A1, wherein the performance counter is associated with a combination of the plurality of events of the plurality of layers layered by the pipeline, the performance counter of a single layer generalizing to the performance counter of a previous layer and expanding the performance counter of a next layer according to the pipeline.

Clause a3 the processor core of clause A2, wherein the plurality of events includes an operational event and an idle event due to a specific factor.

Clause a4 the processor core of clause A2, wherein the plurality of events comprises a control pipeline event subdivided by control pipeline time, an operation pipeline event subdivided by operation type, and a memory pipeline event subdivided by memory type or memory location.

Clause a5 the processor core of clause A1, wherein the front-end bottleneck comprises instructions failing to issue and the back-end bottleneck comprises failing to operate or access.

Clause a6 the processor core of clause A1, wherein the plurality of performance counters comprises a layer 0 performance counter, the layer 0 performance counter being associated with a combination of an instruction issue event of the control module, an operation memory event of the operation module, and the storage module.

Clause A7. the processor core of clause A6, the plurality of performance counters comprising a layer 1 performance counter, the layer 1 performance counter being a lower-layer expanded performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of an instruction fetch blocking event and a downstream blocking event of the instruction issue event.

Clause A8. the processor core of clause A6, the plurality of performance counters comprising a layer 1 performance counter, the layer 1 performance counter being a lower-layer expanded performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of an operational event of the operational module and a memory access event of the memory module.

Clause A9. the processor core of clause A8, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of at least one of the following events: at least one of a convolution operation event, a scalar operation event and a vector operation event of convolution operation in the operation module; memory access events stored on-chip or off-chip in the memory module; and the data input/output event or the data handling event in the storage module.

Clause a10 the processor core of clause A7, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of instruction fetch blocking events and instruction downstream blocking events of the operational events.

Clause a11 the processor core of clause A7, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of an instruction fetch blocking event and an instruction downstream blocking event in the memory event.

Clause a12 the processor core of clause A7, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower tier expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with an operation proportion event of operation time of the operation module accounting for overall time, or a combination of access proportion events of access time of the storage module accounting for overall time.

Clause a13 the processor core of clause A6, wherein the plurality of performance counters comprises a first performance counter to count a front end bottleneck of the layer 0, the first performance counter counting when the instruction issue event is running and the condition that the operation memory event is idle is satisfied.

Clause a14 the processor core of clause A6, wherein the plurality of performance counters comprises a second performance counter to count a front end bottleneck of the layer 0, the second performance counter counting when the instruction issue event is idle and the condition that the operation access event is idle is satisfied.

Clause a15 the processor core of clause A8, wherein the plurality of performance counters comprises a third performance counter to count a back-end bottleneck of the layer 0, the third performance counter counting when the instruction issue event is idle and the condition for operation of the operation access event is satisfied.

Clause a16 the processor core of clause a15, wherein the plurality of performance counters comprises a fourth performance counter to count a back-end bottleneck of the layer 0, the fourth performance counter counting when the instruction issue event is running and a condition of the operation memory event running is satisfied.

Clause a17 the processor core of clause a13, wherein the control module comprises a cache for temporarily storing microinstructions, the plurality of performance counters comprises a fifth performance counter for counting front-end bottlenecks of the layer 1, the fifth performance counter counting when, on the premise of the first performance counter count, the following condition is satisfied: in the micro instruction sending in the cache; the instruction queue of the operation module or the storage module is full.

Clause a18 the processor core of clause a13, wherein the control module comprises a cache for temporarily storing microinstructions, the plurality of performance counters comprises a sixth performance counter for counting front-end bottlenecks of the layer 1, the sixth performance counter counting when, on the premise that the first performance counter counts, the following condition is satisfied: in the micro instruction sending in the cache; the operation module or the storage module is in a synchronous state.

Clause a19 the processor core of clause a14, wherein the plurality of performance counters comprises a seventh performance counter to count a front end bottleneck of the layer 1, the seventh performance counter counting if the second performance counter counts when the following condition is satisfied: occurrence of a cache miss or branch prediction error causes the pipeline interrupt time to exceed a threshold.

Clause a20 the processor core of clause a14, wherein the plurality of performance counters comprises an eighth performance counter to count the front end bottleneck of the layer 1, the eighth performance counter counting when the following condition is satisfied on the premise that the second performance counter counts: instruction wait register de-dependencies cause the pipeline interrupt time to exceed a threshold.

Clause a21 the processor core of clause a16, wherein the plurality of performance counters comprises a ninth performance counter to count the back-end bottleneck of the layer 1, the ninth performance counter counting if the third performance counter or the fourth performance counter counts if both of the following conditions are met: the operation event is operated; and the access event is operated.

Clause a22 the processor core of clause a16, wherein the plurality of performance counters comprises a tenth performance counter to count the back-end bottleneck of the layer 1, the tenth performance counter counting if the third performance counter counts if both of the following conditions are met: the operation event is operated; and the access event is idle.

Clause a23 the processor core of clause a16, wherein the plurality of performance counters comprises an eleventh performance counter to count a back-end bottleneck of the layer 1, the eleventh performance counter counting if the third performance counter counts if both of the following conditions are met: the operation event is idle; and the access event is operated.

Clause a24 an artificial intelligence accelerator comprising a processor core according to any of clauses A1 to 23.

Clause a25 an integrated circuit device comprising the artificial intelligence accelerator of clause a24.

Clause a26 a board comprising the integrated circuit device of clause a25.

Clause a27. A method of artificial intelligence accelerator performance analysis, the artificial intelligence accelerator comprising a processor core, the method comprising: splitting a pipeline of the processor core into a plurality of events; forming a plurality of combinations based on the plurality of events, wherein the plurality of combinations are mutually exclusive and are divided into a front-end bottleneck and a back-end bottleneck; configuring a performance counter associated with the plurality of combinations; judging whether the combinations meet corresponding conditions or not; if so, driving the corresponding performance counter to count; and evaluating the front-end bottleneck and the back-end bottleneck according to the count value of the performance counter.

Clause a28 the method of clause a27, wherein the performance counters of a single layer are generalized to the performance counters of a previous layer and the performance counters of a next layer are expanded according to the pipeline.

Clause a29 the method of clause a28, wherein the plurality of events comprises an operational event and an idle event due to a specific factor.

Clause a30 the method of clause a28, wherein the plurality of events comprises a control pipeline event subdivided by control pipeline time, an operation pipeline event subdivided by operation type, and a memory pipeline event subdivided by memory type or memory location.

Clause a31 the method of clause a27, wherein the front end bottleneck comprises instructions failing to issue and the back end bottleneck comprises failing to operate or failing to access.

Clause a32 the method of clause a27, wherein the plurality of performance counters comprises a layer 0 performance counter, the layer 0 performance counter being associated with a combination of an instruction issue event of the control module, an operation access event of the operation module and the storage module.

Clause a33 the method of clause a32, the plurality of performance counters comprising a layer 1 performance counter, the layer 1 performance counter being a lower-layer expanded performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of instruction fetch blocking events and downstream blocking events of the instruction issue events.

Clause a34 the method of clause a32, the plurality of performance counters comprises a layer 1 performance counter, the layer 1 performance counter being a lower-layer expanded performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of an operational event of the operational module and a memory access event of the memory module.

Clause a35 the method of clause a34, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of at least one of the following events: at least one of a convolution operation event, a scalar operation event and a vector operation event of convolution operation in the operation module; memory access events stored on-chip or off-chip in the memory module; and the data input/output event or the data handling event in the storage module.

Clause a36 the method of clause a33, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of instruction fetch blocking events and instruction downstream blocking events of the operational events.

Clause a37 the method of clause a33, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of instruction fetch blocking events and instruction downstream blocking events in the memory event.

Clause a38 the method of clause a33, the plurality of performance counters comprises a layer 2 performance counter, the layer 2 performance counter being a lower tier expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with an operation proportion event of operation time of the operation module accounting for overall time, or a combination of access proportion events of access time of the storage module accounting for overall time.

Item a39 an operator readable storage medium having stored thereon operator program code of an artificial intelligence accelerator performance analysis method, which when executed by a processing device, performs the method of any of items a27 to 38.

The foregoing has outlined rather broadly the more detailed description of embodiments of the invention, wherein the principles and embodiments of the invention are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The processor core for analyzing the pipeline performance comprises a control module, an operation module and a storage module, wherein the control module, the operation module and the storage module correspond to a plurality of events of the pipeline, the events are combined according to different conditions, and the combinations are mutually exclusive and divided into a front-end bottleneck and a back-end bottleneck.

2. The processor core of claim 1, wherein the performance counter is associated with a combination of the plurality of events of the plurality of layers layered by the pipeline, the performance counter of a single layer generalizing to the performance counter of a previous layer and unrolling the performance counter of a next layer according to the pipeline.

3. The processor core of claim 2, wherein the plurality of events comprises a run event and an idle event due to a particular factor.

4. The processor core of claim 2, wherein the plurality of events comprises control pipeline events subdivided by control pipeline time, operation pipeline events subdivided by operation type, and memory pipeline events subdivided by memory type or memory location.

5. The processor core of claim 1, the front-end bottleneck comprising instructions failing to issue, the back-end bottleneck comprising failing to operate or failing to access.

6. The processor core of claim 1, the plurality of performance counters comprising a layer 0 performance counter, the layer 0 performance counter being associated to a combination of instruction issue events of the control module, operation access events of the operation module and the storage module.

7. The processor core of claim 6, the plurality of performance counters comprising a layer 1 performance counter, the layer 1 performance counter being a lower-layer unroll performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of instruction fetch blocking events and downstream blocking events of the instruction issue events.

8. The processor core of claim 6, the plurality of performance counters comprising a layer 1 performance counter, the layer 1 performance counter being a lower-layer expanded performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of an operational event of the operational module and a memory access event of the storage module.

9. The processor core of claim 8, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of at least one of:

at least one of a convolution operation event, a scalar operation event and a vector operation event of convolution operation in the operation module;

memory access events stored on-chip or off-chip in the memory module;

and the data input/output event or the data handling event in the storage module.

10. The processor core of claim 7, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer unroll performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of instruction fetch blocking events and instruction downstream blocking events of the operation events.

11. The processor core of claim 7, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated to a combination of instruction fetch blocking events and instruction downstream blocking events in the memory access event.

12. The processor core of claim 7, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-level expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with an operation proportion event of operation time of the operation module to an overall time, or a combination of access proportion events of access time of the storage module to an overall time.

13. The processor core of claim 6, wherein the plurality of performance counters includes a first performance counter to count a front end bottleneck of the layer 0, the first performance counter counting when the instruction issue event is running and a condition that the operation access event is idle is satisfied.

14. The processor core of claim 6, wherein the plurality of performance counters includes a second performance counter to count a front end bottleneck of the layer 0, the second performance counter counting when the instruction issue event is idle and a condition of the operation access event idle is satisfied.

15. The processor core of claim 8, wherein the plurality of performance counters includes a third performance counter to count a back-end bottleneck of the layer 0, the third performance counter counting when the instruction issue event is idle and a condition for operation of the operation access event is satisfied.

16. The processor core of claim 15, wherein the plurality of performance counters includes a fourth performance counter to count a back end bottleneck of the layer 0, the fourth performance counter counting when the instruction issue event is running and a condition of the operation access event running is satisfied.

17. The processor core of claim 13, wherein the control module includes a cache that buffers micro instructions, the plurality of performance counters includes a fifth performance counter to count a front end bottleneck of the layer 1, the fifth performance counter to count if the first performance counter counts when the following condition is met:

in the micro instruction sending in the cache;

the instruction queue of the operation module or the storage module is full.

18. The processor core of claim 13, wherein the control module includes a cache that buffers micro instructions, the plurality of performance counters includes a sixth performance counter to count a front end bottleneck of the layer 1, the sixth performance counter to count if the first performance counter counts when:

In the micro instruction sending in the cache;

the operation module or the storage module is in a synchronous state.

19. The processor core of claim 14, wherein the plurality of performance counters includes a seventh performance counter to count a front end bottleneck of the layer 1, the seventh performance counter to count if the second performance counter counts when the following condition is met:

occurrence of a cache miss or branch prediction error causes the pipeline interrupt time to exceed a threshold.

20. The processor core of claim 14, the plurality of performance counters comprising an eighth performance counter to count a front end bottleneck of the layer 1, the eighth performance counter to count if the second performance counter counts when the following condition is met:

instruction wait register de-dependencies cause the pipeline interrupt time to exceed a threshold.

21. The processor core of claim 16, wherein the plurality of performance counters includes a ninth performance counter to count a back end bottleneck of the layer 1, the ninth performance counter counting if the third performance counter or the fourth performance counter counts if both:

The operation event is operated;

and the access event is operated.

22. The processor core of claim 16, wherein the plurality of performance counters includes a tenth performance counter to count a back-end bottleneck of the layer 1, the tenth performance counter counting if the third performance counter counts when both of the following conditions are met:

the operation event is operated;

and the access event is idle.

23. The processor core of claim 16, wherein the plurality of performance counters includes an eleventh performance counter to count a back-end bottleneck of the layer 1, the eleventh performance counter to count if the third performance counter counts when both of the following conditions are met:

the operation event is idle;

and the access event is operated.

24. An artificial intelligence accelerator comprising a processor core according to any one of claims 1 to 23.

25. An integrated circuit device comprising the artificial intelligence accelerator of claim 24.

26. A board card comprising the integrated circuit device of claim 25.

27. A method of artificial intelligence accelerator performance analysis, the artificial intelligence accelerator comprising a processor core, the method comprising:

Splitting a pipeline of the processor core into a plurality of events;

forming a plurality of combinations based on the plurality of events, wherein the plurality of combinations are mutually exclusive and are divided into a front-end bottleneck and a back-end bottleneck;

configuring a performance counter associated with the plurality of combinations;

judging whether the combinations meet corresponding conditions or not;

if so, driving the corresponding performance counter to count;

and evaluating the front-end bottleneck and the back-end bottleneck according to the count value of the performance counter.

28. The method of claim 27, wherein a single layer of performance counters generalizes to a previous layer of performance counters and spreads out a next layer of performance counters according to the pipeline.

29. The method of claim 28, wherein the plurality of events comprises an operational event and an idle event due to a particular factor.

30. The method of claim 28, wherein the plurality of events comprises control pipeline events subdivided by control pipeline time, operation pipeline events subdivided by operation type, and memory pipeline events subdivided by memory type or memory location.

31. The method of claim 27, the front-end bottleneck comprising instructions failing to issue, the back-end bottleneck comprising failing to operate or failing to access.

32. The method of claim 27, the plurality of performance counters comprising a layer 0 performance counter, the layer 0 performance counter being associated with a combination of instruction issue events of the control module, operation access events of the operation module and the storage module.

33. The method of claim 32, the plurality of performance counters comprising a layer 1 performance counter, the layer 1 performance counter being a lower-layer expanded performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of instruction fetch blocking events and downstream blocking events of the instruction issue events.

34. The method of claim 32, the plurality of performance counters comprising a layer 1 performance counter, the layer 1 performance counter being a lower-layer expanded performance counter of the layer 0 performance counter, the layer 1 performance counter being associated with a combination of an operational event of the operational module and a memory access event of the storage module.

35. The method of claim 34, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of at least one of:

memory access events stored on-chip or off-chip in the memory module;

36. The method of claim 33, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of instruction fetch blocking events and instruction downstream blocking events of the operation events.

37. The method of claim 33, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower-layer expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with a combination of instruction fetch blocking events and instruction downstream blocking events in the memory event.

38. The method of claim 33, the plurality of performance counters comprising a layer 2 performance counter, the layer 2 performance counter being a lower tier expanded performance counter of the layer 1 performance counter, the layer 2 performance counter being associated with an operation proportion event of operation time of the operation module to total time, or a combination of access proportion events of access time of the storage module to total time.

39. A computer readable storage medium having stored thereon computer program code of an artificial intelligence accelerator performance analysis method, which when run by a processing device, performs the method of any of claims 27 to 38.