CN111506384A

CN111506384A - Simulation operation method and simulator

Info

Publication number: CN111506384A
Application number: CN201910097439.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2020-08-07
Anticipated expiration: 2039-01-31
Also published as: CN111506384B

Abstract

The present disclosure relates to a simulation operation method and a simulator, the method is used for simulating and executing neural network operation, and comprises the following steps: receiving and storing operational data, wherein the operational data comprises a neural network operational instruction and data for executing the neural network operational instruction; parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event. And obtaining at least one of operation time and operation results of completing the neural network operation instruction based on the determined event processes. The method and the device can be used for rapidly simulating the operation of the neural network.

Description

Simulation operation method and simulator

Technical Field

The present disclosure relates to the field of machine learning, and in particular, to an analog operation method, an analog device, a machine learning chip, and an electronic device.

Background

Computer simulation refers to the act of simulating a real-world process or system using a simulator developed by computer software. Computer system simulators have become an indispensable tool in the field of computer system architecture research today. Researchers can efficiently complete the configuration and observation of the hardware and the software by using the simulator with lower cost and expenditure, and further provide guidance for the design and optimization of the hardware and the software.

Neural networks (neural networks) have been used very successfully, and neural network accelerators (neural network accelerators) are widely used to handle neural network applications. Before the neural network accelerator is formally applied, a developer needs to perform performance evaluation on the accelerator. Researchers typically employ cycle accurate simulators for performance simulation of hardware. Cycle accurate simulators require each module of the simulation hardware to perform various details of operations, including state machine jumps, register changes, pipeline level operations, etc., every clock cycle to maintain consistency between the simulator and the hardware. Such accurate simulation results in huge resource, energy and time overhead, which in turn results in a simulator with accurate cycle time not meeting the actual requirements. Therefore, how to increase the operation speed of the simulator becomes a problem to be solved urgently. .

Disclosure of Invention

The embodiment of the disclosure provides a simulation operation method, a simulator, a machine learning chip, an electronic device and a storage medium, which can conveniently improve the simulation operation speed.

According to a first aspect of the present disclosure, there is provided a simulation operation method for simulating execution of a neural network operation, including:

receiving and storing operational data, wherein the operational data comprises a neural network operational instruction and data for executing the neural network operational instruction;

parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event.

And obtaining at least one of operation time and operation results of completing the neural network operation instruction based on the determined event processes.

In some possible embodiments, the parsing a plurality of operator instructions from the neural network operation instruction includes:

and executing decoding operation on the received neural network operation instruction to obtain the plurality of operation sub-instructions.

In some possible embodiments, the determining the plurality of event processes for completing the plurality of operator instructions comprises at least one of:

determining the number of the event processes according to the number of the operation sub-instructions;

and determining each event process and the execution sequence of each event process according to the execution sequence of each operation sub-instruction.

In some possible embodiments, the determining the number of event processes according to the number of operator instructions includes:

when N operator instructions are analyzed from the neural network operation instruction plan, determining the number of the event processes to be N +2, wherein N is a positive integer greater than or equal to 1.

In some possible embodiments, the determining the execution order of each event process and the execution order of each event process according to the execution order of each operation sub-instruction includes:

determining a first event process, the first event process comprising a first load event;

determining a second event process, the second event process comprising a second load event and a first arithmetic event;

determining a third event process, wherein the third event process comprises a third loading event, a second operation event and a first storage event;

determining an ith event process, wherein the ith event process comprises an ith loading event, an ith-1 operation event and an ith-2 storage event;

determining an N +1 th event process, wherein the N +1 th event process comprises an nth operation event and an N-1 th storage event;

determining an N +2 th event process, wherein the N +2 th event process comprises an nth storage event, i is an integer which is greater than 3 and less than or equal to N, and N is the number of operation sub-instructions and is a positive integer which is greater than or equal to 1;

the jth loading event is used for loading operation data of a jth operation sub-instruction, the jth storage event is used for storing an operation result of the jth operation sub-instruction, the jth operation event is used for executing operation of the jth operation sub-instruction, and j is a positive integer which is larger than 0 and smaller than or equal to N.

each event process also comprises a synchronous event, and the synchronous time of the synchronous event of each event process is determined according to the execution time of each event in each event process.

In some possible embodiments, the obtaining at least one of an operation time and an operation result for completing the operation instruction of the neural network based on the determined event processes includes:

determining the execution time of each event process according to the execution time of each event in each event process;

and acquiring the operation time for completing the neural network operation instruction according to the execution time of each event process.

executing each of the determined event processes;

and obtaining the operation result of the neural network operation according to the operation result of each event process.

In some possible embodiments, the method further comprises:

when executing each event process, determining an operation program for executing the operation event according to the operation type of the operation sub-instruction corresponding to the operation event in the event process;

and executing the corresponding operation event according to the determined operation program.

In some possible embodiments, the determining, according to the operation type of the operation event in the event process, an operation program for executing the operation event includes:

when the operator instruction corresponding to the operation event is a first type of operation, executing the operation event by using a first operation program;

when the operator instruction corresponding to the operation event is the second type of operation, executing the operation event by using a first operation program;

the first class of operations includes at least one of vector operations, scalar operations, and non-linear operations, and the second class of operations includes matrix scalar operations.

According to a second aspect of the present disclosure, there is provided a simulator comprising:

a processor for performing the method of any one of the first aspect.

According to a third aspect of the present disclosure, there is provided a machine learning chip for performing the method of any one of the first aspect.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising the chip according to the third aspect.

According to a fifth aspect of the present disclosure, there is provided a computer readable storage medium having stored therein computer program instructions which, when executed by a processor, implement the method of any one of the first aspects.

According to the embodiments of the present disclosure, it can be known that the embodiments of the present disclosure can form a plurality of matched event processes according to the neural network operation to be executed, and execute each event according to the sequence of the event processes, so as to sequentially simulate and execute each operation, and can execute the operation process based on the event triggering mode, and can quickly perform performance simulation on the neural network operation. The simulation is carried out by taking an event as a unit, the event is usually a user-defined event, each event can trigger one simulation operation, the neural network operation process can be effectively executed, the result of each process can be obtained, and the simulation execution process can be conveniently analyzed.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow diagram of a method of analog operation according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating an apparatus for performing neural network operations according to an embodiment of the present disclosure;

FIG. 3 shows a block diagram of an operational module according to an embodiment of the present disclosure;

FIG. 4 shows a flow chart of step S300 of a method of analog operation according to an embodiment of the disclosure;

FIG. 5 shows a flow chart of step S300 of a method of analog operation according to an embodiment of the disclosure;

fig. 6 shows a process diagram of a simulation operation method according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

The embodiment of the disclosure can simulate and execute simulation operation related to the machine learning algorithm, for example, the simulation operation can be used for simulating and estimating the operation time of the neural network operation and simulating and executing the neural network operation to obtain corresponding operation results, so that information such as the operation result, the operation time, the operation speed and the like can be obtained by executing the corresponding machine learning operation, and the analysis of the machine learning algorithm is realized. The embodiment of the disclosure provides a method for determining a plurality of event processes for executing a neural network operation instruction by analyzing a received neural network operation instruction, wherein the event processes can be sequentially executed according to a sequence, so that the time for analyzing a neural network algorithm can be reduced, and the running speed of a simulator can be increased.

FIG. 1 shows a flow diagram of a method of analog operation according to an embodiment of the present disclosure. As shown in fig. 1, the simulation operation method of the embodiment of the present disclosure may include

S100: receiving and storing operational data, wherein the operational data comprises a neural network operational instruction and data for executing the neural network operational instruction;

s200: parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event.

S300: and obtaining at least one of operation time and operation results of completing the neural network operation instruction based on the determined event processes.

The simulation operation method of the embodiment of the present disclosure may be applied to a simulator, which may perform analysis of neural network operation through simulation of processes such as data storage, data operation, and the like, for example, may be used to analyze operation time or operation result of the neural network operation, and the like. Wherein the simulator may comprise a performance simulator, but the present disclosure is not limited thereto.

In step S100, operation data to be subjected to analog operation may be received first, where the operation data may include a neural network operation instruction and data required to execute the instruction, or may also include information such as a storage address and a data amount of each data to execute the neural network operation instruction.

Upon receiving the operational data, embodiments of the present disclosure may simulate performing storage of the operational data, e.g., may simulate performing a storage operation that stores the operational data to a storage area. In practical applications, the neural network operation may include a storage module, an operation module, and a storage module. The simulator of the embodiment of the disclosure can simulate the storage of the operation data into the preset storage space of the storage module, so that the operation data can be read and called by subsequent operation.

After the operation data is obtained and stored, the simulator may simulate performing analysis on the operation data through step S200, and determine each event process for performing the neural network operation. For example, the analysis process of the neural network operation instruction by the control module and the determination and control process of the corresponding event process can be simulated. That is, the simulator according to the embodiment of the present disclosure may analyze a plurality of operator instructions from the received neural network operation instruction by the simulation control module, so as to respectively simulate and analyze operations of the plurality of operator instructions.

In some possible embodiments, the parsing a plurality of operator instructions from the neural network operation instruction may include: and executing decoding operation on the received neural network operation instruction to obtain the plurality of operation sub-instructions. That is, in the embodiment of the present disclosure, in order to improve data security and reduce data capacity, the neural network operation instruction received in the embodiment of the present disclosure may be an encoded instruction, and after receiving the neural network operation instruction, a decoding operation of the instruction may be performed to analyze a plurality of operator instructions from the neural network operation instruction. The present disclosure is not particularly limited to the encoding and decoding modes of the neural network operation instruction, and those skilled in the art can select an appropriate mode to perform the encoding and decoding operations.

In addition, after obtaining a plurality of operation sub-instructions corresponding to the neural network operation instruction, an event procedure for completing the plurality of operation sub-instructions may be determined, where each event procedure may include at least one event, and the type of the event may be at least one of a load event, an operation event, a store event, and a synchronization event.

The embodiment of the present disclosure may determine a plurality of event processes according to a plurality of operation sub-instructions corresponding to the received neural network operation instruction, where each event process of the plurality of event processes may include at least one event, and there is no dependency relationship between events included in each event process. Specifically, in step S200 of the embodiment of the present disclosure, determining a plurality of event processes for completing the plurality of operation sub-instructions may include at least one of the following:

a) determining the number of the event processes according to the number of the operation sub-instructions;

in the embodiment of the present disclosure, when N operation sub-instructions are resolved from the neural network operation instruction plan, it is determined that the number of the event processes is N +2, where N is a positive integer greater than or equal to 1.

In the embodiment of the disclosure, in order to enable each operation sub-instruction to be executed in order and to be simulated, run and analyzed quickly, a plurality of event processes are determined, wherein a relationship to be satisfied among the event processes is that each event in each event process is an event without a dependency relationship, and the event without a dependency relationship includes that data applied by operation of each event does not have an overlapping relationship, for example, there is no dependency relationship between a load L oad/Store event and an operation computer event, there is no dependency relationship between a L oad event and a Store event, there is no dependency relationship between a Matrix operation Matrix computer event, a Vector operation Vector computer event and a Scalar operation Scalar computer event.

b) And determining each event process and the execution sequence of each event process according to the execution sequence of each operation sub-instruction.

As described above, the relationship existing between event processes of the embodiments of the present disclosure is: the data required by the next event is loaded in the previous event, after the operation sub-instruction in the event is completed, the storage of the corresponding operation result is executed in the event process, and the synchronization time synchronization is utilized between each event process, namely, the next event process can be started through the synchronization event after the events in the previous event process are all completed. Specifically, determining the sequence of each event process in the embodiment of the present disclosure may include:

determining an N +2 th event process, wherein the N +2 th event process comprises an nth storage event, i is an integer which is greater than 3 and less than or equal to N, and N is the number of operation sub-instructions and is a positive integer which is greater than 1;

In this way, a plurality of event processes corresponding to the operator instructions can be determined, and each event process and the execution order of each event process can be determined according to the execution order of each operator instruction.

After the event process is analyzed, the simulation control module can control the operation module and the storage module to execute corresponding operations.

Fig. 2 shows a block diagram of an apparatus for performing neural network operations according to an embodiment of the present disclosure. In practical applications, the apparatus for computing a neural network operation may include the control module 300, the storage module 200 and the storage module 100.

The operation module 100 is configured to complete the received neural network operation and obtain an operation result. The operation module 100 may complete operations corresponding to each event process according to each event process determined by the control module 300, for example, load corresponding data from the storage module 200, complete corresponding operations, store operation results, and the like. The operation module 100 in the embodiment of the present disclosure may perform vector operation, scalar operation, nonlinear operation, matrix vector operation, and the like in the neural network operation. The operation module 100 may further include a buffer for buffering an intermediate result generated in the operation process and data applied in the operation process.

According to the above structure, the simulator can simulate the control operation of the control module 300 to the memory module 200 and the operation module, the read/write operation of the memory module 200, and the operation of the operation module 100.

Fig. 3 illustrates a block diagram of an arithmetic module according to an embodiment of the present disclosure, wherein the arithmetic module 100 of the embodiment of the present disclosure may include a master arithmetic unit 101 and at least one slave arithmetic unit 102.

The main operation unit 101 may be configured to load corresponding data from the storage module according to control of the control module 300, execute a first type of operation in the operation sub-instruction corresponding to the operation event, and cache an operation result of the first type of operation; the embodiment of the disclosure can simulate and execute the operation process of the main operation unit by utilizing the first program.

The slave operation unit 102 may be configured to load corresponding data from the storage module and/or the master operation module according to the control of the control module 300, execute a second type of operation in the operator instruction corresponding to the operation event, and cache an operation result of the second type of operation. The disclosed embodiment may perform the operation process of the slave operation unit 102 using the second program.

Wherein the first type of operation comprises at least one of a vector operation, a scalar operation, and a non-linear operation, and the second type of operation comprises a matrix scalar operation.

The main operation unit 101 can be used as a main data path, and can simultaneously complete a part of vector operations and scalar operations. The main operation unit 101 may cache corresponding data from the storage module according to control of the control module, and the main operation unit 101 may be configured to perform vector operation, scalar operation, and non-linear operation by using the cached data. The vector operation comprises element-by-element addition, element-by-element multiplication and other operations, the scalar operation comprises scalar four-rule operation, and the nonlinear operation mainly comprises transcendental functions such as exponents and hyperbolic functions and is used for supporting the operation of the activation function. The main operation unit 101 mainly performs operations of a Pooling layer, a BN (batch regularization) layer, a ROI Pooling layer, and an active layer.

The slave operation unit 302 can be used as a main core operation module and mainly completes matrix vector operation in a neural network algorithm, and the operations of a convolution layer, a full connection layer and an L STM layer in the neural network are all completed in the slave operation unit 302.

The master operation unit 301 and the slave operation unit 302 may each include a buffer for buffering data required for each sub-operation, and may buffer the result of the operation. In the storage event according to the embodiment of the present disclosure, the storage of the operation result may be to cache the result of the corresponding operation sub-instruction in the buffer, or to store the result in the storage module.

As described above, the storage module 200 may be configured to store data of a neural network operation, for example, the data may include data of a topology, input parameters, weights, output parts, and the like of the neural network, and different types of data may be included for different machine learning operations, and embodiments of the present disclosure are not limited in particular. Since the buffer capacity of the buffer on the operation module 100 is limited, operations such as storing, reading, and writing of data can be performed through data exchange between the buffers of the memory module and the operation module.

In the embodiment of the present disclosure, the simulator may simulate a bidirectional data path between the master operation unit 101 and the slave operation unit 102, a path from the master operation unit 301 to the slave operation unit 302 is used to transmit the input neuron, and a path from the slave operation unit 102 to the master operation unit 101 is used to transmit the sum of the output neuron portions. And the simulator can also simulate a bidirectional data path between the main operation unit 101 and the storage module 200, wherein the path from the storage module 200 to the main operation unit 101 is used for loading input neurons, input neuron indexes, output neuron part sums and weight indexes, and the path from the main operation unit 101 to the storage module 200 is used for storing output neuron part sums. And the simulator can also simulate a unidirectional data path between the operation unit 102 and the memory module 200, and the path from the memory module 200 to the direction from the operation unit 102 is used for loading the weight value.

Specifically, the control module 300 according to the embodiment of the present disclosure may receive a neural network operation instruction to be executed, where the instruction may include an operation instruction that needs to be executed by the neural network and a storage address of data related to the operation in the storage module 200, and the control module 300 may parse the received neural network operation instruction into a plurality of operator instructions, for example, may obtain the plurality of operator instructions by an instruction decoding method, and correspondingly determine the storage address of the data needed by each operator instruction.

The simulator in the embodiment of the disclosure may determine a plurality of event processes according to the determined plurality of operation sub-instructions and the execution sequence of each operation sub-instruction, where the plurality of event processes are used to complete the plurality of operation sub-instructions involved in the neural network operation. The event process of the embodiment of the present disclosure may include at least one event, and the type of the event may be a load event, an operation event, a storage event, and a synchronization event.

And each event can have a corresponding descriptor, and the descriptor can include parameters such as type, data, time and time, wherein the type descriptor is used for describing operations which need to be completed by the event, including control (such as branch jump), access (such as L oad/Store off-chip DDR or Read/Write on-chip cache), operation (such as matrix, vector, scalar, logic operation and the like), the data descriptor is used for describing data related to the event, including information such as data address, data size, data type and the like, the time descriptor is used for describing a time point when the event is triggered, and the time descriptor is used for describing an execution time of the event, and the execution time can be completed by means of performance analysis (Profiling) or Modeling (Modeling).

As described above, events may include load events, Compute events, Store events, and synchronization events, abbreviated below as L oad events, computer events, Store events, and Synchrizine events, in that order.

The L oad event is to load the data involved from the storage module 200 into the buffer of the operation module, and the data may include input neurons, input neuron indexes, output partial sums, weights, weight indexes, and the like.

The computer events are corresponding operations of a neural network which are completed through an operation module, wherein the operation of the neural network comprises Matrix operation, Vector operation and Scalar operation, the Matrix operation mainly comprises Matrix Vector operation, the Matrix operation comprises core operation of the neural network, most of operations in a convolutional layer, a full connection layer and an L STM (long and short term memory network) layer are Matrix Vector operation, the Vector operation comprises Vector inner product, element-by-element multiplication and element-by-element multiplication, and the pooling layer, the BN (batch normalization) layer, the L STM layer and the L RN (partial response normalization) layer in the neural network are all related to the Vector operation.

The Store event is to Store the data cached by the operation module 100 into the storage module 200.

A Synchronize event is a synchronization event. When a Synchronize event occurs, it is necessary to ensure that all other events before the Synchronize event are completed before other events are executed, i.e., the Synchronize event is equivalent to a synchronization signal used to Synchronize other events before the Synchronize event.

For a synchronous event, the embodiments of the present disclosure may determine the time of the synchronous event according to the execution time of each event included in each event process.

In conjunction with the above process, the following describes in detail the step S300 of the embodiment of the present disclosure, and fig. 4 shows a flowchart of the step S300 in the analog operation method according to the embodiment of the present disclosure, where the obtaining the operation time for completing the neural network operation instruction includes:

s301: determining the execution time of each event process according to the execution time of each event in each event process;

s302: and acquiring the operation time for completing the neural network operation instruction according to the execution time of each event process.

The following describes a method for determining the execution time of each event of the simulator.

The execution time of the L oad event is related to the data size and the bandwidth of the storage module, and t can be calculated in the following way_Load＝t_start+Data_Load/Bandwidth_Load. Wherein t is_LoadIs L oad event execution time, t_startIs the start time of the memory module, Data_LoadIs the amount of data involved in the L oad event (including all input neurons, input neuron indices, output partial sums, weights and weight indices), Bandwidth_LoadIs the L oad bandwidth of the Off-Chip Memory.

The execution time of the computer event is related to the amount of computation and the number of computation units, and t can be calculated as follows_Compute＝Data_Compute/(FU @ u%). Wherein t is_ComputeIs computer event execution time, Data_ComputeIs the computation amount involved in the computer event, FU is the number of arithmetic units in the arithmetic block, and u% is the utilization rate of the arithmetic units. The neural element sparsity, weight sparsity, network topology (such as the scale of convolutional layer and the scale of full link layer), data blocking and other factors of the neural network all affect the utilization rate of the operation unit, so that the utilization rate of the operation unit is a real-time value. The utilization rate of the operation unit is calculated by adopting a modeling method for prediction, for example, for convolution operation of a neural network, the utilization rates of the operation unit under a plurality of groups of different network scale configurations and sparsity configurations are measured actually, then modeling is carried out on the utilization rate of the operation unit according to the data, and finally the quantitative relation between the utilization rate of the operation unit and the network configurations and sparsity is obtained. It is noted that the execution times of the Matrix computer event, the Vector computer event and the Scalar event can all be calculated using the above formulas.

The execution time calculation method of the Store event is similar to L oad time, the execution time is related to the data volume and the Off-ChipMemory bandwidth, and t can be calculated in the following mode_Store＝t_start+Data_Store/Bandwidth_Store. Wherein t is_StoreIs a Store event execution event, t_startIs the start time of Off-chip Memory, Data_StoreIs the amount of data (sum of output part) involved in the event, Bandwidth_StoreIs the Store bandwidth of the Off-Chip Memory.

As described in the foregoing embodiments, the embodiments of the present disclosure may determine the time required for each event to run according to each event included in each event process, for example, the time for storing the event may be obtained according to the data amount of the stored data, or the time for computing the event may be obtained according to the data of the computation and the type of the computation, and the execution time of each event process may be determined according to the maximum execution time of each event in each event process. And further determining the sum of the execution time of each event process as the operation time for completing the operation instruction of the neural network.

Meanwhile, after the execution time of each event in each event process is determined, the execution time of the event with the longest execution time in each event process can be used as the synchronization time of the event process, so that the next event process can be executed after each event is finished.

In addition, in the embodiment of the present disclosure, each event in each event process is an event without a dependency relationship, where the event without a dependency relationship includes that data in a storage module applied by an operation of each event does not have an overlapping relationship, for example, L oad/Store events and computer events do not have a dependency relationship, L oad events and Store events do not have a dependency relationship, Matrix computer events, Vector computer events and scale events do not have a dependency relationship, and events without a dependency relationship may be executed simultaneously.

The following describes a process in which the simulator simulation operation module and the storage module execute each event process according to the embodiment of the present disclosure. Fig. 5 is a flowchart illustrating step S300 in the simulation operation method according to the embodiment of the disclosure, wherein the obtaining an operation result of completing the neural network operation instruction includes:

s3001: executing each of the determined event processes;

s3002: and obtaining the operation result of the neural network operation according to the operation result of each event process.

According to the method and the device, the event processes can be executed according to the determined sequence of the event processes, and meanwhile, the operation results corresponding to the operation sub-instructions can be stored, so that the operation results corresponding to the neural network operation instructions are obtained.

Fig. 6 shows a process diagram of a simulation operation method according to an embodiment of the present disclosure. The embodiment of the disclosure can divide the neural network operation into a plurality of sub-operation operations by using a cyclic blocking strategy, i.e., a plurality of sub-operation instructions can be correspondingly analyzed, and data corresponding to each sub-operation can be completely loaded into a buffer of an operation module for buffering. The operation of a certain layer of the neural network is divided into N sub-operations, and the N sub-operations correspond to N data blocks. Fig. 6 shows a process of simulating execution and simultaneously counting the running time. The execution process of the neural network is divided into N +2 steps (event processes), each two steps are divided by a synchrozone event, and finally the execution time of the neural network is the sum of the execution times of the N +2 steps (event processes).

Wherein, the first event process in the N +2 event processes determined according to the operator instruction in the embodiment of the present disclosure includes a first load event and a first synchronization event, the second event process includes a second load event, a first operation event and a second synchronization event, the third event process includes a third load event, a second operation event, a first storage event and a third synchronization time, the ith event process includes an ith load event, an ith-1 operation event, an ith-2 storage event and an ith synchronization event, the N +1 th event process includes an nth operation event, an nth-1 storage event and an N +1 th synchronization event, the N +2 th event process includes an nth storage event and an N +2 th synchronization event, i is an integer greater than 3 and less than or equal to N,

the jth loading event is used for loading operation data of a jth operation sub-instruction from the storage module to the operation module, the jth storage event is used for caching an operation result of the jth operation sub-instruction obtained by the operation module and storing an operation result of an Nth operation sub-instruction to the storage module, the jth operation event is used for executing operation of the jth operation sub-instruction, and the kth synchronization event is used for synchronizing all events in a kth event process, wherein j is an integer which is greater than or equal to 1 and less than or equal to N, and k is a positive integer which is greater than or equal to 1 and less than or equal to N + 2.

The following describes an operation procedure of the embodiment of the present disclosure with reference to fig. 6.

Step1, the analog control module 300 controls to trigger L oad 1 event, namely, the control operation module reads data corresponding to the first sub-operation from the storage module 200, the sent control instruction includes a data storage address corresponding to the first sub-operation, the operation module caches data from the corresponding storage address according to the received execution, wherein the time of the synchronous event in the process of the first event is determined to be the execution time of the first load event, namely, L oad 1 event loads the data corresponding to the first sub-operation from the storage module to the main operation unit 101 and the SRAM (cache) of the slave operation unit, at this time, the main operation unit 101 and the slave operation unit 102 cannot calculate because the cache of no data in the operation module 100, the synchronization is performed by using the first synchronous event in the process of the first event, and the execution time of Step1 is L oad 1 event execution time.

The simulation control module triggers L oad 2 event and computer 1 event, namely, the control operation module reads data corresponding to the second sub-operation from the storage module 200, the sent control instruction includes a data storage address corresponding to the second sub-operation, the operation module executes the operation corresponding to the received instruction, and operates the operation corresponding to the first operation sub-instruction, wherein the time of the synchronous event in the second event process is determined to be the longer one of the execution time of the second load event and the time of the first operation event, the L oad 2 event is the time of loading the data corresponding to the second sub-operation from the storage module into the SRAMs of the master operation module and the slave operation module, at this time, since the on-chip caches of the master operation module and the sub-operation modules already store the data corresponding to the first sub-operation (after L oad 1 event of Step 1), the operation module is triggered to execute the computer 1 event, at this time, the master operation module and the slave operation module may complete the first sub-operation module, the computer 1 event includes a case where the execution time of the map 2 event is defined as the maximum value of the execution time of the map 2 event, which is defined as the maximum value of map of the map of map 2 event, the map 2 event, which is defined as the map of map case, map of map, map.

Step 3, the simulation control module triggers L load 3 event, computer 2 event and Store 1 event, namely, the control operation module reads data corresponding to the third sub-operation from the storage module 200, the sent control instruction comprises a data storage address corresponding to the third sub-operation, the operation module executes the data cached from the corresponding storage address according to the received instruction, and operates the operation corresponding to the second operation sub-instruction and caches the execution result of the first operation sub-instruction in a buffer or transmits the execution result to the storage module for storage, wherein the time of the synchronous event in the third event is determined to be the longer one of the execution time of the third load event, the time of the second operation event and the execution time of the first storage event, L load 3 event loads the data corresponding to the third sub-operation from the storage module to the main operation unit and the SRAM of the slave operation unit, the computer 2 event is determined to execute the second sub-operation by the main operation module and the slave operation module, and the main operation module and the slave operation module can execute the second sub-operation, because the main operation module and the slave operation module Store the result corresponding to the first sub-operation module (the execution time of the execution time, the execution time of the third sub-operation event is triggered by the third sub-operation event 633 event and the execution time, the execution time is also triggered by the execution time of the third sub-operation module 2.

And in the same way, the Step i +1 simulation control module can trigger L oad i +1 event, computer i event and Store i-1 event, and the control module 300 can control the operation module to load the data required by the i +1 th sub-operation from the storage module, operate the ith sub-operation and Store the operation result of the i-1 th sub-operation into the storage module, namely, respectively load the data corresponding to the i +1 th sub-operation from the storage module to the on-chip cache, calculate the ith sub-operation and Store the i-1 th data block into the storage module from the on-chip cache, wherein the synchronization time of the synchronization event of the Step i +1 is the maximum value of the execution time of the three events of L oad i +1, computer i and Store i-1.

The simulation control module triggers L oad N event, computer N-1 event and Store N-2 event, the control module 300 can control the operation module to load data required by the Nth sub-operation from the storage module, operate the N-1 th sub-operation and Store the operation result of the N-2 th sub-operation into the storage module, namely, the data corresponding to the Nth sub-operation is respectively loaded into the on-chip cache from the storage module, the N-1 th sub-operation is calculated, and the N-2 th data block is stored into the Off-chip cache from the on-chip cache from Step1 to Step N, the data related to the N sub-operations of the neural network can be loaded into the on-chip cache, and the synchronization time of the Nth synchronization event is one of L oad N event, computer N-1 event and Store N-2 event with longer execution time.

Step N + 1: the analog control module triggers a computer N event and a Store N-1 event. The control module 300 may control the operation module to operate the nth sub-operation and store the operation result of the N-1 st sub-operation in the storage module. Namely, the Nth sub-operation is calculated respectively, and the (N-1) th data block is stored in the storage module from the on-chip cache. The calculation task of N sub-operations of the neural network is completed from Step2 to Step N + 1. The synchronization time of the N +1 th synchronization event is the longer one of the execution times of the computer N event and the StoreN-1 event.

Step N + 2: and the simulation control module triggers a Store N event to Store the calculation result of the Nth sub-operation into the storage module from the on-chip cache. That is, the control module 300 may control the operation module to store the operation result of the nth sub-operation in the storage module. From Step 3 to N +2, the calculation results of the N sub-operations of the neural network can be stored in the storage module. All the operations of the neural network are completed. The synchronous time of the N +2 th synchronous event is the execution time of the Store N event.

Through the configuration, the simulation process of the operation of the neural network can be realized based on the event triggering mode, and the simulation operation of the neural network can be quickly carried out.

In summary, the embodiment of the present disclosure can form a plurality of matched event processes according to the neural network operation to be executed, and execute each event according to the sequence of the event processes, so as to execute each operation in order, and can execute the operation process based on the event triggering mode, thereby performing performance simulation on the neural network operation quickly. The simulation is carried out by taking an event as a unit, the event is usually a user-defined event, each event can trigger one simulation operation, the neural network operation process can be effectively executed, the result of each process can be obtained, and the simulation execution process can be conveniently analyzed.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the present disclosure also provides a simulator, an electronic device, a machine learning chip, a computer readable storage medium, and a program, which can be used to implement any one of the simulation operation methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the method section are not repeated.

The disclosed embodiment provides a simulator, which includes: a processor for performing the method of any one of the first aspect.

Embodiments of the present disclosure also provide a machine learning chip including instructions for performing the method according to any one of the first aspect.

In some possible implementations, the embodiment of the disclosure further provides a chip packaging structure, which includes the above chip.

In some possible embodiments, the present disclosure further provides a board card, which includes the above chip package structure board card, and may further include other accessories besides the above chip, where the accessories include, but are not limited to: a memory device, an interface arrangement and a control device.

The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is further provided, which includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, there is also provided a computer readable storage medium having stored therein computer program instructions which, when executed by a processor, implement the method of any one of the first aspects.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of simulation operations for simulating performance of neural network operations, comprising:

parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event;

2. The method of claim 1, wherein parsing a plurality of operator instructions from the neural network operation instruction comprises:

3. The method of claim 1, wherein determining the plurality of event processes for completing the plurality of operator instructions comprises at least one of:

4. The method of claim 3, wherein determining the number of event processes based on the number of operator instructions comprises:

when N operator instructions are analyzed from the neural network operation instruction plan, determining the number of the event processes to be N +2, wherein N is a positive integer greater than 1.

5. The method of claim 3, wherein determining the order of execution of each of the event processes and the order of execution of each of the event processes based on the order of execution of each of the operand instructions comprises:

6. The method of claim 5, wherein determining the order of execution of each of the event processes and the order of execution of each of the event processes based on the order of execution of each of the operand instructions comprises:

7. The method of claim 1, wherein obtaining at least one of a computation time and a computation result for completing the neural network computation instruction based on the determined event processes comprises:

8. The method of claim 1, wherein obtaining at least one of a computation time and a computation result for completing the neural network computation instruction based on the determined event processes comprises:

executing each of the determined event processes;

9. The method of claim 8, further comprising:

10. The method according to claim 9, wherein the determining, according to the operation type of the operation event in the event process, an operation program for executing the operation event comprises:

11. A simulator, comprising:

a processor for performing the method of any one of claims 1-10.

12. A machine learning chip, wherein the machine learning chip is configured to perform the method of any one of claims 1-10.

13. An electronic device, characterized in that it comprises a chip according to claim 12.

14. A computer readable storage medium having computer program instructions stored therein, which when executed by a processor implement the method of any one of claims 1 to 10.