CN111506384A - Simulation operation method and simulator - Google Patents

Simulation operation method and simulator Download PDF

Info

Publication number
CN111506384A
CN111506384A CN201910097439.3A CN201910097439A CN111506384A CN 111506384 A CN111506384 A CN 111506384A CN 201910097439 A CN201910097439 A CN 201910097439A CN 111506384 A CN111506384 A CN 111506384A
Authority
CN
China
Prior art keywords
event
neural network
instruction
determining
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910097439.3A
Other languages
Chinese (zh)
Other versions
CN111506384B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN201910097439.3A priority Critical patent/CN111506384B/en
Publication of CN111506384A publication Critical patent/CN111506384A/en
Application granted granted Critical
Publication of CN111506384B publication Critical patent/CN111506384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45508Runtime interpretation or emulation, e g. emulator loops, bytecode interpretation

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present disclosure relates to a simulation operation method and a simulator, the method is used for simulating and executing neural network operation, and comprises the following steps: receiving and storing operational data, wherein the operational data comprises a neural network operational instruction and data for executing the neural network operational instruction; parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event. And obtaining at least one of operation time and operation results of completing the neural network operation instruction based on the determined event processes. The method and the device can be used for rapidly simulating the operation of the neural network.

Description

Simulation operation method and simulator
Technical Field
The present disclosure relates to the field of machine learning, and in particular, to an analog operation method, an analog device, a machine learning chip, and an electronic device.
Background
Computer simulation refers to the act of simulating a real-world process or system using a simulator developed by computer software. Computer system simulators have become an indispensable tool in the field of computer system architecture research today. Researchers can efficiently complete the configuration and observation of the hardware and the software by using the simulator with lower cost and expenditure, and further provide guidance for the design and optimization of the hardware and the software.
Neural networks (neural networks) have been used very successfully, and neural network accelerators (neural network accelerators) are widely used to handle neural network applications. Before the neural network accelerator is formally applied, a developer needs to perform performance evaluation on the accelerator. Researchers typically employ cycle accurate simulators for performance simulation of hardware. Cycle accurate simulators require each module of the simulation hardware to perform various details of operations, including state machine jumps, register changes, pipeline level operations, etc., every clock cycle to maintain consistency between the simulator and the hardware. Such accurate simulation results in huge resource, energy and time overhead, which in turn results in a simulator with accurate cycle time not meeting the actual requirements. Therefore, how to increase the operation speed of the simulator becomes a problem to be solved urgently. .
Disclosure of Invention
The embodiment of the disclosure provides a simulation operation method, a simulator, a machine learning chip, an electronic device and a storage medium, which can conveniently improve the simulation operation speed.
According to a first aspect of the present disclosure, there is provided a simulation operation method for simulating execution of a neural network operation, including:
receiving and storing operational data, wherein the operational data comprises a neural network operational instruction and data for executing the neural network operational instruction;
parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event.
And obtaining at least one of operation time and operation results of completing the neural network operation instruction based on the determined event processes.
In some possible embodiments, the parsing a plurality of operator instructions from the neural network operation instruction includes:
and executing decoding operation on the received neural network operation instruction to obtain the plurality of operation sub-instructions.
In some possible embodiments, the determining the plurality of event processes for completing the plurality of operator instructions comprises at least one of:
determining the number of the event processes according to the number of the operation sub-instructions;
and determining each event process and the execution sequence of each event process according to the execution sequence of each operation sub-instruction.
In some possible embodiments, the determining the number of event processes according to the number of operator instructions includes:
when N operator instructions are analyzed from the neural network operation instruction plan, determining the number of the event processes to be N +2, wherein N is a positive integer greater than or equal to 1.
In some possible embodiments, the determining the execution order of each event process and the execution order of each event process according to the execution order of each operation sub-instruction includes:
determining a first event process, the first event process comprising a first load event;
determining a second event process, the second event process comprising a second load event and a first arithmetic event;
determining a third event process, wherein the third event process comprises a third loading event, a second operation event and a first storage event;
determining an ith event process, wherein the ith event process comprises an ith loading event, an ith-1 operation event and an ith-2 storage event;
determining an N +1 th event process, wherein the N +1 th event process comprises an nth operation event and an N-1 th storage event;
determining an N +2 th event process, wherein the N +2 th event process comprises an nth storage event, i is an integer which is greater than 3 and less than or equal to N, and N is the number of operation sub-instructions and is a positive integer which is greater than or equal to 1;
the jth loading event is used for loading operation data of a jth operation sub-instruction, the jth storage event is used for storing an operation result of the jth operation sub-instruction, the jth operation event is used for executing operation of the jth operation sub-instruction, and j is a positive integer which is larger than 0 and smaller than or equal to N.
In some possible embodiments, the determining the execution order of each event process and the execution order of each event process according to the execution order of each operation sub-instruction includes:
each event process also comprises a synchronous event, and the synchronous time of the synchronous event of each event process is determined according to the execution time of each event in each event process.
In some possible embodiments, the obtaining at least one of an operation time and an operation result for completing the operation instruction of the neural network based on the determined event processes includes:
determining the execution time of each event process according to the execution time of each event in each event process;
and acquiring the operation time for completing the neural network operation instruction according to the execution time of each event process.
In some possible embodiments, the obtaining at least one of an operation time and an operation result for completing the operation instruction of the neural network based on the determined event processes includes:
executing each of the determined event processes;
and obtaining the operation result of the neural network operation according to the operation result of each event process.
In some possible embodiments, the method further comprises:
when executing each event process, determining an operation program for executing the operation event according to the operation type of the operation sub-instruction corresponding to the operation event in the event process;
and executing the corresponding operation event according to the determined operation program.
In some possible embodiments, the determining, according to the operation type of the operation event in the event process, an operation program for executing the operation event includes:
when the operator instruction corresponding to the operation event is a first type of operation, executing the operation event by using a first operation program;
when the operator instruction corresponding to the operation event is the second type of operation, executing the operation event by using a first operation program;
the first class of operations includes at least one of vector operations, scalar operations, and non-linear operations, and the second class of operations includes matrix scalar operations.
According to a second aspect of the present disclosure, there is provided a simulator comprising:
a processor for performing the method of any one of the first aspect.
According to a third aspect of the present disclosure, there is provided a machine learning chip for performing the method of any one of the first aspect.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising the chip according to the third aspect.
According to a fifth aspect of the present disclosure, there is provided a computer readable storage medium having stored therein computer program instructions which, when executed by a processor, implement the method of any one of the first aspects.
According to the embodiments of the present disclosure, it can be known that the embodiments of the present disclosure can form a plurality of matched event processes according to the neural network operation to be executed, and execute each event according to the sequence of the event processes, so as to sequentially simulate and execute each operation, and can execute the operation process based on the event triggering mode, and can quickly perform performance simulation on the neural network operation. The simulation is carried out by taking an event as a unit, the event is usually a user-defined event, each event can trigger one simulation operation, the neural network operation process can be effectively executed, the result of each process can be obtained, and the simulation execution process can be conveniently analyzed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 illustrates a flow diagram of a method of analog operation according to an embodiment of the present disclosure;
FIG. 2 is a block diagram illustrating an apparatus for performing neural network operations according to an embodiment of the present disclosure;
FIG. 3 shows a block diagram of an operational module according to an embodiment of the present disclosure;
FIG. 4 shows a flow chart of step S300 of a method of analog operation according to an embodiment of the disclosure;
FIG. 5 shows a flow chart of step S300 of a method of analog operation according to an embodiment of the disclosure;
fig. 6 shows a process diagram of a simulation operation method according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.
The embodiment of the disclosure can simulate and execute simulation operation related to the machine learning algorithm, for example, the simulation operation can be used for simulating and estimating the operation time of the neural network operation and simulating and executing the neural network operation to obtain corresponding operation results, so that information such as the operation result, the operation time, the operation speed and the like can be obtained by executing the corresponding machine learning operation, and the analysis of the machine learning algorithm is realized. The embodiment of the disclosure provides a method for determining a plurality of event processes for executing a neural network operation instruction by analyzing a received neural network operation instruction, wherein the event processes can be sequentially executed according to a sequence, so that the time for analyzing a neural network algorithm can be reduced, and the running speed of a simulator can be increased.
FIG. 1 shows a flow diagram of a method of analog operation according to an embodiment of the present disclosure. As shown in fig. 1, the simulation operation method of the embodiment of the present disclosure may include
S100: receiving and storing operational data, wherein the operational data comprises a neural network operational instruction and data for executing the neural network operational instruction;
s200: parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event.
S300: and obtaining at least one of operation time and operation results of completing the neural network operation instruction based on the determined event processes.
The simulation operation method of the embodiment of the present disclosure may be applied to a simulator, which may perform analysis of neural network operation through simulation of processes such as data storage, data operation, and the like, for example, may be used to analyze operation time or operation result of the neural network operation, and the like. Wherein the simulator may comprise a performance simulator, but the present disclosure is not limited thereto.
In step S100, operation data to be subjected to analog operation may be received first, where the operation data may include a neural network operation instruction and data required to execute the instruction, or may also include information such as a storage address and a data amount of each data to execute the neural network operation instruction.
Upon receiving the operational data, embodiments of the present disclosure may simulate performing storage of the operational data, e.g., may simulate performing a storage operation that stores the operational data to a storage area. In practical applications, the neural network operation may include a storage module, an operation module, and a storage module. The simulator of the embodiment of the disclosure can simulate the storage of the operation data into the preset storage space of the storage module, so that the operation data can be read and called by subsequent operation.
After the operation data is obtained and stored, the simulator may simulate performing analysis on the operation data through step S200, and determine each event process for performing the neural network operation. For example, the analysis process of the neural network operation instruction by the control module and the determination and control process of the corresponding event process can be simulated. That is, the simulator according to the embodiment of the present disclosure may analyze a plurality of operator instructions from the received neural network operation instruction by the simulation control module, so as to respectively simulate and analyze operations of the plurality of operator instructions.
In some possible embodiments, the parsing a plurality of operator instructions from the neural network operation instruction may include: and executing decoding operation on the received neural network operation instruction to obtain the plurality of operation sub-instructions. That is, in the embodiment of the present disclosure, in order to improve data security and reduce data capacity, the neural network operation instruction received in the embodiment of the present disclosure may be an encoded instruction, and after receiving the neural network operation instruction, a decoding operation of the instruction may be performed to analyze a plurality of operator instructions from the neural network operation instruction. The present disclosure is not particularly limited to the encoding and decoding modes of the neural network operation instruction, and those skilled in the art can select an appropriate mode to perform the encoding and decoding operations.
In addition, after obtaining a plurality of operation sub-instructions corresponding to the neural network operation instruction, an event procedure for completing the plurality of operation sub-instructions may be determined, where each event procedure may include at least one event, and the type of the event may be at least one of a load event, an operation event, a store event, and a synchronization event.
The embodiment of the present disclosure may determine a plurality of event processes according to a plurality of operation sub-instructions corresponding to the received neural network operation instruction, where each event process of the plurality of event processes may include at least one event, and there is no dependency relationship between events included in each event process. Specifically, in step S200 of the embodiment of the present disclosure, determining a plurality of event processes for completing the plurality of operation sub-instructions may include at least one of the following:
a) determining the number of the event processes according to the number of the operation sub-instructions;
in the embodiment of the present disclosure, when N operation sub-instructions are resolved from the neural network operation instruction plan, it is determined that the number of the event processes is N +2, where N is a positive integer greater than or equal to 1.
In the embodiment of the disclosure, in order to enable each operation sub-instruction to be executed in order and to be simulated, run and analyzed quickly, a plurality of event processes are determined, wherein a relationship to be satisfied among the event processes is that each event in each event process is an event without a dependency relationship, and the event without a dependency relationship includes that data applied by operation of each event does not have an overlapping relationship, for example, there is no dependency relationship between a load L oad/Store event and an operation computer event, there is no dependency relationship between a L oad event and a Store event, there is no dependency relationship between a Matrix operation Matrix computer event, a Vector operation Vector computer event and a Scalar operation Scalar computer event.
b) And determining each event process and the execution sequence of each event process according to the execution sequence of each operation sub-instruction.
As described above, the relationship existing between event processes of the embodiments of the present disclosure is: the data required by the next event is loaded in the previous event, after the operation sub-instruction in the event is completed, the storage of the corresponding operation result is executed in the event process, and the synchronization time synchronization is utilized between each event process, namely, the next event process can be started through the synchronization event after the events in the previous event process are all completed. Specifically, determining the sequence of each event process in the embodiment of the present disclosure may include:
determining a first event process, the first event process comprising a first load event;
determining a second event process, the second event process comprising a second load event and a first arithmetic event;
determining a third event process, wherein the third event process comprises a third loading event, a second operation event and a first storage event;
determining an ith event process, wherein the ith event process comprises an ith loading event, an ith-1 operation event and an ith-2 storage event;
determining an N +1 th event process, wherein the N +1 th event process comprises an nth operation event and an N-1 th storage event;
determining an N +2 th event process, wherein the N +2 th event process comprises an nth storage event, i is an integer which is greater than 3 and less than or equal to N, and N is the number of operation sub-instructions and is a positive integer which is greater than 1;
the jth loading event is used for loading operation data of a jth operation sub-instruction, the jth storage event is used for storing an operation result of the jth operation sub-instruction, the jth operation event is used for executing operation of the jth operation sub-instruction, and j is a positive integer which is larger than 0 and smaller than or equal to N.
In this way, a plurality of event processes corresponding to the operator instructions can be determined, and each event process and the execution order of each event process can be determined according to the execution order of each operator instruction.
After the event process is analyzed, the simulation control module can control the operation module and the storage module to execute corresponding operations.
Fig. 2 shows a block diagram of an apparatus for performing neural network operations according to an embodiment of the present disclosure. In practical applications, the apparatus for computing a neural network operation may include the control module 300, the storage module 200 and the storage module 100.
The operation module 100 is configured to complete the received neural network operation and obtain an operation result. The operation module 100 may complete operations corresponding to each event process according to each event process determined by the control module 300, for example, load corresponding data from the storage module 200, complete corresponding operations, store operation results, and the like. The operation module 100 in the embodiment of the present disclosure may perform vector operation, scalar operation, nonlinear operation, matrix vector operation, and the like in the neural network operation. The operation module 100 may further include a buffer for buffering an intermediate result generated in the operation process and data applied in the operation process.
According to the above structure, the simulator can simulate the control operation of the control module 300 to the memory module 200 and the operation module, the read/write operation of the memory module 200, and the operation of the operation module 100.
Fig. 3 illustrates a block diagram of an arithmetic module according to an embodiment of the present disclosure, wherein the arithmetic module 100 of the embodiment of the present disclosure may include a master arithmetic unit 101 and at least one slave arithmetic unit 102.
The main operation unit 101 may be configured to load corresponding data from the storage module according to control of the control module 300, execute a first type of operation in the operation sub-instruction corresponding to the operation event, and cache an operation result of the first type of operation; the embodiment of the disclosure can simulate and execute the operation process of the main operation unit by utilizing the first program.
The slave operation unit 102 may be configured to load corresponding data from the storage module and/or the master operation module according to the control of the control module 300, execute a second type of operation in the operator instruction corresponding to the operation event, and cache an operation result of the second type of operation. The disclosed embodiment may perform the operation process of the slave operation unit 102 using the second program.
Wherein the first type of operation comprises at least one of a vector operation, a scalar operation, and a non-linear operation, and the second type of operation comprises a matrix scalar operation.
The main operation unit 101 can be used as a main data path, and can simultaneously complete a part of vector operations and scalar operations. The main operation unit 101 may cache corresponding data from the storage module according to control of the control module, and the main operation unit 101 may be configured to perform vector operation, scalar operation, and non-linear operation by using the cached data. The vector operation comprises element-by-element addition, element-by-element multiplication and other operations, the scalar operation comprises scalar four-rule operation, and the nonlinear operation mainly comprises transcendental functions such as exponents and hyperbolic functions and is used for supporting the operation of the activation function. The main operation unit 101 mainly performs operations of a Pooling layer, a BN (batch regularization) layer, a ROI Pooling layer, and an active layer.
The slave operation unit 302 can be used as a main core operation module and mainly completes matrix vector operation in a neural network algorithm, and the operations of a convolution layer, a full connection layer and an L STM layer in the neural network are all completed in the slave operation unit 302.
The master operation unit 301 and the slave operation unit 302 may each include a buffer for buffering data required for each sub-operation, and may buffer the result of the operation. In the storage event according to the embodiment of the present disclosure, the storage of the operation result may be to cache the result of the corresponding operation sub-instruction in the buffer, or to store the result in the storage module.
As described above, the storage module 200 may be configured to store data of a neural network operation, for example, the data may include data of a topology, input parameters, weights, output parts, and the like of the neural network, and different types of data may be included for different machine learning operations, and embodiments of the present disclosure are not limited in particular. Since the buffer capacity of the buffer on the operation module 100 is limited, operations such as storing, reading, and writing of data can be performed through data exchange between the buffers of the memory module and the operation module.
In the embodiment of the present disclosure, the simulator may simulate a bidirectional data path between the master operation unit 101 and the slave operation unit 102, a path from the master operation unit 301 to the slave operation unit 302 is used to transmit the input neuron, and a path from the slave operation unit 102 to the master operation unit 101 is used to transmit the sum of the output neuron portions. And the simulator can also simulate a bidirectional data path between the main operation unit 101 and the storage module 200, wherein the path from the storage module 200 to the main operation unit 101 is used for loading input neurons, input neuron indexes, output neuron part sums and weight indexes, and the path from the main operation unit 101 to the storage module 200 is used for storing output neuron part sums. And the simulator can also simulate a unidirectional data path between the operation unit 102 and the memory module 200, and the path from the memory module 200 to the direction from the operation unit 102 is used for loading the weight value.
Specifically, the control module 300 according to the embodiment of the present disclosure may receive a neural network operation instruction to be executed, where the instruction may include an operation instruction that needs to be executed by the neural network and a storage address of data related to the operation in the storage module 200, and the control module 300 may parse the received neural network operation instruction into a plurality of operator instructions, for example, may obtain the plurality of operator instructions by an instruction decoding method, and correspondingly determine the storage address of the data needed by each operator instruction.
The simulator in the embodiment of the disclosure may determine a plurality of event processes according to the determined plurality of operation sub-instructions and the execution sequence of each operation sub-instruction, where the plurality of event processes are used to complete the plurality of operation sub-instructions involved in the neural network operation. The event process of the embodiment of the present disclosure may include at least one event, and the type of the event may be a load event, an operation event, a storage event, and a synchronization event.
And each event can have a corresponding descriptor, and the descriptor can include parameters such as type, data, time and time, wherein the type descriptor is used for describing operations which need to be completed by the event, including control (such as branch jump), access (such as L oad/Store off-chip DDR or Read/Write on-chip cache), operation (such as matrix, vector, scalar, logic operation and the like), the data descriptor is used for describing data related to the event, including information such as data address, data size, data type and the like, the time descriptor is used for describing a time point when the event is triggered, and the time descriptor is used for describing an execution time of the event, and the execution time can be completed by means of performance analysis (Profiling) or Modeling (Modeling).
As described above, events may include load events, Compute events, Store events, and synchronization events, abbreviated below as L oad events, computer events, Store events, and Synchrizine events, in that order.
The L oad event is to load the data involved from the storage module 200 into the buffer of the operation module, and the data may include input neurons, input neuron indexes, output partial sums, weights, weight indexes, and the like.
The computer events are corresponding operations of a neural network which are completed through an operation module, wherein the operation of the neural network comprises Matrix operation, Vector operation and Scalar operation, the Matrix operation mainly comprises Matrix Vector operation, the Matrix operation comprises core operation of the neural network, most of operations in a convolutional layer, a full connection layer and an L STM (long and short term memory network) layer are Matrix Vector operation, the Vector operation comprises Vector inner product, element-by-element multiplication and element-by-element multiplication, and the pooling layer, the BN (batch normalization) layer, the L STM layer and the L RN (partial response normalization) layer in the neural network are all related to the Vector operation.
The Store event is to Store the data cached by the operation module 100 into the storage module 200.
A Synchronize event is a synchronization event. When a Synchronize event occurs, it is necessary to ensure that all other events before the Synchronize event are completed before other events are executed, i.e., the Synchronize event is equivalent to a synchronization signal used to Synchronize other events before the Synchronize event.
For a synchronous event, the embodiments of the present disclosure may determine the time of the synchronous event according to the execution time of each event included in each event process.
In conjunction with the above process, the following describes in detail the step S300 of the embodiment of the present disclosure, and fig. 4 shows a flowchart of the step S300 in the analog operation method according to the embodiment of the present disclosure, where the obtaining the operation time for completing the neural network operation instruction includes:
s301: determining the execution time of each event process according to the execution time of each event in each event process;
s302: and acquiring the operation time for completing the neural network operation instruction according to the execution time of each event process.
The following describes a method for determining the execution time of each event of the simulator.
The execution time of the L oad event is related to the data size and the bandwidth of the storage module, and t can be calculated in the following wayLoad=tstart+DataLoad/BandwidthLoad. Wherein t isLoadIs L oad event execution time, tstartIs the start time of the memory module, DataLoadIs the amount of data involved in the L oad event (including all input neurons, input neuron indices, output partial sums, weights and weight indices), BandwidthLoadIs the L oad bandwidth of the Off-Chip Memory.
The execution time of the computer event is related to the amount of computation and the number of computation units, and t can be calculated as followsCompute=DataCompute/(FU @ u%). Wherein t isComputeIs computer event execution time, DataComputeIs the computation amount involved in the computer event, FU is the number of arithmetic units in the arithmetic block, and u% is the utilization rate of the arithmetic units. The neural element sparsity, weight sparsity, network topology (such as the scale of convolutional layer and the scale of full link layer), data blocking and other factors of the neural network all affect the utilization rate of the operation unit, so that the utilization rate of the operation unit is a real-time value. The utilization rate of the operation unit is calculated by adopting a modeling method for prediction, for example, for convolution operation of a neural network, the utilization rates of the operation unit under a plurality of groups of different network scale configurations and sparsity configurations are measured actually, then modeling is carried out on the utilization rate of the operation unit according to the data, and finally the quantitative relation between the utilization rate of the operation unit and the network configurations and sparsity is obtained. It is noted that the execution times of the Matrix computer event, the Vector computer event and the Scalar event can all be calculated using the above formulas.
The execution time calculation method of the Store event is similar to L oad time, the execution time is related to the data volume and the Off-ChipMemory bandwidth, and t can be calculated in the following modeStore=tstart+DataStore/BandwidthStore. Wherein t isStoreIs a Store event execution event, tstartIs the start time of Off-chip Memory, DataStoreIs the amount of data (sum of output part) involved in the event, BandwidthStoreIs the Store bandwidth of the Off-Chip Memory.
As described in the foregoing embodiments, the embodiments of the present disclosure may determine the time required for each event to run according to each event included in each event process, for example, the time for storing the event may be obtained according to the data amount of the stored data, or the time for computing the event may be obtained according to the data of the computation and the type of the computation, and the execution time of each event process may be determined according to the maximum execution time of each event in each event process. And further determining the sum of the execution time of each event process as the operation time for completing the operation instruction of the neural network.
Meanwhile, after the execution time of each event in each event process is determined, the execution time of the event with the longest execution time in each event process can be used as the synchronization time of the event process, so that the next event process can be executed after each event is finished.
In addition, in the embodiment of the present disclosure, each event in each event process is an event without a dependency relationship, where the event without a dependency relationship includes that data in a storage module applied by an operation of each event does not have an overlapping relationship, for example, L oad/Store events and computer events do not have a dependency relationship, L oad events and Store events do not have a dependency relationship, Matrix computer events, Vector computer events and scale events do not have a dependency relationship, and events without a dependency relationship may be executed simultaneously.
The following describes a process in which the simulator simulation operation module and the storage module execute each event process according to the embodiment of the present disclosure. Fig. 5 is a flowchart illustrating step S300 in the simulation operation method according to the embodiment of the disclosure, wherein the obtaining an operation result of completing the neural network operation instruction includes:
s3001: executing each of the determined event processes;
s3002: and obtaining the operation result of the neural network operation according to the operation result of each event process.
According to the method and the device, the event processes can be executed according to the determined sequence of the event processes, and meanwhile, the operation results corresponding to the operation sub-instructions can be stored, so that the operation results corresponding to the neural network operation instructions are obtained.
Fig. 6 shows a process diagram of a simulation operation method according to an embodiment of the present disclosure. The embodiment of the disclosure can divide the neural network operation into a plurality of sub-operation operations by using a cyclic blocking strategy, i.e., a plurality of sub-operation instructions can be correspondingly analyzed, and data corresponding to each sub-operation can be completely loaded into a buffer of an operation module for buffering. The operation of a certain layer of the neural network is divided into N sub-operations, and the N sub-operations correspond to N data blocks. Fig. 6 shows a process of simulating execution and simultaneously counting the running time. The execution process of the neural network is divided into N +2 steps (event processes), each two steps are divided by a synchrozone event, and finally the execution time of the neural network is the sum of the execution times of the N +2 steps (event processes).
Wherein, the first event process in the N +2 event processes determined according to the operator instruction in the embodiment of the present disclosure includes a first load event and a first synchronization event, the second event process includes a second load event, a first operation event and a second synchronization event, the third event process includes a third load event, a second operation event, a first storage event and a third synchronization time, the ith event process includes an ith load event, an ith-1 operation event, an ith-2 storage event and an ith synchronization event, the N +1 th event process includes an nth operation event, an nth-1 storage event and an N +1 th synchronization event, the N +2 th event process includes an nth storage event and an N +2 th synchronization event, i is an integer greater than 3 and less than or equal to N,
the jth loading event is used for loading operation data of a jth operation sub-instruction from the storage module to the operation module, the jth storage event is used for caching an operation result of the jth operation sub-instruction obtained by the operation module and storing an operation result of an Nth operation sub-instruction to the storage module, the jth operation event is used for executing operation of the jth operation sub-instruction, and the kth synchronization event is used for synchronizing all events in a kth event process, wherein j is an integer which is greater than or equal to 1 and less than or equal to N, and k is a positive integer which is greater than or equal to 1 and less than or equal to N + 2.
The following describes an operation procedure of the embodiment of the present disclosure with reference to fig. 6.
Step1, the analog control module 300 controls to trigger L oad 1 event, namely, the control operation module reads data corresponding to the first sub-operation from the storage module 200, the sent control instruction includes a data storage address corresponding to the first sub-operation, the operation module caches data from the corresponding storage address according to the received execution, wherein the time of the synchronous event in the process of the first event is determined to be the execution time of the first load event, namely, L oad 1 event loads the data corresponding to the first sub-operation from the storage module to the main operation unit 101 and the SRAM (cache) of the slave operation unit, at this time, the main operation unit 101 and the slave operation unit 102 cannot calculate because the cache of no data in the operation module 100, the synchronization is performed by using the first synchronous event in the process of the first event, and the execution time of Step1 is L oad 1 event execution time.
The simulation control module triggers L oad 2 event and computer 1 event, namely, the control operation module reads data corresponding to the second sub-operation from the storage module 200, the sent control instruction includes a data storage address corresponding to the second sub-operation, the operation module executes the operation corresponding to the received instruction, and operates the operation corresponding to the first operation sub-instruction, wherein the time of the synchronous event in the second event process is determined to be the longer one of the execution time of the second load event and the time of the first operation event, the L oad 2 event is the time of loading the data corresponding to the second sub-operation from the storage module into the SRAMs of the master operation module and the slave operation module, at this time, since the on-chip caches of the master operation module and the sub-operation modules already store the data corresponding to the first sub-operation (after L oad 1 event of Step 1), the operation module is triggered to execute the computer 1 event, at this time, the master operation module and the slave operation module may complete the first sub-operation module, the computer 1 event includes a case where the execution time of the map 2 event is defined as the maximum value of the execution time of the map 2 event, which is defined as the maximum value of map of the map of map 2 event, the map 2 event, which is defined as the map of map case, map of map, map.
Step 3, the simulation control module triggers L load 3 event, computer 2 event and Store 1 event, namely, the control operation module reads data corresponding to the third sub-operation from the storage module 200, the sent control instruction comprises a data storage address corresponding to the third sub-operation, the operation module executes the data cached from the corresponding storage address according to the received instruction, and operates the operation corresponding to the second operation sub-instruction and caches the execution result of the first operation sub-instruction in a buffer or transmits the execution result to the storage module for storage, wherein the time of the synchronous event in the third event is determined to be the longer one of the execution time of the third load event, the time of the second operation event and the execution time of the first storage event, L load 3 event loads the data corresponding to the third sub-operation from the storage module to the main operation unit and the SRAM of the slave operation unit, the computer 2 event is determined to execute the second sub-operation by the main operation module and the slave operation module, and the main operation module and the slave operation module can execute the second sub-operation, because the main operation module and the slave operation module Store the result corresponding to the first sub-operation module (the execution time of the execution time, the execution time of the third sub-operation event is triggered by the third sub-operation event 633 event and the execution time, the execution time is also triggered by the execution time of the third sub-operation module 2.
And in the same way, the Step i +1 simulation control module can trigger L oad i +1 event, computer i event and Store i-1 event, and the control module 300 can control the operation module to load the data required by the i +1 th sub-operation from the storage module, operate the ith sub-operation and Store the operation result of the i-1 th sub-operation into the storage module, namely, respectively load the data corresponding to the i +1 th sub-operation from the storage module to the on-chip cache, calculate the ith sub-operation and Store the i-1 th data block into the storage module from the on-chip cache, wherein the synchronization time of the synchronization event of the Step i +1 is the maximum value of the execution time of the three events of L oad i +1, computer i and Store i-1.
The simulation control module triggers L oad N event, computer N-1 event and Store N-2 event, the control module 300 can control the operation module to load data required by the Nth sub-operation from the storage module, operate the N-1 th sub-operation and Store the operation result of the N-2 th sub-operation into the storage module, namely, the data corresponding to the Nth sub-operation is respectively loaded into the on-chip cache from the storage module, the N-1 th sub-operation is calculated, and the N-2 th data block is stored into the Off-chip cache from the on-chip cache from Step1 to Step N, the data related to the N sub-operations of the neural network can be loaded into the on-chip cache, and the synchronization time of the Nth synchronization event is one of L oad N event, computer N-1 event and Store N-2 event with longer execution time.
Step N + 1: the analog control module triggers a computer N event and a Store N-1 event. The control module 300 may control the operation module to operate the nth sub-operation and store the operation result of the N-1 st sub-operation in the storage module. Namely, the Nth sub-operation is calculated respectively, and the (N-1) th data block is stored in the storage module from the on-chip cache. The calculation task of N sub-operations of the neural network is completed from Step2 to Step N + 1. The synchronization time of the N +1 th synchronization event is the longer one of the execution times of the computer N event and the StoreN-1 event.
Step N + 2: and the simulation control module triggers a Store N event to Store the calculation result of the Nth sub-operation into the storage module from the on-chip cache. That is, the control module 300 may control the operation module to store the operation result of the nth sub-operation in the storage module. From Step 3 to N +2, the calculation results of the N sub-operations of the neural network can be stored in the storage module. All the operations of the neural network are completed. The synchronous time of the N +2 th synchronous event is the execution time of the Store N event.
Through the configuration, the simulation process of the operation of the neural network can be realized based on the event triggering mode, and the simulation operation of the neural network can be quickly carried out.
In summary, the embodiment of the present disclosure can form a plurality of matched event processes according to the neural network operation to be executed, and execute each event according to the sequence of the event processes, so as to execute each operation in order, and can execute the operation process based on the event triggering mode, thereby performing performance simulation on the neural network operation quickly. The simulation is carried out by taking an event as a unit, the event is usually a user-defined event, each event can trigger one simulation operation, the neural network operation process can be effectively executed, the result of each process can be obtained, and the simulation execution process can be conveniently analyzed.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
In addition, the present disclosure also provides a simulator, an electronic device, a machine learning chip, a computer readable storage medium, and a program, which can be used to implement any one of the simulation operation methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the method section are not repeated.
The disclosed embodiment provides a simulator, which includes: a processor for performing the method of any one of the first aspect.
Embodiments of the present disclosure also provide a machine learning chip including instructions for performing the method according to any one of the first aspect.
In some possible implementations, the embodiment of the disclosure further provides a chip packaging structure, which includes the above chip.
In some possible embodiments, the present disclosure further provides a board card, which includes the above chip package structure board card, and may further include other accessories besides the above chip, where the accessories include, but are not limited to: a memory device, an interface arrangement and a control device.
The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.
In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.
In some embodiments, an electronic device is further provided, which includes the above board card.
The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
In some embodiments, there is also provided a computer readable storage medium having stored therein computer program instructions which, when executed by a processor, implement the method of any one of the first aspects.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (14)

1. A method of simulation operations for simulating performance of neural network operations, comprising:
receiving and storing operational data, wherein the operational data comprises a neural network operational instruction and data for executing the neural network operational instruction;
parsing a plurality of operator instructions from the neural network operator instruction and determining a plurality of event processes for executing the plurality of operator instructions, each event process comprising: at least one of a load event, an arithmetic event, a store event, and a synchronization event;
and obtaining at least one of operation time and operation results of completing the neural network operation instruction based on the determined event processes.
2. The method of claim 1, wherein parsing a plurality of operator instructions from the neural network operation instruction comprises:
and executing decoding operation on the received neural network operation instruction to obtain the plurality of operation sub-instructions.
3. The method of claim 1, wherein determining the plurality of event processes for completing the plurality of operator instructions comprises at least one of:
determining the number of the event processes according to the number of the operation sub-instructions;
and determining each event process and the execution sequence of each event process according to the execution sequence of each operation sub-instruction.
4. The method of claim 3, wherein determining the number of event processes based on the number of operator instructions comprises:
when N operator instructions are analyzed from the neural network operation instruction plan, determining the number of the event processes to be N +2, wherein N is a positive integer greater than 1.
5. The method of claim 3, wherein determining the order of execution of each of the event processes and the order of execution of each of the event processes based on the order of execution of each of the operand instructions comprises:
determining a first event process, the first event process comprising a first load event;
determining a second event process, the second event process comprising a second load event and a first arithmetic event;
determining a third event process, wherein the third event process comprises a third loading event, a second operation event and a first storage event;
determining an ith event process, wherein the ith event process comprises an ith loading event, an ith-1 operation event and an ith-2 storage event;
determining an N +1 th event process, wherein the N +1 th event process comprises an nth operation event and an N-1 th storage event;
determining an N +2 th event process, wherein the N +2 th event process comprises an nth storage event, i is an integer which is greater than 3 and less than or equal to N, and N is the number of operation sub-instructions and is a positive integer which is greater than 1;
the jth loading event is used for loading operation data of a jth operation sub-instruction, the jth storage event is used for storing an operation result of the jth operation sub-instruction, the jth operation event is used for executing operation of the jth operation sub-instruction, and j is a positive integer which is larger than 0 and smaller than or equal to N.
6. The method of claim 5, wherein determining the order of execution of each of the event processes and the order of execution of each of the event processes based on the order of execution of each of the operand instructions comprises:
each event process also comprises a synchronous event, and the synchronous time of the synchronous event of each event process is determined according to the execution time of each event in each event process.
7. The method of claim 1, wherein obtaining at least one of a computation time and a computation result for completing the neural network computation instruction based on the determined event processes comprises:
determining the execution time of each event process according to the execution time of each event in each event process;
and acquiring the operation time for completing the neural network operation instruction according to the execution time of each event process.
8. The method of claim 1, wherein obtaining at least one of a computation time and a computation result for completing the neural network computation instruction based on the determined event processes comprises:
executing each of the determined event processes;
and obtaining the operation result of the neural network operation according to the operation result of each event process.
9. The method of claim 8, further comprising:
when executing each event process, determining an operation program for executing the operation event according to the operation type of the operation sub-instruction corresponding to the operation event in the event process;
and executing the corresponding operation event according to the determined operation program.
10. The method according to claim 9, wherein the determining, according to the operation type of the operation event in the event process, an operation program for executing the operation event comprises:
when the operator instruction corresponding to the operation event is a first type of operation, executing the operation event by using a first operation program;
when the operator instruction corresponding to the operation event is the second type of operation, executing the operation event by using a first operation program;
the first class of operations includes at least one of vector operations, scalar operations, and non-linear operations, and the second class of operations includes matrix scalar operations.
11. A simulator, comprising:
a processor for performing the method of any one of claims 1-10.
12. A machine learning chip, wherein the machine learning chip is configured to perform the method of any one of claims 1-10.
13. An electronic device, characterized in that it comprises a chip according to claim 12.
14. A computer readable storage medium having computer program instructions stored therein, which when executed by a processor implement the method of any one of claims 1 to 10.
CN201910097439.3A 2019-01-31 2019-01-31 Simulation operation method and simulator Active CN111506384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910097439.3A CN111506384B (en) 2019-01-31 2019-01-31 Simulation operation method and simulator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910097439.3A CN111506384B (en) 2019-01-31 2019-01-31 Simulation operation method and simulator

Publications (2)

Publication Number Publication Date
CN111506384A true CN111506384A (en) 2020-08-07
CN111506384B CN111506384B (en) 2022-12-09

Family

ID=71875632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910097439.3A Active CN111506384B (en) 2019-01-31 2019-01-31 Simulation operation method and simulator

Country Status (1)

Country Link
CN (1) CN111506384B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104205064A (en) * 2012-03-16 2014-12-10 国际商业机器公司 Transformation of a program-event-recording event into a run-time instrumentation event
CN104335232A (en) * 2012-05-30 2015-02-04 高通股份有限公司 Continuous time spiking neural network event-based simulation
US20180150757A1 (en) * 2016-11-29 2018-05-31 International Business Machines Corporation Accurate temporal event predictive modeling
CN108369660A (en) * 2015-07-13 2018-08-03 索邦大学 The data processing equipment of numerical value is indicated with the time interval between event

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104205064A (en) * 2012-03-16 2014-12-10 国际商业机器公司 Transformation of a program-event-recording event into a run-time instrumentation event
CN104335232A (en) * 2012-05-30 2015-02-04 高通股份有限公司 Continuous time spiking neural network event-based simulation
CN108369660A (en) * 2015-07-13 2018-08-03 索邦大学 The data processing equipment of numerical value is indicated with the time interval between event
US20180150757A1 (en) * 2016-11-29 2018-05-31 International Business Machines Corporation Accurate temporal event predictive modeling

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
师小丽等: "数据并行计算机体系结构建模", 《计算机工程》 *
赖鑫等: "支持线程级猜测的存储体系结构设计", 《计算机工程》 *

Also Published As

Publication number Publication date
CN111506384B (en) 2022-12-09

Similar Documents

Publication Publication Date Title
US11449745B2 (en) Operation apparatus and method for convolutional neural network
CN104915322B (en) A kind of hardware-accelerated method of convolutional neural networks
CN109284825B (en) Apparatus and method for performing LSTM operations
US5524175A (en) Neuro-computer system for executing a plurality of controlling algorithms
JP7078758B2 (en) Improving machine learning models to improve locality
US20060130029A1 (en) Programming language model generating apparatus for hardware verification, programming language model generating method for hardware verification, computer system, hardware simulation method, control program and computer-readable storage medium
CN111105023A (en) Data stream reconstruction method and reconfigurable data stream processor
US20210350230A1 (en) Data dividing method and processor for convolution operation
US20130013283A1 (en) Distributed multi-pass microarchitecture simulation
US8140314B2 (en) Optimal bus operation performance in a logic simulation environment
CN114626516A (en) Neural network acceleration system based on floating point quantization of logarithmic block
CN109446740B (en) System-on-chip architecture performance simulation platform
Cho et al. FARNN: FPGA-GPU hybrid acceleration platform for recurrent neural networks
Galicia et al. Neurovp: A system-level virtual platform for integration of neuromorphic accelerators
US7110934B2 (en) Analysis of the performance of a portion of a data processing system
CN111506384B (en) Simulation operation method and simulator
US11176018B1 (en) Inline hardware compression subsystem for emulation trace data
Diamantopoulos et al. A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping
CN116149917A (en) Method and apparatus for evaluating processor performance, computing device, and readable storage medium
CN108846248B (en) Application modeling and performance prediction method
Hoefer et al. SiFI-AI: A Fast and Flexible RTL Fault Simulation Framework Tailored for AI Models and Accelerators
CN111143208B (en) Verification method for assisting FPGA to realize AI algorithm based on processor technology
CN111950219B (en) Method, apparatus, device and medium for realizing simulator
CN114021733A (en) Model training optimization method and device, computer equipment and storage medium
CN110135572B (en) SOC-based trainable flexible CNN system design method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant