CN116128016A - Time cost calculation method and system for neural network processor and readable storage medium - Google Patents

Time cost calculation method and system for neural network processor and readable storage medium Download PDF

Info

Publication number
CN116128016A
CN116128016A CN202310096961.6A CN202310096961A CN116128016A CN 116128016 A CN116128016 A CN 116128016A CN 202310096961 A CN202310096961 A CN 202310096961A CN 116128016 A CN116128016 A CN 116128016A
Authority
CN
China
Prior art keywords
time
unit
carrying
instruction
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310096961.6A
Other languages
Chinese (zh)
Inventor
王东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Intellifusion Technologies Co Ltd
Original Assignee
Shenzhen Intellifusion Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Intellifusion Technologies Co Ltd filed Critical Shenzhen Intellifusion Technologies Co Ltd
Publication of CN116128016A publication Critical patent/CN116128016A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a time cost calculation method, a time cost calculation system and a readable storage medium of a neural network processor, wherein the execution time of a carrying unit and the execution time of a vector calculation unit are calculated based on a behavior level simulator; the calculating the execution time of the carrying unit includes: the carrying unit executes carrying time, the carrying unit updates time, and the carrying unit sends out waiting releasing instruction time; the execution time of the vector calculation unit includes: the vector calculation unit executes the calculation time, the vector calculation unit updates the time, and the vector calculation unit sends out the waiting time; the execution time of the calculation carrying unit and the execution time of the vector calculation unit are executed in parallel. The chip design stage utilizes a behavior-level simulator to obtain the time cost of the instruction which is the same as the execution time on a real chip, so that accurate performance data is obtained, and a data base is provided for chip architecture optimization.

Description

Time cost calculation method and system for neural network processor and readable storage medium
Technical Field
The invention relates to the field of simulators, in particular to a time cost calculation method and system of a neural network processor and a readable storage medium.
Background
The performance data of the framework system of the framework compiler on the vector units of the neural network processor is a key to guiding performance tuning. The chip designer performs architecture tuning based on the exposed architecture defects of the performance data, so that the framework system of the framework compiler has more efficient processing capacity. However, there is hysteresis in acquiring the performance data on the vector units of the obtained neural network processor on the finished chip, which cannot guide the optimization of the architecture in the chip design stage. In the design stage, namely, the working flow of the vector unit of the complex neural network processor is simulated under the condition of no finished chip, and accurate performance data of the vector unit of the neural network processor, particularly the time cost in the performance data, are obtained, and are key factors for restricting the design performance optimization of the architecture of the neural network processor.
Disclosure of Invention
In order to solve the technical problem that accurate performance data are difficult to obtain for a complex workflow of a vector unit of a neural network processor in the prior art, the invention provides a time cost calculation method, a time cost calculation system and a readable storage medium of the neural network processor, which realize the effect of obtaining the accurate performance data in a chip design stage to guide chip architecture optimization.
The invention provides a time cost calculation method of a neural network processor, which is based on a behavior level simulator and used for calculating the execution time of a carrying unit and the execution time of a vector calculation unit;
the calculating the execution time of the carrying unit includes: the carrying unit executes carrying time, the carrying unit updates time, and the carrying unit sends out waiting releasing instruction time;
the execution time of the vector calculation unit includes: the vector calculation unit executes the calculation time, the vector calculation unit updates the time, and the vector calculation unit sends out the waiting time;
the execution time of the calculation carrying unit and the execution time of the vector calculation unit are executed in parallel.
Preferably, calculating the carrying-out time of the carrying unit based on the carrying unit being in the carrying-out state, the time-cost calculation of the carrying unit being in the carrying-out state includes: pushing the number of cycles required for the carrying unit to carry out the current carrying to the end of the buffer for storing the carrying completion time point as the end time of the current carrying.
Preferably, the carrying unit update time is calculated based on the carrying unit being in an updated state, the carrying unit being in an updated state:
the carrying unit does not start carrying or carry out carrying, the carrying unit does not have carrying instruction configuration at present, and the carrying unit is carried out carrying at least once;
or, the carrying unit receives a release waiting instruction sent from the vector calculating unit or from another carrying unit in the opposite direction of carrying data by the carrying unit.
Preferably, the calculating of the time cost of the carrying unit in the updated state includes: updating the value of the header of the buffer for holding the conveyance completion time point to a register for counting the conveyance execution time, and deleting the value of the header of the buffer for holding the conveyance completion time point.
Preferably, the calculating of the time when the carrying unit issues the release wait instruction based on the carrying unit being in the release wait instruction issuing state includes: and saving the actual cycle number of the instruction for releasing the waiting sent by the carrying unit to a cache for storing data required by the handshake of the execution unit.
Preferably, the calculating of the update time of the vector calculating unit based on the vector calculating unit being in an updated state, the calculating of the time cost of the vector calculating unit being in an updated state includes: and updating the data in the buffer memory for storing the data required by the handshake between the vector calculation unit and the carrying unit to a register for counting the execution time.
Preferably, the calculating of the time cost for the vector calculating unit to issue the release wait instruction based on the vector calculating unit being in the release wait instruction state includes:
when the current instruction is obtained by pipeline analysis of the vector calculation unit as a release waiting instruction, the value of a register used for counting the execution time at the current moment is recorded.
Preferably, the handling unit includes an instruction for handling off-chip data to be stored in the chip and an instruction for handling on-chip stored data to be stored in the chip to be off-chip.
The invention also provides a time cost calculation system of the neural network processor, which comprises:
the behavior-level simulator is used for the behavior-level simulator,
a first time cost calculation unit for calculating an execution time of the carrying unit,
a second time cost calculation unit for calculating an execution time of the vector calculation unit;
the first time cost calculating unit is configured to calculate the carrying time executed by the carrying unit, the carrying unit update time, and the carrying unit issues a release waiting instruction time;
the second time cost calculating unit is configured to calculate the calculation time of the vector calculating unit, update the time of the vector calculating unit and send out the waiting instruction releasing time;
the first time cost calculating unit and the second time cost calculating unit are executed in parallel.
The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described above.
Compared with the prior art, the time cost calculation method, the time cost calculation system and the readable storage medium of the neural network processor provided by the embodiment of the invention have the advantages that the behavior-level simulator is utilized to execute a plurality of instructions in parallel in a chip design stage so as to calculate the execution time of the instructions with different instruction types respectively, the accurate time cost which is the same as the execution time of the corresponding instructions on the chip is obtained on the behavior-level simulator, the accurate obtaining of the chip performance parameters in the chip design stage is realized, and a data basis is provided for the chip designer to perform architecture optimization based on the architecture defects of the performance data exposure based on the time cost in the design stage.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
FIG. 1 is a flowchart of a method for calculating time cost of a neural network processor according to an embodiment of the present invention;
FIG. 2 is a flowchart of a neural network processor vector unit according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a data handling flow according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a vector calculation flow provided in an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present invention, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present invention, and the embodiments can be mutually combined and referred to without contradiction.
A first embodiment of the present invention provides a time cost calculating method of a neural network processor, please refer to fig. 1.
Instructions corresponding to vector units inside a neural network processor are divided into three types:
vector compute class instructions, such as add (vadd), subtract (vsub), multiply (vmul), shift (vasr/vlsr/vasl/vlsl), compare (vcmp), swap (vmovi 2 v/vmovicc 2v, etc.), and memory class (vld/vst) instructions;
handling class instructions, such as edma_cfg, edma_ext_addr, edma_burst, and the like.
Synchronous class instructions, such as unblock, wait_on.
Based on the functions and performances of the behavior-level simulator, the vector calculation class instructions and the synchronization class instructions are simulated by the cycle level, and the time required for executing the vector calculation class instructions and the synchronization class instructions on the behavior-level simulator is the same time as the time required for executing the vector calculation class instructions and the synchronization class instructions on the chip, namely the real time is obtained on the behavior-level simulator; and carrying instructions, namely carrying the instructions by a bus once the instructions are analyzed, wherein only the analyzed instructions are in a cycle stage, so that the instructions are simulated by a behavior stage simulator.
Referring to fig. 2, a workflow of a neural network processor vector unit according to an embodiment includes two handling units, eidma0 and eidma1, for handling off-chip data to on-chip memory, and two handling units, eodma0 and eodma1, for handling on-chip memory data to off-chip memory. The plurality of instructions are caused to run in parallel by synchronizing the instructions, the plurality of instructions comprising at least: and carrying the off-chip data to an eidma carrying instruction stored in the chip, a vector calculation instruction and an eodma carrying instruction carrying the stored data in the chip to the outside of the chip. The handling time of the eidma handling instruction for handling the data outside the chip to the memory in the chip is the handling time of the behavior level, namely, the real handling time of the instruction on the chip is not the handling time of the instruction; vector compute instructions are the handling times of the cycle level, i.e., the real time of the instructions on the chip.
The time cost simulation of the neural network vector processing unit is based on a behavioral level simulator, and chip instructions are simulated by the behavioral level simulator in a chip design stage instead of a finished chip, and the cycle number similar to the test data of the finished chip is obtained as the time cost, so that a basis is provided for the chip designer to perform architecture optimization based on the architecture defects of the performance data exposure based on the time cost in the design stage.
Specifically, the time cost of calculating the neural network vector processing unit is based on the execution time of the behavior level simulator calculating the carrying unit and the execution time of the vector calculating unit; the calculating the execution time of the carrying unit includes: the carrying unit executes carrying time, the carrying unit updates time, and the carrying unit sends out waiting releasing instruction time; the execution time of the vector calculation unit includes: the vector calculation unit executes the calculation time, the vector calculation unit updates the time, and the vector calculation unit sends out the waiting time; the execution time of the calculation carrying unit and the execution time of the vector calculation unit are executed in parallel.
Specifically, the conveyance unit execution conveyance time is calculated based on the conveyance unit being in an execution conveyance state, the conveyance unit update time is calculated based on the conveyance unit being in an update state, and the conveyance unit issue cancellation wait instruction state time is calculated based on the conveyance unit being in an issue cancellation wait instruction state; calculating the execution calculation time of the vector calculation unit based on the vector calculation unit being in an execution calculation state, calculating the update time of the vector calculation unit based on the vector calculation unit being in an update state, and calculating the state time of the vector calculation unit sending out a release wait instruction based on the vector calculation unit being in a release wait instruction sending state; the execution sequence of the carrying unit to carry out carrying, the carrying unit to update and the carrying unit to send out a waiting release instruction depends on a software programming mode, and as an alternative embodiment, the execution sequence can be parallel execution; the vector calculation unit performs calculation, the vector calculation unit updates, and the execution sequence of the instruction for releasing wait issued by the vector calculation unit depends on a software programming mode, and as an alternative embodiment, the execution sequence may be parallel execution.
In some embodiments, the handling unit includes an instruction for handling off-chip data to a cache stored in the chip and an instruction for handling on-chip stored data to a cache outside the chip, and the cycle number calculation method of the two is the same, and the handling instruction register and the caches are prepared, and the time calculation process specifically includes: and calculating the carrying time executed by the carrying unit, calculating the updating time of the carrying unit, and calculating the time for the carrying unit to send out the release waiting instruction.
Referring to fig. 3, an embodiment of a specific handling process is shown. The analysis of the transfer instruction eidma requires 10 cycles, and the eidma transfer unit on the behavioral level simulator starts to transfer from time 11 and completes the transfer in the cycle of the present cycle; the analysis of the transfer instruction eidma requires 10 cycles, i.e., 10 to 19 for the actual time, and the eidma transfer unit on the behavioral level simulator starts the transfer from time 21 and completes the transfer in the cycle of this cycle. On the finished chip, the carrying process is carried out twice when the carrying instruction is executed, the first real carrying time is 150 cycles, namely from the real time 10 to 159, and the second real carrying time is 200 cycles, namely from the real time 160 to 359.
In an actual neural network processor, in order to improve the execution efficiency of the vector unit of the neural network processor, instructions such as vector calculation type instructions and transport type instructions are executed in parallel, so that it is difficult to calculate the actual execution time of the vector unit of the neural network processor. Under the complex workflow of the vector unit of the neural network processor as provided in the above embodiment, when the behavior-level simulator performs assembly, since part of the instructions are executed by the cycle level, part of the instructions are executed by the behavior level, making it difficult to determine the actual execution time on the chip by the behavior-level simulator.
In the workflow of the vector unit of the neural network processor provided in the above embodiment, calculating the time for the handling unit to perform handling includes the following steps:
s1, preparing a carrying instruction register and a plurality of caches, wherein the carrying instruction register comprises:
a register eidmacycle for counting the carrying execution time;
a buffer eidmatransfer end fifo for storing the conveyance completion time point;
the buffer memory eidmaub vu fifo and the buffer memory eodama ub vu fifo for storing the data required by the handshake between the handling unit and the vector computing unit are used for storing the buffer memory eidmaub eodama fifo of the data required by the handshake between the handling unit and the handling unit;
eidmacycle bk for storing the number of execution cycles when the handling unit is completed.
The above-mentioned caches are all first-in first-out principle.
S2, judging the state of the carrying unit, calculating the carrying execution time if the carrying unit is in the carrying execution state, calculating the updating time if the carrying unit is in the updating state, and calculating the carrying unit to send the release waiting instruction time if the carrying unit is in the release waiting instruction state, wherein the calculation is based on the state parallel calculation of the carrying unit.
A. Calculation of an execution conveyance time when the conveyance unit is in an execution conveyance state
When the eidma transfer instruction is executed in the current transfer period, calculating the transfer time of the eidma transfer instruction according to the transfer data amount configured by the eidma transfer instruction and the two groups of data of the bus bandwidth, wherein the specific calculation method of the transfer time does not affect the realization of the technical effect of the embodiment. The handling time is pushed to the end of the buffer eidmatransfer end fifo, while the data of the buffer eidmacycle bk at the moment is recorded to the register eidmacycle.
Take the transport flow shown in fig. 3 as an example. The time point at which the transfer 0 starts is 10, and if 150 cycles are required for the transfer 0, the time of the transfer 0 is 160, and 160 is stored at the end of the eidmatransfer end fifo; the point in time when the transfer 1 starts is 160, and assuming that 200 cycles are required for the transfer 1, the point in time when the transfer ends is 360, and 360 is stored again at the end of the eidmatransfer end fifo when the transfer ends.
B. Update time calculation of the carrying unit in an updated state
The judgment that the carrying unit is in the updated state is as follows: the handling unit does not begin handling or handling is complete (i.e., eidmaid state), and the handling unit is not currently configured with a handling instruction (e.g., eidmacfg, eidmaext addr, eidmaburst, etc. instructions), and the handling unit has completed handling at least once (i.e., buffering eidmatransfer end fifo greater than 1);
or, the handling unit eidma receives a release waiting instruction sent by the vector calculation unit or the handling unit eodma.
When the carrying unit is in the judgment of the updated state, updating the value of the head of the buffer for storing the carrying completion time point to a register for counting the carrying execution time, and deleting the value of the head of the buffer for storing the carrying completion time point.
Taking the handling flow shown in fig. 3 as an example, the point of time when the allocation of the handling unit eidma0 is completed is 9, the point of time when the handling unit eidma0 actually starts to handle is 10, and it is assumed that 150 cycles are required for the handling instruction; the point in time when the arrangement of the carrying unit eidma1 is completed is 19, the point in time when the carrying unit eidma1 actually starts carrying is 160, and it is assumed that 200 cycles are required for carrying. After the carrying unit eidma0 and the carrying unit eidma1 are configured through the periodic level simulation, the carrying unit satisfies that the carrying is completed once and the carrying is completed at the moment, and no carrying instruction is configured at present, that is, the carrying unit meets the condition in the updated state. The eidmatransfer end fifo value is updated to a register eidmacycle for counting the transfer execution time, and the value of the buffer eidmatransfer end fifo header for holding the transfer completion time point is deleted. In the next cycle, the handling unit accords with the condition that the handling is completed once and the handling is completed at the moment, and no handling instruction configuration is currently performed, namely, accords with the condition in an updated state, updates the eidmatransfer end fifo front value to a register eidmacycle for counting the handling execution time again, and deletes the value of a buffer eidmatransfer end fifo header for storing the handling completion moment.
C. Calculation of time when the handling unit is in a state of issuing a release wait instruction
When the carrying unit is in a state of sending out a release waiting instruction, saving the actual cycle number of the carrying unit sending out the release waiting instruction to a cache for storing data required by the handshake of the execution unit.
Taking the handling flow shown in fig. 3 as an example, when the handling unit eidma executes to the instruction eidma. Ub. Vu or the instruction eidma. Ub. Eodma, the number of cycles corresponding to when the handling unit eidma issues the unharming instruction is saved in the buffer eidmaubvu fifo for storing data required for the handshake between the handling unit and the vector computing unit, the buffer eodmaub vu fifo for storing data required for the handshake between the handling unit and the vector computing unit, or the buffer eidmaubeodmafifo for storing data required for the handshake between the handling unit and the handling unit.
By combining the three parallel calculations for the time costs when different events are executed in parallel for the handling units in different situations, respectively, with the time records of the different caches in different situations (i.e. the cache eidmatransfer end fifo for saving the handling completion time point, the caches eidmaub vu fifo and eoma ub vu fifo for storing the data required for the handshake of the handling units with the vector calculation unit, the cache eidmaub eoma fifo for storing the data required for the handshake of the handling units with the handling units, and the time records of the eidmacycle bk for saving the implementation cycle number when the handling of the handling units is completed based on the logic in the current situation, the embodiment realizes that the accurate behavior-level simulator obtains the actual time cost similar to that of the finished chip.
Referring to fig. 4, an embodiment of a vector calculation process is shown. The point of time when the eidma carrying instruction is completed is 160, the vector calculation unit starts to execute the calculation type instruction from the point of time 160, and the vector calculation unit sends out a wait canceling instruction vu.ub.eodama to eodama carrying module at the point of time 260 after the calculation type instruction is executed and the calculation type instruction is completed under the assumption that 100 cycles are needed for executing the calculation type instruction.
In the workflow of the vector unit of the neural network processor provided in the above embodiment, the calculation of the time for which the vector calculation unit performs the calculation includes the steps of:
s1, preparing a calculation instruction register and a plurality of caches, wherein the method comprises the following steps:
register vu_cycle for counting the execution time;
the buffer memory vu_ub_eidma_fifo for storing data required by handshake between the vector calculation unit and the carrying unit;
and the real-time period register is used for storing the vu_cycle_bk when the calculation of the vector calculation unit is completed.
The above-mentioned caches are all first-in first-out principle.
S2, judging the state of the vector computing unit, if the vector computing unit is in an updating state, computing the updating time of the vector computing unit, and if the vector computing unit is in a state of issuing a release waiting instruction, computing the release waiting instruction time of the vector computing unit, wherein the computing is based on the state parallel computing of the vector computing unit.
A. Vector calculation unit update time calculation with the vector calculation unit in an updated state
If the vector calculation unit receives a de-waiting instruction (see vu.wo.eidma or eodma.wo.vu in fig. 4) from the handling unit (including the handling unit eidma or the handling unit eodma), the vector calculation unit considers that the vector calculation unit is in the updated state at the current moment to calculate the update time. The time point is updated to a register vu_cycle for counting the execution time of the calculation.
B. The vector calculation unit in the state of sending out the release waiting instruction sends out the release waiting instruction time calculation
Since the calculation instruction of the vector calculation unit performs the periodic level simulation on the behavior level simulator, that is, is consistent with the finished chip, when the vector calculation pipeline parses that the current instruction is the de-waiting instruction vu.ub.eidma or vu.ub.eodama, the time point is recorded to a register vu_cycle for counting the calculation execution time. Specifically, when the unhatched instruction is vu.ub.eidma, executing a register vu_cycle for recording a buffer vu_ub_eidma_fifo for storing data required for handshake between the vector calculation unit and the handling unit to the register vu_cycle for counting the execution time; when the cancel waiting instruction is vu.ub.eodama, execution records the arrival of the buffer vu_ub_eodama_fifo for storing data necessary for handshake of the vector calculation unit and the handling unit to a register vu_cycle for counting the execution time.
After the execution of the behavior level simulator is completed, the execution time of the carrying unit and the vector computing unit is determined through the values in the register vu_cycle used for counting the execution time, the register eidma_cycle used for counting the carrying execution time and the register eodama_cycle used for counting the carrying execution time, and the three registers, so that the similar real execution time of the model on a finished chip is determined. In addition, the established caches include a cache eidmab_transfer_end_fifo for storing the handling completion time point, a cache eidmab_vu_fifo and a cache eodmab_vu_fifo for storing data required for the handshake between the handling unit and the handling unit, a cache eidmab_eodma_fifo for storing data required for the handshake between the handling unit and the handling unit, a cache vu_ub_eidma_fifo and a cache vu_ub_eodma_fifo for storing data required for the handshake between the handling unit and the handling unit, wherein the records of the start time of the handling instruction and the vector calculation instruction are related, and the end time can also help the architect optimize the architecture and improve the software and hardware performance.
A second embodiment of the present invention provides a time cost calculation system of a neural network processor, including:
the behavior-level simulator is used for the behavior-level simulator,
a first time cost calculation unit for calculating an execution time of the carrying unit,
a second time cost calculation unit for calculating an execution time of the vector calculation unit;
the first time cost calculating unit is configured to calculate the carrying time executed by the carrying unit, the carrying unit update time, and the carrying unit issues a release waiting instruction time;
the second time cost calculating unit is configured to calculate the calculation time of the vector calculating unit, update the time of the vector calculating unit and send out the waiting instruction releasing time;
the first time cost calculating unit and the second time cost calculating unit are executed in parallel.
Calculating the carrying unit execution carrying time, the carrying unit updating time and the carrying unit release waiting instruction sending time based on the carrying unit being in an execution carrying state or the carrying unit being in an updating state or the carrying unit being in a release waiting instruction sending state, wherein the carrying unit execution carrying, the carrying unit updating and the carrying unit release waiting instruction sending execution sequence depends on a software programming mode, and as an alternative embodiment, the execution sequence can be executed in parallel;
based on the fact that the vector calculation unit is in a calculation state, or the vector calculation unit is in an update state, or the vector calculation unit is in a state of sending out a release waiting instruction to calculate the execution time of the vector calculation unit, the vector calculation unit updates the time, the vector calculation unit sends out the release waiting instruction, the vector calculation unit performs calculation, the vector calculation unit updates, and the execution sequence of the vector calculation unit sending out the release waiting instruction depends on a software programming mode, as an alternative embodiment, the execution sequence can be parallel execution.
As an alternative embodiment, the behavior-level simulator comprises two handling units, eidma0 and eidma1, for handling off-chip data to on-chip memory, and two handling units, eodma0 and eodma1, for handling on-chip memory data to off-chip. The plurality of instructions are caused to run in parallel by synchronizing the instructions, the plurality of instructions comprising at least: and carrying the off-chip data to an eidma carrying instruction stored in the chip, a vector calculation instruction and an eodma carrying instruction carrying the stored data in the chip to the outside of the chip.
In some embodiments, the handling unit includes instructions for handling off-chip data to the off-chip cache and instructions for handling on-chip stored data to the off-chip cache, both of which have the same cycle number calculation method. Specifically, the first time cost calculation unit is configured to: preparing a carrying instruction register and a plurality of caches, calculating carrying time executed by a carrying unit, calculating updating time of the carrying unit, and calculating time for the carrying unit to send out a release waiting instruction.
Specifically, the first time cost calculation unit includes: the storage unit includes a carry instruction register and a plurality of caches, specifically, the carry instruction register and the plurality of caches include:
a register eidmacycle for counting the carrying execution time;
a buffer eidmatransfer end fifo for storing the conveyance completion time point;
the buffer memory eidmaub vu fifo and the buffer memory eodama ub vu fifo for storing data required by handshake between the handling unit and the vector computing unit are used for storing the buffer memory eidmaub eodama fifo for storing data required by handshake between the handling unit and the handling unit;
eidmacycle bk for storing the number of execution cycles when the handling unit is completed.
The above-mentioned caches are all first-in first-out principle.
The first time cost calculation unit further comprises a judgment module, wherein the judgment module is used for judging the state of the carrying unit, and if the carrying unit is in a carrying executing state, an updating state or a waiting releasing instruction state, the time calculation module is triggered;
the first time cost calculating unit further includes a time calculating module, where the time calculating module is configured to calculate execution time of the carrying instruction, specifically, calculate execution time if the carrying unit is in an execution carrying state, calculate update time if the carrying unit is in an update state, calculate execution time of the carrying unit when the carrying unit is in an execution release waiting instruction state, and the execution order of the calculation depends on a software programming manner, where the execution order may be executed in parallel as an alternative embodiment.
The time calculation module is used for calculating the carrying-out time when the carrying unit is in the carrying-out state:
when the eidma transfer instruction is executed in the current transfer period, calculating the transfer time of the eidma transfer instruction according to the transfer data amount configured by the eidma transfer instruction and the two groups of data of the bus bandwidth, wherein the specific calculation method of the transfer time does not affect the realization of the technical effect of the embodiment. The handling time is pushed to the end of the buffer eidmatransfer end fifo, while the data of the buffer eidmacycle bk at the moment is recorded to the register eidmacycle.
The time calculating module is further configured to calculate an update execution time when the handling unit is in an update state:
the judgment that the carrying unit is in the updated state is as follows: the handling unit does not begin handling or handling is complete (i.e., eidmaid state), and the handling unit is not currently configured with a handling instruction (e.g., eidmacfg, eidmaext addr, eidmaburst, etc. instructions), and the handling unit has completed handling at least once (i.e., buffering eidmatransfer end fifo greater than 1);
or, the handling unit eidma receives a release waiting instruction sent by the vector calculation unit or the handling unit eodma.
When the carrying unit is in the judgment of the updated state, updating the value of the head of the buffer for storing the carrying completion time point to a register for counting the carrying execution time, and deleting the value of the head of the buffer for storing the carrying completion time point.
The time calculating module is further configured to calculate a state time for issuing a release wait instruction when the handling unit is in the state for issuing the release wait instruction:
when the carrying unit is in a state of sending out a release waiting instruction, saving the actual cycle number of the carrying unit sending out the release waiting instruction to a cache for storing data required by the handshake of the execution unit.
As an alternative embodiment, the second time cost calculating unit includes a calculation instruction register and a plurality of caches, specifically includes:
register vu_cycle for counting the execution time;
the buffer memory vu_ub_eidma_fifo for storing data required by handshake between the vector calculation unit and the carrying unit;
and the real-time period register is used for storing the vu_cycle_bk when the calculation of the vector calculation unit is completed.
The above-mentioned caches are all first-in first-out principle.
The second time cost calculation unit further comprises a vector calculation unit judgment module, wherein the vector calculation unit judgment module is used for judging the state of the vector calculation unit, and if the vector calculation unit is in an updating state and sends out a waiting release instruction state, the vector calculation unit calculation module is triggered;
the second time cost calculation unit further includes a vector calculation unit calculation module for calculating a vector calculation unit instruction execution time, wherein the vector calculation unit instruction execution time includes: if the vector calculation unit is in an update state, calculating the update time of the vector calculation unit, and if the vector calculation unit is in a state of sending out a release waiting instruction, calculating the release waiting instruction time by the vector calculation unit, wherein the execution sequence of the calculation depends on a software programming mode, and as an alternative embodiment, the execution sequence can be parallel execution.
Specifically, the vector calculation unit calculation module is configured to calculate a vector calculation unit update time when the vector calculation unit is in an update state:
if the vector calculation unit receives a de-waiting instruction (see vu.wo.eidma or eodma.wo.vu in fig. 4) from the handling unit (including the handling unit eidma or the handling unit eodma), the vector calculation unit considers that the vector calculation unit is in the updated state at the current moment to calculate the update time. The time point is updated to a register vu_cycle for counting the execution time of the calculation.
Specifically, the vector calculation unit calculation module is configured to calculate, when the vector calculation unit is in a state of issuing a release wait instruction, a release wait instruction time for the vector calculation unit to issue:
since the calculation instruction of the vector calculation unit performs the periodic level simulation on the behavior level simulator, that is, is consistent with the finished chip, when the vector calculation pipeline parses that the current instruction is the de-waiting instruction vu.ub.eidma or vu.ub.eodama, the time point is recorded to a register vu_cycle for counting the calculation execution time. Specifically, when the unhatched instruction is vu.ub.eidma, executing a register vu_cycle for recording a buffer vu_ub_eidma_fifo for storing data required for handshake between the vector calculation unit and the handling unit to the register vu_cycle for counting the execution time; when the cancel waiting instruction is vu.ub.eodm, execution records the arrival of the buffer vu_ub_eodama_fifo for storing data necessary for handshake of the vector calculation unit and the handling unit to a register vu_cycle for counting the execution time.
As an alternative embodiment, the time cost calculation system of the neural network processor further includes: and the comprehensive unit is configured to determine the execution time of the carrying unit and the vector calculation unit through the values in the register vu_cycle used for counting the execution time, the register eidma_cycle used for counting the carrying execution time and the register eodama_cycle used for counting the carrying execution time after the execution of the action stage simulator is completed, and further determine the similar real execution time of the model on a finished chip. In addition, the established caches include a cache eidmab_transfer_end_fifo for storing the handling completion time point, a cache eidmab_vu_fifo and a cache eodmab_vu_fifo for storing data required for the handshake between the handling unit and the handling unit, a cache eidmab_eodma_fifo for storing data required for the handshake between the handling unit and the handling unit, a cache vu_ub_eidma_fifo and a cache vu_ub_eodma_fifo for storing data required for the handshake between the handling unit and the handling unit, wherein the records of the start time of the handling instruction and the vector calculation instruction are related, and the end time can also help the architect optimize the architecture and improve the software and hardware performance.
A third embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements a method as described in any of the previous embodiments.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by computer programs to instruct related hardware, and the programs may be stored in a computer readable storage medium, which when executed may include the processes of the embodiments of the methods described above.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (10)

1. The time cost calculation method of the neural network processor is characterized in that the execution time of a carrying unit and the execution time of a vector calculation unit are calculated based on a behavior level simulator;
the calculating the execution time of the carrying unit includes: the carrying unit executes carrying time, the carrying unit updates time, and the carrying unit sends out waiting releasing instruction time;
the execution time of the vector calculation unit includes: the vector calculation unit executes the calculation time, the vector calculation unit updates the time, and the vector calculation unit sends out the waiting time;
the execution time of the calculation carrying unit and the execution time of the vector calculation unit are executed in parallel.
2. The method according to claim 1, wherein calculating the handling unit-execution-handling time based on the handling unit being in an execution-handling state, the time-cost calculation of the handling unit being in an execution-handling state includes: pushing the number of cycles required for the carrying unit to carry out the current carrying to the end of the buffer for storing the carrying completion time point as the end time of the current carrying.
3. The method according to claim 1, wherein the transportation unit update time is calculated based on the transportation unit being in an updated state, the transportation unit being in an updated state:
the carrying unit does not start carrying or carry out carrying, the carrying unit does not have carrying instruction configuration at present, and the carrying unit is carried out carrying at least once;
or, the carrying unit receives a release waiting instruction sent from the vector calculating unit or from another carrying unit in the opposite direction of carrying data by the carrying unit.
4. A time cost calculation method of a neural network processor according to claim 3, wherein the time cost calculation of the carrying unit in the updated state includes: updating the value of the header of the buffer for holding the conveyance completion time point to a register for counting the conveyance execution time, and deleting the value of the header of the buffer for holding the conveyance completion time point.
5. The method according to claim 1, wherein calculating the time-to-cancel instruction state time of the carrying unit based on the carrying unit being in the to-cancel instruction state, the time-to-cancel instruction state time of the carrying unit being in the to-cancel instruction state, comprises: and saving the actual cycle number of the instruction for releasing the waiting sent by the carrying unit to a cache for storing data required by the handshake of the execution unit.
6. The method according to claim 1, wherein calculating the vector calculation unit update time based on the vector calculation unit being in an update state, the time cost calculation of the vector calculation unit being in an update state includes: and updating the data in the buffer memory for storing the data required by the handshake between the vector calculation unit and the carrying unit to a register for counting the execution time.
7. The method according to claim 1, wherein calculating the time cost calculation for the vector calculation unit to issue the release wait instruction based on the vector calculation unit being in the release wait instruction state, the time cost calculation for the vector calculation unit being in the release wait instruction state includes:
when the current instruction is obtained by pipeline analysis of the vector calculation unit as a release waiting instruction, the value of a register used for counting the execution time at the current moment is recorded.
8. The method of any one of claims 1-7, wherein the handling unit includes instructions for handling off-chip data to be stored on-chip and instructions for handling on-chip stored data to be off-chip.
9. A time cost computing system for a neural network processor, comprising:
the behavior-level simulator is used for the behavior-level simulator,
a first time cost calculation unit for calculating an execution time of the carrying unit,
a second time cost calculation unit for calculating an execution time of the vector calculation unit;
the first time cost calculating unit is configured to calculate the carrying time executed by the carrying unit, the carrying unit update time, and the carrying unit issues a release waiting instruction time;
the second time cost calculating unit is configured to calculate the calculation time of the vector calculating unit, update the time of the vector calculating unit and send out the waiting instruction releasing time;
the first time cost calculating unit and the second time cost calculating unit are executed in parallel.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that,
the computer program implementing the method according to any of claims 1-8 when executed by a processor.
CN202310096961.6A 2022-12-30 2023-01-17 Time cost calculation method and system for neural network processor and readable storage medium Pending CN116128016A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2022117348721 2022-12-30
CN202211734872 2022-12-30

Publications (1)

Publication Number Publication Date
CN116128016A true CN116128016A (en) 2023-05-16

Family

ID=86295313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310096961.6A Pending CN116128016A (en) 2022-12-30 2023-01-17 Time cost calculation method and system for neural network processor and readable storage medium

Country Status (1)

Country Link
CN (1) CN116128016A (en)

Similar Documents

Publication Publication Date Title
US11200724B2 (en) Texture processor based ray tracing acceleration method and system
JP2020170551A (en) Information processing device and information processing method
US9996394B2 (en) Scheduling accelerator tasks on accelerators using graphs
CN107346351A (en) For designing FPGA method and system based on the hardware requirement defined in source code
Sinha et al. Parallel simulation of mixed-abstraction SystemC models on GPUs and multicore CPUs
WO2020083050A1 (en) Data stream processing method and related device
US8725486B2 (en) Apparatus and method for simulating a reconfigurable processor
US8681166B1 (en) System and method for efficient resource management of a signal flow programmed digital signal processor code
JP2020519993A5 (en)
WO2023197526A1 (en) Data processing method and apparatus, electronic device, and readable storage medium
CN111651202B (en) Device for executing vector logic operation
CN104050148B (en) Fast Fourier Transform (FFT) accelerator
US9804903B2 (en) Data processing apparatus for pipeline execution acceleration and method thereof
US8700380B2 (en) Method for generating performance evaluation model
CN115129460A (en) Method and device for acquiring operator hardware time, computer equipment and storage medium
WO2018076979A1 (en) Detection method and apparatus for data dependency between instructions
US20170004232A9 (en) Device and method for accelerating the update phase of a simulation kernel
CN116128016A (en) Time cost calculation method and system for neural network processor and readable storage medium
CN116308989A (en) GPU acceleration method for full-homomorphic rapid number theory transformation
US10452393B2 (en) Event-driven design simulation
US11106846B1 (en) Systems and methods for emulation data array compaction
JP2000057203A (en) Simulation method for leveling compiled code by net list conversion use
US9251023B2 (en) Implementing automated memory address recording in constrained random test generation for verification of processor hardware designs
CN110515729A (en) Graph computing node vector load balancing method and device based on graph processor
US20240118897A1 (en) Instruction Execution Method and Apparatus for Graph Computation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination