CN109272109B - Instruction scheduling method and device of neural network model - Google Patents

Instruction scheduling method and device of neural network model Download PDF

Info

Publication number
CN109272109B
CN109272109B CN201811276880.XA CN201811276880A CN109272109B CN 109272109 B CN109272109 B CN 109272109B CN 201811276880 A CN201811276880 A CN 201811276880A CN 109272109 B CN109272109 B CN 109272109B
Authority
CN
China
Prior art keywords
instruction
sequence
neural network
instructions
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811276880.XA
Other languages
Chinese (zh)
Other versions
CN109272109A (en
Inventor
李军
李建军
黄畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Robotics Technology Research and Development Co Ltd
Original Assignee
Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Robotics Technology Research and Development Co Ltd filed Critical Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority to CN201811276880.XA priority Critical patent/CN109272109B/en
Publication of CN109272109A publication Critical patent/CN109272109A/en
Application granted granted Critical
Publication of CN109272109B publication Critical patent/CN109272109B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Advance Control (AREA)

Abstract

An instruction scheduling method and device of a neural network model are disclosed, comprising the following steps: when a first instruction sequence corresponding to a first neural network model needs to be operated, determining a second instruction sequence corresponding to a second neural network model to be operated, wherein the first neural network model is operated before the second neural network model; selecting at least one instruction in the second sequence of instructions; inserting the at least one instruction into the first sequence of instructions; and executing the first instruction sequence including the at least one instruction. The method and the device can improve the overall execution efficiency of the plurality of neural network models on the premise of not increasing hardware resources.

Description

Instruction scheduling method and device of neural network model
Technical Field
The present application relates to the field of artificial neural networks, and in particular, to a method and an apparatus for instruction scheduling of a neural network model.
Background
In some specific application scenarios (e.g., automatic driving, face recognition), multiple neural network models need to be run to obtain the desired result. For the situation that a plurality of neural network models need to be operated to obtain a required result, how to establish a pipeline between instruction sequences of the plurality of neural network models reduces the idle and waste of hardware resources, so that the overall execution efficiency of the plurality of neural networks is improved as much as possible on the premise of not increasing the hardware resources, and an effective solution is not provided at present.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides an instruction scheduling method and device of a neural network model.
According to an aspect of the present application, there is provided an instruction scheduling method of a neural network model, including:
When a first instruction sequence corresponding to a first neural network model needs to be operated, determining a second instruction sequence corresponding to a second neural network model to be operated, wherein the first neural network model is operated before the second neural network model;
Selecting at least one instruction in the second sequence of instructions;
Inserting the at least one instruction into the first sequence of instructions; and
Executing the first instruction sequence including the at least one instruction.
According to another aspect of the present application, there is provided an electronic device including: one or more processors; and a memory storing computer instructions which, when executed by the processor, cause the processor to perform the instruction scheduling method of the neural network model.
According to another aspect of the present application, there is provided an instruction scheduling apparatus of a neural network model, including:
The instruction sequence determining unit is configured to determine a second instruction sequence corresponding to a second neural network model to be operated when a first instruction sequence corresponding to a first neural network model needs to be operated, wherein the first neural network model is operated before the second neural network model;
An instruction selection unit configured to select at least one instruction in the second instruction sequence;
An instruction insertion unit configured to insert the at least one instruction into the first instruction sequence; and
An instruction execution unit configured to execute the first instruction sequence including the at least one instruction.
In addition, the present application also provides a computer readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to execute the instruction scheduling method of the neural network model as described above.
By the method and the device, the assembly line can be established among the instruction sequences of the neural network models, so that the idle and waste of hardware resources are reduced, the flow execution capacity of the hardware resources is exerted more, and the overall execution efficiency of the neural networks is improved on the premise of not increasing the hardware resources.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is a schematic diagram of instruction scheduling inside an instruction sequence of a neural network model in the related art.
Fig. 2 is a schematic diagram of the overall operation flow of three neural network models in the related art.
Fig. 3 is a system configuration diagram to which the present application is applicable.
Fig. 4 is an exemplary diagram of a heterogeneous network architecture of the system shown in fig. 3.
Fig. 5 is a flowchart illustrating a neural network model instruction scheduling method according to an exemplary embodiment of the present application.
FIG. 6 is a diagram of an exemplary flow for inserting instructions in an instruction sequence into another instruction sequence in an exemplary embodiment of the present application.
FIG. 7 is a diagram illustrating an exemplary implementation of instruction scheduling for a neural network model provided by an exemplary embodiment of the present application.
Fig. 8 is an exemplary schematic diagram of an overall execution flow of three neural network models after an instruction scheduling method of the neural network models provided by an exemplary embodiment of the present application is used.
Fig. 9 is a diagram of an exemplary process for executing an instruction scheduling method of a neural network model by the system shown in fig. 4 according to an exemplary embodiment of the present application.
Fig. 10 is a schematic structural diagram of an instruction scheduling apparatus of a neural network model according to an exemplary embodiment of the present application.
Fig. 11 is another exemplary structural diagram of an instruction scheduling apparatus of a neural network model according to an exemplary embodiment of the present application.
Fig. 12 is a block diagram of an electronic device provided in an exemplary embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Summary of the application
In some application scenarios, multiple neural network models need to be run to achieve the desired results. For example, in face recognition, a neural network model is called to detect whether an image of a person's face is contained in an image, if the image of the person's face is contained, another neural network model is scheduled to recognize the image of the person's face in the image, and finally a required result is obtained. If a plurality of images exist, a plurality of neural network models need to be called simultaneously for detection, and similarly, if the face images of people exist in the plurality of images, a plurality of neural network models need to be called for identification. For another example, in automatic driving, a plurality of neural network models are needed to perform detection, identification and other processing on the images acquired in real time, so that a final result can be obtained. It can be seen that there are many scenarios in which multiple neural network models need to be run to obtain the desired results. How to improve the overall execution efficiency of the multiple neural network models in these scenarios is very critical.
In the pipeline technology, the pipeline execution capacity of the hardware resources strongly depends on the instruction sequence, and the instruction sequence with more reasonable instruction scheduling can play more pipeline execution capacity of the hardware resources on the premise of ensuring the correctness of the operation result, thereby reducing the idle and waste of the hardware resources. In the related art, because the instruction sequence of each neural network model is generated before operation, and the compiler can only compile in units of the neural network model before operation, this means that the compiler can only schedule instructions within the instruction sequence of a single neural network model to implement a pipeline within the instruction sequence, but cannot schedule instructions between the instruction sequences of a plurality of neural network models, and cannot establish a pipeline between the instruction sequences of a plurality of neural network models.
When the neural network model starts to run, the operation cannot be started immediately, and data needs to be loaded first. Since pipelines cannot be established among instruction sequences of a plurality of neural networks, a computing unit (e.g., BPU) can only wait during the loading process at the beginning stage of each neural network model, which causes waste of hardware resources and reduction of execution efficiency. Fig. 1 shows an example of instruction scheduling inside an instruction sequence of a neural network model in the related art, and it can be seen from fig. 1 that although instruction scheduling inside an instruction sequence can mask part of the running time of the instruction sequence through parallelism among instructions inside the sequence, the loading time of each neural network model at the beginning stage cannot be masked, and a BPU can only wait during the loading time, which not only causes waste of hardware resources, but also reduces the overall execution efficiency of the plurality of neural network models.
From the whole, the execution process of each neural network model can be simplified into "loading data in the starting stage- > calculation in the intermediate stage- > storage result in the final stage". When a plurality of neural network models need to be operated, because the related technology cannot establish a pipeline between instruction sequences of the plurality of neural network models, the plurality of neural network models need to be executed one by one, and the execution process is as follows: and executing the first neural network model, executing the second neural network model after the first neural network model is finished, and so on. Fig. 2 shows an example of the overall operation flow of the three neural network models, wherein the overall operation flow of the three neural network models is as follows: the loading, calculation and storage of the neural network model 1 are executed firstly, the loading, calculation and storage of the neural network model 2 are executed after the execution of the neural network model 1 is finished, and the loading, calculation and storage of the neural network 3 are executed after the execution of the neural network model 2 is finished, so that the whole operation process needs 3 × 3 to 9 time periods, and 3 time periods (namely, time period 1, time period 4 and time period 7) of the 9 time periods have the BPU in an idle state, and the overall execution efficiency is very low.
For some application scenarios, it is necessary to frequently call a plurality of neural network models with relatively simple structures (for example, a neural network model with only three to four layers of convolution) to obtain the required result, and since a pipeline cannot be established between instruction sequences of the plurality of neural network models, the BPU must wait for loading once every time the neural network model is called, so that the BPU is in an idle state for a long time, and the overall operation efficiency of the plurality of neural network models is very low. According to statistics, the loading time of the starting stage of the neural network models with the simpler structures accounts for 30% -50% of the whole operation duration, namely, each time the neural network model is called, the BPU is in an idle state within 30% -50% of the whole operation duration of the neural network model, namely, in the process of obtaining the required result by calling the neural network models, the BPU is in an idle state within at least 30% -50% of the operation duration, and therefore the overall execution efficiency of the neural network models is quite low.
Therefore, how to establish a pipeline between instruction sequences of a plurality of neural network models and reduce the idleness and waste of hardware resources under the condition that a plurality of neural network models need to be operated to obtain a required result is an urgent technical problem to be solved.
In view of the above technical problems, a basic concept of the present application is to provide an instruction scheduling method, an apparatus, an electronic device, and a computer-readable storage medium for a neural network model, wherein when a first instruction sequence corresponding to a first neural network model needs to be run, a second instruction sequence corresponding to a second neural network model to be run is determined, and the first neural network model is run before the second neural network model; selecting at least one instruction in the second sequence of instructions; inserting the at least one instruction into the first sequence of instructions; and executing the first instruction sequence including the at least one instruction. According to the embodiment of the application, at least one instruction in the second instruction sequence corresponding to the second neural network model is dispatched to the first instruction sequence corresponding to the first neural network model, so that the at least one instruction of the second neural network model can be parallel to the instruction in the first neural network model, a pipeline is established between the first instruction sequence and the second instruction sequence, the pipeline execution capacity of hardware resources can be exerted more on the premise that the operation result is correct, idle and waste of the hardware resources (such as BPU) are further reduced, and the overall execution efficiency of a plurality of neural network models is improved on the premise that the hardware resources are not increased. Particularly, under the condition that a plurality of neural network models with simpler structures need to be frequently called, the overall execution efficiency of the neural network models can be greatly improved on the premise of not increasing hardware resources through the embodiment of the application.
It should be noted that, although the application scenarios of a plurality of neural network models with relatively simple structures are described above as examples, the application scope of the embodiments of the present application is not limited thereto. The embodiment of the application can be applied to any scene needing to operate two or more than two neural network models. For example, the embodiments of the present application are still applicable to a scenario that requires a plurality of neural network models with relatively complex structures.
Exemplary System
The embodiment of the application can be applied to any system supporting a plurality of neural network models to operate, and the system can be in a heterogeneous network structure or a homogeneous network structure.
FIG. 3 is an exemplary architecture 30 of the above-described system that supports the operation of multiple neural network models, including: a compiling device 301 and an operating device 302 which are connected or communicated with each other, wherein the compiling device 301 is responsible for compiling the instruction sequences of the neural network models before operation (namely, in an off-line state), and the operating device 302 is responsible for operating the instruction sequences of the neural network models provided by the compiling device. Here, the compiling apparatus 301 may be implemented by one or more processors, which run compilers, and in practical applications, may be implemented by a CPU with strong performance; the runtime apparatus 302 may include one or more processors, one or more of which may be used to implement neural network related computations, including but not limited to: convolution, calculation of activation functions, pooling, etc.
Fig. 4 is an example of a heterogeneous network structure 40 of the system shown in fig. 3, wherein a first processor 401 belongs to the compiling device 301, and a memory 403, a second processor 402 and a third processor 404 belong to the running device 302. The first processor 401 is responsible for compiling instruction sequences of each neural network model before running, the second processor 402 is responsible for loading the instruction sequences of each neural network model into the memory 403 during running, the memory 403 is responsible for storing the instruction sequences of each neural network model in a waiting state during running, and the third processor 404 is responsible for reading the instruction sequences of the neural network models from the memory 403 during running and running the instruction sequences. Here, the first processor 401 may be a strong-performance CPU configured with a compiler; the second processor 402 may be a weak ARM processor, the third processor 404 may be a processor supporting neural network related computation, such as a Brain Processor (BPU), a Tensor Processing Unit (TPU), and the memory 403 may be a memory (e.g., DDR) or a non-volatile memory (e.g., hard disk, SSD, Flash, EEPROM, and the like).
It should be noted that fig. 3 and fig. 4 are only examples, and the system to which the embodiment of the present application is applied is not limited thereto. The embodiment of the application can be applied to any system supporting two or more than two neural network models to operate.
Exemplary method
FIG. 5 is an exemplary method 500 for instruction scheduling for a neural network model provided by an exemplary embodiment of the present application. As shown in fig. 5, the exemplary method 500 includes the steps of:
Step 501, when a first instruction sequence corresponding to a first neural network model needs to be operated, determining a second instruction sequence corresponding to a second neural network model to be operated, wherein the first neural network model is operated before the second neural network model;
Step 502, selecting at least one instruction in the second instruction sequence;
Step 503, inserting the at least one instruction into the first instruction sequence; and the number of the first and second groups,
Step 504, executing the first instruction sequence including the at least one instruction.
In the embodiment of the application, when the first instruction sequence of the first neural network model needs to be operated, at least one instruction of the second neural network model is inserted into the instruction sequence of the first neural network model, a pipeline is established between the first instruction sequence and the second instruction sequence, and the at least one instruction of the second neural network model can be parallel to the instruction in the first neural network model, so that the pipeline execution capacity of hardware resources is exerted to a greater extent on the premise of ensuring the correct operation result, the overall operation time of the plurality of neural network models is saved, the idle and waste of hardware resources (such as BPU) are further reduced, and the overall execution efficiency of the plurality of neural network models is improved on the premise of not increasing the hardware resources. For the condition that a plurality of neural network models with simpler structures need to be frequently called, the overall execution efficiency of the neural network models can be greatly improved on the premise of not increasing hardware resources through the embodiment of the application.
In the embodiment of the present application, when the first instruction sequence needs to be executed (i.e., when the first instruction sequence is in a wait state) in step 501, the second instruction sequence to be executed later is determined, so that instruction scheduling between the instruction sequences is performed on the first instruction sequence to be executed and the second instruction sequence immediately after the first instruction sequence in steps 502 to 503, and thus a pipeline between multiple instruction sequences is established according to an actual execution sequence of multiple neural network models, so that the pipeline between the multiple instruction sequences does not affect an overall operation result of the multiple neural network models, and an operation result is ensured to be correct.
In the embodiment of the present application, the determination of the second instruction sequence in step 501 depends on the operation result of the first neural network model and/or the final required result in the current scenario.
In an implementation manner of the embodiment of the present application, in step 501, when each neural network model is called, it may be dynamically determined which neural network model is to be operated next, so that it may be determined which second instruction sequence is to be operated. Specifically, which neural network model needs to be called next (i.e., which is the second neural network model) may be determined according to the operation result of the first neural network model and the final result required by the current scenario, and which instruction sequence needs to be executed after the first instruction sequence is determined when determining which neural network model needs to be called. Taking human face recognition as an example, a neural network model 1 (an example of a first neural network model described in this application) is used to detect whether there is a face image of a person in an original image, a neural network model 2 (an example of a second neural network model described in this application) is used to recognize the face image of the person in the original image, the neural network model 1 corresponds to an instruction sequence 1 (an example of a first instruction sequence described in this application), and the neural network model 2 corresponds to an instruction sequence 2, so if the detection result of the neural network model 1 is the face image of the person in the original image, it is necessary to continue to call the neural network model 2 for recognition, and at this time, it can be confirmed that the instruction sequence 2 needs to be run after the instruction sequence 1. If the result of the neural network model 1 is that the face image of the person does not exist in the original image, the neural network model 2 does not need to be called continuously for recognition, namely the instruction sequence 2 does not need to be run after the instruction sequence 1, and the processing of the original image can be finished directly. If 2000 original images are detected at the same time, 2000 instruction sequences 1 may need to be run to detect the 2000 original images respectively, if 1000 original images have human face images, 1000 neural network models 2 may need to be called at the same time to identify the human face images in the 1000 original images respectively, and correspondingly, 1000 instruction sequences 2 need to be run after the 2000 instruction sequences 1.
In another implementation manner of the embodiment of the present application, in step 501, a calling order of a plurality of neural network models may be preset according to a requirement of an application scenario, where the calling order indicates a running order of instruction sequences of the plurality of neural network models.
It should be noted that the above two implementation manners are only examples, and how to specifically determine which instruction sequence is to be executed after the first instruction sequence in step 501 in this application embodiment may have various implementation manners, which is not limited in this application embodiment.
In some implementations of embodiments of the present application, the at least one instruction is a load instruction for loading data required for the second neural network model operation. The load instruction may include a load instruction of the characteristic data and/or a load instruction of a weight, offset, or the like parameter. In one implementation, the at least one instruction may be a load instruction at a beginning stage of the second neural network model, i.e., a load instruction preceding the first compute instruction in the second instruction sequence. Therefore, the loading instruction of the second neural network model at the starting stage can be advanced to the running process of the first neural network model, the loading time of the second neural network model at the starting stage can be saved, the waiting time of the BPU at the starting stage is reduced, and the execution efficiency is improved. In this implementation, the at least one instruction may be inserted between the computation instructions in the first instruction sequence in step 503. Of course, the implementation manners of "inserting the load instruction in the middle part of the second instruction sequence into the first instruction sequence" and "inserting the calculation instruction of the second instruction sequence into the first instruction sequence" may also be adopted. In a specific implementation, which instruction or instructions are selected from the second instruction sequence to insert into the first instruction sequence depends on a specific application scenario and the instruction sequence length of each neural network model. The embodiments of the present application are not limited thereto.
In some implementations of the embodiment of the application, the selecting at least one instruction in the second instruction sequence in step 502 may include: traversing the second instruction sequence to find a first calculation instruction in the second instruction sequence; and selecting an instruction in the second sequence of instructions that precedes the first calculating instruction. Therefore, the instruction of the starting stage of the second neural network model can be used as an object for scheduling among instruction sequences in a targeted manner, so that the starting stage of the second neural network model can be parallel to a part of stages (such as an intermediate stage) of the first neural network model, the execution time of the starting stage of the second neural network model is saved, the hardware resources (such as BPU) are prevented from waiting for a long time at the starting stage of the second neural network model, and the overall execution efficiency of a plurality of neural network models is improved on the premise of not increasing the hardware resources.
Because the runtime of the instruction depends on the size of the data that needs to be operated by the instruction, the runtime of the instruction is longer when the data that needs to be operated is larger, and the runtime of the instruction is shorter when the data that needs to be operated is smaller, in some implementations of this embodiment of the application, the at least one instruction may be inserted into the first instruction sequence one by one in the order of the data that needs to be operated from large to small in step 503. In this way, the instruction with longer running time in the second neural network model can be preferentially inserted into the first instruction sequence, so that the instruction with longer running time in the second instruction sequence can be parallel to one or more instructions in the first instruction sequence, and thus more running time (for example, time for loading data in the initial stage of the second neural network model) is saved, and the overall execution efficiency of the plurality of neural network models is further improved.
In this embodiment of the present application, step 503 may include: checking whether a data conflict (also referred to as a dependency conflict) and/or a hardware resource conflict (also referred to as a structural conflict) exists between the at least one instruction and each instruction in the first instruction sequence to determine a legal position of the at least one instruction in the first instruction sequence for inserting the at least one instruction into the first instruction sequence. Therefore, before at least one instruction is inserted into the first instruction sequence, the legal position of the at least one instruction in the first instruction sequence is determined through checking, the generation of conflict can be effectively avoided, a more reasonable pipeline is established among the instruction sequences of the plurality of neural network models, the running or the calculation of the neural network models is prevented from generating errors, and the overall execution efficiency of the plurality of neural network models is improved on the premise of ensuring the correct overall operation result of the plurality of neural network models.
In some implementations of embodiments of the present application, how to determine the legal position by checking data conflicts and/or hardware resource conflicts depends on whether legal problems in the execution of the instructions (such legal problems include, but are not limited to, running errors and/or computational errors) are caused. In one implementation, for a case where only a data conflict causes a legal problem in the instruction execution process (e.g., for an asynchronous blocking instruction), only the data conflict needs to be checked, and if no data conflict exists, the legal position can be determined, and if the data conflict exists, the illegal position can be determined. In another implementation, for a case where any one of the data conflict and the hardware resource conflict causes a legal problem in the instruction execution process (for example, for a synchronous blocking instruction), the data conflict and the hardware resource conflict need to be checked (the data conflict may be checked first and then the hardware resource conflict may be checked), if the data conflict and the hardware resource conflict do not exist, the legal position may be determined, and if the data conflict and the hardware resource conflict exist, the illegal position may be determined.
In this embodiment, an instruction may include an opcode to indicate which operations the instruction is to perform (i.e., which tasks the instruction is to accomplish), which may include, but is not limited to: fetch data, write data, convolution, pooling (Pooling), etc., and it can be determined whether an instruction belongs to a load instruction or a compute instruction of the neural network model by an opcode. In addition, an instruction may include: the instruction comprises an address code and an operand, wherein the address code is used for indicating the memory address of data operated by the instruction, and the operand is used for indicating which objects the instruction is to aim at (such as characteristic data, weight related to neural network calculation, offset and other parameters). In a specific application, the type, the size of operation data, the memory address and the like of an instruction can be determined by analyzing the instruction.
In some implementation manners of the embodiment of the present application, checking whether a data conflict exists between the at least one instruction and each instruction in the first instruction sequence may include: judging whether the memory address accessed by the at least one instruction is overlapped with the memory address accessed by the instructions of the first instruction sequence, and whether the at least one instruction and/or the instructions of the first instruction sequence execute write operation on data in the memory address; and when the memory addresses are overlapped and the at least one instruction and/or the instruction in the first instruction sequence execute the write operation on the data in the memory addresses, determining that the at least one instruction has a data conflict with the instruction in the first instruction sequence. For example, if instruction a is an instruction in the first instruction sequence and instruction B is an instruction to be inserted into the first instruction sequence in the second instruction sequence, if instruction a reads data at address a and instruction B writes data at address a, it is determined that there is a data conflict between instruction a and instruction B; if instruction A reads data at address a and instruction B also reads data at address a, it is confirmed that there is no data conflict between instruction A and instruction B. That is, two instructions access data at the same address, and if the data content is not changed, it is determined that there is no data conflict between the two instructions, and if any instruction needs to change the data content, there is a data conflict between the two instructions. Here, the memory address accessed by the instruction and what operation it performs can be obtained by parsing the instruction.
In an implementation manner of the embodiment of the present application, when it is checked whether a data conflict exists between one instruction in the second instruction sequence and one instruction in the first instruction sequence (for example, a last first instruction in the first instruction sequence), it is checked in units of bits, that is, each bit of data operated by one instruction in the second instruction sequence and one instruction in the first instruction sequence is checked to determine whether data conflict exists between the two instructions, and once there is a conflict between partial data (that is, memory addresses of partial data are the same and a write operation needs to be performed on the partial data), it is considered that there is a data conflict between the two instructions.
It should be noted that, the above-mentioned manner of checking data collision is only an example, and checking whether there is a data collision between at least one instruction in the second instruction sequence and each instruction in the first instruction sequence may also be implemented in other manners. The present application is not limited to the specific implementation of checking for data conflicts.
In some implementation manners of the embodiments of the application, mapping relationship information between the instructions and the hardware resources may be preconfigured in a static configuration manner, and whether a hardware resource conflict exists between the two instructions is determined by querying the mapping relationship information during verification. In an example, the checking whether there is a hardware resource conflict between at least one instruction in the second instruction sequence and each instruction in the first instruction sequence may include: inquiring mapping relation information between each pre-configured instruction and a hardware resource used by the instruction, and judging whether the at least one instruction and the instruction of the first instruction sequence need to use the same hardware resource or not based on the mapping relation information; determining that there is a hardware resource conflict of the at least one instruction with an instruction of the first sequence of instructions when use of the same hardware resource is required. Determining that there is no hardware resource conflict between the at least one instruction and the instructions of the first sequence of instructions when the same hardware resource need not be used. Here, the mapping relationship information between the instruction and the hardware resource used by the instruction is used to indicate which hardware resource is used by each instruction through a mapping relationship, which may be one-to-one, one-to-many, many-to-one, and the like, depending on the type of the instruction, the opcode of the instruction, and the specific configuration of the hardware resource (e.g., whether one processor or multiple processors, whether one memory or multiple memories, and the like). Through the mapping relation between the instruction and the hardware resource used by the instruction, the conflict of the hardware resource can be avoided. It should be noted that, for the specific checking manner of the hardware resource conflict, the embodiment of the present application is not limited thereto.
In one embodiment, an instruction sequence refers to a sequence including a plurality of instructions, and each instruction in the sequence is sorted according to its operation start time, the instruction with the operation start time before is sorted before, and the instruction with the operation start time after is sorted after. In some implementations of the embodiment of the present application, a legal position of an instruction in the second instruction sequence in the first instruction sequence means that an operation result of the first instruction sequence is not affected after an instruction in the second instruction sequence is inserted into the legal position in the first instruction sequence, and the instruction in the second instruction sequence does not have data collision with all instructions in the first instruction sequence after the legal position. In one example, the position of an instruction in an instruction sequence may be represented by a sequence number, which may indicate the ordering of the instruction in the instruction sequence. For example, if an instruction is ordered as 3 in an instruction sequence, the position of the instruction may be represented as 3. Of course, the position of an instruction in an instruction sequence may also be represented in other ways when implemented specifically, and embodiments of the present application are not limited thereto. Accordingly, the legal position is represented in a manner similar to that described above. It should be noted that the legal position of an instruction in the second instruction sequence in the first instruction sequence may be an insertable position of the instruction in the first instruction sequence, but is not necessarily an actual insertion position of the instruction in the first instruction sequence.
In this embodiment, the legal position of an instruction in the second instruction sequence in the first instruction sequence may not be present, or one or more of them may be present. If a plurality of legal positions of an instruction in the second instruction sequence exist in the first instruction sequence, one legal position can be selected as the insertion position of the instruction. If only one legal position of one instruction in the second instruction sequence is in the first instruction sequence, the legal position can be used as the insertion position of the instruction. If at least one instruction in the second instruction sequence does not have a legal position in the first instruction sequence, splitting the at least one instruction into two groups of instructions, and inserting one group of instructions which do not have data conflict with the instructions in the first instruction sequence into the first instruction sequence.
In this embodiment, whether to split at least one instruction in the second instruction sequence depends on the size of data where no conflict exists in the operation data of the at least one instruction. In some implementations, it may be determined whether a size of data, in which there is no conflict, in the at least one instruction operation data is greater than a preset threshold; and splitting the at least one instruction when the size of the data without conflict in the at least one instruction operation data is larger than the threshold value. The at least one instruction may not be split when a size of data of the at least one instruction operation data for which there is no conflict is less than or equal to the threshold. The threshold may be preset, and a value of the threshold may be an empirical value, and a specific value of the threshold is related to an actual application scenario, which is not limited in the present application. In one example, when the at least one instruction is a load instruction, the first N byte data of each load instruction operation occupies most of the runtime according to statistics, and at this time, the threshold may be set based on N, for example, the threshold may be a multiple of N (e.g., 2N), where N is an integer not less than 1. In one implementation, if an instruction in the second instruction sequence belongs to the beginning stage of the second neural network model (e.g., a load instruction before the first compute instruction) and the data of the operation data of the instruction is large (e.g., the size of the data of the operation of the instruction is larger than the threshold), the instruction may be split into two groups of instructions according to the data collision condition between the instruction and the instruction in the first instruction sequence, where the first group of instructions and the instruction in the first instruction sequence have data collision, and the second group of instructions and the instruction in the first instruction sequence have no data collision, and the second group of instructions is inserted into the first instruction sequence. Therefore, when one instruction in the second instruction sequence does not have a legal position, the instruction can be partially inserted into the first instruction sequence in a splitting mode, so that the running time of the second neural network model is further saved, and the overall execution efficiency of a plurality of neural network models is improved.
In an implementation manner of the embodiment of the present application, splitting an instruction in the second instruction sequence according to a data collision condition between the instruction in the second instruction sequence and an instruction in the first instruction sequence may specifically be: determining (using the results of the data collision check) whether there is data in a collision and data in no collision between an instruction in the second instruction sequence and an instruction in the first instruction sequence (e.g., the results of the data collision check can be used directly to determine which are data in a collision and which are data in no collision); and splitting the at least one instruction into two groups of instructions according to the data with conflict and the data without conflict, wherein a first group of instructions (possibly comprising one or more instructions) in the two groups of instructions is used for operating the data with conflict (such as instructions for loading the data with conflict), and a second group of instructions (possibly comprising one or more instructions) in the two groups of instructions is used for operating the data without conflict (such as instructions for loading the data without conflict). Optionally, the size of the data without conflict is compared with the threshold before splitting the instruction, and when the size of the data without conflict is greater than the threshold, the at least one instruction is split into two groups of instructions according to the data with conflict and the data without conflict. In this way, a data conflict exists between the first set of instructions and the instructions in the first instruction sequence, but no data conflict exists between the second set of instructions and the instructions in the first instruction sequence, so that the second set of instructions is inserted into the first instruction sequence, i.e. the one instruction in the second instruction sequence is partially inserted into the first instruction sequence. In the embodiment of the application, the instruction with longer running time in the second instruction sequence is inserted into the first instruction sequence by splitting the instruction in the second instruction sequence, so that more running time is saved, and the overall execution efficiency of a plurality of neural network models is improved to a greater extent on the premise of not increasing hardware resources.
In one example, the operation data of one instruction a in the second instruction sequence is "abcdef", the data collision between the instruction a and the instructions in the first instruction sequence is found through data collision check, the data with the collision is "b" and "d", at this time, the instruction a may be split into two groups of instructions, the first group of instructions includes instruction 1 of operation data "b" and instruction 2 of operation data "d", the second group of instructions includes instruction 3 of operation data "a", instruction 4 of operation data "c", and instruction 5 of operation data "ef", after the splitting, instruction 3, instruction 4, and instruction 5 in the second group of instructions may be respectively inserted into the first instruction sequence, and instruction 1 and instruction 2 in the first group of instructions are retained in the second instruction sequence.
In an implementation manner of the embodiment of the present application, if two or more legal positions exist in the first instruction sequence for the at least one instruction, the legal position with the longest parallel time may be determined as an insertion position of the at least one instruction, so as to save the running time as much as possible, thereby improving the overall execution efficiency of the plurality of neural network models more greatly.
In another implementation manner of the embodiment of the present application, if two or more legal positions of the at least one instruction exist in the first instruction sequence, the legal position which has the longest parallel time and is the most advanced in the first instruction sequence is determined as the insertion position of the at least one instruction, so as to save the runtime as much as possible, thereby further greatly improving the overall execution efficiency of the plurality of neural network models.
In the embodiment of the application, for one instruction in the second instruction sequence, the parallel time of the instruction at a legal position in the first instruction sequence depends on the running time of one or more instructions in the first instruction sequence parallel to the instruction. Since the running time of an instruction is directly related to the size of the data operated by the instruction, in an implementation manner of the embodiment of the present application, determining the legal location with the longest parallel time may include: determining one or more instructions in the first sequence of instructions that are parallel to the at least one instruction at each legal location, estimating a parallel time for each legal location based on a size of the one or more instruction operation data; and determining the legal position with the longest parallel time based on the estimated parallel time of each legal position. Therefore, the running time of one instruction in the second instruction sequence which can be really covered at each legal position can be estimated, the more the running time is really covered, the more the running time is saved, and the legal position which can save more running time is used as the actual insertion position of the instruction, so that the integral execution efficiency of a plurality of neural network models is greatly improved.
In an embodiment of the present application, a parallel time of an instruction in the second instruction sequence at a legal position in the first instruction sequence is equal to a sum of runtime of one or more instructions in the first instruction sequence that are parallel to an instruction in the second instruction sequence at the legal position. That is, the running time of one or more instructions in the first instruction sequence that are parallel to an instruction in the second instruction sequence at a legal position in the first instruction sequence is estimated to determine the parallel time of an instruction in the second instruction sequence at a legal position in the first instruction sequence. Here, for the load instruction, the runtime is estimated according to the size of the data to be read, and generally, the larger the data to be read, the longer the runtime is, the linear relationship between the size of the read data and the runtime may be obtained by collecting the actual runtime of each instruction and the size of the actual read data, and performing statistical analysis on these. For a calculation instruction, the runtime can be estimated according to the size of the calculation data, generally speaking, the larger the data to be calculated is, the longer the corresponding calculation instruction runtime is, and the nonlinear relationship between the runtime of the calculation instruction and the size of the data actually calculated by the calculation instruction is generally a nonlinear relationship, and the nonlinear relationship can also be obtained by collecting the actual runtime of the calculation instruction and the size of the data actually calculated by the calculation instruction and performing statistical analysis. In general, the runtime of an instruction can be determined based on the size of the data it needs to operate on.
In an implementation manner of the embodiment of the present application, the checking may be performed on each instruction starting from the end of the first instruction sequence. For one instruction in the second instruction sequence, the verification is executed from the end of the first instruction sequence, which is favorable for finding out the legal position of the instruction, which is sequenced at the top in the first instruction sequence, so that the legal position, which is sequenced at the top, is used as the insertion position of the instruction in the first instruction sequence, thereby saving the running time as much as possible and improving the overall execution efficiency of a plurality of neural network models.
The following example illustrates a specific implementation flow of the instruction scheduling method of the neural network model.
The neural network model 1 (an example of a first neural network model described herein) corresponds to the instruction sequence 1 (an example of a first instruction sequence described herein), the instruction sequence 2 (an example of a second instruction sequence described herein) corresponds to the neural network model 2 (an example of a second neural network model described herein), and the neural network model 1 is executed before the neural network model 2. As shown in fig. 6, an exemplary flow 600 for inserting a load instruction (i.e., a load instruction preceding the first compute instruction in the instruction sequence 2) in the beginning stage of the neural network model 2 into the instruction sequence 1 may include the following steps:
Step 601, traversing the instruction sequence 2, finding a first calculation instruction (for example, a first convolution instruction) in the instruction sequence 2, and extracting all instructions (i.e., all loading instructions) before the first calculation instruction in the instruction sequence 2 to form a candidate instruction set;
Step 602, determining the size of the data operated by each instruction by analyzing each instruction in the candidate instruction set, and arranging each instruction in the candidate instruction set according to the sequence of the operated data from big to small;
Step 603, reading an instruction with the top rank from the candidate instruction set, if the reading fails, indicating that the candidate instruction set is empty, ending the current flow, if the reading succeeds, indicating that the candidate instruction set is not empty, and continuing to step 604;
Here, assume that the currently fetched instruction is instruction A;
Step 604, determining the insertion position of the instruction A in the instruction sequence 1, continuing to step 606 if the instruction A does not have a legal insertion position in the instruction sequence 1, and continuing to step 605 if the instruction A has a legal insertion position in the instruction sequence 1;
Step 605, inserting the instruction a at the insertion position in the instruction sequence 1, and returning to step 603 to continue fetching the next instruction in the candidate instruction set.
Step 606, the instruction a is placed back before the first calculation instruction in the instruction sequence 2.
In one example, assuming that instruction sequence 1 includes three instructions, i.e., instruction B, instruction C, and instruction D, the exemplary process of determining the insertion position of instruction a in instruction sequence 1 in step 604 may include the following steps:
Step 1, initializing legal position record of instruction A, and starting to search from the first instruction at the end of instruction sequence 1
Here, it is assumed that one currently searched instruction is instruction B.
Here, the legal location record of the initialization instruction a includes: and generating a legal position record of the instruction A, and setting the value of each parameter in the legal position record as a default value, wherein the default value is an invalid value. Wherein, the parameters in the legal location record may include: the instruction A is in the latest legal position of the instruction sequence 1, the number of parallel clock cycles when the latest legal position is inserted into the instruction A, and a first mark indicating whether the latest legal position has hardware resource conflict.
Step 2, checking whether data conflict exists between the instruction B and the instruction A, if so, determining that the instruction A does not have a legal position in the instruction sequence 1, and ending the current process; otherwise, continuing to step 3;
Step 3, checking whether a hardware resource conflict exists between the instruction B and the instruction A, if so, continuing the step 5, otherwise, continuing the step 4;
Step 4, determining the position before the instruction B in the instruction sequence 1 as the latest legal position of the instruction A, estimating the number m of parallel clock cycles when the instruction A is inserted into the latest legal position, updating the legal position record of the instruction A, and continuing to step 5;
Here, updating the legal location record for instruction a includes: updating the latest legal position of the instruction A in the instruction sequence 1 to the position before the instruction B; updating the parallel clock period number of the instruction A into a parallel clock period number m; and updating the value of the first mark to the position before the instruction B without hardware resource conflict.
Here, the number of parallel clock cycles is an exemplary representation of the parallel time described above. In practice, other representations are not excluded. The representation manner of the parallel time in the embodiment of the present application is not limited to this.
Here, the process of estimating the number m of parallel clock cycles may include: the method comprises the steps of determining one or more instructions in parallel with the instruction A in the instruction sequence 1 when the instruction A is inserted into a position before the instruction B, estimating the running time (such as the running clock period number) of the instructions in parallel with the instruction A in the instruction sequence 1 according to data of operation of the instructions in parallel with the instruction A in the instruction sequence 1, and determining the sum of the running time (such as the running clock period number) of the one or more instructions in parallel with the instruction A in the instruction sequence 1 as the parallel period number m.
Step 5, judging whether the instruction B is an instruction with the most advanced sequence in the instruction sequence 1, if so, continuing to step 14, otherwise, continuing to search a second instruction C at the end of the instruction sequence;
Step 6, checking whether data conflict exists between the instruction C and the instruction A, and if so, continuing to step 14; otherwise, continuing to step 7;
Step 7, checking whether a hardware resource conflict exists between the instruction C and the instruction A, if so, continuing to step 14, otherwise, continuing to step 8;
Step 8, estimating the number n of parallel clock cycles when the instruction A is inserted into the position before the instruction C in the instruction sequence 1, comparing the number n of parallel clock cycles with the number m of parallel clock cycles in the legal position record of the instruction A, if the number n of parallel clock cycles is more than or equal to the number m of parallel clock cycles, determining the position before the instruction C as the latest legal position of the instruction A, and correspondingly updating the legal position record of the instruction A; if the number n of the parallel clock cycles is less than the number m of the parallel clock cycles, keeping the legal position record of the instruction A unchanged; continuing to step 9;
Here, the process of estimating the number n of parallel clock cycles is similar to the process of estimating the number m of parallel clock cycles described above, and is not described again.
Here, updating the legal location record for instruction a includes: updating the latest legal position of the instruction A in the instruction sequence 1 to be the position before the instruction C; updating the parallel clock period number of the instruction A into a parallel clock period number n; and updating the value of the first mark to the position before the instruction C without hardware resource conflict.
Step 9, judging whether the instruction C is an instruction with the most advanced sequence in the instruction sequence 1, if so, continuing to step 14, otherwise, continuing to search a third instruction D at the end of the instruction sequence;
Step 10, checking whether a data conflict exists between the instruction D and the instruction A, and if so, continuing to step 14; otherwise, continuing to step 11;
Step 11, checking whether a hardware resource conflict exists between the instruction D and the instruction A, if so, continuing to step 13, otherwise, continuing to step 12;
Step 12, estimating a parallel clock cycle number k when the instruction a is inserted into a position before the instruction D in the instruction sequence 1, comparing the parallel clock cycle number k with a parallel clock cycle number (possibly a parallel clock cycle number m, or a parallel clock cycle number n) in a legal position record of the instruction a, if the parallel clock cycle number k is greater than or equal to the parallel clock cycle number in the legal position record, determining that the position before the instruction D is the latest legal position of the instruction a, and correspondingly updating the legal position record of the instruction a; if the number k of the parallel clock cycles is less than the number of the parallel clock cycles in the legal position record, keeping the legal position record of the instruction A unchanged; continuing to step 13;
Here, the process of estimating the number k of parallel clock cycles is similar to the above-described process of estimating the number m of parallel clock cycles, and is not described again.
Here, updating the legal location record for instruction a includes: updating the latest legal position of the instruction A in the instruction sequence 1 to be the position before the instruction D; updating the parallel clock period number of the instruction A into a parallel clock period number k; and updating the value of the first mark to be the position before the instruction D, wherein no hardware resource conflict exists.
Step 13, judging that the instruction D is the instruction with the most advanced sequence in the instruction sequence 1, and continuing to step 14;
And step 14, reading the legal position record of the instruction A, judging whether the value of each parameter in the legal position record is an invalid value, if not, determining the latest legal position in the legal position record of the instruction A as the insertion position of the instruction A in the instruction sequence 1, if so, determining that the instruction A does not have a legal insertion position in the instruction sequence 1, and ending the current flow.
It should be noted that the above flow is only an exemplary implementation manner of determining the insertion position of the instruction a in the instruction sequence 1, and the embodiment of the present application is not limited thereto.
Alternatively, when there is no legal position in the instruction a and the instruction sequence 1, and the data operated by the instruction a is large (for example, when the size of the data operated by the instruction a and the instruction in the instruction sequence 1 do not conflict with each other exceeds a preset threshold), the instruction a is split into two groups of instructions, the first group of instructions in the two groups of instructions has data conflict with one instruction in the instruction sequence 1 (the instruction may be any one instruction in the instruction sequence 1), the second group of instructions has no data conflict with the instruction in the instruction sequence 1, the first group of instructions is put back into the instruction sequence 2, for the second group of instructions, the insertion positions of one or more instructions in the second group of instructions in instruction sequence 1 may be determined according to the above flow, and one or more instructions in the second group of instructions may be inserted into instruction sequence 1. Therefore, when a loading instruction in the instruction sequence 2 does not have a legal position, the loading instruction can be partially inserted into the instruction sequence 1 in a splitting mode, so that the loading time of the starting stage of the neural network model 2 is further saved, and the overall execution efficiency of a plurality of neural network models is improved.
fig. 7 shows an example of an execution process of performing instruction scheduling between instruction sequences using the instruction scheduling method of the neural network model provided in the embodiment of the present application, in which an instruction sequence a (an example of a first instruction sequence in the present application) and an instruction sequence b (an example of a second instruction sequence in the present application) correspond to a neural network model a (an example of a first neural network model in the present application) and a neural network model b (an example of a second neural network model in the present application), respectively, the neural network a runs before the neural network model b, and the above method of the embodiment of the present application can insert a Feature data loading instruction "L D-Feature" and a Weight loading instruction "L D Weight" in a start stage of the neural network model b between convolution calculation instructions "CONV" of the instruction sequence a, so that the Feature data loading instruction "L D-Feature" and the Weight loading instruction "L D Weight" of the instruction sequence b are inserted in advance to the instruction sequence a and the convolution calculation instruction "convolution" CONV "of the instruction sequence a, and the Feature data loading instruction" L D Weight "is inserted in the start stage before the instruction sequence a runs for a time, thus the hardware model b is completed without waiting for a time of the hardware model b to be completed in a stage and the hardware model b before the hardware model b is completed.
Figure 8 shows the overall operation flow of three neural network models after the instruction scheduling method of the neural network model provided by the embodiment of the application is used, wherein, a production line is established among the instruction sequences of the neural network model 1, the neural network model 2 and the neural network model 3, the loading of the neural network model 2 can be executed when the calculation of the neural network model 1 is executed, the calculation of the neural network model 2 can be executed when the 'storage result' of the neural network model 1 is executed, the loading of the neural network model 3 can be executed when the calculation of the neural network model 2 is executed, the calculation of the neural network model 3 can be executed when the 'storage result' of the neural network model 2 is executed, the whole operation process can be completed in 5 time periods, and compared with the related art shown in fig. 2, the whole execution efficiency is obviously improved on the premise of not increasing hardware resources.
In the embodiment of the present application, the processing of the above exemplary method may be performed simultaneously for instruction sequences of multiple neural network models, or may be performed one by one for instruction sequences of multiple neural network models, depending on the configuration of hardware resources and the operation order of the neural network models. For example, if there is only one processor (e.g., only one BPU) running the neural network model, only one instruction sequence can be run at a time, then the processing of the exemplary method can be performed on the instruction sequences of the plurality of neural network models one by one in order of the running order of the neural network model to pipeline between the instruction sequences of the plurality of neural network models. When there are multiple processors (e.g., with multiple BPUs) running the neural network model, multiple neural network models can be run simultaneously, and accordingly, the processing of the above exemplary method can be performed simultaneously on the instruction sequences of the multiple neural network models in the order of running the neural network models to pipeline the instruction sequences of the multiple neural network models that are running simultaneously.
In one implementation of the embodiment of the present application, the step 501 of determining a second instruction sequence corresponding to a second neural network model to be executed, the step 502 of selecting at least one instruction in the second instruction sequence, and the step 503 of inserting the at least one instruction into the first instruction sequence may be executed by a second processor in the system shown in fig. 4; the first and second instruction sequences may be compiled by a first processor in the system of fig. 4. Here, the first processor and the second processor are different types of processors. The step 504 of executing the first sequence of instructions comprising the at least one instruction may be performed by a third processor of the system shown in fig. 4.
Fig. 9 shows an exemplary process of the system shown in fig. 4 executing the exemplary method of the embodiment of the present application, in which after the first processor compiles instruction sequences of each neural network model (one neural network model corresponds to one instruction sequence) and provides the instruction sequences to the second processor, the second processor executes the processes of the exemplary method of the embodiment of the present application according to the running sequence of each neural network model and loads the processes into the memory, and the third processor reads the instruction sequences or the instructions in the instruction sequences from the memory and runs the instructions. Wherein, the loading process may include: the instruction sequence of the neural network model to be run is written to the memory (e.g., DDR) and its starting address in the memory (i.e., the first address) and the length of the instruction sequence (i.e., how many instructions are included) are provided to the third processor so that the third processor reads the instruction sequence from the memory. In one example, the second processor may perform the exemplary method of the embodiments of the present application as follows: determining an instruction sequence (assumed as an instruction sequence E) to be executed next according to the running sequence of the neural network model, extracting the instruction sequence E from the instruction sequence compiled by the first processor, putting the instruction sequence E into a cache, waiting the instruction sequence E in the cache, executing the processing of the exemplary method on the instruction sequence E and the instruction sequence before the instruction sequence E (assumed as an instruction sequence D) by the second processor at the moment, namely dispatching at least one instruction in the instruction sequence E into the instruction sequence D, and loading the instruction sequence D into the memory so that the third processor reads the instruction sequence D from the memory and executes the instruction sequence D, wherein partial instructions of the instruction sequence E are executed in advance of the running process of the instruction sequence D. After the instruction sequence D is loaded, the second processor extracts an instruction sequence F of which the operation sequence is behind the instruction sequence E and puts the instruction sequence F into a cache, at least one instruction in the instruction sequence F is dispatched to the instruction sequence E, then the instruction sequence E is loaded to the memory, and the third processor reads the instruction sequence E from the memory and operates the instruction sequence E, so that part of instructions in the instruction sequence F are executed in the operation process of the instruction sequence E in advance. The operation is sequentially executed until the last neural network model operation is finished.
In practical applications, the instruction sequences of the respective neural network models may be subjected to instruction scheduling within the sequences (for example, may be implemented by a first processor of the system shown in fig. 4) in a compiling stage of the instruction sequences of the respective neural network models, a pipeline may be established within the respective instruction sequences by the instruction scheduling within the sequences, and the instruction scheduling between the instruction sequences (for example, may be implemented by a second processor of the system shown in fig. 4) may be performed by the above-described exemplary method according to the embodiment of the present application in an operating stage of the respective neural network models, and a pipeline may be established between the instruction sequences by the instruction scheduling between the instruction sequences. Therefore, the pipeline inside the instruction sequence is combined with the pipeline between the instruction sequences, and the pipeline execution capacity of hardware resources can be exerted to the maximum extent, so that the overall execution efficiency of a plurality of neural network models is improved to a greater extent on the premise of not increasing the hardware resources.
The above exemplary method of the embodiment of the application can be applied to various situations where a plurality of neural network models need to be operated, and can improve the overall execution efficiency of the plurality of neural network models on the premise of not increasing hardware resources and ensuring correct operation results.
According to statistics, for a neural network model with a simpler structure (for example, a two-layer neural network), the time required for loading data and loading parameters will account for 30% -50% of the total operation time of the neural network, so for a plurality of neural network models with a simpler structure, by the above method of the embodiment of the present application, about 40% of the operation time can be saved in the best case for the whole of the plurality of neural network models, and about 20% of the operation time can be saved on average, and accordingly, the overall execution efficiency of the plurality of neural network models can be improved by at least 20% -40%.
According to statistics, for a neural network with a complex structure (for example, a neural network structure with one thousand layers), the time required for loading data and loading parameters accounts for 3% -5% of the total operation time of the neural network. For a plurality of neural network models with complex structures, the above method of the embodiment of the present application can save about 3% of the running time for the whole plurality of neural network models, and accordingly, the overall execution efficiency of the plurality of neural network models can be improved by at least 3%.
Exemplary devices
FIG. 10 is an exemplary apparatus 10 for instruction scheduling for a neural network model provided by an exemplary embodiment of the present application. As shown in fig. 10, an exemplary apparatus 10 for instruction scheduling for a neural network model includes:
An instruction sequence determining unit 101 configured to determine a second instruction sequence corresponding to a second neural network model to be executed when a first instruction sequence corresponding to a first neural network model needs to be executed, wherein the first neural network model is executed before the second neural network model;
An instruction selection unit 102 configured to select at least one instruction in the second instruction sequence;
An instruction insertion unit 103 configured to insert the at least one instruction into the first instruction sequence; and
An instruction execution unit 104 configured to execute the first instruction sequence including the at least one instruction.
Fig. 11 is an exemplary apparatus 11 for instruction scheduling of a neural network model provided by an exemplary embodiment of the present application. In an implementation manner of the embodiment of the present application, the instruction selecting unit 102 may include: a traversal module 1021 and a selection module 1022, wherein the traversal module 1021 is configured to traverse the second instruction sequence to find a first computation instruction in the second instruction sequence; a selection model 1022 configured to select an instruction in the second sequence of instructions that precedes the first computing instruction. Therefore, the instruction of the starting stage of the second neural network model can be used as an object for scheduling among instruction sequences in a targeted manner, so that the starting stage of the second neural network model can be parallel to a part of stages (such as an intermediate stage) of the first neural network model, the execution time of the starting stage of the second neural network model is saved, the hardware resources (such as BPU) are prevented from waiting for a long time at the starting stage of the second neural network model, and the overall execution efficiency of a plurality of neural network models is improved on the premise of not increasing the hardware resources.
In an implementation manner of the embodiment of the present application, the instruction insertion unit 103 is configured to insert the at least one instruction into the first instruction sequence one by one according to a descending order of data of instruction operations. In this way, the instruction with longer running time in the second neural network model can be preferentially inserted into the first instruction sequence, so that the instruction with longer running time in the second instruction sequence can be parallel to one or more instructions in the first instruction sequence, and thus more running time (for example, time for loading data in the initial stage of the second neural network model) is saved, and the whole execution efficiency of the plurality of neural network models is further improved.
In an implementation manner of the embodiment of the present application, the at least one instruction is a load instruction for loading data required for the second neural network model operation. Here, the data may include, but is not limited to, feature data, weight parameters, and the like. In this implementation, the instruction insertion unit 103 is configured to insert the at least one instruction between the computation instructions in the first instruction sequence. Therefore, the loading instruction of the second neural network model at the starting stage can be advanced to the running process of the first neural network model, the loading time of the second neural network model at the starting stage can be saved, the waiting time of the BPU at the starting stage is reduced, and the execution efficiency is improved.
As shown in fig. 11, in an implementation manner of the embodiment of the present application, the instruction insertion unit 103 may include: the first and/or second verification modules 1031, 1032; wherein the first checking module 1031 is configured to check whether there is a data collision between the at least one instruction and each instruction in the first instruction sequence to determine a legal position of the at least one instruction in the first instruction sequence, so as to insert the at least one instruction into the first instruction sequence; the second check module 1032 is configured to check whether a hardware resource conflict exists between the at least one instruction and each instruction in the first instruction sequence to determine a legal position of the at least one instruction in the first instruction sequence, so as to insert the at least one instruction into the first instruction sequence. Therefore, before at least one instruction is inserted into the first instruction sequence, the legal position of the at least one instruction in the first instruction sequence is determined through checking, the generation of conflict can be effectively avoided, a more reasonable pipeline is established among the instruction sequences of the plurality of neural network models, the running or the calculation of the neural network models is prevented from generating errors, and the overall execution efficiency of the plurality of neural network models is improved on the premise of ensuring the correct overall operation result of the plurality of neural network models.
Here, the first checking module 1031 and/or the second checking module 1302 are configured to perform the checking for each instruction from the end of the first instruction sequence.
In this embodiment of the application, the first verification module 1031 is configured to: judging whether the memory address accessed by the at least one instruction is overlapped with the memory address accessed by the instructions of the first instruction sequence, and whether the at least one instruction executes write operation on data in the memory address; and when the memory addresses are overlapped and the at least one instruction and/or the instruction of the first instruction sequence execute the write operation on the data in the memory addresses, determining that the at least one instruction and the instruction of the first instruction sequence have data conflict.
In this embodiment of the application, the second checking module 1302 is configured to query mapping relationship information between each pre-configured instruction and a hardware resource used by the instruction, and determine whether the at least one instruction and an instruction of the first instruction sequence need to use the same hardware resource based on the mapping relationship information; determining that there is a hardware resource conflict of the at least one instruction with an instruction of the first sequence of instructions when use of the same hardware resource is required.
As shown in fig. 11, in some implementations of the embodiment of the present application, the instruction insertion unit 103 may further include: a first determining module 1033, configured to determine that one legal position is an insertion position of the at least one instruction when two or more legal positions exist in the first instruction sequence of the at least one instruction; an insert operation module 1034 configured to insert the at least one instruction at the insert location in the first sequence of instructions. In an implementation manner of this embodiment of the present application, the first determining module 1033 may be configured to determine, when two or more legal positions exist in the first instruction sequence of the at least one instruction, a legal position with the longest parallel time as an insertion position of the at least one instruction; in another implementation manner of this embodiment of the present application, the first determining module 1033 may be configured to determine that the legal position which is longest in parallel time and is the most advanced in the first instruction sequence is the insertion position of the at least one instruction when the at least one instruction has two or more legal positions in the first instruction sequence. In this way, the running time can be saved as much as possible, and the overall execution efficiency of the plurality of neural network models can be greatly improved.
In one example, the instruction insertion unit 103 may further include: a second determination module 1035 configured to determine one or more instructions in the first sequence of instructions that are parallel to the at least one instruction at the respective legal location; an estimation module 1036 configured to estimate parallel times for the respective legal locations based on the size of the operation data of the one or more instructions determined by the second determination module. The first determining module 1033 is configured to determine an insertion position of the at least one instruction by using the parallel time obtained by the estimation model. Here, the two ways in which the first determining module determines the insertion position of the at least one instruction by using the parallel time are described above, and are not described again.
As shown in fig. 11, in some implementations of the embodiment of the present application, the instruction insertion unit 103 may further include: a splitting module 1037 configured to split the at least one instruction into two groups of instructions when there is no legal position of the at least one instruction in the first instruction sequence; an insert operation module 1034 configured to insert, into the first instruction sequence, one of the two groups of instructions obtained by splitting by the splitting module 1037, which does not have data collision with the instructions in the first instruction sequence. Therefore, the instruction with longer running time in the second instruction sequence is inserted into the first instruction sequence by splitting the instruction in the second instruction sequence, so that more running time is saved, and the overall execution efficiency of a plurality of neural network models is improved to a greater extent on the premise of not increasing hardware resources.
As shown in fig. 11, in an implementation manner of the embodiment of the present application, the instruction insertion unit 103 may further include: a third determining module 1038 for determining that there is conflicting data and non-conflicting data between the at least one instruction and an instruction in the first sequence of instructions; and a splitting module 1037 configured to split the at least one instruction into two groups of instructions according to the data with conflict and the data without conflict, where a first group of instructions of the two groups of instructions is used for operating the data with conflict, and a second group of instructions of the two groups of instructions is used for operating the data without conflict.
The device provided by the embodiment of the application can be suitable for various scenes in which a plurality of neural network models need to be operated, and can improve the overall execution efficiency of the plurality of neural network models on the premise of not increasing hardware resources and ensuring correct operation results. It should be noted that the above-mentioned exemplary apparatus 10 and the exemplary apparatus 11 are both examples, and the specific structure of the instruction scheduling apparatus of the neural network model in the embodiment of the present application is not limited to these two ways.
In practical applications, the above exemplary apparatus 10 and the exemplary apparatus 11 of the embodiments of the present application may be implemented by an operating device in an "exemplary system". In one implementation, in the exemplary apparatus 10 and the exemplary apparatus 11, the instruction sequence determining unit 101, the instruction selecting unit 102, and the instruction inserting unit 103 may be implemented by a second processor in the system shown in fig. 4, and the instruction executing unit 104 may be implemented by a third processor in the system shown in fig. 4.
Exemplary electronic device
In addition to the above method, an embodiment of the present application may also be an electronic device including: one or more processors; and a memory storing computer instructions that, when executed by the processor, cause the processor to perform the steps in the instruction scheduling method of the neural network model according to various embodiments of the present application described in the "exemplary methods" section above in this specification.
The electronic equipment provided by the embodiment of the application can be suitable for various scenes in which a plurality of neural network models need to be operated, and can improve the overall execution efficiency of the plurality of neural network models on the premise of not increasing hardware resources and ensuring correct operation results. In practical applications, the electronic device according to the embodiment of the present application may be implemented by an operating device in an "exemplary system". In one implementation, the second processor and the third processor in the system shown in fig. 4 may be included in the electronic device.
FIG. 12 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
As shown in fig. 12, electronic device 12 includes one or more processors 121 and memory 122.
Processor 121 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in electronic device 12 to perform desired functions.
Memory 122 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 121 to implement the instruction scheduling methods of the neural network models of the various embodiments of the present application described above and/or other desired functions.
In one example, the electronic device 12 may further include: an input device 123 and an output device 124, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input means 123 may be, for example, a microphone or a microphone array. The input device 123 may also include, for example, a keyboard, a mouse, and the like. The output device 124 can output various information to the outside. The output devices 124 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device 12 relevant to the present application are shown in fig. 12, and components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 12 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of instruction scheduling of a neural network model according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the instruction scheduling method of a neural network model according to various embodiments of the present application described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (18)

1. An instruction scheduling method of a neural network model comprises the following steps:
When a first instruction sequence corresponding to a first neural network model needs to be operated, determining a second instruction sequence corresponding to a second neural network model to be operated, wherein the first neural network model is operated before the second neural network model;
Selecting at least one instruction in the second sequence of instructions;
Checking the at least one instruction to determine a legal position of the at least one instruction in the first sequence of instructions;
Inserting the at least one verified instruction into the first instruction sequence; and
Executing the first instruction sequence including the at least one instruction.
2. The instruction scheduling method of claim 1, wherein selecting at least one instruction in the second sequence of instructions comprises:
Traversing the second instruction sequence to find a first calculation instruction in the second instruction sequence; and the number of the first and second groups,
Selecting an instruction in the second sequence of instructions that precedes the first compute instruction.
3. The instruction scheduling method of claim 1, wherein inserting the at least one instruction into the first instruction sequence comprises:
And inserting the at least one instruction into the first instruction sequence one by one according to the descending order of the data of the instruction operation.
4. The instruction scheduling method according to claim 1, wherein said at least one instruction is a load instruction for loading data required for said second neural network model operation.
5. The instruction scheduling method of claim 1, wherein inserting the at least one instruction into the first instruction sequence comprises: inserting the at least one instruction between compute instructions in the first instruction sequence.
6. The instruction scheduling method of claim 1, wherein checking said at least one instruction to determine said legal position of said at least one instruction in said first sequence of instructions comprises:
And checking whether data conflict and/or hardware resource conflict exist between the at least one instruction and each instruction in the first instruction sequence to determine the legal position of the at least one instruction in the first instruction sequence so as to insert the at least one instruction into the first instruction sequence.
7. The instruction scheduling method of claim 6,
And if two or more legal positions of the at least one instruction exist in the first instruction sequence, determining the legal position with the longest parallel time as the insertion position of the at least one instruction.
8. The instruction scheduling method of claim 6, wherein if said at least one instruction has two or more legal positions in said first instruction sequence, determining the legal position which has the longest parallel time and is the first most in the first instruction sequence as the insertion position of said at least one instruction.
9. The instruction scheduling method of claim 6,
If the legal position of the at least one instruction in the first instruction sequence does not exist, splitting the at least one instruction into two groups of instructions, and inserting one group of instructions which do not have data conflict with the instructions in the first instruction sequence into the first instruction sequence.
10. The instruction scheduling method of claim 9, wherein said splitting said at least one instruction into two groups of instructions comprises:
Determining that there is conflicting data and non-conflicting data between the at least one instruction and an instruction in the first sequence of instructions; and the number of the first and second groups,
And splitting the at least one instruction into two groups of instructions according to the data with conflict and the data without conflict, wherein the first group of instructions in the two groups of instructions is used for operating the data with conflict, and the second group of instructions in the two groups of instructions is used for operating the data without conflict.
11. The instruction scheduling method of claim 7 or 8, wherein determining the legal position with the longest parallel time comprises:
Determining one or more instructions in the first sequence of instructions that are parallel to the at least one instruction at respective legal locations;
Estimating parallel times for the respective legal locations based on the size of the one or more instruction operation data;
And determining the legal position with the longest parallel time based on the estimated parallel time of each legal position.
12. The instruction scheduling method according to claim 6, wherein said checking is performed for each instruction starting from the end of said first instruction sequence.
13. The instruction scheduling method of claim 6, wherein checking whether there is a data conflict between the at least one instruction and each instruction in the first sequence of instructions comprises:
Judging whether the memory address accessed by the at least one instruction is overlapped with the memory address accessed by the instructions of the first instruction sequence, and whether the at least one instruction and/or the instructions of the first instruction sequence execute write operation on data in the memory address;
And when the memory addresses are overlapped and the at least one instruction and/or the instruction of the first instruction sequence execute the write operation on the data in the memory addresses, determining that the at least one instruction and the instruction of the first instruction sequence have data conflict.
14. The instruction scheduling method of claim 6, wherein checking whether there is a hardware resource conflict between the at least one instruction and each instruction in the first instruction sequence comprises:
Inquiring mapping relation information between each pre-configured instruction and a hardware resource used by the instruction, and judging whether the at least one instruction and the instruction of the first instruction sequence need to use the same hardware resource or not based on the mapping relation information;
Determining that there is a hardware resource conflict of the at least one instruction with an instruction of the first sequence of instructions when use of the same hardware resource is required.
15. The instruction scheduling method of claim 1,
The step of determining a second sequence of instructions corresponding to a second neural network model to be run, the step of selecting at least one instruction in the second sequence of instructions, the step of checking the at least one instruction to determine the legal position of the at least one instruction in the first sequence of instructions, and the step of inserting the checked at least one instruction into the first sequence of instructions are performed by a second processor;
The first and second sequences of instructions are compiled by a first processor, and the first and second processors are different types of processors.
16. An electronic device, comprising:
One or more processors; and
A memory storing computer instructions which, when executed by the processor, cause the processor to perform the method of any one of claims 1 to 15.
17. An instruction scheduling apparatus of a neural network model, comprising:
The instruction sequence determining unit is configured to determine a second instruction sequence corresponding to a second neural network model to be operated when a first instruction sequence corresponding to a first neural network model needs to be operated, wherein the first neural network model is operated before the second neural network model;
An instruction selection unit configured to select at least one instruction in the second instruction sequence;
The instruction insertion unit is configured to verify the at least one instruction to determine a legal position of the at least one instruction in the first instruction sequence, and insert the verified at least one instruction into the first instruction sequence; and
An instruction execution unit configured to execute the first instruction sequence including the at least one instruction.
18. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 15.
CN201811276880.XA 2018-10-30 2018-10-30 Instruction scheduling method and device of neural network model Active CN109272109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811276880.XA CN109272109B (en) 2018-10-30 2018-10-30 Instruction scheduling method and device of neural network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811276880.XA CN109272109B (en) 2018-10-30 2018-10-30 Instruction scheduling method and device of neural network model

Publications (2)

Publication Number Publication Date
CN109272109A CN109272109A (en) 2019-01-25
CN109272109B true CN109272109B (en) 2020-07-17

Family

ID=65195560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811276880.XA Active CN109272109B (en) 2018-10-30 2018-10-30 Instruction scheduling method and device of neural network model

Country Status (1)

Country Link
CN (1) CN109272109B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767078B (en) * 2019-04-02 2024-08-06 上海寒武纪信息科技有限公司 Data operation method, device and related product
CN112148291A (en) * 2019-06-26 2020-12-29 中兴通讯股份有限公司 Instruction block processing method and device, storage medium and electronic device
CN112766470B (en) * 2019-10-21 2024-05-07 地平线(上海)人工智能技术有限公司 Feature data processing method, instruction sequence generating method, device and equipment
CN110908667B (en) * 2019-11-18 2021-11-16 北京迈格威科技有限公司 Method and device for joint compilation of neural network and electronic equipment
CN111190401A (en) * 2019-12-30 2020-05-22 讯飞智元信息科技有限公司 Instruction scheduling method, hydraulic engineering control method, equipment and medium
CN113496275B (en) 2020-04-08 2023-07-25 北京地平线机器人技术研发有限公司 Instruction execution method and device and electronic equipment
CN112348179B (en) * 2020-11-26 2023-04-07 湃方科技(天津)有限责任公司 Efficient convolutional neural network operation instruction set architecture construction method and device, and server
CN112540835B (en) * 2020-12-10 2023-09-08 北京奇艺世纪科技有限公司 Method and device for operating hybrid machine learning model and related equipment
CN112766478B (en) * 2021-01-21 2024-04-12 中国电子科技集团公司信息科学研究院 FPGA (field programmable Gate array) pipeline structure oriented to convolutional neural network
CN113221642B (en) * 2021-04-02 2024-04-05 哈尔滨鹏博普华科技发展有限责任公司 Violation snapshot image AI recognition system
CN118171711B (en) * 2024-05-15 2024-07-19 上海为旌科技有限公司 Instruction scheduling method, system, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105408859A (en) * 2013-09-12 2016-03-16 马维尔国际贸易有限公司 Method and system for instruction scheduling
CN105468335A (en) * 2015-11-24 2016-04-06 中国科学院计算技术研究所 Pipeline-level operation device, data processing method and network-on-chip chip
CN106485318A (en) * 2015-10-08 2017-03-08 上海兆芯集成电路有限公司 There is the processor of mixing coprocessor/performance element neutral net unit
CN107301455A (en) * 2017-05-05 2017-10-27 中国科学院计算技术研究所 Mixing cube storage system and speed-up computation method for convolutional neural networks
CN107491287A (en) * 2017-08-30 2017-12-19 苏州乐麟无线信息科技有限公司 The execution method and device of instruction
CN108416422A (en) * 2017-12-29 2018-08-17 国民技术股份有限公司 A kind of convolutional neural networks implementation method and device based on FPGA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018081592A (en) * 2016-11-17 2018-05-24 富士通株式会社 Compile program, compile method, and compiler

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105408859A (en) * 2013-09-12 2016-03-16 马维尔国际贸易有限公司 Method and system for instruction scheduling
CN106485318A (en) * 2015-10-08 2017-03-08 上海兆芯集成电路有限公司 There is the processor of mixing coprocessor/performance element neutral net unit
CN105468335A (en) * 2015-11-24 2016-04-06 中国科学院计算技术研究所 Pipeline-level operation device, data processing method and network-on-chip chip
CN107301455A (en) * 2017-05-05 2017-10-27 中国科学院计算技术研究所 Mixing cube storage system and speed-up computation method for convolutional neural networks
CN107491287A (en) * 2017-08-30 2017-12-19 苏州乐麟无线信息科技有限公司 The execution method and device of instruction
CN108416422A (en) * 2017-12-29 2018-08-17 国民技术股份有限公司 A kind of convolutional neural networks implementation method and device based on FPGA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于FPGA的深度学习加速器设计与实现;余奇;《基于FPGA的深度学习加速器设计与实现》;20160915;正文第3-4章 *

Also Published As

Publication number Publication date
CN109272109A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109272109B (en) Instruction scheduling method and device of neural network model
US7543282B2 (en) Method and apparatus for selectively executing different executable code versions which are optimized in different ways
US7503039B2 (en) Preprocessor to improve the performance of message-passing-based parallel programs on virtualized multi-core processors
JP5611756B2 (en) Program flow control
US20190079805A1 (en) Execution node selection method and information processing apparatus
WO2010064260A1 (en) Method and system for parallelization of sequencial computer program codes
CN115829006A (en) Compiling method and device of neural network model, electronic equipment and storage medium
US7712091B2 (en) Method for predicate promotion in a software loop
US10684834B2 (en) Method and apparatus for detecting inter-instruction data dependency
US20160196156A1 (en) Simulation apparatus, simulation method, and computer product
CN112766470B (en) Feature data processing method, instruction sequence generating method, device and equipment
US11188315B1 (en) Method and apparatus for reusable and relative indexed register resource allocation in function calls
US10802854B2 (en) Method and apparatus for interpreting bytecode instruction stream
CN117251387A (en) Data prefetching method, compiling method and related devices
US9395962B2 (en) Apparatus and method for executing external operations in prologue or epilogue of a software-pipelined loop
US9396044B2 (en) Memory efficient thread-level speculation
JP2002014868A (en) Microprocessor having memory referring operation detecting mechanism and compile method
CN112445587A (en) Task processing method and task processing device
CN114327643B (en) Machine instruction preprocessing method, electronic device and computer-readable storage medium
CN116069464B (en) Optimization method and device based on distributed storage call data execution
CN116301874A (en) Code compiling method, electronic device and storage medium
CN118092887B (en) Wasm instruction set generation method, wasm instruction set generation device, terminal and storage medium
KR20130111027A (en) Dynamic memory managing methof in embedded system
CN113407240B (en) Simulation method of C64x + DSP software flow circulation buffer mechanism
EP4113283A1 (en) Compilation system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant