CN113255902A - Neural network circuit, system and method for controlling data flow - Google Patents

Neural network circuit, system and method for controlling data flow Download PDF

Info

Publication number
CN113255902A
CN113255902A CN202010087266.XA CN202010087266A CN113255902A CN 113255902 A CN113255902 A CN 113255902A CN 202010087266 A CN202010087266 A CN 202010087266A CN 113255902 A CN113255902 A CN 113255902A
Authority
CN
China
Prior art keywords
engine
logic engine
logic
state
physical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010087266.XA
Other languages
Chinese (zh)
Inventor
刘哲
陈云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010087266.XA priority Critical patent/CN113255902A/en
Publication of CN113255902A publication Critical patent/CN113255902A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Combined Controls Of Internal Combustion Engines (AREA)

Abstract

The application discloses a neural network circuit, a system and a method for controlling data flow, relates to the technical field of artificial intelligence, can be applied to a cyclic neural network, and is beneficial to realizing the order-preserving execution of the data flow in the neural network. The neural network circuit includes a first physics engine virtualized as one or more logic engines. A first logical engine of the one or more logical engines is to obtain a first data frame and to compute the first data frame when a state of the first physical engine is a first state and a state of the first logical engine is an idle state. The state of the first physical engine is a first state and is used for indicating that no logic engine performing calculation is arranged on the first physical engine, the state of the first logic engine is an idle state and is used for indicating that the second logic engine can receive data, and the second logic engine is used for calculating output data of the first logic engine.

Description

Neural network circuit, system and method for controlling data flow
Technical Field
The present application relates to the field of artificial intelligence technology, and more particularly, to neural network circuits, systems, and methods for controlling data flow.
Background
Natural language processing is one of the most difficult problems in artificial intelligence. At present, the voice recognition, voice awakening, translation and other aspects are well developed. The processing technology of natural language is gradually changed from the traditional hidden Markov model to the neural network model. Among them, the Recurrent Neural Network (RNN) is widely used in natural speech processing. The recurrent neural network is a neural network with memory capacity, and the neurons can receive information of other neurons and information of the neurons, form a network structure with loops and are more suitable for processing sequence data. In a recurrent neural network system, the calculations performed for each recurrent process are the same. In addition, the cyclic neural network has strict sequential dependence relationship, and the front and back loops cannot be executed in parallel. Moreover, in one loop, the calculation also has a strict execution order. Therefore, there is a need to provide a solution for controlling data flow in a recurrent neural network.
Disclosure of Invention
The embodiment of the application provides a neural network circuit, a system and a method for controlling data flow, which can be applied to a recurrent neural network and other neural networks, and are beneficial to realizing the order-preserved execution of the data flow in the neural network.
In order to achieve the above purpose, the embodiments of the present application provide the following technical solutions:
in a first aspect, a neural network circuit is provided that includes a first physics engine virtualized as one or more logic engines. A first logic engine of the one or more logic engines to: obtaining a first data frame; when the state of the first physical engine is in the first state and the state of the first logic engine is in the idle state, the first data frame is calculated. The state of the first physical engine is a first state and is used for indicating that no logic engine performing calculation is arranged on the first physical engine, the state of the first logic engine is an idle state and is used for indicating that the second logic engine can receive data, and the second logic engine is used for calculating output data of the first logic engine. The first physical engine may be any one of the physical engines in the neural network circuit, and the first logic engine may be one of the logic engines on the first physical engine. The first data frame is the data required by the first logic engine to initiate a computation. The second logic engine belongs to the first physics engine or one of the neural network circuits other than the first physics engine.
In the technical scheme, a physical engine is virtualized into one or more logic engines, and a first logic engine is controlled to conditionally start computation. Specifically, after the first logic engine obtains a data frame, on one hand, the calculation needs to be started only when there is no logic engine performing the calculation on the physical engine to which the first logic engine belongs, that is, at most one logic engine performs the calculation on the first physical engine at the same time; on the other hand, the calculation can be started only when the second logic engine (i.e. the logic engine receiving the output data of the first logic engine) can receive the data (i.e. the buffer space corresponding to the second logic engine is free), so that the calculation result can be sent to the second logic engine immediately after the calculation of the first logic engine is completed. In combination, the logic engines in the neural network circuit are beneficial to executing calculation in order, and therefore the data flow in the neural network is beneficial to being executed in order.
In addition, at most one logic engine on the first physical engine performs calculation at the same time, which is beneficial to time division multiplexing of physical resources of the physical engines among a plurality of logic engines deployed on the same physical engine, thereby being beneficial to improving the utilization rate of the physical resources. After the first logic engine completes the calculation, the calculation result can be immediately sent to the second logic engine, that is, the data obtained after the first logic engine completes the calculation does not need to be cached, and can be directly used for the calculation of the second logic engine, thereby being beneficial to saving the cache space.
In one possible design, the first logic engine and the second logic engine are logic engines that perform computations in one sub-execution. That is, the first logic engine is not the logic engine that last performed the computation during one of the sub-executions. For a logic engine which executes calculation finally in a sub-execution process, a receiving end node of the logic engine can be considered to be capable of receiving data all the time, so that a state can not be set for the logic engine, and therefore bit width of a state register is saved, and resource overhead is saved; and, help to reduce the maintenance complexity to the state of the logic engine.
Optionally, with respect to the steps performed by the last logic engine performing the computation in one sub-execution process, reference may be made to the description of the fourth logic engine below.
For the recurrent neural network, each cyclic process in the first m-1 cyclic processes of the cyclic layer is a sub-execution process, and the mth cyclic process of the cyclic layer is taken as a sub-execution process of the neural network circuit together with the non-cyclic process. Wherein m is the number of cycles of the cycle layer, m is 2 or more, and m is an integer. For an acyclic network, one sub-implementation is equivalent to one implementation.
In one possible design, the first logic engine and the second logic engine are logic engines that perform computations in one sub-execution process, and the second logic engine does not belong to the first physical engine. Since one physical engine corresponds to one buffer queue, when any one logical engine performs a calculation on the same physical engine, the buffer queue is already empty, and therefore, a receiving end node of the logical engine (i.e., the logical engine that receives the output data of the logical engine) can receive the data. Therefore, the state of the logic engine can not be set, which is beneficial to saving the bit width of the state register and saving the resource expense; and, help to reduce the maintenance complexity to the state of the logic engine. Optionally, as to the steps executed by a logic engine when the logic engine and its receiving end node belong to the same physical engine in a sub-execution process, reference may be made to the description of the sixth logic engine below.
In one possible design, the first logic engine is further to: when the calculation times of the first logic engine reach the preset times, setting the state of the first logic engine to be a busy state, wherein the state of the first logic engine is the busy state and is used for indicating that the second logic engine cannot receive data. The preset times is the total calculation times of the first logic engine in a sub-execution process. That is, when the number of times of one execution does not reach the preset number of times, the state of the current logic engine is not set to the busy state. This is a technical solution provided in consideration of "if the state of the current logic engine is set to a busy state at this time, and no logic engine is in a clear state for the current logic engine, then the current logic engine cannot start the next calculation, and at this time, the neural network circuit will be locked.
In one possible design, the computation performed by the first logical engine on the first data frame is a computation of a sub-execution, and the first logical engine is a logical engine on the first physical engine that last performed the computation in the sub-execution. In this case, the first logic engine is further configured to: when the calculation times of the first logic engine reach the preset times, first indication information is sent to a third logic engine, wherein the third logic engine is used for sending data to a first target logic engine, the first target logic engine is a logic engine which executes the first calculation in the sub-execution process of the first physical engine, the first indication information is used for indicating the third logic engine to set the self state to be an idle state, and the state of the third logic engine is the idle state and is used for indicating that the first target logic engine can receive the data. This is a technical solution provided in consideration of "when a sub-execution process on the same physical engine is not executed, other logic engines on the physical engine do not start the calculation of the next sub-execution process", or "after a sub-execution process on the same physical engine is executed, the physical engine can start the next sub-execution process (i.e., the neural network circuit performs pipeline calculation with the granularity of the physical engine)".
In one possible design, the computation performed by the first logic engine on the first data frame belongs to the computation of one of the sub-executions, and the first logic engine is not the logic engine that last performed the computation on the first physical engine in the sub-executions. In this case, the first logic engine is further configured to: and when the calculation times of the first logic engine reach the preset times, sending first indication information to a third logic engine, wherein the first indication information is used for indicating the third logic engine to set the self state to be an idle state, the third logic engine is used for sending data to a first target logic engine, the first target logic engine is positioned on a first physical engine, and the first target logic engine is a logic engine which executes calculation next after the first logic engine executes calculation in the sub-execution process. In this way, the data required by the next computation-executing logic engine on the first physical engine to perform the computation may arrive during the computation performed by the first logic engine, that is, the computation delay of the first logic engine masks the transmission delay of the next computation-executing logic engine on the physical engine where the first logic engine is located, so that after the computation of the first logic engine is completed, the next computation-executing logic engine may start the computation without waiting for the data to arrive.
In one possible design, the neural network circuit further includes: a fourth logic engine; the fourth logic engine and the first logic engine are logic engines in the same sub-execution process, and the fourth logic engine is the logic engine which performs calculation in the last sub-execution process. The fourth logic engine is to obtain a second data frame. And the fourth logic engine is used for calculating the second data frame when the state of a second physical engine to which the fourth logic engine belongs is the first state, wherein the state of the second physical engine is the first state and is used for indicating that no logic engine on the second physical engine is executing calculation. For a logic engine which executes calculation finally in a sub-execution process, a receiving end node of the logic engine can be considered to be capable of receiving data all the time, so that a state can not be set for the logic engine, and therefore bit width of a state register is saved, and resource overhead is saved; and, help to reduce the maintenance complexity to the state of the logic engine.
In one possible design, the fourth logic engine is further to: and sending second indication information to a fifth logic engine, wherein the fifth logic engine is used for sending data to a second target logic engine, the second target logic engine is a logic engine for performing calculation in the sub-execution process on the second physical engine, and the second indication information is used for indicating that the second target logic engine can receive data. In this way, the logic engine that facilitates the first execution of the computation on the second physical engine may begin executing the next sub-execution process, thereby facilitating the neural network circuitry to pipeline computations at the physical engine granularity.
In one possible design, the neural network circuit further includes: the sixth logic engine and a seventh logic engine, the seventh logic engine is used for receiving the output data of the sixth logic engine; the sixth logic engine, the seventh logic engine and the first logic engine are logic engines in the same sub-execution process, and the sixth logic engine and the seventh logic engine belong to a third physical engine. The sixth logic engine is to obtain a third data frame. The sixth logic engine is configured to perform a computation on the third data frame when the state of the third physical engine is the first state, where the state of the third physical engine is the first state and indicates that there is no logic engine on the third physical engine that is performing the computation. Since one physical engine corresponds to one buffer queue, when any one logic engine performs calculation on the same physical engine, the buffer queue is already empty, and therefore, a receiving end node of the logic engine can receive data. Therefore, the state of the logic engine can not be set, which is beneficial to saving the bit width of the state register and saving the resource expense; and, help to reduce the maintenance complexity to the state of the logic engine.
In one possible design, the first logic engine is further to: after the first logic engine starts computing, setting the state of the first physical engine to be a second state, wherein the second state is used for indicating that the physical resources of the first physical engine are occupied; and after the first logic engine sends the calculation result to the second logic engine, setting the state of the first physical engine to be the first state. In this way, time-sharing multiplexing of resources of the physical engines among the logical engines on the same physical engine is facilitated.
In a second aspect, a method of controlling data flow in a neural network circuit is provided, the neural network circuit comprising a first physics engine virtualized as one or more logic engines; the method is applied to a first logic engine of the one or more logic engines. The method comprises the following steps: obtaining a first data frame; when the state of the first physical engine is a first state and the state of the first logic engine is an idle state, calculating the first data frame, wherein the state of the first physical engine is the first state and is used for indicating that no logic engine performing calculation is arranged on the first physical engine, the state of the first logic engine is the idle state and is used for indicating that the second logic engine can receive data, and the second logic engine is used for calculating output data of the first logic engine.
In one possible design, the method further includes: when the calculation times of the first logic engine reach the preset times, setting the state of the first logic engine to be a busy state, wherein the state of the first logic engine is the busy state and is used for indicating that the second logic engine cannot receive data.
In one possible design, the computation performed by the first logic engine on the first data frame is a computation of a sub-execution, and the first logic engine is a logic engine on the first physical engine that last performed the computation in the sub-execution. The method further comprises the following steps: when the calculation times of the first logic engine reach the preset times, first indication information is sent to a third logic engine, wherein the third logic engine is used for sending data to a first target logic engine, the first target logic engine is a logic engine which executes the first calculation in the sub-execution process of the first physical engine, the first indication information is used for indicating the third logic engine to set the self state to be an idle state, and the state of the third logic engine is the idle state and is used for indicating that the first target logic engine can receive the data.
In one possible design, the computation performed by the first logic engine on the first data frame belongs to the computation of one of the sub-executions, and the first logic engine is not the logic engine that last performed the computation on the first physical engine in the sub-executions. The method further comprises the following steps: and when the calculation times of the first logic engine reach the preset times, sending first indication information to a third logic engine, wherein the first indication information is used for indicating the third logic engine to set the self state to be an idle state, the third logic engine is used for sending data to a first target logic engine, the first target logic engine is positioned on a first physical engine, and the first target logic engine is a logic engine which executes calculation next after the first logic engine executes calculation in the sub-execution process.
In one possible design, the method further includes: after the first logic engine starts computing, setting the state of the first physical engine to be a second state, wherein the second state is used for indicating that the physical resources of the first physical engine are occupied; and after the first logic engine sends the calculation result to the second logic engine, setting the state of the first physical engine to be the first state.
For explanation and description of the beneficial effects of the second aspect and its possible design, reference may be made to the description of the first aspect or its corresponding design, and further description is omitted here.
In a third aspect, a neural network system is provided, which includes: a processor and a neural network circuit as provided in the first aspect or any one of the possible designs of the first aspect, wherein the processor is configured to transmit one or more frames of data to the neural network circuit. Wherein the one or more data frames include a first data frame.
In a fourth aspect, there is provided a computer readable storage medium for storing a computer program which, when run on a computer, causes the computer to perform the method provided by the second aspect or any one of the possible designs of the second aspect.
In a fifth aspect, there is provided a computer program product which, when run on a computer, causes the method provided by the second aspect described above or any of its possible designs to be performed.
It is understood that any one of the methods, the neural network system, the computer readable storage medium or the computer program product provided above is used to execute the corresponding neural network circuit provided above, and therefore, the beneficial effects achieved by the methods can refer to the beneficial effects in the corresponding neural network circuit, which are not described herein again.
Drawings
Fig. 1 is a schematic structural diagram of a neural network system according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a logical structure of a recurrent neural network system according to an embodiment of the present disclosure;
fig. 3 is a diagram illustrating the result of an intelligent headset according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a neural network layer in a neural network circuit according to an embodiment of the present disclosure;
fig. 5 is a schematic diagram of a deployment method of a neural network circuit according to an embodiment of the present disclosure;
fig. 6A is a schematic diagram illustrating a deployment result of a neural network circuit according to an embodiment of the present disclosure;
fig. 6B is a schematic diagram illustrating a deployment result of another neural network circuit according to an embodiment of the present disclosure;
fig. 7 is a flowchart illustrating a method for controlling data flow in a neural network circuit according to an embodiment of the present disclosure;
FIG. 8 is a schematic flow chart illustrating another method for controlling data flow in a neural network circuit according to an embodiment of the present disclosure;
FIG. 9 is a schematic flow chart illustrating another method for controlling data flow in a neural network circuit according to an embodiment of the present disclosure;
FIG. 10 is a schematic flow chart illustrating another method for controlling data flow in a neural network circuit according to an embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of a neural network circuit according to an embodiment of the present disclosure;
FIG. 12 is a schematic diagram of another neural network circuit according to an embodiment of the present disclosure;
FIG. 13 is a schematic diagram of another neural network circuit according to an embodiment of the present disclosure;
fig. 14 is a schematic structural diagram of another neural network circuit provided in the embodiment of the present application.
Detailed Description
The technical scheme provided by the embodiment of the application can be applied to an Artificial Neural Network (ANN). An artificial Neural Network (NN) or neural network-like network is a mathematical model or computational model that mimics the structure and function of a biological neural network (e.g., the central nervous system of an animal, particularly the brain) in the field of machine learning and cognitive science, and is used to estimate or approximate functions. The artificial neural network may include a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Time Delay Neural Network (TDNN), a multi-layer perceptron (MLP), and a Recurrent Neural Network (RNN).
Fig. 1 is a schematic structural diagram of a neural network system according to an embodiment of the present disclosure. As shown in fig. 1, the neural network system 100 may include a processor 105 and a neural network circuit 110. The neural network circuit 110 is connected to the processor 105 through an interface. The interface may include a Serial Peripheral Interface (SPI) interface, a Peripheral Component Interconnect Express (PCIE) interface, and the like. As shown in fig. 1, the neural network circuit 110 may be connected to the processor 105 through a PCIE bus 106. Therefore, data can be input into the neural network circuit 110 through the PCIE bus 106, and the data after the processing by the neural network circuit 110 is completed is received through the PCIE bus 106. Also, the processor 105 may monitor the operating state of the neural network circuit 110 through the interface.
Multiple processor cores (cores) may be included in processor 105. The processor 105 may be an ultra-large scale integrated circuit. The processor core may be a Central Processing Unit (CPU), or may be Another Specific Integrated Circuit (ASIC), or a DSP.
The neural network circuit 110 includes an input-output interface (TxRx)1252 and a switching device 1254. The input/output interface 1252 is configured to receive data transmitted to the neural network circuit 110 through the PCIE bus 106, and send the data after the processing by the neural network circuit 110 is completed to the PCIE bus 106. One or more processing units (PEs) 1256 may be included in the neural network circuitry 110, the one or more processing units 1256 being configured to perform neural network computations on data input into the neural network circuitry 110. The calculation result of the processing unit 1256 may be transmitted to the other processing unit 1256 through the switching apparatus 1254. In practice, one processing unit 1256 may include modules to implement convolution, pooling, or other neural network operations. Here, the specific circuit or function of the processing unit 1256 is not limited.
A processing unit 1256 may include one or more physical engines (HE) 1302, and a physical engine 1302 may include one or more logical engines (HE) 1304. The physical engine 1302 is a processing unit having independent resources on hardware among the processing units 1256. For example, the physics engine 1302 may be a matrix multiply-add function (e.g., crossbar), an activate function, or a dot multiply function, among others. Parallel computations may be performed between different physics engines 1302. The logic engine 1304 is a processing unit that can be executed logically independently, and physical resources of a physical engine 1302 can be time-division multiplexed among a plurality of logic engines 1304 on the physical engine 1302.
In recent years, the present invention is also widely applied to a neural network system, because a resistive random access memory (ReRAM) or a nor flash memory has the advantage of integrating storage and calculation. For example, a resistive random access memory crossbar (ReRAM crossbar) composed of a plurality of memristor cells (ReRAM cells) may be used to perform matrix multiply-add operations in a neural network system. The physics engine 1302 may be implemented by a ReRAM crossbar.
Optionally, the technical solution provided in the embodiment of the present application may be applied to a recurrent neural network, for example, a Long Short Term Memory (LSTM) network, a Gated Recurrent Unit (GRU) network, a Convolutional Recurrent Neural Network (CRNN), and a variation corresponding to the neural network. Hereinafter, the application of the technical solution provided in the embodiment of the present application to a recurrent neural network is taken as an example for description, and it can be understood that an acyclic neural network can be regarded as a special recurrent neural network that executes one cycle, so that the embodiment of the present application can also be applied to an acyclic neural network, which is described herein in a unified manner and is not described in detail below.
Fig. 2 is a schematic diagram of a logic structure of a recurrent neural network system according to an embodiment of the present disclosure. As shown in a diagram in fig. 2, the recurrent neural network system includes an input layer 201, a hidden layer 202, and an output layer 203. The hidden layer 202 is used for performing loop calculation on the data received from the input layer 201, and outputting the calculation result to the output layer 203. As an example, in conjunction with fig. 1, the processor 105 and the neural network circuitry 110 in fig. 1 may be implemented by the hidden layer 202 in fig. 2.
The recurrent neural network introduces the concept of "memory". "Loop" each loop process derived from the hidden layer 202 performs the same computation, with the output being dependent on the input (i.e., data frame x of the current loop process input)t) And "memorization" (i.e., the calculation result h of the previous cycle of the current cyclet-1). Therefore, the cyclic neural network has strict sequential dependence relationship, and the front and back loops cannot be executed in parallel. Moreover, in one loop, the calculation also has a strict execution order.
Taking the case that the recurrent neural network is a GRU network, the recurrent process obtained after the hidden layer is expanded according to the time sequence is shown in b diagram in fig. 2. Specifically, during the t-1 cycle, data frame xt-1And the calculation result h of the t-2 th cycle processt-2Obtaining a calculation result h by calculating (such as multiplying by W)t-1. During the t-th cycle, data frame xtAnd the calculation result h of the t-1 th cyclet-1Obtaining a calculation result h by calculating (such as multiplying by W)t. During the t +1 th cycle, data frame xt+1And the result h of the t cycletObtaining a calculation result h by calculating (such as multiplying by W)t+1. Wherein t is an integer of 2 or more. It is to be understood that the calculation result of the 0 th loop process may be a predefined value when t is 2.
The technical scheme provided by the embodiment of the application can be applied to scenes for processing natural languages. For example, the neural network system provided by the embodiment of the application is used for realizing voice recognition, voice wakeup, voice translation and the like. In the aspect of speech recognition, for example, the method can be applied to scenes such as "speech-to-text conversion" and speech control of automobiles. In the aspect of voice awakening, for example, the voice awakening method can be applied to voice awakening of robots, mobile phones, wearable devices, smart homes, vehicles and the like. In speech translation, for example, it can be applied to neural machine translation and the like.
Example 1: in practical applications, some tasks' output after the task is related to the previous content, such as filling in the blank. To fill in the void in "once there was a sense of truth, i did not go to ___ before i, and needed to know not only all the words in front but also the order between the words.
If the traditional language model is applied, the options are: 'Beijing', 'school', 'treasure', 'work', etc. Traditional language models are based on statistical models and have very limited access to information in front of the cross-line. The most common conventional language model currently makes possible the use of the preceding information: 'go …', 'do not go …', 'i do not go …', and take up a lot of storage space for a long time.
Obviously, even if the available information is 'i did not go …', the probability of selecting 'treasure' is not greater than 'beijing'; and the recurrent neural network will choose 'treasure' and the probability will be much larger than 'Beijing', 'go to work', etc. This is because the recurrent neural network uses not only the information that i did not go to … ', but also the information that i ' put in front of i ' and ' had a sense of truth '. When the information is used, the word such as ' work ' school ' and the like cannot be displayed on the horizontal line.
Example 2: the technical scheme provided by the embodiment of the application can be applied to voice awakening tasks aiming at end-side equipment (such as a mobile phone, an intelligent earphone, an intelligent sound box, an intelligent television, a tablet and the like). Taking voice wake-up of the smart headset as an example, fig. 3 is a schematic diagram of a hardware structure of the smart headset 70. This intelligent earphone includes: a Microphone (MIC) 701, a Digital Signal Processor (DSP) 702, and a Computer In Memory (CIM) chip 703. The microphone 701, the DSP702 and the CIM chip 703 may be connected through the SPI.
By way of example, in conjunction with fig. 1, the neural network system 100 in fig. 1 may be implemented by the DSP702 and the CIM chip 703 in fig. 3, wherein the processor 105 in fig. 1 may be implemented by the DSP702 in fig. 3, and the neural network circuit 110 in fig. 1 may be located in the CIM chip 703.
The microphone 701 may be used to collect voice signals and send the voice signals to the DSP 702. The DSP702 may be configured to process the received voice signal to obtain a plurality of voice frames, and input each voice frame of the plurality of voice frames to the CIM chip 703 as a data frame of a cycle process of the CIM chip 703. The CIM chip 703 is a storage-computation integrated chip, and is configured to perform multiple loop computations on a received speech frame to obtain a speech recognition result.
In one implementation, the CIM chip 703 may determine whether the speech recognition result matches a predefined wakeup word (if the predefined wakeup word is included, the speech recognition result and the predefined wakeup word are considered to be matched), and send the determination result to the DSP 702; the DSP702 sends the matching result to the awakened device (e.g., a cell phone or a sound device connected to the smart phone). A wake action is performed by the awakened device. In another implementation, the CIM chip 703 may send the speech recognition result to the DSP702, and the DSP702 determines whether the speech recognition result matches a predefined wakeup word. The wake-up action is an action indicated by the voice recognition result, for example, if the voice recognition result is "open wechat", the wake-up action may be an action of the mobile phone starting the wechat application.
The above description is given by taking the application of the recurrent neural network to the natural language processing as an example, but the present invention is not limited thereto in actual implementation. For example, the recurrent neural network can also be applied to aspects such as machine translation, image description information generation, video marking and the like.
Hereinafter, techniques and terms related to the embodiments of the present application will be briefly described.
1) Data frame, speech frame
The data frame is data that can start a calculation. For example, if 40-dimensional data is required to be input for a logic engine to start a convolution calculation, a frame of data for the frame of data to perform the convolution calculation is 40-dimensional.
The speech frame is one of data frames, and specifically is a data frame obtained by processing a speech signal.
2) Neural network layer
The neural network system may include a plurality of neural network layers. The neural network layer is a logical layer concept, and one neural network layer means that a neural network operation is to be performed. Each layer of neural network calculation is realized by a calculation node. The neural network layer may include convolutional layers, pooling layers, and the like. As shown in fig. 4, 6 neural network layers (also referred to as 6-layer neural networks, or 6 layers) may be included in the neural network system 300: a first layer 302, a second layer 304, a third layer 306, a fourth layer 308, a fifth layer 310, and a sixth layer 312. Among them, the first layer 302 may perform a convolution operation, the second layer 304 may perform a pooling operation on the output data of the first layer 302, the third layer 306 may perform a convolution operation on the output data of the second layer 304, the fourth layer 308 may perform a convolution operation on the output result of the third layer 306, the fifth layer 310 may perform a summing operation on the output data of the second layer 304 and the output data of the fourth layer 308, and so on. Fig. 4 is only a simple example and illustration of the neural network layers in the neural network system, and does not limit the specific operation of each layer of the neural network, for example, the fourth layer 308 may also be a pooling operation, and the fifth layer 310 may also be another neural network operation such as a convolution operation or a pooling operation.
In a recurrent neural network system, the neural network layer may be divided into a recurrent layer and a non-recurrent layer. For example, in the neural network system shown in fig. 4, the neural network system may obtain one output result of the neural network system by performing the calculation of the m cyclic layers and performing the calculation of the one acyclic layer, as shown in fig. 4. Wherein m is an integer greater than or equal to 2.
For example, in speech recognition, after acquiring a speech to be recognized, the neural network system performs m computations of "the first layer 302 to the fifth layer 310" to obtain an output result of the loop layer. In the first m-1 circulation processes, the calculation result of the fifth layer 310 and a speech frame are jointly used as input data of the first layer 302 in the next circulation process; after the mth loop process is calculated, the calculation result of the fifth layer 310 is used as the output result of the loop layer. The output result of the loop layer is output to the sixth layer 312, and after the sixth layer 312 performs calculation based on the output result of the loop layer, the output result of the neural network circuit, that is, the speech recognition result, is obtained.
3) Sub-layer of neural network layer
A neural network layer may include one or more neural network sub-layers. As an example, one neural network sublayer may be one operator or a plurality of operators (e.g., a plurality of operators with dependencies) in the neural network layer. Taking an example that a certain neural network layer is the gru layer, a calculation of the gru layer needs to be implemented based on a plurality of formulas, wherein each formula can be regarded as an operator, and each operator is a sub-layer of the neural network.
It can be understood that, if a neural network layer includes a neural network sublayer, the neural network layer and the neural network sublayer have the same concept, and they can be used interchangeably, and are not described herein in detail.
4) Dependency relationship
Two neural network layers having a dependency relationship means that input data of one neural network layer includes output data of the other neural network layer. For example, as shown in fig. 4, the output data of the first layer 302 is the input data of the second layer 304, and thus, the first layer 302 and the second layer 304 have a dependency relationship. The output data of the second layer 304 is the input data of the third layer 306, and the input data of the fifth layer 310 includes the output data of the second layer 304, so that the second layer 304 and the third layer 306 have a dependency relationship, and the second layer 304 and the fifth layer 310 also have a dependency relationship. Two neural network sub-layers having a dependency relationship means that input data of one neural network sub-layer includes output data of the other neural network sub-layer.
5) An execution process, a cycle process, a sub-execution process
For convenience of description, in the embodiment of the present application, a process from when the neural network system acquires input data (for example, a piece of speech to be executed) at a time to when an output result of the neural network system is obtained based on the input data (for example, a speech recognition result of the piece of speech to be executed) is referred to as "an execution process" of the neural network system.
In one implementation of the neural network system, for the loop layer, a plurality of loop processes need to be performed, such as m loop processes based on the example in fig. 4, where "the first layer 302, the second layer 304, the third layer 306, the fourth layer 308, and the fifth layer 310" perform calculations in each loop process. For the acyclic layer, it is necessary to perform the acyclic process once, and as based on the example shown in fig. 4, the sixth layer 312 needs to perform the acyclic process 1 time.
During a loop, each neural network layer in the loop layer may perform one or more computations. For example, based on the example in fig. 4, during a loop, the first layer 302 performs 5 computations and then sends the results of the computations to the second layer 304. After the second layer 304 performs 1 calculation, the calculation results are sent to the third layer 306 and the fifth layer 310, respectively. After the third layer 306 performs 2 calculations, the calculation result is sent to the fourth layer 308. After the fourth layer 308 performs 2 calculations, the calculation results are sent to the fifth layer 310. After the fifth layer 310 performs 1 calculation, the loop process is ended.
It is understood that when m is 1, the neural network system includes the neural network layers that are all non-cyclic layers.
For convenience of description, in the embodiment of the present application, one execution process of the neural network system is divided into a plurality of sub-execution processes. Each of the first m-1 circulation processes of the circulation layer is a sub-execution process, and the mth circulation process of the circulation layer is taken as a sub-execution process together with the non-circulation process. For example, based on the example in fig. 4, assuming that m is 20, then each of the first 19 loop processes may be taken as one sub-execution process in which the first to fifth layers participate in the calculation; the 20 th loop process and the non-loop process collectively serve as a sub-execution process in which the first to sixth layers participate in the calculation.
For an acyclic neural network, one sub-implementation is equivalent to one implementation.
6) Sending end node of logic engine and receiving end node of logic engine
The dependencies between the logic engines correspond to dependencies between sub-layers deployed on the logic engines. The dependency relationship between the logic engines can be characterized by the execution order between the logic engines. For example, if output data of one sub-layer can be used as input data of another sub-layer, output data of a logic engine where the sub-layer is located can be used as input data of a logic engine where the other sub-layer is located, the two logic engines have a dependency relationship, and an execution sequence of the logic engine where the sub-layer is located is before the logic engine where the other sub-layer is located.
Depending on the order of execution between logic engines, the logic engine may determine the 1 st logic engine to perform a computation, and the 2 nd logic engine … … to perform a computation may last perform a computation.
If the output data of the sub-layer deployed on the current logic engine can be used as the input data of another sub-layer, the current logic engine is the sending end node of the logic engine where the another sub-layer is located, and the logic engine where the another sub-layer is located is the receiving end node of the current logic engine. For example, if the neural network circuit includes logic engines 1 to 7, and the sequential execution order of the logic engines 1 to 7 is from logic engine 1 to logic engine 7, then the 1 st logic engine for performing calculation is logic engine 1, and the last logic engine for performing calculation is logic engine 7. And the sending end node of the logic engine i +1 is the logic engine i, and the receiving end node of the logic engine i is the logic engine i +1, wherein i is more than or equal to 1 and less than or equal to 6, and i is an integer.
The sending end node of the first logic engine in the neural network circuit that performs the computation (e.g., logic engine 1 in this example) is the sending end node of the neural network circuit, i.e., the device/module that sends out the data frame. The logic engine in the neural network circuit that last performed the computation (the receiver node, as logic engine 7 in this example, may be the receiver node of the neural network circuit). Taking the application of the technical solution provided in the embodiment of the present application to the intelligent headset shown in fig. 3 as an example, the logic engines 1 to 7 may be located in the CIM chip 703, and both the transmitting end node of the logic engine 1 and the receiving end node of the logic engine 7 may be the DSP 702.
7) Indication information, clear status command
A sends indication information to B, which indicates that B's receiving end node is capable of receiving data or that B can send data to its receiving end node.
The clear status command is a specific implementation of the indication information. For example, a sends B an clear state command instructing B to set its state to idle state to indicate that B's receiving end node is able to receive data. If the state of B is idle at a certain moment, B can maintain the state of B unchanged after receiving the clear state command; if the state of B is busy at a certain moment, B can modify the state of B from busy to idle after receiving the clear state command.
For convenience of understanding, in the following embodiments, the indication information is the clear status command as an example.
It is understood that if the module/device receiving the clear status command is not a logic engine (e.g., DSP702 in fig. 3), then the clear status command may be used interchangeably since the module/device may not set a status because it is not a logic engine (e.g., DSP702 in fig. 3).
8) Other terms
In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
In the embodiments of the present application, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.
The following describes a deployment method of the neural network circuit provided in the embodiment of the present application. The execution subject of the deployment method of neural network circuitry provided below may be the processor 105 in fig. 1.
Fig. 5 is a schematic diagram of a deployment method of a neural network circuit according to an embodiment of the present disclosure. The method shown in fig. 5 may comprise the steps of:
s101: the processor obtains metadata information for the neural network circuit. Wherein the metadata information of the neural network system is used for characterizing the attributes of the neural network system. For example, the metadata information of the neural network circuit may include one or more of the following information: which layers the neural network system comprises; whether the layers have dependency relationship; whether each layer is a circulating layer or a non-circulating layer; in one implementation, several loop layers need to be implemented; in a cyclic process, each cyclic layer needs to perform several calculations; each layer includes which sub-layers; whether the sub-layers have dependency relationship with each other in the same layer.
S102: the processor deploys each sublayer in the neural network system on the logic engine according to the dependency relationship between each layer in the neural network circuit and the dependency relationship between each sublayer in the same layer, optionally, in combination with an optimization objective (such as an optimization objective of high utilization rate and/or small computation delay of the physical engine). The logic engines correspond to the sub-layers one by one.
In one implementation, for any one loop layer or any one non-loop layer, the processor may deploy sub-layers belonging to different layers on different physical engines respectively; and deploying the sub-layers with the dependency relationship in the same layer on the same physical engine, and deploying the sub-layers without the dependency relationship in the same layer on different physical engines. Such deployment may improve the overall computational efficiency of the neural network circuit, as different physics engines may execute in parallel.
Referring to fig. 6A, a schematic diagram of a deployment result of a neural network circuit provided in an embodiment of the present application is shown. The neural network circuit comprises physical engines 1-4, wherein logic engines 1, 3 and 4 are deployed on the physical engine 1, logic engines 2 and 5 are deployed on the physical engine 2, a logic engine 6 is deployed on the logic engine 3, a logic engine 7 is deployed on the physical engine 4, a layer where sublayers deployed in the logic engines 1-6 are located is a cyclic layer, and a layer where the sublayers deployed in the logic engine 7 are located is an acyclic layer.
In another implementation, the non-loop layer and the loop layer may be deployed on the same physical engine, or may be deployed on different physical engines; and deploying the sub-layers with the dependency relationship in the same layer on the same physical engine, and deploying the sub-layers without the dependency relationship in the same layer on different physical engines. Such deployment may save physical engine resources, i.e., save hardware resources.
Referring to fig. 6B, a schematic diagram of a deployment result of a neural network circuit provided in the embodiment of the present application is shown. The neural network circuit comprises physical engines 1-3, wherein logic engines 1, 3 and 4 are deployed on the physical engine 1, logic engines 2 and 5 are deployed on the physical engine 2, logic engines 6 and 7 are deployed on the logic engine 3, a layer where sublayers deployed in the logic engines 1-6 are located is a cyclic layer, and a layer where sublayers deployed in the logic engine 7 are located is an acyclic layer.
The above is merely an example, and it does not limit the deployment of each sub-layer in the neural network circuit to which the embodiments of the present application are applicable. The technical scheme provided by the embodiment of the application can be suitable for the deployment situation of any neural network system.
S103: the processor establishes a correspondence between the physical engines (e.g., each physical engine) and the status registers. For example, a correspondence between each physical engine and the first status register and the second status register is established.
The first status register corresponding to the physics engine is used to record whether the current status of the physics engine is an available status (i.e., a first status in this application) or an unavailable status (i.e., a second status in this application). The available state indicates that there are no logical engines on the physical engine that are performing computations. The unavailable state indicates that there is a logical engine on the physical engine that is performing a computation. For example, the available state is represented by a binary number "0", and the unavailable state is represented by a binary number "1".
The second status register corresponding to the physical engine is used for recording whether the current status of the logic engine deployed on the physical engine is busy status (busy) or idle status (idle). The busy state indicates that the receiving node of the logic engine is unable to receive data. The idle state indicates that the receiving node of the logic engine is able to receive data. For example, an idle state is represented by a binary number "0", and a busy state is represented by a binary number "1".
The status register is understood in software as a bitmap. As an example, the number of bits used to implement the bitmap of the first status register may be 1. As an example, if N logic engines are deployed on one physical engine, the number of bits used to implement the bitmap of the second status register allocated to the physical engine may be N, where each bit corresponds to one logic engine for representing the status of the logic engine.
In specific implementation, the first status register and the second status register corresponding to one physical engine may be managed separately or may be managed in a unified manner. Wherein, unified management can be understood as that the first status register and the second status register are the same register, for example, one of the bits in a bitmap is used to represent the status of the physical engine, and each of the other bits in the bitmap is used to represent the status of a logic engine on the physical engine.
Alternatively, if a logic engine is deployed on the same physical engine as its receiver node, the processor may not set state for that logic engine. This is because the state of the logic engine is used to indicate whether the receiving end node of the logic engine can receive data, one physical engine corresponds to one buffer queue, and when any logic engine performs a calculation on the same physical engine, the buffer queue is already empty, and at this time, the receiving end node of the logic engine can receive data, so that no state may be set for the logic engine. Thus, the bit width of the status register is saved, and the resource overhead is saved; and, help to reduce the maintenance complexity to the state of the logic engine.
Alternatively, for the logic engine that performs the computation last in each sub-execution process, it may be considered that the receiver node of the logic engine may receive data all the time, and therefore, the processor may not set a state for the logic engine. Thus, the bit width of the status register is saved, and the resource overhead is saved; and, help to reduce the maintenance complexity to the state of the logic engine.
It should be noted that, the flow control mechanism between the logic engine (e.g., logic engine 7 in fig. 6A or fig. 6B) that performs the final computation in the neural network circuit and the receiving end node thereof may adopt other flow control mechanisms in the prior art, which is not limited in this embodiment of the present application.
S104: the processor determines the execution sequence (or dependency relationship) among the logic engines in the neural network circuit and configures execution logic for each logic engine based on the metadata information of the neural network circuit and the deployment situation of each sub-layer in the neural network circuit (for example, on which logic engine each sub-layer is deployed), and the setting situation of the status bit in the status register (for example, which physical engine corresponds to which status register, which logic engine corresponds to which status bit, etc.).
The execution logic of the logic engine may include: whether the logic engine checks the state of the logic engine; if the user checks the state, the user should check which state bit when the state is checked, and set the state of the user as a busy state when the user checks the state; when to whom to send status clear commands; during one cycle, the logic engine performs several calculations, etc. The execution logic may be stored in a configuration register, and during the calculation performed by the neural network circuit, each logic engine may read its own execution logic from the configuration register and perform the calculation based on the execution logic. Reference may be made to fig. 7 through 10 described below with respect to the execution logic of the logic engine.
(optional) S105: the processor performs weight rearrangement on the loop layers which execute multiple calculations in one loop process according to the metadata information of the neural network circuit, so that each loop layer executes one calculation in one loop process.
Optionally, the technical solution including S105 may be applied to a scenario that the neural network circuit includes a cyclic layer and a non-cyclic layer, and in a cyclic process, part or all of the cyclic layers need to perform multiple computations. In this way, the calculation speed of the neural network system is improved, and the calculation time is saved.
Taking the example where the neural network system is a Convolutional Recurrent Neural Network (CRNN), the CRNN includes: conv layer, gru1 layer, gru2 layer, fc1 layer and fc2 layer. The conv, gru1 and gru2 layers are cyclic layers, and the fc1 and fc2 layers are non-cyclic layers. If during a cycle, the conv layer performs 7 computations, and both layers gru1 and gru2 perform 1 computation, the processor may duplicate the weights of the conv layer by 7 and concatenate the 7 weights into a large matrix, which is then deployed to a physics engine (e.g., crossbar). Wherein the "large matrix" is relative to the weight matrix of the conv layer itself.
Those skilled in the art will appreciate that performing multiple calculations on a datum based on one weight matrix is equivalent to performing a calculation on the datum based on another weight matrix. The embodiment of the present application does not limit the specific obtaining manner of the another weight matrix (such as the above-mentioned large matrix), and for example, reference may be made to the prior art.
The execution order of S103, S104, and S105 is not limited in the embodiment of the present application.
Hereinafter, a method for controlling data flow in a neural network circuit provided in an embodiment of the present application is described. For a logic engine in a sub-execution process, the following situations can be classified:
1) for the non-last logic engine in the sub-execution process:
if the logic engine is deployed on the same physical engine as its receiving end node, then in the deployment phase, the processor may not set a state for the logic engine, and in the execution phase, the logic engine may execute the method shown in fig. 7.
If the logic engine and its receiving end node are not deployed on the same physical engine, and the logic engine is not the last logic engine executed on the physical engine where the logic engine is located, the processor may set a state for the logic engine in the deployment phase, and the logic engine may execute the method shown in fig. 8 in the execution phase.
If the logic engine and its receiving end node are not deployed on the same physical engine, and the logic engine is the last logic engine executed on the physical engine where the logic engine is located, in the deployment phase, the processor may set a state for the logic engine, and in the execution phase, the logic engine may execute the method shown in fig. 9.
2) For the last logic engine in the sub-execution process, the processor may not set a state for the logic engine during the deployment phase. In the execution phase, the logic engine may perform the method as shown in FIG. 10.
The method of controlling data flow in the neural network circuit shown in fig. 7 may include the steps of:
s201: when receiving a data frame, the current logic engine checks the state of the physical engine to which the current logic engine belongs (namely, the current physical engine). If the current state of the physical engine is the available state, S202 is executed. If the state of the current physical engine is the unavailable state, waiting for the state of the current physical engine to change to the available state, for example, the current logic engine may return to execute S201 when a first preset time period from a time when the determination result of S201 is "no" is reached. The embodiment of the application does not limit the specific value and the value mode of the first preset time period.
S202: the current logical engine starts the computation and sets the state of the current physical engine to an unavailable state.
Executing S202 helps to avoid that other logic engines on the current physical engine start computing during the process of computing by the current logic engine, thereby helping to implement time-sharing multiplexing of physical resources of the physical engine by different logic engines on the same physical engine.
S203: and after the current logic engine sends the calculation result, setting the state of the current physical engine to be an available state.
The current logic engine executing the method shown in fig. 7 is deployed on the same physical engine as its receiving end node, in which case the current logic engine may not be set with state and therefore may not query its own state. In addition, the sender node of the receiver node of the current logic engine is the current logic engine itself, and the current logic engine does not set state, and thus, the current logic engine configuration may not send a state clearing command. Therefore, the control flow is simplified, the maintenance cost is reduced, and the bit width of the state register is saved.
The method of controlling data flow in the neural network circuit shown in fig. 8 may include the steps of:
s301: when receiving a data frame, the current logic engine checks the state of the physical engine to which the current logic engine belongs (namely, the current physical engine). If the current physical engine status is available, S302 is executed. And if the state of the current physical engine is the unavailable state, waiting for the current physical engine to become the available state.
S302: the current logic engine checks whether its own state is an idle state.
If yes, go to S303. If not, the current logic engine can wait for the state of the current logic engine to change into an idle state. For example, the current logic engine may return to execute S302 when a second preset time period arrives from a time at which the determination result of S302 is "no". The embodiment of the application does not limit the specific value and the value mode of the second preset time period.
S303: the current logical engine starts the computation and sets the state of the current physical engine to an unavailable state.
S304: when the self calculation number of the current logic engine reaches the preset number, the state of the current logic engine is set to be a busy state, and a clear state command is sent to a sending end node of a logic engine which executes calculation next after the current logic engine executes calculation on the current physical engine in the sub-execution process.
It is understood that, in one sub-execution process, if the current logic engine needs to perform a plurality of calculations, the state of the current logic engine is not set to the busy state when the number of calculations does not reach the preset number. This is a technical solution provided in consideration of "if the state of the current logic engine is set to a busy state at this time, and no logic engine is in a clear state for the current logic engine, then the current logic engine cannot start the next calculation, and at this time, the neural network system will be locked up". For example, based on the example shown in fig. 6A, assuming that the logic engine 4 generates 10 data per calculation and the logic engine 5 needs 50 data to start a calculation, the logic engine 4 needs to execute 5 times before the logic engine 5 can start a calculation. If the logic engine 4 is set to busy after the logic engine 4 performs a calculation, the logic engine 5 cannot start the calculation, and at this time, "data cannot flow" down; moreover, since the logic engine 4 needs to be in an idle state to start the next computation, and no logic engine clears the state of the logic engine 4, the logic engine 4 cannot start the next computation, so that the neural network system is locked, that is, the data is "stuck".
Alternatively, if the current logic engine performs only one calculation in one sub-execution process, the processor may configure the current logic engine to directly set its own state as a busy state without determining whether the number of calculations reaches a preset number of times when performing S304. For example, if S105 is executed in the deployment phase, each logic engine in one sub-execution process may not execute "determine whether the number of computations reaches the preset number".
It is understood that the current logic engine determines that the state of the current logic engine is an idle state before initiating the computation; and setting the state of the current logic engine to a busy state after the current logic engine starts computing. In one sub-execution process, after the current logic engine starts the computation, whether the state of the current logic engine is in a busy state is not judged any more, so that the current logic engine can send the computation result to the receiving end node of the current logic engine after starting the computation.
It is understood that, after starting the computation, the current logic engine sends a clear status command to the sending node of the "logic engine that executes the computation next after the current logic engine executes the computation in the current sub-execution process on the current physical engine". In this way, the data required by the next computation-executing logic engine on the current physical engine to execute the computation can arrive in the computation-executing process of the current logic engine, that is, the computation delay of the current logic engine masks the transmission delay of the next computation-executing logic engine on the physical engine where the current logic engine is located, so that after the computation of the current logic engine is completed, the next computation-executing logic engine can start the computation without waiting for the arrival of the data.
S305: and after the current logic engine sends the calculation result, setting the state of the current physical engine to be an available state.
The method of controlling data flow in the neural network circuit shown in fig. 9 may include the steps of:
s401: when receiving a data frame, the current logic engine checks the state of the physical engine to which the current logic engine belongs (namely, the current physical engine). If the current state of the physical engine is the available state, S402 is executed. And if the state of the current physical engine is the unavailable state, waiting for the current physical engine to become the available state.
S402: the current logic engine checks whether its own state is an idle state.
If yes, go to S403. If not, the current logic engine can wait for the state of the current logic engine to change into an idle state.
S403: the current logical engine starts the computation and sets the state of the current physical engine to an unavailable state.
S404: when the current logic engine determines that the calculation times of the current logic engine reach the preset times, the state of the current logic engine is set to be a busy state, and a clear state command is sent to a sending end node of the first logic engine for executing calculation in the current sub-execution process on the current physical engine.
Regarding "when the number of computations of the current logic engine reaches a preset number, a clear status command is sent to a sending end node of a logic engine that executes the computation first in the current sub-execution process on the current physical engine", it is considered that "when one sub-execution process on the same physical engine is not executed, the other logic engines on the physical engine do not start the computation of the next sub-execution process", and the proposed technical solution is helpful to ensure the pipeline execution between the sub-execution processes.
If the first logic engine in the current physical engine to perform the computation is the first logic engine in the neural network system, the sending end node of the logic engine may be a module/device/apparatus (e.g., the DSP702 in fig. 3) to send the data frame, and at this time, the clear status command is used to indicate to the module/device/apparatus to send the data frame that "the first logic engine in the loop layer to perform the computation can receive the data".
S405: and after the current logic engine sends the calculation result, setting the state of the current physical engine to be an available state.
The method of controlling data flow in the neural network circuit shown in fig. 10 may include the steps of:
s501: when receiving a data frame, the current logic engine checks the state of the physical engine (i.e. the current physical engine) to which the current logic engine belongs. If the current state of the physical engine is the available state, S502 is executed. And if the state of the current physical engine is the unavailable state, waiting for the current physical engine to become the available state.
S502: the current logical engine starts the computation and sets the state of the current physical engine to an unavailable state.
S503: and when the current logic engine determines that the calculation times of the current logic engine reach the preset times, sending a clear state command to a sending end node of the logic engine which executes the calculation first in the current sub-execution process on the current physical engine.
S504: and after the current logic engine sends the calculation result, setting the state of the current physical engine to be an available state.
In contrast to the method shown in FIG. 9 described above, the logic engine executing the method shown in FIG. 10 may not view its own state during execution because it does not set a state. For the explanation of other steps, reference may be made to the above description, which is not repeated here.
It should be noted that, for the last logic engine in the loop layer, assuming that m loop processes need to be executed, in the first m-1 sub-execution processes (i.e., the first m-1 loop processes), the logic engine is the last logic engine in the sub-execution processes, and therefore, the method shown in fig. 10 is executed. In the mth sub-execution process (i.e., mth loop process + non-loop process), the logic engine is not the last logic engine in the sub-execution process, and it may execute the method shown in fig. 7 or fig. 8 or fig. 9 according to the deployment situation.
The technical solutions provided in fig. 7 to 10 will be described below by taking fig. 6A and 6B as examples.
Example one
In this embodiment, the technical solutions provided in fig. 7 to 10 are described by taking fig. 6A as an example. In this embodiment, assuming that the loop layer of the neural network system loops 20 times in total, and in the process of each loop, the logic engine 4 performs 5 times of calculation, and the other logic engines each perform 1 time of calculation, then:
for the 1 st sub-execution (i.e., the 1 st loop):
the logic engines of the 1 st sub-execution process are logic engines 1-6, wherein the logic engines 3 and 6 are not set to be in a state. Initial moment of the execution phase: the physical engines 1-4 are in an available state, and the logic engines 1, 2, 4, 5 are in an idle state.
Logic engine 1 may perform the method illustrated in fig. 8. In step S304, the logic engine 1 sets its own state to a busy state, and sends a state clearing command to the sending end node (i.e., logic engine 2) of the next logic engine (i.e., logic engine 3) that performs the calculation after the logic engine 1 performs the calculation in the next sub-execution process on the current physical engine (i.e., physical engine 1). After the execution is finished, the logic engine 1 is in a busy state, and the logic engines 2, 4 and 5 are in idle states.
Logic engine 2 may perform the method illustrated in fig. 8. In S304, the logic engine 2 sets its own status to a busy status, and sends a clear status command to the sending end node (i.e., logic engine 4) of the next computation-executing logic engine (i.e., logic engine 5) after the logic engine 2 performs the computation, among the logic engines in the sub-execution process on the current physical engine (i.e., physical engine 2). After the execution is finished, the logic engines 1 and 2 are in a busy state, and the logic engines 4 and 5 are in an idle state.
The logic engine 3 may perform the method illustrated in fig. 7. After the execution is finished, the logic engines 1 and 2 are in a busy state, and the logic engines 4 and 5 are in an idle state.
The logic engine 4 may perform the method illustrated in fig. 9. In S404, after the first 4 times of computations are performed by the logic engine 4, the next computation is performed directly, and at the time of the 5 th computation, the state of the logic engine 4 is set to the busy state, and a clear state command is sent to the sending end node (e.g., the DSP702 in fig. 3) of the logic engine (e.g., the logic engine 1) that performs the first computation during the current sub-execution on the current physical engine (e.g., the physical engine 1). After the execution is finished, the logic engines 1, 2, and 4 are in busy state, and the logic engine 5 is in idle state.
The logic engine 5 may perform the method illustrated in fig. 9. In S404, the logic engine 5 sets its own status to a busy status, and sends a clear status command to the sending end node (logic engine 1) of the first logic engine (i.e., logic engine 2) performing the computation in the current sub-execution process on the current physical engine (i.e., physical engine 2). After the execution is finished, logic engines 2, 4, and 5 are in busy state, and logic engine 1 is in idle state.
The logic engine 6 may perform the method illustrated in fig. 10. In S503, the logic engine 6 sends a status clearing command to the sending end node (i.e. the logic engine 5) of the first logic engine (i.e. the logic engine 6) on the current physical engine (i.e. the physical engine 3) to perform the computation. After the execution is finished, logic engines 2 and 4 are in busy state, and logic engines 1 and 5 are in idle state.
Optionally, when the state of a certain logic engine is in an idle state, other logic engines may not send a clear state command to the logic engine. For example, in the above example, logic engine 1 may not send a clear status command to logic engine 2. This helps to save signalling overhead. In actual implementation, the logic engine may not determine whether the state of the receiving end node of the sent clear status command is idle, but send the clear status command regardless of whether the receiving end node of the sent clear status command is idle, which helps to simplify the design complexity.
For the 2 nd sub-execution process (i.e., the 2 nd loop process):
the logic engines of the 2 nd sub-execution process are logic engines 1-6, wherein the logic engines 3 and 6 are not set to be in a state. At the initial moment of the execution phase, logic engines 2, 4 are busy and logic engines 1, 5 are idle.
Logic engine 1 may perform the method illustrated in fig. 8. After the execution is finished, logic engines 1 and 4 are in busy state, and logic engines 2 and 5 are in idle state.
Logic engine 2 may perform the method illustrated in fig. 8. After the execution is finished, the logic engines 1 and 2 are in a busy state, and the logic engines 4 and 5 are in an idle state.
The execution processes and execution results of the logic engines 3 to 6 may refer to the execution processes and execution results of the logic engines 3 to 6 in the 1 st sub-execution process, respectively, and are not described herein again.
The processes are performed for the 3 rd to 19 th sub-processes (i.e., the 3 rd to 19 th loop processes):
for the description of any one of the 3 rd to 19 th sub-executions, reference may be made to the description in the 2 nd sub-execution.
For the 20 th sub-execution process (i.e., the 20 th cyclic process + the non-cyclic process):
the logic engines in the 20 th sub-execution process are logic engines 1-7. Wherein the logic engines 3, 7 do not set a state. At the beginning of the execution phase, logic engines 2, 4 are busy and logic engines 1, 5, 6 are idle.
The execution processes and execution results of the logic engines 1 to 5 may refer to the execution processes and execution results of the logic engines 1 to 5 in the 2 nd sub-execution process, respectively, and are not described herein again.
The logic engine 6 may perform the method illustrated in fig. 9. Wherein, in executing S404, the logic engine 6 sets its own state to busy state, and sends a clear state command to the sending end node (i.e. logic engine 5) of the logic engine (i.e. logic engine 6) that is the first to perform the computation in the current sub-execution process on the current physical engine (i.e. physical engine 3). After the execution is finished, logic engines 2, 4, and 6 are in busy state, and logic engines 1 and 5 are in idle state.
The logic engine 7 may perform the method illustrated in fig. 10. Wherein, in executing S503, the logic engine 7 sends an clear status command to the sender node (i.e., logic engine 6) of the logic engine (i.e., logic engine 7) on the current physical engine (i.e., logic engine 4) that is the first to perform the computation during the next sub-execution. After execution, logic engines 2, 4 are busy and logic engines 1, 5, 6 are idle.
It is understood that, in the present embodiment, for the last logic engine (i.e., logic engine 6) in the loop layer, the logic engine 6 does not set the state during the first 19 sub-executions, and during the last sub-execution, the state of the logic engine 6 is set and the state of the logic engine 6 is set to the busy state. This implementation may be equivalent to: in the first 19 sub-executions, the state is set for the logic engine 6, but the state of the logic engine 6 is not set to the busy state in the execution; and the state of the logic engine 6 is set to busy state during the last sub-execution.
That is, when the loop count of the logic engine that performs the calculation last in the loop layer does not reach the preset count, the state of the logic engine is not set to the busy state. This is a technical solution proposed in consideration of "if the state of the logic engine is set to a busy state at this time, the logic engine that performs calculation first in the non-loop layer cannot be started, and no logic engine clears the state for the logic engine, which may result in that the logic engine cannot start calculation, resulting in a neural network system being locked up". For example, based on fig. 6A, if the logic engine 6 sets its own state to a busy state during each execution of the loop layer in the first 19 times, the logic engine 7 cannot start a computation, and thus no logic engine clears the state for the logic engine 6, which may result in the logic engine 6 not starting a computation, thereby causing a neural network system to be locked.
Example two
In this embodiment, the technical solutions provided in fig. 7 to 10 are described by taking fig. 6B as an example. In this embodiment, assuming that the loop layer of the neural network system loops 20 times in total, and in the process of each loop, the logic engine 4 performs 5 times of calculation, and the other logic engines each perform 1 time of calculation, then:
the processes are performed for the 1 st to 19 th sub-processes (i.e., the 1 st to 19 th loop processes):
for the description of any sub-execution process in the 1 st to 19 th sub-execution processes, reference may be made to the sub-execution processes in the 1 st to 19 th sub-execution processes in the execution process described above based on fig. 6A, which is not described herein again.
For the 20 th sub-execution process (i.e., the 20 th cyclic process + the non-cyclic process):
the logic engines of the 20 th sub-execution process are logic engines 1-7, wherein the logic engines 3, 6, 7 are not set to be in a state. At the initial moment of the execution phase, logic engines 2, 4 are busy and logic engines 1, 5 are idle.
The execution processes and execution results of the logic engines 1 to 5 may refer to the execution processes and execution results of the logic engines 1 to 5 in the 2 nd sub-execution process, respectively, and are not described herein again.
The logic engine 6 may perform the method illustrated in fig. 7. After the execution is finished, logic engines 2 and 4 are in busy state, and logic engines 1 and 5 are in idle state.
The logic engine 7 may perform the method illustrated in fig. 10. Wherein, in executing S503, the logic engine 7 sends an idle state command to the sender node (i.e., logic engine 5) of the logic engine (i.e., logic engine 6) on the current physical engine (i.e., physical engine 3) that is performing the computation the first time the process is executed.
The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. To implement the above functions, it includes hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the exemplary method steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, the neural network system may be divided into functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.
Fig. 11 is a schematic structural diagram of a neural network circuit 11 according to an embodiment of the present disclosure. The neural network circuit 11 is configured to perform the method provided in any of the above embodiments. The neural network circuit 11 may include: a first physics engine 111, the first physics engine 111 virtualized as one or more logical engines. A first logic engine 121 of the one or more logic engines is to: obtaining a first data frame; when the state of the first physical engine 111 is a first state and the state of the first logic engine 121 is an idle state, the first data frame is calculated, where the first state of the first physical engine 111 is used to indicate that no logic engine on the first physical engine 111 is performing calculation, the state of the first logic engine 121 is an idle state to indicate that the second logic engine 122 can receive data, and the second logic engine 122 is used to calculate the output data of the first logic engine 121. In fig. 11, the second logic engine 122 is described as an example that does not belong to the first physical engine 111, but in actual implementation, the present invention is not limited thereto. For example, in conjunction with FIG. 8, first logic engine 121 may be the current logic engine in FIG. 8 and may be used to perform S301-S303. For another example, in conjunction with fig. 9, first logic engine 121 may be the current logic engine in fig. 9, and may be configured to perform S401-S403.
Optionally, the first logic engine 121 is further configured to: when the number of times of calculation of the first logic engine 121 reaches the preset number of times, the state of the first logic engine 121 is set to a busy state, wherein the busy state of the first logic engine 121 indicates that the second logic engine cannot receive data. For example, in conjunction with fig. 8 or 9, the first logic engine 121 may be used to perform the step of setting to busy state in S304 or S404.
Optionally, the calculation performed by the first logic engine 121 on the first data frame is a calculation of a sub-execution process, and the first logic engine 121 is a logic engine that performs a calculation on the first physical engine 111 in the sub-execution process last. The first logic engine 121 is further configured to: when the number of times of calculation of the first logic engine 121 reaches a preset number of times, sending first indication information to the third logic engine 123, where the third logic engine 123 is configured to send data to the first target logic engine 124, the first target logic engine 124 is a logic engine that performs the first calculation in the sub-execution process on the first physical engine 111, the first indication information is configured to indicate that the third logic engine 123 sets its own state to an idle state, and the state of the third logic engine 123 is an idle state and is configured to indicate that the first target logic engine 124 can receive data. Fig. 12 is a schematic structural diagram of the neural network circuit 11 applicable to this alternative implementation. The first logic engine 121 and the first target logic engine may be the same logic engine or different logic engines. Fig. 12 illustrates an example in which the two logic engines are different. For example, in connection with fig. 6A, first logic engine 121 may be logic engine 5, in which case the first target logic engine is logic engine 2 and the third logic engine is logic engine 1.
Optionally, the calculation performed by the first logic engine 121 on the first data frame belongs to a calculation of a sub-execution process, and the first logic engine 121 is not a logic engine that performs a calculation last on the first physical engine in the sub-execution process. The first logic engine 121 is further configured to: when the number of times of calculation of the first logic engine 121 reaches a preset number of times, sending first indication information to the third logic engine 123, where the first indication information is used to indicate that the third logic engine 123 sets its own state to an idle state, the third logic engine 123 is used to send data to the first target logic engine 124, the first target logic engine 124 is located on the first physical engine 111, and the first target logic engine is a logic engine that executes calculation next after the first logic engine 121 executes calculation in the sub-execution process. Fig. 13 is a schematic structural diagram of the neural network circuit 11 applicable to this alternative implementation. For example, in connection with fig. 6A, the first logic engine 121 may be logic engine 2, in which case the first target logic engine is logic engine 5 and the third logic engine is logic engine 4.
Optionally, the neural network circuit 11 further includes: a fourth logic engine 125; fourth logic engine 125 is the logic engine in the same sub-execution as first logic engine 121, and fourth logic engine 125 is the logic engine in the sub-execution that last performed the computation. The fourth logic engine 125 is configured to obtain a second data frame; the fourth logic engine 125 is configured to perform the calculation on the second data frame when the state of the second physical engine 112 to which the fourth logic engine 125 belongs is the first state, where the state of the second physical engine 112 is the first state and indicates that there is no logic engine on the second physical engine 112 that is performing the calculation. Fig. 14 is a schematic structural diagram of the neural network circuit 11 applicable to this alternative implementation. The second logic engine 122 and the fourth logic engine 125 may or may not be the same logic engine. Fig. 14 illustrates an example in which the two logic engines are not the same. For example, in connection with FIG. 6A, when the sub-execution process is not the last sub-execution process, the fourth logic engine 125 may be logic engine 6.
Optionally, the fourth logic engine 125 is further configured to: and sending second indication information to the fifth logic engine 126, wherein the fifth logic engine 126 is used for sending data to the second target logic engine 127, the second target logic engine 127 is a logic engine for performing calculation first in the sub-execution process of the second physical engine 112, and the second indication information is used for indicating that the second target logic engine 127 can receive data. Fig. 14 is a schematic structural diagram of the neural network circuit 11 applicable to this alternative implementation. For example, in conjunction with fig. 6A, when the sub-execution process is not the last sub-execution process, the fourth logic engine 125 may be logic engine 6, the second target logic engine 127 may be logic engine 6, and the fifth logic engine 126 may be logic engine 5.
Optionally, the first logic engine 121 is further configured to: after the first logic engine 121 starts computing, setting the state of the first physical engine 111 to a second state, where the second state is used to indicate that the physical resources of the first physical engine 111 are occupied; and setting the state of the first physics engine 111 to the first state after the first logic engine 121 transmits the calculation result to the second logic engine 122. For example, based on fig. 7-10, the current logic engine may be used to perform S203, S305, S405, and S504, respectively.
In one example, referring to fig. 1, the neural network circuit 11 may be the neural network circuit 110 of fig. 1. Any physics engine (e.g., first physics engine 111, etc.) 1 in the neural network circuit 11 may be physics engine 1302 in fig. 1, and any logic engine (e.g., first logic engine 121, etc.) in the neural network circuit 11 may be logic engine 1304 in fig. 1.
For the detailed description of the above alternative modes, reference is made to the foregoing method embodiments, which are not described herein again. In addition, for any explanation and beneficial effect description of the neural network circuit 11, reference may be made to the corresponding method embodiments, and details are not repeated.
It should be noted that the actions correspondingly performed by the logic engines are merely specific examples, and the actions actually performed by the modules refer to the actions or steps mentioned in the above description based on the embodiments of fig. 7 to fig. 10.
Embodiments of the present application also provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the computer program causes the computer to execute the actions or steps mentioned in any of the embodiments provided above.
The embodiment of the application also provides a chip. Integrated in this chip are the circuitry and one or more interfaces for implementing the functions of the neural network system 11 described above. Optionally, the functions supported by the chip may include processing actions in the embodiments described with reference to fig. 7 to fig. 10, which are not described herein again. Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments may be implemented by a program instructing the associated hardware to perform the steps. The program may be stored in a computer-readable storage medium. The above-mentioned storage medium may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an Application Specific Integrated Circuit (ASIC), a microprocessor (DSP), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
The embodiments of the present application also provide a computer program product containing instructions, which when executed on a computer, cause the computer to execute any one of the methods in the above embodiments. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), among others.
It should be noted that the above devices for storing computer instructions or computer programs provided in the embodiments of the present application, such as, but not limited to, the above memories, computer readable storage media, communication chips, and the like, are all nonvolatile (non-volatile).
Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. Although the present application has been described in conjunction with specific features and embodiments thereof, various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application.

Claims (14)

1. A neural network circuit, comprising:
a first physics engine virtualized as one or more logical engines;
a first logic engine of the one or more logic engines to:
obtaining a first data frame;
when the state of the first physical engine is a first state and the state of the first logic engine is an idle state, calculating the first data frame, wherein the state of the first physical engine is the first state and is used for indicating that no logic engine performing calculation is arranged on the first physical engine, the state of the first logic engine is the idle state and is used for indicating that a second logic engine can receive data, and the second logic engine is used for calculating output data of the first logic engine.
2. The neural network circuit of claim 1,
the first logic engine is further to: when the calculation times of the first logic engine reach preset times, setting the state of the first logic engine to be a busy state, wherein the state of the first logic engine is the busy state and is used for indicating that the second logic engine cannot receive data.
3. The neural network circuit of claim 1 or 2, wherein the computation performed by the first logic engine on the first data frame is a computation of a sub-execution, and the first logic engine is a logic engine on the first physical engine that last performed a computation of the sub-execution;
the first logic engine is further to: when the number of times of calculation of the first logic engine reaches a preset number of times, sending first indication information to a third logic engine, where the third logic engine is configured to send data to a first target logic engine, the first target logic engine is a logic engine that performs first calculation in the sub-execution process on the first physical engine, the first indication information is configured to indicate the third logic engine to set a self state to an idle state, and the state of the third logic engine is the idle state and is configured to indicate that the first target logic engine can receive data.
4. The neural network circuit of claim 1 or 2, wherein the computation performed by the first logic engine on the first data frame belongs to a computation of a sub-execution, and the first logic engine is not a logic engine on the first physical engine that last performed a computation in the sub-execution;
the first logic engine is further to: when the number of times of calculation of the first logic engine reaches a preset number of times, sending first indication information to a third logic engine, where the first indication information is used to indicate that the third logic engine sets its own state to an idle state, the third logic engine is used to send data to a first target logic engine, the first target logic engine is located on the first physical engine, and the first target logic engine is a logic engine that executes calculation next after the first logic engine executes calculation in the sub-execution process.
5. The neural network circuit of any one of claims 1 to 4, further comprising: a fourth logic engine; the fourth logic engine and the first logic engine are logic engines in the same sub-execution process, and the fourth logic engine is a logic engine for performing calculation in the last sub-execution process;
the fourth logic engine is to obtain a second data frame;
the fourth logic engine is configured to perform computation on the second data frame when a state of a second physical engine to which the fourth logic engine belongs is a first state, where the state of the second physical engine is the first state and is used to indicate that no logic engine on the second physical engine is performing computation.
6. The neural network circuit of claim 5,
the fourth logic engine is further to: and sending second indication information to a fifth logic engine, wherein the fifth logic engine is used for sending data to a second target logic engine, the second target logic engine is a logic engine for executing calculation in the sub-execution process on the second physical engine, and the second indication information is used for indicating that the second target logic engine can receive data.
7. The neural network circuit of any one of claims 1-6, wherein the first logic engine is further configured to: after the first logic engine starts computing, setting the state of the first physical engine to be a second state, wherein the second state is used for indicating that physical resources of the first physical engine are occupied; and after the first logic engine sends the calculation result to the second logic engine, setting the state of the first physical engine to be the first state.
8. A method of controlling data flow in a neural network circuit, the neural network circuit comprising a first physics engine virtualized as one or more logic engines; the method is applied to a first logic engine of the one or more logic engines; the method comprises the following steps:
obtaining a first data frame;
when the state of the first physical engine is a first state and the state of the first logic engine is an idle state, calculating the first data frame, wherein the state of the first physical engine is the first state and is used for indicating that no logic engine performing calculation is arranged on the first physical engine, the state of the first logic engine is the idle state and is used for indicating that a second logic engine can receive data, and the second logic engine is used for calculating output data of the first logic engine.
9. The method of claim 8, further comprising:
when the calculation times of the first logic engine reach preset times, setting the state of the first logic engine to be a busy state, wherein the state of the first logic engine is the busy state and is used for indicating that the second logic engine cannot receive data.
10. The method of claim 8 or 9, wherein the computation performed by the first logical engine on the first data frame is a computation of a sub-execution, and wherein the first logical engine is a logical engine on the first physical engine that last performed a computation of the sub-execution; the method further comprises the following steps:
when the number of times of calculation of the first logic engine reaches a preset number of times, sending first indication information to a third logic engine, where the third logic engine is configured to send data to a first target logic engine, the first target logic engine is a logic engine that performs first calculation in the sub-execution process on the first physical engine, the first indication information is configured to indicate the third logic engine to set a self state to an idle state, and the state of the third logic engine is the idle state and is configured to indicate that the first target logic engine can receive data.
11. The method of claim 8 or 9, wherein the computation performed by the first logical engine on the first data frame belongs to a computation of a sub-execution, and the first logical engine is not a logical engine that last performed a computation on the first physical engine in the sub-execution; the method further comprises the following steps:
when the number of times of calculation of the first logic engine reaches a preset number of times, sending first indication information to a third logic engine, where the first indication information is used to indicate that the third logic engine sets its own state to an idle state, the third logic engine is used to send data to a first target logic engine, the first target logic engine is located on the first physical engine, and the first target logic engine is a logic engine that executes calculation next after the first logic engine executes calculation in the sub-execution process.
12. The method according to any one of claims 8 to 11, further comprising:
after the first logic engine starts computing, setting the state of the first physical engine to be a second state, wherein the second state is used for indicating that physical resources of the first physical engine are occupied; and after the first logic engine sends the calculation result to the second logic engine, setting the state of the first physical engine to be the first state.
13. A neural network system, comprising:
a processor and the neural network circuit of any one of claims 1-7,
the processor is configured to transmit one or more data frames to the neural network circuit, wherein the one or more data frames include the first data frame.
14. A computer-readable storage medium for storing a computer program which, when run on a computer, causes the computer to perform the method of any one of claims 8 to 12.
CN202010087266.XA 2020-02-11 2020-02-11 Neural network circuit, system and method for controlling data flow Pending CN113255902A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010087266.XA CN113255902A (en) 2020-02-11 2020-02-11 Neural network circuit, system and method for controlling data flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010087266.XA CN113255902A (en) 2020-02-11 2020-02-11 Neural network circuit, system and method for controlling data flow

Publications (1)

Publication Number Publication Date
CN113255902A true CN113255902A (en) 2021-08-13

Family

ID=77219590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010087266.XA Pending CN113255902A (en) 2020-02-11 2020-02-11 Neural network circuit, system and method for controlling data flow

Country Status (1)

Country Link
CN (1) CN113255902A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113843796A (en) * 2021-09-30 2021-12-28 上海傅利叶智能科技有限公司 Data transmission method and device, control method and device of online robot and online robot

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113843796A (en) * 2021-09-30 2021-12-28 上海傅利叶智能科技有限公司 Data transmission method and device, control method and device of online robot and online robot

Similar Documents

Publication Publication Date Title
US20220391771A1 (en) Method, apparatus, and computer device and storage medium for distributed training of machine learning model
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
US20210342395A1 (en) Storage edge controller with a metadata computational engine
CN110389909A (en) Use the system and method for the performance of deep neural network optimization solid state drive
US20190303198A1 (en) Neural network processor
CN106503791A (en) System and method for the deployment of effective neutral net
CN110991632A (en) Method for designing heterogeneous neural network computing accelerator based on FPGA
CN115136123A (en) Tile subsystem and method for automated data flow and data processing within an integrated circuit architecture
EP3844620A1 (en) Method, apparatus, and system for an architecture for machine learning acceleration
CN114237869B (en) Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment
CN111752879B (en) Acceleration system, method and storage medium based on convolutional neural network
CN110717574A (en) Neural network operation method and device and heterogeneous intelligent chip
CN115600676A (en) Deep learning model reasoning method, device, equipment and storage medium
CN112699074A (en) Techniques for configuring and operating neural networks
CN110825514A (en) Artificial intelligence chip and instruction execution method for artificial intelligence chip
CN113255902A (en) Neural network circuit, system and method for controlling data flow
CN114330686A (en) Configurable convolution processing device and convolution calculation method
WO2021244045A1 (en) Neural network data processing method and apparatus
US20230126978A1 (en) Artificial intelligence chip and artificial intelligence chip-based data processing method
CN113127179A (en) Resource scheduling method and device, electronic equipment and computer readable medium
CN111935026B (en) Data transmission method, device, processing equipment and medium
US20230229899A1 (en) Neural network processing method and device therefor
CN114253694A (en) Asynchronous processing method and device based on neural network accelerator
US11531578B1 (en) Profiling and debugging for remote neural network execution
WO2024001870A1 (en) Training method for artificial intelligence model, and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination