CN111767995A - Operation method, device and related product - Google Patents

Operation method, device and related product Download PDF

Info

Publication number
CN111767995A
CN111767995A CN201910263147.2A CN201910263147A CN111767995A CN 111767995 A CN111767995 A CN 111767995A CN 201910263147 A CN201910263147 A CN 201910263147A CN 111767995 A CN111767995 A CN 111767995A
Authority
CN
China
Prior art keywords
task
executed
neural network
operator
tasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910263147.2A
Other languages
Chinese (zh)
Other versions
CN111767995B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201910263147.2A priority Critical patent/CN111767995B/en
Priority to PCT/CN2020/082831 priority patent/WO2020200250A1/en
Publication of CN111767995A publication Critical patent/CN111767995A/en
Application granted granted Critical
Publication of CN111767995B publication Critical patent/CN111767995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Multi Processors (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to an arithmetic method, apparatus and related product, the product comprising a control module, the control module comprising: the device comprises an instruction cache unit, an instruction processing unit and a storage queue unit; the instruction cache unit is used for storing the calculation instruction associated with the artificial neural network operation; the instruction processing unit is used for analyzing the calculation instruction to obtain a plurality of operation instructions; the storage queue unit is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue. Through the method, the operation efficiency of the related product in the operation of the neural network model can be improved.

Description

Operation method, device and related product
Technical Field
The present disclosure relates to the field of deep learning technologies, and in particular, to an operation method, an operation device, and a related product.
Background
In the field of deep learning techniques, concurrent synchronization of multiple tasks is typically achieved at the instruction level using hardware instructions and corresponding hardware mechanisms. However, the method of implementing multitask concurrent synchronization by using hardware instructions and corresponding hardware mechanisms is not applicable to all processors, so that the method of implementing concurrent synchronization by using hardware instructions and corresponding hardware mechanisms has certain limitations and low universality.
Disclosure of Invention
In view of the above, the present disclosure provides an operation method, an operation device and a related product.
According to an aspect of the present disclosure, there is provided an arithmetic method applied to a processor including a plurality of processing units, the method including: determining a dependency relationship between tasks to be executed of a neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation; and adding a synchronous operator in the neural network model according to the dependency relationship so as to enable each task to be executed of the modified neural network model to be executed according to the dependency relationship.
In a possible implementation manner, the adding a synchronization operator in the neural network model according to the dependency relationship includes: and adding the synchronous operator between the network layer of the current task and the network layer of the previous task which run on different processing units according to the dependency relationship, wherein the current task is executed after the previous task is executed.
In a possible implementation manner, the dependency relationship is determined by whether the storage address intervals of the data required by the tasks to be executed have an overlapping area.
In one possible implementation, the synchronization operator includes a first operator and a second operator; the first operator is used for representing the running state of the preceding task; and the second operator is used for determining whether to run the current task according to the first operator.
In a possible implementation manner, the second operator is configured to read the first operator at a preset time interval, and determine whether to run the current task according to the read first operator.
In one possible implementation, the operation status includes operation incomplete or operation complete.
In one possible implementation, the synchronization operator is stored in a shared memory of the processor.
In one possible implementation, the method further includes: and generating a scheduling table according to the neural network model added with the synchronous operator, wherein the scheduling table comprises the tasks to be executed by each processing unit and the dependency relationship of each task to be executed.
In one possible implementation manner, the scheduling table is multiple, and each scheduling table corresponds to each processing unit of the processor.
According to another aspect of the present disclosure, there is provided an arithmetic device comprising: the dependency relationship determining module is used for determining the dependency relationship among tasks to be executed of the neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation; and the synchronous operator adding module is used for adding a synchronous operator in the neural network model according to the dependency relationship so as to enable each task to be executed of the modified neural network model to be executed according to the dependency relationship.
In one possible implementation, the synchronization operator adding module includes: and the first adding submodule is used for adding the synchronous operator between a network layer of a current task and a network layer of a previous task which run on different processing units according to the dependency relationship, wherein the current task is a task executed after the previous task is executed.
In a possible implementation manner, the dependency relationship is determined by whether the storage address intervals of the data required by the tasks to be executed have an overlapping area.
In one possible implementation, the synchronization operator includes a first operator and a second operator; the first operator is used for representing the running state of the preceding task; and the second operator is used for determining whether to run the current task according to the first operator.
In a possible implementation manner, the second operator is configured to read the first operator at a preset time interval, and determine whether to run the current task according to the read first operator.
In one possible implementation, the operation status includes operation incomplete or operation complete.
In one possible implementation, the synchronization operator is stored in a shared memory of the processor.
In one possible implementation, the apparatus further includes: and the scheduling table generating module is used for generating a scheduling table according to the neural network model added with the synchronous operator, and the scheduling table comprises the tasks to be executed by each processing unit and the dependency relationship of each task to be executed.
In one possible implementation manner, the scheduling table is multiple, and each scheduling table corresponds to each processing unit of the processor.
According to another aspect of the present disclosure, there is provided a computer device, including a memory, and a processor, where the memory stores thereon a computer program operable on the processor, and the processor implements the steps of any one of the above-mentioned operation methods when executing the computer program.
According to another aspect of the present disclosure, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any one of the above-described operational methods.
According to another aspect of the present disclosure, there is provided a machine learning operation apparatus, including one or more of any one of the operation apparatuses described above, configured to acquire input data and control information to be operated from another processing apparatus, execute a specified machine learning operation, and transmit an execution result to the other processing apparatus through an I/O interface;
when the machine learning arithmetic device comprises a plurality of arithmetic devices, the arithmetic devices can be connected through a specific structure and transmit data;
the plurality of operation devices are interconnected through a PCIE bus and transmit data so as to support operation of larger-scale machine learning;
a plurality of the arithmetic devices share the same control system or have respective control systems;
the plurality of computing devices share a memory or own respective memories;
the plurality of arithmetic devices are connected in an arbitrary connection topology.
According to another aspect of the present disclosure, a combined processing device is provided, which includes the above machine learning arithmetic device, a universal interconnection interface and other processing devices; and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.
In one possible implementation manner, the combination processing apparatus further includes: a storage device; the storage device is respectively connected with the machine learning arithmetic device and the other processing devices and is used for storing the machine learning arithmetic device or the combined processing device.
According to another aspect of the present disclosure, a neural network chip is provided, the chip including the above machine learning arithmetic device or combined processing device.
According to another aspect of the present disclosure, there is provided an electronic device including the neural network chip described above.
According to another aspect of the present disclosure, a board card is provided, which includes: memory device, interface device and control device and the above-mentioned neural network chip;
wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;
the storage device is used for storing data;
the interface device is used for realizing data transmission between the neural network chip and external equipment;
and the control device is used for monitoring the state of the neural network chip.
In one possible implementation, the storage device includes: a plurality of groups of memory cells, each group of memory cells is connected with the neural network chip through a bus, and the memory cells are: DDR SDRAM; the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit; the interface device is as follows: a standard PCIE interface.
According to the embodiment of the disclosure, after the neural network model is split to determine a plurality of tasks to be executed, and the plurality of tasks to be executed are issued to different processing units, a synchronization operator is added in the neural network model according to the dependency relationship between the tasks to be executed determined by the neural network model, so that the plurality of tasks to be executed of the modified neural network model are executed according to the dependency relationship, and the purpose of performing concurrent synchronization of multiple tasks by adding the synchronization operator at the operator level (network structure of the neural network model) is achieved. Compared with the conventional concurrent synchronization mode which uses hardware instructions and corresponding hardware mechanisms at an instruction level (hardware mechanism) to realize multitask, the method can be suitable for various processors, and therefore universality is effectively improved.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 shows a flow diagram of a method of operation according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram illustrating a plurality of tasks to be executed in an operation method according to an embodiment of the present disclosure being issued to different processing units;
fig. 3 is a schematic diagram illustrating that a plurality of tasks to be executed are issued to different processing units in an operation method according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram illustrating that a plurality of tasks to be executed are issued to different processing units in an operation method according to an embodiment of the present disclosure;
FIG. 5 shows a block diagram of a computing device according to an embodiment of the present disclosure;
FIG. 6 shows a block diagram of a combined processing device according to an embodiment of the present disclosure;
FIG. 7 shows a block diagram of another combined processing device according to an embodiment of the present disclosure;
fig. 8 shows a block diagram of a board card according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
First, it should be noted that the operation method of the present disclosure can be applied to a general-purpose processor, such as: CPU (Central processing Unit/Processor), can also apply to the artificial intelligence Processor. Wherein, an artificial intelligence processor refers to a processor (IPU) for performing artificial intelligence operations, such as: the system comprises one or a combination of a GPU (graphic Processing Unit), a Neural-Network Processing Unit (NPU), a Digital Signal Processing (DSP) and a Field Programmable Gate Array (FPGA) chip. The present disclosure is not limited to a particular type of artificial intelligence processor.
Meanwhile, a processor referred to in the present disclosure may include a plurality of processing units (also referred to as cores for short) each of which may independently execute various tasks assigned thereto, such as: a convolution operation task, a pooling task, a full connection task, or the like. The processing unit and the task executed by the processing unit are not specifically limited herein.
Fig. 1 shows a flow diagram of a method of operation according to an embodiment of the present disclosure. Referring to fig. 1, the operation method of the present disclosure includes:
and S100, determining the dependency relationship among tasks to be executed of the neural network model.
It is noted here that the neural network model may be various artificial intelligence models. That is, the Neural Network model may be a CNN (Convolutional Neural Network), or may be one or more of Neural networks such as RNN (Recurrent Neural Network), BiRNN (Bidirectional RNN), GRU (Gated Recurrent Unit), LSTM (Long Short-Term Memory), and the like. The specific type of neural network model is not limited herein.
Meanwhile, the task to be executed refers to a task determined after the task is split for the neural network model to be operated according to the content of the neural network model. The number of the tasks to be executed is multiple, and each task to be executed is issued to a different processing unit for operation. That is, by splitting the neural network model to be run, the neural network model can be run in different processing units in a model parallel manner.
It should be noted here that, as will be understood by those skilled in the art, the model parallelism is different parts of the neural network model that are responsible for different processing units in the distributed system, for example, different network layers of the neural network model are allocated to different processing units, or different parameters in the same layer are allocated to different processing units, that is, the characteristics of the operators and the like inherent to the model are split and run on different processing units.
And S200, adding a synchronous operator in the neural network model according to the dependency relationship so as to enable each task to be executed of the modified neural network model to be executed according to the dependency relationship. And the synchronous operator is used for representing the running state of the previous task and determining the running state of the current task according to the running state of the previous task. And the preceding task and the current task are two tasks to be executed with dependency relationship in the multiple tasks to be executed.
That is, when one neural network model is operated in a model parallel mode, the neural network model is split according to the contents of the neural network model and is sent to different processing units to be operated, so that the scheduling of a plurality of tasks to be executed is realized. After the scheduling process is completed, the task flow on each processing unit (i.e., the tasks to be executed that need to be run on each processing unit) is determined. Wherein, a certain logical dependency relationship exists between each task to be executed in different task pipelines (for example, the task B cannot be executed without the task A). Therefore, a synchronization operator can be added in the neural network model according to the dependency relationship among the tasks to be executed, so that the tasks to be executed can operate according to the dependency relationship.
Therefore, after the neural network model is split to determine a plurality of tasks to be executed, and the tasks to be executed are issued to different processing units, a synchronization operator is added into the neural network model according to the dependency relationship between the tasks to be executed determined by the neural network model, so that the plurality of tasks to be executed of the modified neural network model are executed according to the dependency relationship, and the purpose of performing multi-task concurrent synchronization by adding the synchronization operator at the operator level (the network structure of the neural network model) is achieved. Compared with the conventional concurrent synchronization mode which uses hardware instructions and corresponding hardware mechanisms at an instruction level (hardware mechanism) to realize multitask, the method can be suitable for various processors, and therefore universality is effectively improved.
It should be noted that, in the present disclosure, the addition of the synchronization operator is performed according to the dependency relationship between the tasks to be executed. Therefore, the number of synchronization operators may be plural. I.e. how many synchronization operators can be added, depending on the respective tasks to be executed.
Such as: fig. 2 is a schematic diagram illustrating that a plurality of tasks to be executed are issued to different processing units in an operation method according to an embodiment of the present disclosure. Referring to fig. 2, by splitting a certain neural network model, 5 tasks to be executed are determined, which are: task A, task B, task C, task D, and task E. Wherein, the task A has a dependency relationship with the task B, the task C, the task D and the task E. The processor running the neural network model is correspondingly configured with four processing units, namely a core 1, a core 2, a core 3 and a core 4. Wherein, the task A and the task B are issued to the core 1 for operation, the task C is issued to the core 2 for operation, the task D is issued to the core 3 for operation, and the task E is issued to the core 4 for operation. Therefore, when the synchronous operators are added to the neural network model, the synchronous operators can be added between the network layers respectively corresponding to the task A and the task B, between the network layers respectively corresponding to the task A and the task C, between the network layers respectively corresponding to the task A and the task D, and between the network layers respectively corresponding to the task A and the task E, so that the topology structure executed by the neural network model before the synchronous operators are inserted is not changed with the topology structure executed after the synchronous operators are inserted, and the running accuracy of the neural network model is ensured.
In one possible implementation, adding a synchronization operator in the neural network model according to the dependency relationship may include: and adding a synchronization operator between the network layer of the current task and the network layer of the previous task running on different processing units according to the dependency relationship. It should be noted that the current task is a task executed after the previous task is executed.
That is, when a synchronization operator is added to the neural network model, the synchronization operator may be added between network layers corresponding to a preceding task and a current task having a dependency relationship, respectively. And the current task is a task executed after the previous task is executed. Meanwhile, the preceding task and the current task may be respectively run on different processing units.
Such as: fig. 3 is a schematic diagram illustrating that a plurality of tasks to be executed are issued to different processing units in an operation method according to an embodiment of the present disclosure. Referring to fig. 3, three tasks to be executed, namely a task a, a task B and a task C, are determined by splitting a certain neural network model. Task a has a dependency relationship with task B (i.e., task a is a successor to task B, and task B is a current task), and task a also has a dependency relationship with task C (i.e., task a is a successor to task C, and task C is a current task). The processor running the neural network model is configured with two processing units, core 1 and core 2 respectively. Wherein, the task A and the task B are issued to the core 1 for operation, and the task C is issued to the core 2 for operation. Therefore, when the synchronization operator is added to the neural network model, the synchronization operator can be added only between the network layer corresponding to the task a and the network layer corresponding to the task C. For the task a and the task B, because the two tasks are issued to the same processing unit (core 1) for operation, the processing unit can directly operate the task a and the task B according to the dependency relationship between the task a and the task B, and therefore, a synchronization operator does not need to be added between the network layers respectively corresponding to the task a and the task B.
Therefore, when the synchronous operators are added between the network layers respectively corresponding to the previous task and the current task with the dependency relationship, the synchronous operators are only added between the two tasks to be executed which have the dependency relationship and respectively run on different processing units, and unnecessary addition of the synchronous operators is avoided, so that the addition operation of the synchronous operators is simplified, the addition time of the synchronous operators is saved, and the efficiency of multi-task concurrent synchronization is effectively improved finally.
In a possible implementation manner, the dependency relationship may be determined by whether the storage address intervals of the data required by each to-be-executed task have an overlapping area. In a plurality of to-be-executed tasks obtained after the neural network model is split, if storage address intervals of data (which can be matrixes) required by two to-be-executed tasks have overlapped areas, the two to-be-executed tasks can be determined to have a dependency relationship (for example, a previous task and a current task, and the storage address interval of the data required by the previous task and the storage address interval of the data required by the current task have overlapped areas). If the storage address intervals of the data required by the two to-be-executed tasks do not have the overlapped area, the two to-be-executed tasks can be determined not to have the dependency relationship.
Such as: the plurality of tasks to be executed include a first task to be executed and a second task to be executed. When the storage address region of the data required by the first task to be executed and the storage address region of the data required by the second task to be executed have an overlapped region, determining that the first task to be executed and the second task to be executed have a dependency relationship. And when the storage address interval of the data required by the first task to be executed and the storage address interval of the data required by the second task to be executed do not have an overlapped area, determining that the first task to be executed and the second task to be executed do not have a dependency relationship.
In a possible implementation, the dependency relationship may further include an input-output relationship between a plurality of tasks to be performed. That is, the preceding task and the current task may be in an input-output relationship. That is, the output result of the preceding task is the input data of the current task. Thus, the current task cannot be run without running the preceding task. It may also be considered that the current task can only be run after the preceding task has been run.
When the dependency relationship includes an input/output relationship between a plurality of tasks to be executed, and when a synchronization operator is added to the neural network model according to the dependency relationship, the method may include: and adding a synchronization operator between the network layers corresponding to the two tasks to be executed with the input and output relationship.
Such as: by splitting a certain neural network model, two tasks to be executed are determined, namely a task A and a task B. Wherein, the output result of the task A is the input data of the task B. Meanwhile, the processor running the neural network model is configured with two processing units, core 1 and core 2 respectively. Wherein, the task A is issued to the core 1 for operation, and the task B is issued to the core 2 for operation. Therefore, when the synchronization operator is added to the neural network model, the synchronization operator can be added between the network layer corresponding to the task a and the network layer corresponding to the task B.
Therefore, the accuracy of the neural network model operation is further ensured by adding the synchronization operator between the network layers corresponding to the two tasks to be executed with the input-output relationship according to the input-output relationship between the tasks to be executed.
In one possible implementation, the dependency relationship may further include a running order of the plurality of tasks to be performed. That is, after the neural network model is split to determine a plurality of tasks to be executed, and the plurality of tasks to be executed are issued to different processing units for processing, at this time, because the structure of the neural network model is fixed, the operation sequence among the plurality of tasks to be executed is also determined. The plurality of tasks to be executed may include that the two tasks to be executed are executed synchronously (that is, the two tasks to be executed may be executed in parallel). If in actual conditions, the running sequence of two to-be-executed tasks which are executed in parallel needs to be set (such as priority setting), at this time, when a synchronization operator is added in the neural network model according to the dependency relationship, the synchronization operator can also be added between the network layers corresponding to the two to-be-executed tasks which are adjacent in the running sequence.
Such as: fig. 4 is a schematic diagram illustrating that a plurality of tasks to be executed are issued to different processing units in an operation method according to an embodiment of the present disclosure. Referring to fig. 4, by splitting a certain neural network model, 5 tasks to be executed are determined, which are: task A, task B, task C, task D, and task E. Task A and task B are executed in parallel, and task A and task C are executed serially (task A is in front and task C is behind), task A and task D are executed serially (task A is in front and task D is behind), task B and task C are executed serially (task B is in front and task C is behind), and task B and task E are executed serially (task B is in front and task E is behind).
The processor running the neural network model is correspondingly configured with three processing units, namely a core 1, a core 2 and a core 3. Wherein, the task A and the task B are issued to the core 1 for operation, the task B and the task E are issued to the core 2 for operation, and the task C is issued to the core 3 for operation. Meanwhile, according to the actual situation, the running sequence of the task A and the task B which are executed in parallel initially is set, and the task B is executed after the task A is executed.
Therefore, when the synchronization operator is added to the neural network model, the synchronization operator can be added between the network layers corresponding to the task a and the task B, the network layers corresponding to the task a and the task C, and the network layers corresponding to the task B and the task C.
When the dependency relationship comprises the running sequence of a plurality of tasks to be executed, a synchronization layer is added between two adjacent tasks to be executed in the running sequence, so that the accuracy of the neural network model operation is further improved.
In one possible implementation, the synchronization operator may include a first operator and a second operator. And the first operator is used for representing the running state of the preceding task. And the second operator is used for determining whether to run the current task according to the first operator. The method comprises the following steps that a previous task and a current task are respectively operated on different processing units, and the previous task is the previous of the current task.
That is, the synchronization operator may be implemented using a pair of operators (a first operator and a second operator). The first operator is added behind a network layer corresponding to the preceding task and used for representing the running state of the preceding task. And the second operator is set and added with the first operator, is positioned in front of the current task and is used for determining whether to operate the current task according to the first operator. The pair of operators is adopted as the synchronous operator, so that the structure is simple and the realization is easy.
In a possible implementation manner, when the second operator determines whether to run the current task according to the first operator, the second operator may read the first operator within a preset time interval, and determine whether to run the current task according to the read first operator.
That is, the second operator may read the first operator within a preset time interval, and determine whether the previous task is finished according to the currently read first operator. And after the previous task is determined to be operated completely, the current task can be determined to be operated. Upon determining that the predecessor task has not been completed or has not been run, it may be determined that the current task is not currently running.
The second operator is set to read the first operator within the preset time interval, so that the operation of frequently reading the first operator is avoided, the reading times of the first operator are reduced, and the power consumption is effectively reduced.
In one possible implementation, the operation state includes any one of operation incompletion and operation completion. The first operator can set a zone bit, and different running states of the relay task are represented by different values of the zone bit.
Such as: the first operator may be: notify, the second operator is: sync. The synchronization operator does not compute, only locks at the operator level. The method is realized by adopting a flag bit mode. The usage of notify is that the calculation completion (operation completion) of the predecessor task is set to 1, and the default of the calculation completion (operation) is 0. Before operation (waiting for operation) 0 is defaulted. The usage of sync synchronization is: reading the value of the flag bit of the notify (first operator) at intervals, reading 1, and then moving backwards, so that the running state of the current task is running, and waiting if the current task is not read or is read to 0, so that the running state of the current task is waiting to run.
By setting the zone bit for the first operator and representing the running state of the preceding task through the value of the zone bit, the synchronous operator added in the neural network model exists as an operator-level lock and does not participate in the operation of the neural network model, so that the dependency relationship cannot be changed due to synchronization, and the accuracy of the network topology structure of the neural network model is also ensured.
In one possible implementation, the synchronization operator may be stored in a shared memory of the processor. Therefore, when the processor runs the neural network model, each processing unit can directly read the corresponding synchronous operator from the shared memory to determine the running sequence of each task to be executed, and then the tasks are sequentially run according to the determined running sequence.
The added synchronous operator is stored in the shared memory, so that the hardware configuration of the processor is simplified, and the processor does not need to be changed in any hardware, thereby effectively saving the hardware cost.
In a possible implementation manner, after adding the synchronization operator in the neural network according to the dependency relationship, the method may further include: and generating a scheduling table according to the neural network model added with the synchronous operator. It should be noted that the generated schedule may be stored in a shared memory of the processor. In addition, the scheduling table comprises tasks to be executed by each processing unit and the dependency relationship of each task to be executed.
The scheduling table is generated according to the neural network model added with the synchronous operators, the generated scheduling table is stored in the shared memory of the processor, when each processing unit operates each part of tasks of the neural network model, the scheduling table and the corresponding synchronous operators can be directly read from the shared memory, and then each task to be executed is sequentially operated according to the read scheduling table and the read synchronous operators, so that the data reading speed is effectively improved, and the operation speed is effectively accelerated.
In one possible implementation, the number of the scheduling tables may be multiple. And, each schedule corresponds to each processing unit of the processor, respectively. Therefore, when the processing unit reads the scheduling table from the shared memory and runs each task to be executed according to the scheduling table, the processing unit can directly read the corresponding scheduling table without reading all data, so that the data reading amount is effectively reduced, and the running speed is further improved.
In summary, according to the operation method disclosed by the present disclosure, when the neural network model operates on the processor in a model parallel manner, a synchronization operator is added to the neural network model according to the dependency relationship between the tasks to be executed determined by the neural network model, so that the plurality of tasks to be executed of the modified neural network model are executed according to the dependency relationship, and the purpose of adding the synchronization operator at an operator level (the network structure of the neural network model) to perform the concurrent synchronization of the plurality of tasks is achieved. Compared with the conventional method of realizing the multitask concurrent synchronization mode by using hardware instructions and corresponding hardware mechanisms at the instruction level (hardware mechanism), the method is suitable for various processors, thereby effectively improving the universality, ensuring the correctness of data reading and improving the data reading efficiency.
Referring to fig. 5, the present disclosure further provides an arithmetic device 100. The arithmetic device 100 includes: the dependency relationship determining module 110 is configured to determine a dependency relationship between to-be-executed tasks of the neural network model, where the to-be-executed tasks are obtained by splitting tasks of the neural network model, and the to-be-executed tasks are issued to different processing units for operation. And a synchronous operator adding module 120, configured to add a synchronous operator to the neural network model according to the dependency relationship, so that each task to be executed of the modified neural network model is executed according to the dependency relationship.
In one possible implementation, the synchronization operator adding module 120 includes:
and the first adding submodule is used for adding the synchronous operator between a network layer of a current task and a network layer of a previous task which run on different processing units according to the dependency relationship, wherein the current task is a task executed after the previous task is executed.
In a possible implementation, the dependency relationship is determined by whether the storage address intervals of the data required by each task to be executed have an overlapping area.
In one possible implementation, the synchronization operator includes a first operator and a second operator;
the first operator is used for representing the running state of the preceding task;
and the second operator is used for determining whether to run the current task according to the first operator.
In a possible implementation manner, the second operator is configured to read the first operator at a preset time interval, and determine whether to run the current task according to the read first operator.
In one possible implementation, the operation status includes operation incomplete or operation complete.
In one possible implementation, the synchronization operator is stored in a shared memory of the processor.
In one possible implementation, the apparatus further includes:
and the scheduling table generating module is used for generating a scheduling table according to the neural network model added with the synchronous operator, and the scheduling table comprises the tasks to be executed by each processing unit and the dependency relationship of each task to be executed.
In one possible implementation manner, the scheduling table is multiple, and each scheduling table corresponds to each processing unit of the processor.
According to another aspect of the present disclosure, there is provided a computer device, including a memory and a processor, where the memory stores thereon a computer program operable on the processor, and the processor implements the steps of any one of the operation methods when executing the computer program.
According to another aspect of the present disclosure, there is also provided a readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of any one of the above operational methods.
According to an aspect of the present disclosure, there is provided a machine learning arithmetic device including one or more arithmetic devices as any one of the above, for acquiring input data and control information to be operated from other processing devices, executing a specified machine learning operation, and transmitting an execution result to the other processing devices through an I/O interface. Other processing devices such as: the device comprises a camera, a display, a mouse, a keyboard, a network card, a wifi interface and a server. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, the computing devices are interconnected and transmit data through a PCIE bus, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.
The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.
FIG. 6 shows a block diagram of a combined processing device 200a according to an embodiment of the present disclosure. Referring to fig. 6, the present disclosure also provides a combined processing device 200a, which includes the above machine learning computing device (neural network computing device 210), the universal interconnection interface 220 and the other processing device 230. The machine learning arithmetic unit 210 interacts with the other processing unit 230 to complete the operation designated by the user.
Other processing devices 230 include one or more processor types of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing device 230 is not limited. The other processing device 230 is used as an interface for the machine learning arithmetic device and external data and control, and comprises data transportation and completes basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.
A universal interconnect interface 220 for transmitting data and control commands between the machine learning computing device 210 and other processing devices 230. The machine learning arithmetic device 210 acquires necessary input data from the other processing device 230 and writes the acquired input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices 230 and written into a control cache on the machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.
Fig. 7 shows a block diagram of a combined processing device 200b according to another embodiment of the present disclosure. Referring to fig. 7, the combined processing device 200b of the present disclosure may further include a storage device 240, and the storage device 240 is connected to the machine learning arithmetic device 210 and the other processing device 230, respectively. The storage device 240 is used to store data in the machine learning arithmetic device 210 and the other processing device 230, and is particularly suitable for data that is required to be calculated and cannot be stored in the internal storage of the local machine learning arithmetic device or the other processing device.
This combination processing apparatus 200b can regard as the SOC chip-on-chip system of equipment such as cell-phone, robot, unmanned aerial vehicle, video monitoring equipment, effectively reduces control part's core area, improves processing speed, reduces whole consumption. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
In some embodiments, a neural network chip is also disclosed, which includes the above machine learning arithmetic device or combined processing device.
In some embodiments, a chip packaging structure is disclosed, which includes the neural network chip.
In some embodiments, a board card is disclosed, which includes the above chip package structure. Referring to fig. 8, fig. 8 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, interface device 391 and control device 392.
The memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.
In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.
In some embodiments, an electronic device is provided that includes the above board card.
The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method of operation, applied to a processor comprising a plurality of processing units, the method comprising:
determining a dependency relationship between tasks to be executed of a neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation;
and adding a synchronous operator in the neural network model according to the dependency relationship so as to enable each task to be executed of the modified neural network model to be executed according to the dependency relationship.
2. The method of claim 1, wherein adding a synchronization operator to the neural network model according to the dependency comprises:
and adding the synchronous operator between the network layer of the current task and the network layer of the previous task which run on different processing units according to the dependency relationship, wherein the current task is executed after the previous task is executed.
3. The method according to claim 1 or 2, wherein the dependency relationship is determined by whether the storage address intervals of the data required by each task to be executed have overlapping regions.
4. The method of claim 2 or 3, wherein the synchronization operator comprises a first operator and a second operator;
the first operator is used for representing the running state of the preceding task;
and the second operator is used for determining whether to run the current task according to the first operator.
5. An arithmetic device, comprising:
the dependency relationship determining module is used for determining the dependency relationship among tasks to be executed of the neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation;
and the synchronous operator adding module is used for adding a synchronous operator in the neural network model according to the dependency relationship so as to enable each task to be executed of the modified neural network model to be executed according to the dependency relationship.
6. A machine learning arithmetic device, characterized in that the machine learning arithmetic device comprises one or more arithmetic devices according to claim 5, and is used for acquiring input data and control information to be operated from other processing devices, executing specified machine learning operation, and transmitting the execution result to other processing devices through an I/O interface;
when the machine learning arithmetic device comprises a plurality of arithmetic devices, the arithmetic devices can be connected through a specific structure and transmit data;
the plurality of operation devices are interconnected through a PCIE bus and transmit data so as to support operation of larger-scale machine learning;
a plurality of the arithmetic devices share the same control system or have respective control systems;
the plurality of computing devices share a memory or own respective memories;
the plurality of arithmetic devices are connected in an arbitrary connection topology.
7. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus of claim 6, a universal interconnect interface and other processing apparatus;
and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.
8. A neural network chip, comprising the machine learning computation apparatus of claim 6, or the combined processing apparatus of claim 7.
9. An electronic device, characterized in that the electronic device comprises the neural network chip of claim 8.
10. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and the neural network chip of claim 8;
wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;
the storage device is used for storing data;
the interface device is used for realizing data transmission between the neural network chip and external equipment;
and the control device is used for monitoring the state of the neural network chip.
CN201910263147.2A 2019-04-02 2019-04-02 Operation method, device and related product Active CN111767995B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910263147.2A CN111767995B (en) 2019-04-02 2019-04-02 Operation method, device and related product
PCT/CN2020/082831 WO2020200250A1 (en) 2019-04-02 2020-04-01 Operation method and apparatus, and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910263147.2A CN111767995B (en) 2019-04-02 2019-04-02 Operation method, device and related product

Publications (2)

Publication Number Publication Date
CN111767995A true CN111767995A (en) 2020-10-13
CN111767995B CN111767995B (en) 2023-12-05

Family

ID=72718498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910263147.2A Active CN111767995B (en) 2019-04-02 2019-04-02 Operation method, device and related product

Country Status (1)

Country Link
CN (1) CN111767995B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329834A (en) * 2020-10-29 2021-02-05 北京百度网讯科技有限公司 Video memory space distribution method and device during cyclic network model training
CN112559054A (en) * 2020-12-22 2021-03-26 上海壁仞智能科技有限公司 Method and computing system for synchronizing instructions
CN114048030A (en) * 2021-11-09 2022-02-15 北京百度网讯科技有限公司 Method and device for scheduling operator

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102624870A (en) * 2012-02-01 2012-08-01 北京航空航天大学 Intelligent optimization algorithm based cloud manufacturing computing resource reconfigurable collocation method
US20180032336A1 (en) * 2016-08-01 2018-02-01 Beijing Baidu Netcom Science And Technology Co., Ltd. Processor and method for executing instructions on processor
WO2018058427A1 (en) * 2016-09-29 2018-04-05 北京中科寒武纪科技有限公司 Neural network computation apparatus and method
CN107886167A (en) * 2016-09-29 2018-04-06 北京中科寒武纪科技有限公司 Neural network computing device and method
WO2018113553A1 (en) * 2016-12-21 2018-06-28 杭州海康威视数字技术股份有限公司 Image analysis method and device
US10127494B1 (en) * 2017-08-02 2018-11-13 Google Llc Neural network crossbar stack
US20180365512A1 (en) * 2017-06-20 2018-12-20 Nvidia Corporation Equivariant landmark transformation for landmark localization
CN109189474A (en) * 2018-02-05 2019-01-11 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing vector adduction instruction
CN109345377A (en) * 2018-09-28 2019-02-15 北京九章云极科技有限公司 A kind of generating date system and Real-time Data Processing Method
CN109376017A (en) * 2019-01-07 2019-02-22 人和未来生物科技(长沙)有限公司 Cloud computing platform task processing method, system and its application method based on container

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102624870A (en) * 2012-02-01 2012-08-01 北京航空航天大学 Intelligent optimization algorithm based cloud manufacturing computing resource reconfigurable collocation method
US20180032336A1 (en) * 2016-08-01 2018-02-01 Beijing Baidu Netcom Science And Technology Co., Ltd. Processor and method for executing instructions on processor
WO2018058427A1 (en) * 2016-09-29 2018-04-05 北京中科寒武纪科技有限公司 Neural network computation apparatus and method
CN107886167A (en) * 2016-09-29 2018-04-06 北京中科寒武纪科技有限公司 Neural network computing device and method
WO2018113553A1 (en) * 2016-12-21 2018-06-28 杭州海康威视数字技术股份有限公司 Image analysis method and device
US20180365512A1 (en) * 2017-06-20 2018-12-20 Nvidia Corporation Equivariant landmark transformation for landmark localization
US10127494B1 (en) * 2017-08-02 2018-11-13 Google Llc Neural network crossbar stack
CN109189474A (en) * 2018-02-05 2019-01-11 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing vector adduction instruction
CN109345377A (en) * 2018-09-28 2019-02-15 北京九章云极科技有限公司 A kind of generating date system and Real-time Data Processing Method
CN109376017A (en) * 2019-01-07 2019-02-22 人和未来生物科技(长沙)有限公司 Cloud computing platform task processing method, system and its application method based on container

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘卫宁;刘波;孙棣华;: "面向多任务的制造云服务组合", 计算机集成制造系统, no. 01 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329834A (en) * 2020-10-29 2021-02-05 北京百度网讯科技有限公司 Video memory space distribution method and device during cyclic network model training
CN112329834B (en) * 2020-10-29 2023-08-01 北京百度网讯科技有限公司 Method and device for distributing video memory space during training of cyclic network model
CN112559054A (en) * 2020-12-22 2021-03-26 上海壁仞智能科技有限公司 Method and computing system for synchronizing instructions
CN112559054B (en) * 2020-12-22 2022-02-01 上海壁仞智能科技有限公司 Method and computing system for synchronizing instructions
CN114048030A (en) * 2021-11-09 2022-02-15 北京百度网讯科技有限公司 Method and device for scheduling operator

Also Published As

Publication number Publication date
CN111767995B (en) 2023-12-05

Similar Documents

Publication Publication Date Title
US11809360B2 (en) Network-on-chip data processing method and device
CN112799726B (en) Data processing device, method and related product
CN111767995B (en) Operation method, device and related product
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
CN111767121B (en) Operation method, device and related product
CN109726800B (en) Operation method, device and related product
CN109740746B (en) Operation method, device and related product
CN113556242B (en) Method and equipment for performing inter-node communication based on multi-processing nodes
CN111767078B (en) Data operation method, device and related product
CN111209230B (en) Data processing device, method and related product
CN111767999B (en) Data processing method and device and related products
CN111340202B (en) Operation method, device and related product
CN114281558A (en) Multi-core processor, method for multi-core processor and corresponding product
CN111210011B (en) Data processing device and related product
CN111338694B (en) Operation method, device, computer equipment and storage medium
CN111026440B (en) Operation method, operation device, computer equipment and storage medium
WO2020192587A1 (en) Artificial intelligence computing device and related product
WO2023236479A1 (en) Method for executing task scheduling and related products thereof
CN111384944B (en) Full adder, half adder, data processing method, chip and electronic equipment
CN111209245A (en) Data processing device, method and related product
CN111813537A (en) Operation method, device and related product
CN117908959A (en) Method for performing atomic operations and related products
CN118363754A (en) Splitting method of single operator on multi-core processor and related product
CN111047027A (en) Operation method, device and related product
CN117311812A (en) Method for reordering buffer and related products thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TG01 Patent term adjustment