CN111767995B

CN111767995B - Operation method, device and related product

Info

Publication number: CN111767995B
Application number: CN201910263147.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2023-12-05
Anticipated expiration: 2039-04-02
Also published as: CN111767995A

Abstract

The present disclosure relates to an operation method, an operation device, and a related product, wherein the product includes a control module, and the control module includes: the system comprises an instruction cache unit, an instruction processing unit and a storage queue unit; the instruction cache unit is used for storing calculation instructions related to the artificial neural network operation; the instruction processing unit is used for analyzing the calculation instructions to obtain a plurality of operation instructions; the store queue unit is configured to store an instruction queue, where the instruction queue includes: a plurality of arithmetic instructions or calculation instructions to be executed in the order of the queue. By the method, the operation efficiency of the related products in the operation of the neural network model can be improved.

Description

Operation method, device and related product

Technical Field

The disclosure relates to the technical field of deep learning, in particular to an operation method, an operation device and related products.

Background

In the field of deep learning technology, hardware instructions and corresponding hardware mechanisms are typically used at the instruction level to achieve concurrent synchronization of multiple tasks. However, the manner of implementing concurrent synchronization of multiple tasks using hardware instructions and corresponding hardware mechanisms is not applicable to all processors, which makes the method of implementing concurrent synchronization using hardware instructions and corresponding hardware mechanisms have a certain limitation and low versatility.

Disclosure of Invention

In view of this, the disclosure provides an operation method, an operation device and a related product.

According to an aspect of the present disclosure, there is provided an operation method applied to a processor including a plurality of processing units, the method including: determining the dependency relationship among all tasks to be executed of a neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation; and adding a synchronization operator into the neural network model according to the dependency relationship so that each task to be executed of the modified neural network model is executed according to the dependency relationship.

In a possible implementation manner, adding a synchronization operator in the neural network model according to the dependency relationship includes: and adding the synchronization operator between a network layer of a current task and a network layer of a preceding task running on different processing units according to the dependency relationship, wherein the current task is a task executed after the execution of the preceding task is finished.

In one possible implementation, the dependency relationship is determined by whether the storage address intervals of the data required by each task to be performed have overlapping regions.

In one possible implementation, the synchronization operator includes a first operator and a second operator; the first operator is used for representing the running state of the previous task; and the second operator is used for determining whether to run the current task according to the first operator.

In a possible implementation manner, the second operator is configured to read a first operator at a preset time interval, and determine whether to run the current task according to the read first operator.

In one possible implementation, the running state includes running incomplete or running complete.

In one possible implementation, the synchronization operator is stored in a shared memory of the processor.

In one possible implementation, the method further includes: generating a schedule according to the neural network model added with the synchronization operator, wherein the schedule comprises tasks to be executed by the processing units and the dependency relationship of the tasks to be executed.

In one possible implementation, the plurality of schedules are multiple, and each schedule corresponds to a respective processing unit of the processor.

According to another aspect of the present disclosure, there is provided an arithmetic device, including: the dependency relationship determining module is used for determining the dependency relationship among the tasks to be executed of the neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation; and the synchronous operator adding module is used for adding a synchronous operator into the neural network model according to the dependency relationship so as to enable each task to be executed of the modified neural network model to be executed according to the dependency relationship.

In one possible implementation, the synchronization operator adding module includes: and the first adding sub-module is used for adding the synchronous operator between the network layer of the current task and the network layer of the previous task running on different processing units according to the dependency relationship, wherein the current task is the task executed after the execution of the previous task is finished.

In one possible implementation, the apparatus further includes: the schedule generating module is used for generating a schedule according to the neural network model added with the synchronization operator, wherein the schedule comprises tasks to be executed, which are to be executed by the processing units, and the dependency relationship of the tasks to be executed.

According to another aspect of the present disclosure, there is provided a computer device including a memory, a processor, the memory having stored thereon a computer program executable on the processor, the processor implementing the steps of any one of the above-mentioned operation methods when executing the computer program.

According to another aspect of the present disclosure, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any one of the above-described operation methods.

According to another aspect of the present disclosure, there is provided a machine learning operation device, including one or more of any one of the above operation devices, configured to acquire input data and control information to be operated from other processing devices, perform specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;

When the machine learning computing device comprises a plurality of computing devices, the computing devices can be connected through a specific structure and transmit data;

the plurality of computing devices are interconnected through the PCIE bus and transmit data so as to support the operation of machine learning in a larger scale;

the plurality of arithmetic devices share the same control system or have respective control systems;

a plurality of computing devices share a memory or have respective memories;

the interconnection mode of the plurality of arithmetic devices is any interconnection topology.

According to another aspect of the present disclosure, there is provided a combination processing apparatus including the above machine learning arithmetic apparatus, a general-purpose interconnect interface, and other processing apparatuses; the machine learning operation device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

In one possible implementation manner, the combination processing device further includes: a storage device; the storage device is respectively connected with the machine learning operation device and the other processing device and is used for storing the machine learning operation device or the combined processing device.

According to another aspect of the present disclosure, there is provided a neural network chip including the above machine learning arithmetic device or the combination processing device.

According to another aspect of the present disclosure, there is provided an electronic device including the above neural network chip.

According to another aspect of the present disclosure, there is provided a board including: a memory device, an interface device, and a control device, and the neural network chip;

the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the neural network chip and external equipment;

the control device is used for monitoring the state of the neural network chip.

In one possible implementation, the memory device includes: each group of storage units is connected with the neural network chip through a bus, and the storage units are as follows: DDR SDRAM; the chip comprises: the DDR controller is used for controlling data transmission and data storage of each storage unit; the interface device is as follows: standard PCIE interfaces.

According to the embodiment of the disclosure, after splitting and determining a plurality of tasks to be executed on a neural network model and issuing the tasks to be executed to different processing units, a synchronization operator is added in the neural network model through the dependency relationship among the tasks to be executed determined according to the neural network model, so that the tasks to be executed of the modified neural network model are executed according to the dependency relationship, and the aim of concurrent synchronization of multiple tasks by adding the synchronization operator at an operator level (a network structure of the neural network model) is fulfilled. Compared with the conventional concurrent synchronization mode of using hardware instructions and corresponding hardware mechanisms at the instruction level (hardware mechanisms), the method can be suitable for various processors, and therefore the universality is effectively improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of an operation method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a plurality of tasks to be performed issued to different processing units in an operation method according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of a plurality of tasks to be performed issued to different processing units in an operation method according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of a plurality of tasks to be performed issued to different processing units in an operation method according to an embodiment of the disclosure;

FIG. 5 illustrates a block diagram of an computing device according to an embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of a combined processing apparatus according to an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of another combined processing apparatus according to an embodiment of the disclosure;

fig. 8 shows a block diagram of a board card according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

First, it should be noted that the operation method of the present disclosure may be applied to a general-purpose processor, such as: the CPU (Central Processing Unit/Processor, central processing unit) may also be used in an artificial intelligence Processor. Wherein an artificial intelligence processor refers to a processor (IPU) for performing artificial intelligence operations, such as: including one or a combination of GPU (Graphics Processing Unit ), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field-programmable gate array (Field-Programmable Gate Array, FPGA) chips. The present disclosure is not limited by the specific type of artificial intelligence processor.

Meanwhile, the processor mentioned in the present disclosure may include a plurality of processing units (also may be simply referred to as cores), and each processing unit may independently execute various tasks allocated thereto, for example: convolution operation task, pooling task or full connection task, etc. The tasks that the processing unit operates on are not specifically limited here.

Fig. 1 shows a flowchart of an operation method according to an embodiment of the present disclosure. Referring to fig. 1, the operation method of the present disclosure includes:

and step S100, determining the dependency relationship among the tasks to be executed of the neural network model.

It should be noted here that the neural network model may be various artificial intelligence models. That is, the neural network model may be one or more of CNN (convolutional neural network ), RNN (recurrent neural network, recurrent Neural Network), biRNN (bi-directional RNN, bidirectional RNN), GRU (gated loop unit, gated Recurrent Unit), LSTM (Long Short-Term Memory network), and the like. The specific type of neural network model is not limited here.

Meanwhile, the task to be executed refers to a task determined after the task splitting is performed on the neural network model to be operated according to the content of the neural network model. The number of the tasks to be executed is multiple, and each task to be executed is respectively issued to different processing units for running. That is, by splitting the neural network model to be operated, the neural network model can be operated in different processing units in a model parallel manner.

It should be noted here that, as will be understood by those skilled in the art, model parallelism is that different processing units in a distributed system are responsible for different parts of the neural network model, for example, different network layers of the neural network model are assigned to different processing units, or different parameters within the same layer are assigned to different processing units, that is, the characteristics of operators inherent to the model are split to operate on different processing units.

Step S200, adding a synchronization operator in the neural network model according to the dependency relationship, so that each task to be executed of the modified neural network model is executed according to the dependency relationship. The synchronous operator is used for representing the running state of the previous task and determining the running state of the current task according to the running state of the previous task. And the previous task and the current task are two tasks to be executed with a dependency relationship in the plurality of tasks to be executed.

When a neural network model is operated in a model parallel mode, the neural network model is split according to the content of the neural network model and is concurrently operated on different processing units, so that the scheduling of a plurality of tasks to be executed is realized. At the conclusion of the scheduling process, the running of tasks on the various processing units (i.e., the tasks to be performed that need to be performed on each processing unit) is determined. Wherein, each task to be executed in different task flow has a certain logical dependency relationship (for example, the task B can not be executed without executing the task A). Therefore, according to the dependency relationship among the tasks to be executed, a synchronization operator can be added in the neural network model, so that the tasks to be executed can run according to the dependency relationship.

Therefore, after the neural network model is split to determine a plurality of tasks to be executed, and the tasks to be executed are issued to different processing units, a synchronization operator is added into the neural network model through the dependency relationship among the tasks to be executed determined according to the neural network model, so that the tasks to be executed of the modified neural network model are executed according to the dependency relationship, and the aim of concurrent synchronization of multiple tasks is fulfilled by adding the synchronization operator at an operator level (network structure of the neural network model). Compared with the conventional concurrent synchronization mode of using hardware instructions and corresponding hardware mechanisms at the instruction level (hardware mechanisms), the method can be suitable for various processors, and therefore the universality is effectively improved.

It should be noted that, in the present disclosure, the addition of the synchronization operator is performed according to a dependency relationship between each task to be executed. Thus, the number of synchronization operators may be plural. I.e. how many dependencies between the respective tasks to be performed, how many synchronization operators can be added.

Such as: fig. 2 is a schematic diagram illustrating a plurality of tasks to be executed issued to different processing units in an operation method according to an embodiment of the present disclosure. Referring to fig. 2, by splitting a task from a neural network model, 5 tasks to be executed are determined, where the tasks are respectively: task A, task B, task C, task D, and task E. The task A has a dependency relationship with the task B, the task C, the task D and the task E. The processor running the neural network model is correspondingly configured with four processing units, namely a core 1, a core 2, a core 3 and a core 4. Wherein task A and task B are issued to core 1 for operation, task C is issued to core 2 for operation, task D is issued to core 3 for operation, and task E is issued to core 4 for operation. Therefore, when the synchronization operators are added to the neural network model, the synchronization operators can be added among the network layers corresponding to the task A and the task B respectively, among the network layers corresponding to the task A and the task C respectively, among the network layers corresponding to the task A and the task D respectively and among the network layers corresponding to the task A and the task E respectively, so that the topology structure executed by the neural network model before the synchronization operators are inserted and the topology structure executed after the synchronization operators are inserted are unchanged, and the operation accuracy of the neural network model is ensured.

In one possible implementation, adding a synchronization operator to the neural network model according to the dependency relationship may include: according to the dependency, a synchronization operator is added between the network layer of the current task and the network layer of the successor task running on different processing units. It should be noted that, the current task is a task executed after the execution of the previous task is completed.

That is, when adding the synchronization algorithm in the neural network model, it is possible to add between network layers to which the preceding task and the current task having the dependency relationship respectively correspond. The current task is a task executed after the execution of the next task is completed. Meanwhile, the predecessor task and the current task may be run on different processing units, respectively.

Such as: fig. 3 is a schematic diagram illustrating a plurality of tasks to be executed issued to different processing units in an operation method according to an embodiment of the present disclosure. Referring to fig. 3, three tasks to be executed, namely task a, task B and task C, are determined by splitting a task from a certain neural network model. Task a has a dependency relationship with task B (i.e., task a is a successor task to task B, task B is a current task), task a also has a dependency relationship with task C (i.e., task a is a successor task to task C, which is a current task). The processor running the neural network model is configured with two processing units, core 1 and core 2, respectively. Wherein task A and task B are issued to core 1 for operation and task C is issued to core 2 for operation. Therefore, when adding the synchronization algorithm to the neural network model, the synchronization algorithm may be added only between the network layer corresponding to the task a and the network layer corresponding to the task C. For the task a and the task B, since the two tasks are issued to the same processing unit (core 1) to run, the processing unit can directly run the task a and the task B according to the dependency relationship between the task a and the task B, so that no synchronization operator needs to be added between the network layers respectively corresponding to the task a and the task B.

Therefore, when the synchronization operator is added between the network layers corresponding to the previous task and the current task with the dependency relationship, the synchronization operator is only added between the two tasks to be executed with the dependency relationship and respectively running on different processing units, so that the unnecessary addition of the synchronization operator is avoided, the addition operation of the synchronization operator is simplified, the addition time of the synchronization operator is saved, and finally the efficiency of concurrent synchronization of multiple tasks is effectively improved.

In one possible implementation, the dependency relationship may be determined by whether the storage address intervals of the data required for each task to be performed have overlapping regions. Among the multiple tasks to be executed obtained after splitting the neural network model, if the storage address intervals of data (which may be a matrix) required by each of the two tasks to be executed have overlapping areas, it may be determined that the two tasks to be executed have a dependency relationship (for example, a storage address interval of data required by a previous task and a storage address interval of data required by a current task have overlapping areas). If the storage address intervals of the data required by the two tasks to be executed do not have overlapping areas, it can be determined that the two tasks to be executed do not have a dependency relationship.

Such as: the plurality of tasks to be executed comprise a first task to be executed and a second task to be executed. When the storage address region of the data required by the first task to be executed and the storage address region of the data required by the second task to be executed have overlapping regions, determining that the first task to be executed and the second task to be executed have a dependency relationship. And when the storage address interval of the data required by the first task to be executed and the storage address interval of the data required by the second task to be executed do not have an overlapping area, determining that the first task to be executed and the second task to be executed do not have a dependency relationship.

In one possible implementation, the dependency relationship may also include an input-output relationship between a plurality of tasks to be performed. That is, the successor task and the current task may be input-output relationships. That is, the output result of the preceding task is the input data of the current task. Thus, the current task cannot be run without running the preceding task. It is also considered that the current task can be run only after the preceding task is run.

When the dependency relationship includes an input-output relationship among a plurality of tasks to be executed, adding a synchronization algorithm in the neural network model according to the dependency relationship may include: and adding a synchronization operator between network layers corresponding to the two tasks to be executed, which have an input-output relationship.

Such as: by splitting a task of a certain neural network model, two tasks to be executed are determined, namely a task A and a task B. The output result of the task A is input data of the task B. Meanwhile, the processor running the neural network model is configured with two processing units, core 1 and core 2, respectively. Wherein task a is issued to core 1 operation and task B is issued to core 2 operation. Therefore, when adding the synchronization algorithm to the neural network model, the synchronization algorithm may be added between the network layer corresponding to the task a and the network layer corresponding to the task B.

Therefore, according to the input-output relation among the tasks to be executed, a synchronization operator is added between the network layers corresponding to the two tasks to be executed with the input-output relation, and the accuracy of the neural network model operation is further ensured.

In one possible implementation, the dependency relationship may also include an order of execution of the plurality of tasks to be performed. That is, after the neural network model is subjected to task splitting to determine a plurality of tasks to be executed, and the plurality of tasks to be executed are issued to different processing units for processing, at this time, since the structure of the neural network model is fixed, the running sequence among the plurality of tasks to be executed is also determined. The plurality of tasks to be executed may include that the operations of the two tasks to be executed are synchronous operations (i.e., the two tasks to be executed may be executed in parallel). If in actual situations, the running sequence of two tasks to be executed, which are executed in parallel, needs to be set (for example, priority setting), at this time, when adding a synchronization operator in the neural network model according to the dependency relationship, the synchronization operator can also be added between network layers corresponding to two tasks to be executed, which are adjacent in running sequence.

Such as: fig. 4 is a schematic diagram illustrating a plurality of tasks to be executed issued to different processing units in an operation method according to an embodiment of the present disclosure. Referring to fig. 4, by splitting a task from a neural network model, 5 tasks to be executed are determined, which are respectively: task A, task B, task C, task D, and task E. The task a and the task B are executed in parallel, while the task a and the task C are executed serially (task a is before, task C is after), the task a and the task D are executed serially (task a is before, task D is after), the task B and the task C are executed serially (task B is before, task C is after), and the task B and the task E are executed serially (task B is before, task E is after).

The processor running the neural network model is correspondingly configured with three processing units, namely a core 1, a core 2 and a core 3. Wherein task A and task B are issued to core 1 for operation, task B and task E are issued to core 2 for operation, and task C is issued to core 3 for operation. Meanwhile, according to the actual situation, the running sequence of the task A and the task B which are executed in parallel initially is set, and the task B is executed after the task A is executed.

Thus, when adding the synchronization operators to the neural network model, the synchronization operators can be added between the network layers corresponding to the task A and the task B, between the network layers corresponding to the task A and the task C, and between the network layers corresponding to the task B and the task C.

By adding a synchronization layer between two tasks to be executed, which are adjacent to each other in the operation order when the dependency relationship contains the operation orders of the tasks to be executed, the accuracy of the neural network model operation is further improved.

In one possible implementation, the synchronization operator may include a first operator and a second operator. And the first operator is used for representing the running state of the previous task. And the second operator is used for determining whether to run the current task according to the first operator. The method comprises the steps of executing a previous task and a current task on different processing units respectively, wherein the previous task is the previous task.

That is, the synchronization operator may be implemented using a pair of operators (first operator and second operator). The first operator is added after a network layer corresponding to the preceding task and used for representing the running state of the preceding task. And the second operator is added with the first operator junction pair setting and is positioned in front of the current task and used for determining whether to run the current task according to the first operator. The realization mode of adopting a pair of operators as synchronous operators has simple structure and is easy to realize.

In one possible implementation manner, when determining whether to run the current task according to the first operator, the second operator may determine whether to run the current task by reading the first operator within a preset time interval and according to the read first operator.

That is, the second operator may read the first operator in a preset time interval, and determine whether the running of the successor task is completed according to the first operator that is read currently. After the running of the next task is determined, the current task can be determined to run. When the previous task is determined to be not run or not run, it may be determined that the current task is not currently run.

By setting the second operator to read the first operator in the preset time interval, the operation of frequently reading the first operator is avoided, and the reading times of the first operator are reduced, so that the power consumption is effectively reduced.

In one possible implementation, the operational status includes any of operational incompletion and operational completion. The first operator can set a flag bit, and different running states of the preceding task are represented by different values of the flag bit.

Such as: the first operator may be: notify, second operator is: and (6) sync. The synchronization operator does not compute, but only an operator-level lock. The method is realized by adopting a flag bit mode. The usage of notify is that the previous task calculation completion (running completion) is set to 1, and the calculation completion (running) is defaulted to 0. The pre-operation (waiting to run) defaults to 0. The rules for sync synchronization are: reading the value of a notify (first operator) flag bit at intervals, reading 1 and going backwards, so that the running state of the current task is running, waiting if the running state of the current task is not read or is read to 0, and making the running state of the current task be waiting running.

The first operator is provided with the zone bit, the running state of the preceding task is represented by the value of the zone bit, so that the synchronous operator added in the neural network model exists as an operator-level lock, and the synchronous operator cannot participate in the operation of the neural network model, therefore, the dependence cannot be changed due to synchronization, and the accuracy of the network topology structure of the neural network model is also ensured.

In one possible implementation, the synchronization operator may be stored in a shared memory of the processor. Therefore, when the processor runs the neural network model, each processing unit can directly read the corresponding synchronous operator from the shared memory to determine the running sequence of each task to be executed, and then the processing units run sequentially according to the determined running sequence.

By storing the added synchronization operator in the shared memory, the hardware configuration of the processor is simplified, and any hardware change is not needed to be carried out on the processor, so that the hardware cost is effectively saved.

In one possible implementation, after adding the synchronization operator in the neural network according to the dependency relationship, the method may further include: and generating a schedule according to the neural network model added with the synchronization operator. It should be noted that, the generated schedule may be stored in the shared memory of the processor. In addition, the schedule includes tasks to be executed by each processing unit, and the dependency relationship of each task to be executed.

By generating the scheduling table according to the neural network model added with the synchronization operator and storing the generated scheduling table into the shared memory of the processor, each processing unit can directly read the scheduling table and the corresponding synchronization operator from the shared memory when each part of tasks of the neural network model are operated, and then each task to be executed is operated in sequence according to the read scheduling table and the synchronization operator, so that the data reading speed is effectively improved, and the operation speed is also effectively accelerated.

In one possible implementation, the number of schedules may be multiple. And, each schedule corresponds to each processing unit of the processor, respectively. Therefore, when the processing unit reads the schedule in the shared memory and runs each task to be executed according to the schedule, the processing unit can directly read the corresponding schedule without reading all data, so that the data reading quantity is effectively reduced, and the running speed is further improved.

In summary, according to the operation method disclosed by the invention, when the neural network model is operated on the processor in a model parallel mode, a synchronization operator is added into the neural network model according to the dependency relationship among each task to be executed determined by the neural network model, so that a plurality of tasks to be executed of the modified neural network model are executed according to the dependency relationship, and the aim of adding the synchronization operator at an operator level (network structure of the neural network model) to perform concurrent synchronization of multiple tasks is fulfilled. Compared with the conventional concurrent synchronization mode of using hardware instructions and corresponding hardware mechanisms at the instruction level (hardware mechanisms), the method can be suitable for various processors, so that the universality is effectively improved, the correctness of data reading is ensured, and the data reading efficiency is improved.

Referring to fig. 5, the present disclosure further provides an arithmetic device 100. The arithmetic device 100 includes: the dependency relationship determining module 110 is configured to determine a dependency relationship between tasks to be executed of the neural network model, where the tasks to be executed are obtained by splitting tasks of the neural network model, and the tasks to be executed are issued to different processing units for running. And the synchronization operator adding module 120 is configured to add a synchronization operator to the neural network model according to the dependency relationship, so that each task to be executed of the modified neural network model is executed according to the dependency relationship.

In one possible implementation, the synchronization operator adding module 120 includes:

and the first adding sub-module is used for adding the synchronous operator between the network layer of the current task and the network layer of the previous task running on different processing units according to the dependency relationship, wherein the current task is the task executed after the execution of the previous task is finished.

In one possible implementation, the dependency relationship is determined by whether the storage address intervals of the data required for each task to be performed have overlapping regions.

In one possible implementation, the synchronization operator includes a first operator and a second operator;

The first operator is used for representing the running state of the previous task;

and the second operator is used for determining whether to run the current task according to the first operator.

In one possible implementation, the apparatus further includes:

the schedule generating module is used for generating a schedule according to the neural network model added with the synchronization operator, wherein the schedule comprises tasks to be executed, which are to be executed by the processing units, and the dependency relationship of the tasks to be executed.

According to another aspect of the present disclosure, there is provided a computer device including a memory, a processor, the memory having stored thereon a computer program executable on the processor, the processor implementing the steps of any of the above-described operation methods when the computer program is executed.

According to another aspect of the present disclosure, there is also provided a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the above-described operation methods.

According to an aspect of the present disclosure, there is provided a machine learning computing device including one or more of any of the computing devices described above, for acquiring input data and control information to be computed from other processing devices, and performing specified machine learning operations, and transmitting the execution results to the other processing devices through I/O interfaces. Other processing devices such as: camera, display, mouse, keyboard, network card, wifi interface, server. When more than one computing device is included, the computing devices may be linked and data may be transferred through a specific structure, for example, interconnected and data may be transferred through a PCIE bus, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or independent control systems may be provided; the memory may be shared, or each accelerator may have its own memory. In addition, the interconnection mode can be any interconnection topology.

The machine learning operation device has higher compatibility and can be connected with various types of servers through PCIE interfaces.

Fig. 6 shows a block diagram of a combination processing device 200a according to an embodiment of the disclosure. Referring to fig. 6, the present disclosure also provides a combination processing device 200a, which includes the machine learning computing device 210, the universal interconnect interface 220, and the other processing device 230. The machine learning computing device 210 interacts with other processing devices 230 to collectively perform the user-specified operations.

Other processing means 230 include one or more processor types of general-purpose/special-purpose processors such as Central Processing Units (CPU), graphics Processing Units (GPU), neural network processors, etc. The number of processors included in the other processing device 230 is not limited. The other processing device 230 is used as an interface between the machine learning operation device and external data and control, and comprises data carrying to complete basic control such as starting and stopping of the machine learning operation device; the other processing device may cooperate with the machine learning computing device to complete the computing task.

A universal interconnect interface 220 for transferring data and control instructions between the machine learning computing device 210 and other processing devices 230. The machine learning computing device 210 acquires necessary input data from the other processing device 230, and writes the input data to a memory device on the machine learning computing device chip; control instructions may be obtained from other processing devices 230 and written into a control cache on the machine learning computing device chip; the data in the memory module of the machine learning arithmetic device may be read and transmitted to the other processing device.

Fig. 7 shows a block diagram of a combination processing device 200b according to another embodiment of the present disclosure. Referring to fig. 7, the combination processing device 200b of the present disclosure may further include a storage device 240, where the storage device 240 is connected to the machine learning computing device 210 and the other processing device 230, respectively. The storage device 240 is used for storing data of the machine learning operation device 210 and the other processing device 230, and is particularly suitable for data which cannot be stored in the internal storage of the machine learning operation device or the other processing device in the data required to be operated.

The combined processing device 200b can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle, video monitoring equipment and the like, so that the core area of a control part is effectively reduced, the processing speed is improved, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing apparatus is connected to some parts of the device. Some components such as cameras, displays, mice, keyboards, network cards, wifi interfaces.

In some embodiments, a neural network chip is also disclosed, which includes the machine learning computing device or the combination processing device.

In some embodiments, a chip package structure is disclosed that includes the neural network chip described above.

In some embodiments, a board card is disclosed that includes the above-described chip package structure. Referring to fig. 8, fig. 8 provides a board that may include other mating components in addition to the chips 389, including but not limited to: a memory device 390, an interface device 391 and a control device 392.

The memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include multiple sets of memory cells 393. Each group of storage units is connected with the chip through a bus. It is understood that each set of memory cells may be DDR SDRAM (English: double Data Rate SDRAM, double Rate synchronous dynamic random Access memory).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers inside, where 64 bits of the 72-bit DDR4 controllers are used to transfer data and 8 bits are used for ECC verification. It is understood that the theoretical bandwidth of data transfer can reach 25600MB/s when DDR4-3200 granules are employed in each set of memory cells.

In one embodiment, each set of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each storage unit.

The interface device is electrically connected with the chip in the chip packaging structure. The interface means is used for enabling data transmission between the chip and an external device, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X10 interface transmission is adopted, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may be another interface, and the present application is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the calculation result of the chip is still transmitted back to the external device (e.g. a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may drive a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of operation for a processor, the processor comprising a plurality of processing units, the method comprising:

determining the dependency relationship among all tasks to be executed of a neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation;

and adding a synchronization operator into the neural network model according to the dependency relationship so that each task to be executed of the modified neural network model is executed according to the dependency relationship.

2. The method according to claim 1, wherein adding a synchronization operator in the neural network model according to the dependency relationship comprises:

and adding the synchronization operator between a network layer of a current task and a network layer of a preceding task running on different processing units according to the dependency relationship, wherein the current task is a task executed after the execution of the preceding task is finished.

3. A method according to claim 1 or 2, wherein the dependency relationship is determined by whether the storage address intervals of the data required for each of the tasks to be performed have overlapping areas.

4. The method of claim 2, wherein the synchronization operator comprises a first operator and a second operator;

5. The method of claim 4, wherein the second operator is configured to read a first operator at a preset time interval, and determine whether to run the current task according to the read first operator.

6. The method of claim 4, wherein the operational status comprises operational incompletion or operational completion.

7. The method of claim 1, wherein the synchronization operator is stored in a shared memory of the processor.

8. The method according to claim 1, wherein the method further comprises:

generating a schedule according to the neural network model added with the synchronization operator, wherein the schedule comprises tasks to be executed by the processing units and the dependency relationship of the tasks to be executed.

9. The method of claim 8, wherein the plurality of schedules each correspond to a respective processing unit of the processor.

10. An arithmetic device, the device comprising a processor, the processor comprising:

the dependency relationship determining module is used for determining the dependency relationship among the tasks to be executed of the neural network model, wherein the tasks to be executed are obtained by splitting the tasks of the neural network model, and the tasks to be executed are issued to different processing units for operation;

and the synchronous operator adding module is used for adding a synchronous operator into the neural network model according to the dependency relationship so as to enable each task to be executed of the modified neural network model to be executed according to the dependency relationship.

11. The computing device of claim 10, wherein the synchronization operator adding module comprises:

12. The apparatus according to claim 10 or 11, wherein the dependency relationship is determined by whether storage address intervals of data required for each of the tasks to be performed have overlapping areas.

13. The apparatus of claim 11, wherein the synchronization operator comprises a first operator and a second operator;

14. The apparatus of claim 13, wherein the second operator is configured to read a first operator at a preset time interval, and determine whether to run the current task according to the read first operator.

15. The apparatus of claim 14, wherein the operational status comprises operational incompletion or operational completion.

16. The apparatus of claim 10, wherein the synchronization operator is stored in a shared memory of the processor.

17. The apparatus as recited in claim 10, further comprising:

18. The apparatus of claim 17, wherein the plurality of schedules each correspond to a respective processing unit of the processor.

19. A computer device comprising a memory, a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 9 when the computer program is executed.

20. A readable storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the method according to any of claims 1 to 9.

21. A machine learning computing device, characterized in that the machine learning computing device comprises one or more computing devices according to any one of claims 10 to 18, and is configured to obtain input data and control information to be computed from other processing devices, perform specified machine learning computation, and transmit the execution result to the other processing devices through an I/O interface;

When the machine learning computing device comprises a plurality of computing devices, the computing devices are connected through a PCIE bus and transmit data;

a plurality of computing devices share a memory or have respective memories;

22. A combination processing device, comprising the machine learning computing device of claim 21, a universal interconnect interface, and other processing devices;

the machine learning operation device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

23. The combination processing device of claim 22, further comprising: a storage device;

the storage device is connected to the machine learning computing device and the other processing device, respectively, and is configured to store the machine learning computing device or the combination processing device according to claim 22.

24. A neural network chip, characterized in that the chip comprises a machine learning arithmetic device according to claim 21, or a combination processing device according to claim 22, or a combination processing device according to claim 23.

25. An electronic device comprising the neural network chip of claim 24.

26. A board, characterized in that, the board includes: a memory device, an interface device and a control device, and a neural network chip as claimed in claim 24;

the storage device is used for storing data;

the control device is used for monitoring the state of the neural network chip.

27. The board card of claim 26, wherein the board card is configured to,

the memory device includes: each group of storage units is connected with the neural network chip through a bus, and the storage units are as follows: DDR SDRAM;

the chip comprises: the DDR controller is used for controlling data transmission and data storage of each storage unit;

the interface device is as follows: standard PCIE interfaces.