CN111767078A

CN111767078A - Data operation method and device and related product

Info

Publication number: CN111767078A
Application number: CN201910263151.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2020-10-13

Abstract

The present disclosure relates to a data operation method, apparatus, and related product, the product including a control module, the control module including: the device comprises an instruction cache unit, an instruction processing unit and a storage queue unit; the instruction cache unit is used for storing the calculation instruction associated with the artificial neural network operation; the instruction processing unit is used for analyzing the calculation instruction to obtain a plurality of operation instructions; the storage queue unit is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue. Through the method, the operation efficiency of the related product in the operation of the neural network model can be improved.

Description

Data operation method and device and related product

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a data operation method and apparatus, and a related product.

Background

In the deep learning technology field, the operation mode of the neural network model generally comprises model parallel and data parallel. When the neural network models operate in a parallel manner, the same neural network model (i.e., the same weight data is used) is required to be used for processing the multiple paths of input data. That is, the multiple input data need to access the weight data of the same memory address. At this time, if the multiple input data are not synchronous, the same weight data cannot be read simultaneously, which makes the running speed of the neural network model slow.

Disclosure of Invention

In view of this, the present disclosure provides a data operation method, an apparatus and a related product, which can achieve the purpose that multiple paths of different inputs can simultaneously read common weight data, and effectively accelerate the operation rate of a neural network model.

According to an aspect of the present disclosure, there is provided a data execution method applied to an artificial intelligence processor including a plurality of processors, the method including:

after a first processor runs a synchronous instruction in a neural network model, determining the current running state of each processor running the synchronous instruction in other processors;

the first processor is any one of a plurality of processors running the neural network model;

and when the current running state is that other processors running the synchronous instruction run the synchronous instruction, the processors running the synchronous instruction read the shared data at the same time.

In one possible implementation, the method further includes:

and when the current running state is that at least one processor in the other processors running the synchronous instruction does not start running or is running the synchronous instruction, pausing the running processes of the first processor and the other processors running the synchronous instruction.

In one possible implementation, determining the current operating state of each of the other processors in the plurality of processors that execute the synchronization instruction includes:

acquiring the number of processors which have run the synchronous instruction at present;

and when the number of the processors which have run the synchronous instructions at present is the same as the number specified in the synchronous instructions, determining that the synchronous instructions have been run by other processors which run the synchronous instructions at present.

In one possible implementation, the synchronization instruction is added to the network topology of the neural network model based on the manner in which the neural network model operates.

In a possible implementation manner, the operation manner is a data parallel manner;

when the neural network model runs on a plurality of processors in a data parallel mode, according to a forward propagation sequence of the neural network, acquiring input and output data quantity generated when the neural network model runs to a current instruction on each processor layer by layer;

and after the value of the input and output data volume reaches a first value, inserting the synchronous instruction before the current instruction.

In a possible implementation manner, the operation manner is any one of a model operation manner or a hybrid operation manner;

the hybrid operation mode is a mode of hybrid operation of data parallel and model parallel;

when the neural network model operates in the model parallel mode or the hybrid operation mode, acquiring various downloading weight instructions in the neural network model currently operated by each processor;

determining the data volume of weight data corresponding to each download weight instruction in the neural network model currently operated by each processor;

inserting the synchronization instruction in the neural network model based on the size of the data volume.

In one possible implementation, inserting the synchronization instruction into the network structure of the neural network model based on the data volume size includes:

and when the data volume is larger than or equal to a second numerical value, inserting the synchronization instruction before a download weight instruction in a neural network model currently operated by each processor.

In one possible implementation, the synchronization instruction includes first information and second information;

the first information is used for distinguishing each synchronization instruction in the neural network model;

the second information specifying a number of processors to run each of the synchronization instructions.

According to an aspect of the present disclosure, there is also provided a data operating apparatus including:

the state determination module is configured to determine the current operation state of each other processor in the plurality of processors which runs the synchronous instruction after the first processor runs the synchronous instruction in the neural network model;

wherein the first processor is any one of a plurality of processors running the neural network model;

and the data reading module is configured to read the shared data simultaneously by the processors which finish the synchronous instruction when the state determining module determines that the current operation state is that the other processors which finish the synchronous instruction run the synchronous instruction.

In one possible implementation manner, the method further includes:

and the suspension module is configured to suspend the operation process of the first processor and other processors which have already run the synchronization instruction when the state determination module determines that the current operation state is that at least one of the other processors which run the synchronization instruction does not start to run or is running the synchronization instruction.

In one possible implementation, the state determination module includes:

the number acquisition submodule is configured to acquire the number of processors which have run the synchronous instruction at present;

and the determining submodule is configured to determine that the current operation state is that other processors operating the synchronous instruction have already operated the synchronous instruction when the number of the processors currently operating the synchronous instruction is the same as the number specified in the synchronous instruction.

According to an aspect of the present disclosure, there is also provided a computer device, including a memory, and a processor, where a computer program operable on the processor is stored, and the processor implements the steps of any of the foregoing methods when executing the computer program.

According to an aspect of the present disclosure, there is also provided a readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.

According to an aspect of the present disclosure, there is also provided a machine learning operation device, including one or more data operation devices as described above, configured to obtain input data and control information to be operated from other processing devices, execute a specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of arithmetic devices, the arithmetic devices can be connected through a specific structure and transmit data;

the plurality of operation devices are interconnected through a PCIE bus and transmit data so as to support operation of larger-scale machine learning;

a plurality of the arithmetic devices share the same control system or have respective control systems;

the plurality of computing devices share a memory or own respective memories;

the plurality of arithmetic devices are connected in an arbitrary connection topology.

According to an aspect of the present disclosure, there is also provided a combined processing apparatus, including the machine learning arithmetic apparatus as described above, a universal interconnection interface, and other processing apparatuses;

and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

In one possible implementation manner, the method further includes: a storage device;

the storage device is connected to the machine learning arithmetic device and the other processing device, respectively, and is configured to store the machine learning arithmetic device or the combined processing device according to claim 27.

According to an aspect of the present disclosure, there is also provided a neural network chip, the chip including the machine learning arithmetic device as described above, or the combined processing device as described in any one of the above.

According to an aspect of the present disclosure, there is also provided an electronic device including the neural network chip as described above.

According to an aspect of the present disclosure, a board card is further provided, where the board card includes: memory devices, interface devices and control devices and neural network chips as described above;

wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the neural network chip and external equipment;

and the control device is used for monitoring the state of the neural network chip.

In one possible implementation, the storage device includes: a plurality of groups of memory cells, each group of memory cells is connected with the neural network chip through a bus, and the memory cells are: a DDRSDRAM;

the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;

the interface device is as follows: a standard PCIE interface.

When the neural network model runs in at least one of data parallel mode and model parallel mode, adding synchronous instructions in the neural network model, after the first processor runs the synchronous instruction in the neural network model, determining the current running state of other processors in the plurality of processors running the synchronous instruction which is the same as the current running state of the first processor, and when it is determined that all other processors which run the same synchronization instruction as the first processor have run the synchronization instruction, the first processor and other processors which have the same synchronous instruction with the current running of the first processor simultaneously read the common weight data in the cache (cache), thereby effectively improving the cache hit rate, and the purpose that multiple paths of different inputs can read the shared weight data at the same time is also realized, and finally, the running speed of the neural network model is effectively accelerated.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a data execution method of an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating the operation logic between operators of one convolutional layer in a neural network model in the data operation method according to the embodiment of the disclosure;

FIG. 3 shows a block diagram of a data execution apparatus of an embodiment of the present disclosure;

FIG. 4 shows a block diagram of a combined processing device according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of another combined processing device according to an embodiment of the present disclosure;

fig. 6 shows a block diagram of a board card according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

First, it should be noted that, in the data processing method of the present disclosure, the neural network model may be various network models, such as: CNN (Convolutional Neural Network), RNN (cyclic Neural Network), birn (Bidirectional RNN), GRU (Gated cyclic Unit), LSTM (Long Short-Term Memory Network), and the like, and the present invention is not particularly limited thereto.

Meanwhile, the data processing method disclosed by the invention can be applied to an artificial intelligence processor. Wherein, an artificial intelligence processor refers to a processor (IPU) for performing artificial intelligence operations, such as: the system comprises one or a combination of a GPU (Graphics Processing Unit), a Neural-Network Processing Unit (NPU), a Digital Signal Processing (DSP) and a Field Programmable Gate Array (FPGA) chip. The present disclosure is not limited to a particular type of artificial intelligence processor.

Meanwhile, it should be noted that when the data operation method of the embodiment of the present disclosure is applied to an artificial intelligence processor, the artificial intelligence processor includes a plurality of processors. When multiple processors run the neural network model, three modes of operation may be included. The first is data parallel mode (data parallelisms): different processors have multiple copies of the same neural network model, each assigned to different data (e.g., different sample data sets), and then the computed results of all processors are combined in some manner. The second is model parallel approach (model parallelism) where different processors in a distributed system are responsible for different parts of the neural network model. Such as: different network layers of the neural network model are assigned to different processors, or different parameters within the same layer are assigned to different processors. The third is an operation mode of mixing data parallel and model parallel. That is, part of the processors in the distributed system operate the neural network model in a model parallel manner, and the other part of the processors operate the neural network model in the data parallel manner.

FIG. 1 illustrates a data execution method of an embodiment of the disclosure. Referring to fig. 1, in a data operation method according to an embodiment of the present disclosure, the method may include:

step S100, after the first processor runs the synchronous instruction in the neural network model, determining the current running state of each processor running the synchronous instruction in other processors. Here, it should be noted that the first processor refers to any one of a plurality of processors that run a neural network model. Meanwhile, it should be noted that the synchronization instruction is an instruction added in the neural network model, and the number and the position of the added synchronization instruction are different based on different neural network models. That is, the synchronization instruction added in the neural network model may include a plurality of pieces. Here, as will be understood by those skilled in the art, the other processors in the plurality of processors that execute the synchronization instruction refer to the processors that are the same synchronization instruction as the synchronization instruction currently executed by the first processor.

For example, three synchronization instructions are added to the neural network model, which are: the first synchronization instruction, the second synchronization instruction, and the third synchronization instruction. After the first processor runs the first synchronization instruction, when the running states of other processors are determined, only the current running state of the processor distributed with the first synchronization instruction needs to be determined, and the processor without the first synchronization instruction running task is not in the determined range.

In step S200, when the processors in the current operation state are running the synchronization instruction, the processors running the synchronization instruction read the common data at the same time. That is, when all other processors running the first synchronization instruction have run the first synchronization instruction, the first processor and all other processors running the first synchronization instruction read the data stored in the cache at the same time.

Here, it should be noted that the data read by each processor at the same time may be weight data in the neural network model, and may also include various parameters required for operator operation in the neural network model. And the read weight data is data currently shared by each processor (i.e., the weight data required by the next operator to be executed after each processor executes the first synchronization instruction).

Therefore, the data operation method of the embodiment of the disclosure, when the neural network model is operated in at least one operation mode of data parallel and model parallel, by adding the synchronization instruction in the neural network model, after the first processor finishes operating the synchronization instruction in the neural network model, determines the current operation state of each processor in the plurality of processors which operates the synchronization instruction currently the same as that of the first processor, and when determining that each processor which operates the synchronization instruction currently the same as that of the first processor finishes operating the synchronization instruction, the first processor and each processor which operates the synchronization instruction currently the same as that of the first processor simultaneously read the common weight data in the cache (cache), effectively improves the cache hit rate, and also realizes the purpose that the common weight data can be simultaneously read by multiple paths of different inputs, and finally, the running speed of the neural network model is effectively accelerated.

In one possible implementation, the current operating state may further include that at least one of the processors that are executing the synchronization instruction does not start to execute or is executing the synchronization instruction. That is, after the first processor runs the first synchronization instruction, at least one of the processors assigned with the first synchronization instruction in the plurality of processors does not start running the first synchronization instruction or runs the synchronization instruction. At this time, the operation of the first processor and other processors that have already executed the synchronization instruction may be suspended.

That is to say, in the embodiment of the present disclosure, after the first processor has run the first synchronization instruction, and among the multiple processors in the neural network model, when at least one of the other processors to which the first synchronization instruction is allocated does not start running the first synchronization instruction or is running the first synchronization instruction (that is, does not run the first synchronization instruction), the running processes of the first processor and the other processors that have run the first synchronization instruction are suspended, and after all the other processors that have not run the first synchronization instruction, the weight data of the operator to be executed after the first synchronization instruction in the cache is read at the same time.

When at least one processor in the processors with the current running state of other running synchronous instructions does not start running or runs the synchronous instructions, the running processes of the first processor and the other processors which run the synchronous instructions are suspended, and the purpose that the processors running the same synchronous instruction can synchronously read the common weight data is further ensured.

It should be noted that, when determining the current operating state of each processor running the synchronization instruction in the multiple processors, the method may be implemented by counting the number of processors that have run the synchronization instruction. Here, as will be understood by those skilled in the art from the foregoing description, the number of acquired processors that have currently executed the synchronization instruction refers to the number of processors that have executed the same instruction as the synchronization instruction currently executed by the first processor. Meanwhile, the number of the statistical processors can be realized by a counter.

That is, taking the first processor running the first synchronous instruction as an example, the number of the acquired processors is the number of processors that have run the first synchronous instruction. Wherein the first synchronization instruction is any one of a plurality of synchronization instructions added in the neural network model.

Based on this, in a possible implementation manner, when determining the current operation state of each other processor which runs the synchronization instruction, the number of processors which have run the synchronization instruction at present is obtained first. And then determining the current running state based on the acquired number of the processors. When the number of processors which have run the synchronization instruction at present is the same as the running number of processors specified in the synchronization instruction (for example, the number of processors which have run the first synchronization instruction at present is the same as the running number of processors specified in the first synchronization instruction), it indicates that each processor which is allocated with the instruction which is the same as the synchronization instruction currently run by the first processor among the plurality of processors has run the synchronization instruction, so that it can be determined that each processor which has run the synchronization instruction at present is in the state of running the synchronization instruction at present, which indicates that the synchronous reading of the weight data can be performed at this time.

When the number of processors that have currently executed the synchronization instruction is different from the number of processors specified in the synchronization instruction (that is, the number of processors that have currently executed the synchronization instruction is smaller than the number specified in the synchronization instruction), it indicates that at least one of the processors that are allocated with the same instruction as the synchronization instruction currently executed by the first processor among the plurality of processors has not yet executed the synchronization instruction, and thus it may be determined that at least one of the processors that have currently executed the synchronization instruction has not started to execute or is executing the synchronization instruction.

The current operation state of each processor running the synchronous instruction in the plurality of processors is determined by counting the number of the processors running the synchronous instruction, the calculation logic is simple, the implementation is easy, the calculation process is effectively simplified, the calculation difficulty is reduced, and meanwhile, the energy consumption in the operation process of the neural network model is effectively reduced.

In order to more clearly illustrate the process of the data parallel method of the embodiment of the present disclosure, the following description will be made in more detail by taking a data parallel manner as an example.

In the embodiment of the present disclosure, the distributed system (e.g., artificial intelligence processor) includes four processors, which are respectively: a first processor, a second processor, a third processor, and a fourth processor. When the neural network model is operated in a data parallel mode, each processor operates the same neural network model. That is, the network topology of the neural network model run by each processor is the same. The difference is that each processor processes a different set of sample data.

Meanwhile, the synchronization instructions added in the neural network model operated by each processor include three synchronization instructions, namely a first synchronization instruction, a second synchronization instruction and a third synchronization instruction. And, the three synchronous instructions are added at different positions. Such as: in this embodiment, the intermediate network layer of the neural network model includes a convolutional layer, in which three operators are set. A first synchronization instruction is added before a first operator in the convolutional layer, a second synchronization instruction is added before a second operator in the convolutional layer, and a third synchronization instruction is added before a third operator in the convolutional layer. Wherein the first operator, the second operator and the third operator are sequentially executed in a network topology of the neural network model.

In addition, in this embodiment, the four processors operate the neural network model in a data parallel manner, so that each processor needs to operate the three synchronization instructions. That is, when the three synchronization instructions are added to the neural network model, the number of operations of the processor specified in each synchronization instruction is 4.

Based on this, in the process of running the neural network model in the present embodiment by each processor, the operation speed of processing data differs due to the difference in hardware configuration of each processor. The time for the processors to run to the synchronization instructions and the time for the operators in the neural network model are thus different.

After the first processor runs the first synchronization instruction in the neural network model, the first processor reads the count value of the counter at the moment to obtain the number of processors which run the first synchronization instruction at present. Here, when the number of processors is counted by a counter, one counter may be provided for each synchronization instruction. For example, after each processor runs the first synchronization instruction, the counter corresponding to the first synchronization instruction counts once.

When the first processor reads that the current counter count value is 1, it indicates that only one of the four processors (i.e., the first processor) has executed the first synchronization instruction, and thus the execution process of the first processor may be suspended. When the first processor is suspended, the count value of the counter can be read at preset interval time.

Meanwhile, the counter continues to count the number of processors that have run the first synchronization instruction. That is, in the running process of the four processors, after each processor runs the first synchronous instruction, the counter counts for one time correspondingly. And each processor suspends the operation process after running the first synchronization instruction, and can read the current counted count value of the counter at regular time according to the preset interval time until the current counted count value of the counter is 4.

When the count value read by each processor or any one processor to the counter is the same as the number specified in the first synchronization instruction (that is, 4), the weight data of the first operator stored in the cache is read by the four processors at the same time, so that each processor can perform the operation of the first operator based on the read weight data of the first operator.

Here, it should be further noted that, when the four processors simultaneously read the weight data of the first operator stored in the cache, if the weight data of the first operator is not currently stored in the cache, the four processors directly read the weight data of the first operator stored in the memory (e.g., DDR, for storing ownership value data in the neural network model). One of the four processors (i.e., the processor with the highest reading speed) reads the weight data of the first operator from the DDR, and caches the weight data of the first operator in a cache (cache), and the other three processors directly read the weight data of the first operator from the cache.

Therefore, the four processors can simultaneously read the corresponding weight data when executing the operation of the first operator. When the second synchronization instruction and the third synchronization instruction are executed, the processes executed by the four processors are the same as or similar to the process executed by the first synchronization instruction, and therefore, the description is omitted here.

Here, it should be noted that, when the counter is used to count the processors that currently run the synchronization instruction, the number of the counter may correspond to the number of the synchronization instructions. With the deepening of the neural network model (the number of layers of the neural network model can be increased), when the number of added synchronous instructions is increased, in order to avoid the situation that the number of counters is set too much, so that hardware configuration is too complex and power consumption is increased, a certain number of counters can be set, and the number of processors is counted by alternately using the set counters. The number of counters may be set to 32. After the first 32 added synchronous instructions are counted by the 32 counters, the 33 th synchronous instruction starts to continue to count by using the 32 counters.

In addition, it will be understood by those skilled in the art that when the neural network model is run in a model parallel mode or a hybrid mode of operation (i.e., a mode in which data parallel and model parallel are run in a hybrid mode), the process of data running is the same as or similar to that described above, except that the number of operations of the processor specified in each synchronization instruction is different. And therefore will not be described in detail herein.

Further, according to the foregoing, when the neural network model is operated in any one of the data parallel operation mode, the model parallel operation mode and the hybrid operation mode, it is necessary to read the weight data at the same time based on the synchronous instruction in the neural network model. In one possible implementation, when the synchronization instruction is added to the neural network model, the operation may be performed based on the operation mode of the neural network model. Also, synchronization instructions may be added to the network topology of the neural network model as instruction levels in the neural network model. The number of the synchronization instructions may be multiple.

In addition, according to the foregoing, the determination of the current operation state of each processor may be implemented based on whether the number of processors currently running the same synchronization instruction is the same as the number of processors specified in the synchronization instruction.

Thus, in one possible implementation, the synchronization instruction may include the first information and the second information. The first information is used for distinguishing each synchronization instruction in the neural network model. Second information specifying the number of processors executing each synchronization instruction.

For example, such as: when the synchronization instructions are characterized by barrier, each synchronization instruction form can be expressed as: barrier (id, cnt). Wherein, id is used for characterizing the attribute of each synchronous instruction, namely for distinguishing each synchronous instruction. cnt indicates the number of processors running the synchronization instruction. Such as: after each processor runs the current synchronization instruction, checking whether the number of the processors running the current synchronization instruction is equal to cnt or not, and waiting until barrier running the same id has cnt if not.

Therefore, when the number of processors which run a certain synchronization instruction at present is counted by the counter, the synchronization instruction can be identified according to the id in the synchronization instruction, and the number of processors which run the synchronization instruction of the current id can be determined based on the cnt in the synchronization instruction. The method is simple and easy to realize.

It should be noted that, when adding the synchronization instruction to the network topology of the neural network model, the synchronization instruction may be inserted before running the neural network model. And, at the time of addition, it can be performed based on the operation manner in which the neural network model operates.

In one possible implementation, when the proposed neural network model runs on multiple processors in a data parallel manner, adding a synchronization instruction to a network topology of the neural network model may include:

and according to the forward propagation sequence of the neural network, acquiring input and output data (IO) quantities generated when the neural network model runs to the current instruction on each processor layer by layer. Here, as will be understood by those skilled in the art, the layer-by-layer acquisition refers to a network topology based on a neural network model, and the data amount generated when each operator in each neural network model operates on each processor is calculated step by step according to the neural network forward propagation sequence. Meanwhile, the IO quantities generated when each processor runs to the current instruction mean that the sum of the IO quantities generated by all the operators currently run by each processor is calculated step by step according to the forward propagation sequence of the neural network.

Such as: the convolutional layer in the preceding neural network model is still taken as an example. In this embodiment, the convolution layer is configured with four operators, namely, a first operator, a second operator, a third operator and a fourth operator. Referring to fig. 2, the first operator, the third operator and the fourth operator are serial operators, and the operation sequence sequentially includes: a first operator, a third operator, and a fourth operator; the second operator and the third operator are parallel operators.

Based on the above, in the process of acquiring the IO amount generated when the neural network model runs to the current instruction on each processor layer by layer, when the calculation is gradually carried out until each processor carries out the operation to the fourth operator, the first operator, the second operator and the third operator are all operated. Therefore, the quantity of input and output data generated when each processor runs to the current instruction (i.e., the fourth operator) in the neural network model is obtained refers to the sum of IO quantities generated when each processor operates the first operator, the second operator, the third operator and the fourth operator.

After the IO amount of each processor running to the current instruction is obtained through the method, the obtained value of the IO amount is judged. And after the value of the IO quantity reaches a first value, inserting a synchronous instruction before the current instruction. Wherein the value size or range of the first value may be determined as desired. Here, it should be noted that, when the neural network model is operated in a data parallel manner, the amount of IO generated by different processors when operating to the same operator (or the same instruction in the same operator) in the neural network model may be different due to different hardware configurations of the processors. Thus, in the data operation method of the embodiment of the present disclosure, when the neural network model operates in a data parallel manner, the positions of the synchronization instructions added in the neural network model are different based on different processors.

Here, it should be further noted that, since data parallelism refers to that each processor in the distributed system is allocated with the same neural network model, when adding a synchronization instruction in the aforementioned synchronization instruction form, the cnt value in each synchronization instruction is the number of processors currently running the neural network model.

Further, when the running mode of the neural network model is drawn up in a model parallel mode and a synchronization instruction is added to the network topology structure of the neural network model, as a possible implementation mode of the embodiment of the present disclosure, the following implementation mode may be implemented.

And acquiring various download weight instructions (namely, various load weight instructions) in the neural network model currently operated by each processor. Here, in the model parallel, since the operation executed by each processor is a partial operator in the neural network model, the obtaining of each download weight value instruction allocated to each processor can be obtained by the scheduling information of the neural network model. As will be understood by those skilled in the art, the scheduling information of the neural network model refers to the assignment information for assigning the operators in the neural network model to the processors.

In addition, in a possible implementation manner, when determining each download weight instruction executed by each processor, the download weight instructions may be determined according to a forward propagation order of the neural network, or according to other orders, as long as all the download weight instructions in each processor can be obtained.

And determining the data volume of the weight data corresponding to each download weight instruction in the neural network model currently operated by each processor. Here, it should be noted that the weight data corresponding to each download instruction refers to a weight to be downloaded when each processor executes each download weight instruction. That is, when each processor executes different download weight instructions, the downloaded weights are different, and thus different weights may have different data amounts.

It should be noted that, when determining the data amount of the weight data corresponding to each download weight instruction in the neural network model currently operated by each processor (that is, currently allocated to a neural network model that has not yet started to operate), the data amount may be calculated based on parameters such as specific arithmetic logic of each operator set in the neural network model and hardware configuration of each processor, which is not described herein again.

After the data volume of the weight data corresponding to each download weight instruction of each processor is determined, a synchronization instruction can be inserted into the network topology structure of the neural network model based on the data volume.

Here, it should be noted that, in a possible implementation manner, when the synchronization instruction is inserted, the synchronization may be performed by determining whether a data amount of the weight data corresponding to each download weight instruction is greater than or equal to the second numerical value. And when the data volume is larger than or equal to the second numerical value, inserting a synchronization instruction before a download weight instruction in the neural network model currently operated by each processor.

Wherein, the value of the second value can be according to the formula: ln (the forward dependency number of the operation corresponding to the current weight value) 512 k. It will be understood by those skilled in the art that the forward dependency number of the operation (i.e., operator) corresponding to the current weight refers to the number of other operators that the current operator depends on. It should be further noted that the above formula is only a basic formula, and may also include other hardware parameters, and these hardware parameters are different based on different processors, so that no one example is given here. That is, the values of the second values set by different processors are different, so that different positions can be inserted based on different processors when the synchronous instruction is inserted.

In addition, the synchronous instructions are inserted into the neural network model operated by the processors at present through the method, the calculation method is simple, the synchronous instructions can be obtained without complex operation, the operation amount is effectively reduced, and meanwhile, the energy consumption is reduced.

It should be noted that, the process of determining each download weight instruction in each processor and the process of determining the data amount of the weight data corresponding to each download instruction may be performed synchronously. Namely, after a download weight instruction is determined, the data volume of the weight data corresponding to the download weight instruction is calculated and obtained. When the determination is performed in a synchronous manner, the determination can be performed step by step based on the forward propagation sequence of the neural network. In addition, when determining the data quantity of each download weight instruction and the corresponding weight data, the downloading can be performed step by step. That is, determining each download weight instruction in each processor, determining all download weight instructions in each processor, and then calculating and acquiring the data volume of the corresponding weight data for each download weight instruction, so as to improve the accuracy of synchronous instruction insertion.

It should be further noted that, after the synchronization instruction is determined to be inserted, the running number of the processors in each inserted synchronization instruction can be determined according to the scheduling information of the neural network model. Here, as will be understood by those skilled in the art, the scheduling information of the neural network model is allocation information of each part of the network after the neural network model is split. Based on the scheduling information of the neural network model, operators operated by each processor can be obtained; according to the information of each operator operated by each processor, it can be obtained to which processors the same operator is distributed to operate (i.e. the number of processors operating the same operator). Therefore, the cnt value inserted into the synchronous instruction before the download weight instruction corresponding to the operator can be directly obtained based on the number of processors operating the same operator.

For example, such as: we will still take the convolutional layer in the neural network model as an example. In this embodiment, five operators are provided in the convolutional layer, which are respectively: the first operator, the second operator, the third operator, the fourth operator and the fifth operator. Wherein the first operator and the second operator are assigned to the first processor run, the second operator and the third operator are assigned to the second processor run, and the third operator, the fourth operator, and the fifth operator are assigned to the third processor run. The first processor and the second processor share weight data of the second operator and the second processor and the third processor share weight data of the third operator.

Thus, the number of operations of the processors specified in the synchronization instruction before the second operator is inserted is set to 2 (i.e., the first processor and the second processor). The number of processors specified in the synchronization instruction before the third operator is inserted is set to 2 (i.e., the second processor and the third processor).

Therefore, based on the steps, the insertion of the synchronous instruction in the neural network model when the neural network model operates in a model parallel mode can be realized, the insertion mode is simple, and the method can be realized without complex operation.

In addition, it should be noted that when the neural network model operates in the hybrid operation mode, the mode and principle of adding the synchronization instruction are the same as or similar to the parallel operation mode of the model, and therefore, the details are not repeated here. Among them, those skilled in the art can understand the way in which the hybrid operation mode data parallel and the model parallel are mixed.

Such as: the first processor and the second processor run a model of a neural network in a model parallel manner, and the third processor and the fourth processor also run the model of the neural network in a model parallel manner. The first processor and the second processor then act as a set of processors, and the third processor and the fourth processor act as a set of processors, which can then be considered to run the neural network model in a data parallel manner.

Referring to fig. 3, the present disclosure also provides a data operation apparatus 100. The data execution apparatus 100 of the embodiment of the present disclosure includes a state determination module 110 and a data reading module 120.

After the first processor runs the synchronization instruction in the neural network model, determining the current running state of each processor running the synchronization instruction in the plurality of processors; wherein the first processor is any one of a plurality of processors running a neural network model.

And the data reading module 120 is configured to, when the state determining module determines that the processors having the current operation states of other synchronous instructions have already executed the synchronous instructions, read the shared data by the processors having executed the synchronous instructions at the same time.

In one possible implementation manner, the method further includes:

and a suspending module (not shown in the figure) configured to suspend the operation process of the first processor and other processors already running the synchronization instruction when the state determining module determines that at least one of the processors currently running in the state of other running synchronization instructions does not start running or is running the synchronization instruction.

In one possible implementation, the state determination module 110 includes:

a number obtaining sub-module (not shown in the figure) configured to obtain the number of processors that have run the synchronization instruction;

and a determination submodule (not shown in the figure) configured to determine that each processor, which is currently running in the state of other running synchronization instructions, has run the synchronization instructions when the number of processors which have run the synchronization instructions is the same as the number specified in the synchronization instructions.

According to another aspect of the present disclosure, there is provided a computer device, including a memory and a processor, where the memory stores thereon a computer program operable on the processor, and the processor implements the steps of any one of the operation methods when executing the computer program.

According to another aspect of the present disclosure, there is also provided a readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of any one of the above operational methods.

According to an aspect of the present disclosure, there is provided a machine learning arithmetic device including one or more arithmetic devices as any one of the above, for acquiring input data and control information to be operated from other processing devices, executing a specified machine learning operation, and transmitting an execution result to the other processing devices through an I/O interface. Other processing devices such as: the device comprises a camera, a display, a mouse, a keyboard, a network card, a wifi interface and a server. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, the computing devices are interconnected and transmit data through a PCIE bus, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

FIG. 4 shows a block diagram of a combined processing device 200a according to an embodiment of the present disclosure. Referring to fig. 4, the present disclosure also provides a combined processing device 200a, which includes the above machine learning computing device (neural network computing device 210), the universal interconnection interface 220 and the other processing device 230. The machine learning arithmetic unit 210 interacts with the other processing unit 230 to complete the operation designated by the user.

Other processing devices 230 include one or more processor types of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing device 230 is not limited. The other processing device 230 is used as an interface for the machine learning arithmetic device and external data and control, and comprises data transportation and completes basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

A universal interconnect interface 220 for transmitting data and control commands between the machine learning computing device 210 and other processing devices 230. The machine learning arithmetic device 210 acquires necessary input data from the other processing device 230 and writes the acquired input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices 230 and written into a control cache on the machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Fig. 5 shows a block diagram of a combined processing device 200b according to another embodiment of the present disclosure. Referring to fig. 5, the combined processing device 200b of the present disclosure may further include a storage device 240, and the storage device 240 is connected to the machine learning arithmetic device 210 and the other processing device 230, respectively. The storage device 240 is used to store data in the machine learning arithmetic device 210 and the other processing device 230, and is particularly suitable for data that is required to be calculated and cannot be stored in the internal storage of the local machine learning arithmetic device or the other processing device.

This combination processing apparatus 200b can regard as the SOC chip-on-chip system of equipment such as cell-phone, robot, unmanned aerial vehicle, video monitoring equipment, effectively reduces control part's core area, improves processing speed, reduces whole consumption. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, a neural network chip is also disclosed, which includes the above machine learning arithmetic device or combined processing device.

In some embodiments, a chip packaging structure is disclosed, which includes the neural network chip.

In some embodiments, a board card is disclosed, which includes the above chip package structure. Referring to fig. 6, fig. 6 provides a card that may include other kit components in addition to the chip 389, including but not limited to: memory device 390, interface device 391 and control device 392.

The memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a ddr SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of data execution, the method being applied to an artificial intelligence processor, the artificial intelligence processor comprising a plurality of processors, the method comprising:

2. The method of claim 1, wherein the synchronization instructions are added to a network topology of the neural network model based on how the neural network model operates.

3. The method of claim 2, wherein the operational mode is a data parallel mode;

4. The method of claim 2, wherein the operating mode is any one of a model operating mode or a hybrid operating mode;

5. A data execution apparatus, comprising:

6. A machine learning arithmetic device, characterized in that the machine learning arithmetic device comprises one or more data operating devices according to claim 5, for acquiring input data and control information to be operated from other processing devices, executing specified machine learning operation, and transmitting the execution result to other processing devices through an I/O interface;

the plurality of computing devices share a memory or own respective memories;

7. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus of claim 6, a universal interconnect interface and other processing apparatus;

8. A neural network chip, comprising the machine learning computation apparatus of claim 6, or the combined processing apparatus of claim 7.

9. An electronic device, characterized in that the electronic device comprises the neural network chip of claim 8.

10. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and the neural network chip of claim 8;

the storage device is used for storing data;