CN113780541A

CN113780541A - Neural network accelerator, data processing device and neural network acceleration method

Info

Publication number: CN113780541A
Application number: CN202111020115.3A
Authority: CN
Inventors: 祝叶华; 孙炜
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2021-12-10

Abstract

The application provides a neural network accelerator, which comprises a first processing unit and a second processing unit; a first data transfer module in the first processing unit is connected with a second data transfer module in the second processing unit; the first processing unit sends first target data to the second data transfer module through the first data transfer module, and/or the first processing unit receives second target data sent by the second data transfer module through the first data transfer module. The application also provides a data processing device and a neural network acceleration method.

Description

Neural network accelerator, data processing device and neural network acceleration method

Technical Field

The present disclosure relates to the field of electronic technologies, and in particular, to a neural network accelerator, a data processing apparatus, and a neural network acceleration method.

Background

The deep learning algorithm is widely applied to the fields of computer vision, speech recognition, natural language processing, bioinformatics and the like due to the superior performance of the deep learning algorithm. Neural network accelerators used for deep learning computations can process data sets, such as image data, for information of interest to a user, such as face recognition, intrusion detection, living object detection, orientation detection, object classification, behavior counting, and the like.

At present, a neural network accelerator for deep learning computation adopts a distributed storage mode, which has the advantage of higher processing parallelism, but the adoption of the distributed storage mode limits the flexibility of deep learning algorithm implementation.

Disclosure of Invention

The embodiment of the application provides a neural network accelerator, a data processing device and a neural network accelerating method.

The technical scheme of the application is realized as follows:

the application provides a neural network accelerator, comprising: a first processing unit and a second processing unit;

a first data transfer module in the first processing unit is connected with a second data transfer module in the second processing unit;

the first processing unit sends first target data to the second data transfer module through the first data transfer module, and/or the first processing unit receives second target data sent by the second data transfer module through the first data transfer module.

Optionally, the first processing unit further includes a first data storage module and a first data processing module;

the first data transfer module is respectively connected with the first data processing module and the first data storage module;

the first data storage module is used for storing data to be processed;

the first data processing module is used for processing the data to be processed to obtain a processing result.

Optionally, the first data transfer module includes a forwarding sub-module;

the forwarding sub-module is respectively connected with the first data storage module, the first data processing module and the second data transfer module;

the forwarding submodule is used for reading the processing result and/or the data to be processed; and sending the processing result and/or the data to be processed to the second data transfer module as the first target data.

Optionally, the first data transfer module includes an operation sub-module and a forwarding sub-module;

the operation submodule is respectively connected with the first data storage module, the first data processing module and the forwarding submodule;

the forwarding sub-module is also connected with the second data transfer module;

the operation submodule is used for performing operation processing on the processing result and/or the data to be processed to obtain a first operation result;

the forwarding sub-module is configured to send the first operation result to the second data transfer module as the first target data.

Optionally, the forwarding sub-module is further configured to receive the second target data sent by the second data relay module;

the operation submodule is further configured to perform operation processing on at least one of the second target data, the data to be processed, and the processing result to obtain a second operation result.

Optionally, the neural network accelerator further comprises a third processing unit;

a third data transfer module of the third processing unit is connected with the forwarding sub-module;

the forwarding sub-module is further configured to send the second operation result to the third data forwarding module.

Optionally, the operation submodule comprises at least one unary operation circuit and/or at least one binary operation circuit;

the operation type of the at least one unary operation circuit, and/or the at least one binary operation circuit is determined based on the configuration information.

Optionally, the second target data comprises at least one of:

the data to be processed is stored in a second data storage module in the second processing unit;

processing results obtained by a second data processing module in the second processing unit;

the fourth target data is sent by the fourth processing unit; and a fourth data transfer module of the fourth processing unit is connected with the second data transfer module.

Optionally, the first processing unit and the second processing unit are processing units adjacent to each other in any two positions of a plurality of processing units included in the neural network accelerator.

The embodiment of the present application further provides a data processing apparatus, including the neural network accelerator provided in the above embodiment.

The embodiment of the application also provides a neural network acceleration method, which is applied to a neural network accelerator, wherein the neural network accelerator comprises a first processing unit and a second processing unit; a first data transfer module in the first processing unit is connected with a second data transfer module in the second processing unit;

the method comprises the following steps:

controlling the first processing unit to send first target data to the second data transfer module through the first data transfer module; so that the second processing unit processes the first target data;

and/or controlling the first processing unit to receive the second target data sent by the second data transfer module through the first data transfer module and process the second target data.

The neural network accelerator provided by the embodiment of the application can comprise a first processing unit and a second processing unit; the first data transfer module in the first processing unit is connected with the second data transfer module in the second processing unit; the first processing unit sends the first target data to the second data transfer module through the first data transfer module, and/or the first processing unit receives the second target data sent by the second data transfer module through the first data transfer module. That is to say, the processing unit in the embodiment of the present application may include a data relay module, and the data of the current processing unit is forwarded to other processing units through the data relay module, so that information sharing between different processing units is realized, and flexibility of a deep learning algorithm is improved.

Drawings

Fig. 1 is a schematic diagram illustrating an architecture of a neural network accelerator in the related art according to an embodiment of the present disclosure;

fig. 2 is a first schematic structural diagram of a neural network accelerator according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a neural network accelerator according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a neural network accelerator according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a neural network accelerator according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a neural network accelerator according to an embodiment of the present application;

fig. 7 is a schematic diagram of data transmission provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram six of a neural network accelerator according to an embodiment of the present application;

fig. 9 is a seventh schematic structural diagram of a neural network accelerator according to an embodiment of the present application;

FIG. 10 is a schematic diagram of an exemplary shuffle operation provided by an embodiment of the present application;

fig. 11 is a flowchart illustrating a neural network acceleration method according to an embodiment of the present disclosure.

Detailed Description

So that the manner in which the features and elements of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

It should be noted that the terms "first", "second", and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

For the convenience of understanding of the technical solutions of the embodiments of the present application, the following related technologies of the embodiments of the present application are described below, and the following related technologies may be optionally combined with the technical solutions of the embodiments of the present application as alternatives, and all of them belong to the protection scope of the embodiments of the present application.

A neural network accelerator for deep learning computations may include a convolution processing module, a vector processing module, and a storage module. The convolution processing module can include a multiply-accumulate array, and completes convolution operation by means of the multiply-accumulate array, namely the convolution processing module can complete data processing with high operation density. In addition, the vector processing module mainly completes data processing with small operation density, such as pooling, size scaling, characteristic image splicing, characteristic image data rearrangement and the like. The Memory module may be a hierarchical design, for example, the Memory unit may be divided into a working register, a Static Random-Access Memory (SRAM), a Dynamic Random-Access Memory (DRAM), and the like.

In practical application, the neural network accelerator can be optimized in three aspects of power consumption, performance and area. When modules in the neural network accelerator are divided, the corresponding relation between each processing module and each storage module needs to be simple and clear as much as possible, so that the layout and wiring among modules at the later stage are facilitated, and the time sequence of data processing is facilitated.

Referring to an architecture diagram of a neural network accelerator in the related art shown in fig. 1, the neural network accelerator for deep learning computation may include N independent processing units: processing unit 1 to processing unit N. Each processing unit may include a convolution processing module, a vector processing module, and a storage module. That is, the neural network accelerator in the related art adopts a distributed storage mode, and each operation module exclusively occupies a storage module. However, in this architecture, there is an obstacle to information sharing between the individual storage modules, limiting the flexibility of the deep learning algorithm.

Based on this, the embodiment of the present application provides a neural network accelerator, and in particular, the neural network accelerator may include a first processing unit and a second processing unit; the first data transfer module in the first processing unit is connected with the second data transfer module in the second processing unit; the first processing unit sends the first target data to the second data transfer module through the first data transfer module, and/or the first processing unit receives the second target data sent by the second data transfer module through the first data transfer module. That is to say, the processing unit in the embodiment of the present application may include a data relay module, and the data of the current processing unit is forwarded to other processing units through the data relay module, so that information sharing between different processing units is realized, and flexibility of a deep learning algorithm is improved.

In order to facilitate understanding of the technical solutions of the embodiments of the present application, the technical solutions of the present application are described in detail below with specific embodiments. The above related art can be arbitrarily combined with the technical solutions of the embodiments of the present application as alternatives, which all belong to the scope of protection of the embodiments of the present application. The embodiment of the present application includes at least part of the following contents.

Referring to a first structural schematic diagram of the neural network accelerator shown in fig. 2, the neural network accelerator provided in an embodiment of the present application may include: a first processing unit 21 and a second processing unit 22.

In some embodiments, the neural network accelerator may be an accelerator for deep learning computation, an accelerator for neural network computation, or the like, which is not limited in this application. The neural network accelerator in the embodiment of the present application may be disposed in a data processing device, a data processing chip, or an electronic device, and the electronic device may be a smart phone, a tablet computer, a personal computer, a server, or an industrial computer, and the like, which is not limited in the embodiment of the present application.

It should be noted that, a plurality of independent processing units may be included in the neural network accelerator, and the plurality of processing units may process data in parallel. In this embodiment, each processing unit may correspond to data of one channel group, that is, each processing unit may process data of a corresponding channel group. Here, one channel group may include one channel or a plurality of channels, which is not limited in this embodiment of the present application. The data of the channel group may be understood as feature information, and for example, in an image processing application, the output of the channel may be a feature map of an image to be processed.

In practical applications, data transmission and operation are mostly concentrated on a single channel group, and data interaction between the channel groups is not frequent, that is, data interaction between different units is less. Therefore, the parallelism of data processing can be improved by processing data of different channel groups by a plurality of processing units.

In some embodiments, the first processing unit 21 and the second processing unit 22 may be two different processing units adjacent to each other in the plurality of processing units. Here, the first processing unit 21 and the second processing unit 22 are not used to limit that only two processing units are included in the neural network accelerator.

In some embodiments, each of the plurality of processing units in the neural network accelerator may include a data relay module, and different processing units may share data through the data relay module.

Referring to fig. 1, the first processing unit 21 may include a first data relay module 211 therein, and the second processing unit 22 may include a second data relay module 221 therein.

In some embodiments, the first data relay module 211 in the first processing unit 21 is connected 221 with the second data relay module in the second processing unit 22.

In this embodiment, the first processing unit 21 sends the first target data to the second data relay module 221 through the first data relay module 211, and/or the first processing unit 21 receives the second target data sent by the second data relay module 221 through the first data relay module 211.

In some embodiments, the first target data may be data generated by the first processing unit 21, or the first processing unit 21 receives data sent by other processing units through the first data relay module 211. The second target data may be data generated by the second processing unit 22, or the second processing unit 22 receives data sent by other processing units through the second data relay module 221. The source of the first target data and the second target data is not limited in the embodiment of the present application.

Therefore, the processing unit in the embodiment of the application may include a data transfer module, and the data of the current processing unit is transferred to other processing units through the data transfer module, so that information sharing between different processing units is realized, and the flexibility of the deep learning algorithm is improved.

In some embodiments, referring to the second schematic structural diagram of the neural network accelerator shown in fig. 3, the first processing unit may further include a first data storage module 212 and a first data processing module 213;

the first data transfer module 211 is connected 213 with the first data processing module 212 and the first data storage module respectively;

a first data storage module 212 for storing data to be processed;

the first data processing module 213 is configured to process the data to be processed to obtain a processing result.

In this embodiment, the first data relay module 211 may be connected to the first data storage module 212 through a first input/output path, the first data relay module 211 may be connected to the first data processing module 213 through a second input/output path, and the first data relay module 211 may also be connected to the second data relay module 221 in the second processing unit 22 through a third input/output path.

Based on this, the first data relay module 211 may read the to-be-processed data stored in the first data storage module 212 through the first input/output path, or write the obtained data into the first data storage module 212. The first data relay module 211 may read the processing result of the first data processing module 213 through the second input/output path, or write data to the first data processing module 213, so that the first data processing module 213 processes the written data. Further, the first data relay module 211 may send data to another processing unit (i.e., the second processing unit) through the third input/output path, or receive data sent by another processing unit.

It is understood that the data to be processed stored in the first data storage module 212 and the processing result processed by the first data processing module 213 can not only be transmitted in the current first processing unit 21, but also be transmitted to the second processing unit 22 by the first data relay module 211 for processing.

In the embodiment of the present application, the second processing unit 22 is similar to the first processing unit 21, and the second processing unit 22 may also include a second data storage module 222 and a second data processing module 223. The second data relay module 221 in the second processing unit 22 may be connected to the second data storage module 222 and the second data processing module 223, respectively. In this way, the data to be processed and the processing result corresponding to the second processing unit may not only be transmitted in the current second processing unit 22, but also be transmitted to the first processing unit 21 by the second data relay module 221 for processing.

That is to say, the first processing unit 21 and the second processing unit 22 can be connected by the first data relay module 211 and the second data relay module 221, and each processing unit has a memory module exclusively, so that the data of the processing unit can be processed in the current processing unit, thereby improving the operating efficiency of the data and facilitating the layout and wiring on the hardware. In addition, the data of each processing unit can be transmitted to other processing units through the data transfer module for processing, so that information sharing among different processing units is realized, and the flexibility of the deep learning algorithm is improved.

In some embodiments, the first data processing module 213 and the second data processing module 223 may include a convolution processing sub-module and a vector processing sub-module. The vector processing submodule can realize data operation processing with small operation density such as pooling, size scaling, characteristic image splicing, characteristic image data rearrangement and the like on the data to be processed. The convolution processing submodule can comprise a multiply-accumulate array and is mainly used for carrying out convolution operation processing on the result of the vector processing submodule to obtain a processing result corresponding to the data to be processed.

It should be noted that the first processing unit 21 and the second processing unit 22 are processing units adjacent to each other in any two positions of the plurality of processing units in the neural network accelerator provided in the embodiment of the present application. That is to say, each processing unit in the neural network accelerator may be connected to a processing unit adjacent to the processing unit in the position thereof through the data relay module, so that data of a plurality of processing units may form a data loop through the data relay module, and data corresponding to each processing unit may be transmitted to any one processing unit for processing. Therefore, information sharing among different processing units is realized, and the flexibility of the deep learning algorithm is improved.

In some embodiments, the plurality of processing units may be all processing units in the neural network accelerator, or may be a part of processing units in the neural network accelerator. That is, the neural network accelerator may connect all the processing units included therein through the data relay module to form a complete data loop. In addition, the neural network accelerator may also connect some of the processing units included therein through the data relay module, and connect the other processing units of the neural network accelerator through the data relay module, so as to form two data loops, so as to reduce the delay of data transmission.

In some embodiments, referring to the third structural schematic diagram of the neural network accelerator shown in fig. 4, the first data relay module 211 in the neural network accelerator provided in the embodiment of the present application may include a forwarding sub-module 2111.

The forwarding sub-module 2111 is connected to the first data storage module 212, the first data processing module 213, and the second data relay module 221 respectively;

a forwarding sub-module 2111 for reading the processing result and/or the data to be processed; and sending the processing result and/or the data to be processed as the first target data to the second data relay module 221.

In this embodiment, the forwarding sub-module 2111 may be connected to the first data storage module 212 through the first input/output path. In this way, the forwarding sub-module 2111 may read the data to be processed from the first data storage module 212 by using the input path of the first input/output path, and write the data to the first data storage module 212 by using the output path of the first input/output path.

The forwarding sub-module 2111 may be connected to the first data processing module 213 through the second input/output path, so that the forwarding sub-module 2111 may write the data to be processed into the first data processing module 213 through an output path in the second input/output path, and read a processing result corresponding to the data to be processed from the first data processing module 213 through an input path in the second input/output path.

The forwarding sub-module 2111 may be connected to the second data relay module 221 in the second processing unit 22 through a third input/output path. In this way, the forwarding sub-module 2111 may send the first target data to the second data relay module 221 by using the output path in the third input/output path, and receive the second target data sent by the second data relay module 221 by using the input path in the first input/output path.

In some embodiments, the second data transit module 221 may also include a forwarding sub-module 2211. The third input/output path is specifically connected to the forwarding sub-module 2211 in the second data relay module 221.

In this embodiment, the forwarding sub-module 2111 may be controlled to connect at least some of the first input/output path, the second input/output path, and the third input/output path, so that the data to be processed and/or the processing result in the first processing unit 21 may be transmitted not only in the first processing unit 21, but also in the second processing unit 22 for processing.

In some embodiments, a central processing unit (e.g., CPU) may control the turning on and off of multiple input/output paths in the forwarding sub-module 2111 according to a specific deep learning algorithm to implement different functions of the deep learning algorithm.

In an example, the forwarding sub-module 2111 may turn on an input path of the first input-output path, and an output path of the third input-output path under the control of the CPU. In this way, the forwarding sub-module 2111 may read the to-be-processed data in the first data storage module 212 through the input path of the first input/output path, and send the read to-be-processed data as the first target data to the second data relay module 221 of the second processing unit 22 through the output path of the third input/output path.

In another example, the forwarding sub-module 2111 may turn on an input path of the second input-output path, and an output path of the third input-output path under the control of the CPU. In this way, the forwarding sub-module 2111 may read the processing result of the first data processing module 213 through the input path of the second input/output path, and send the read processing result as the first target data to the second data relay module 221 of the second processing unit 22 through the output path of the third input/output path.

In yet another example, the forwarding sub-module 2111 may turn on an input path of the first input-output path, an input path of the second input-output path, and an output path of the third input-output path under the control of the CPU. In this way, the forwarding sub-module 2111 can read the data to be processed in the first data storage module 212 through the input path of the first input/output path, and read the processing result of the first data processing module 213 through the input path of the second input/output path. Further, the forwarding sub-module 2111 may send the read data to be processed and the processing result as the first target data to the second data relay module 221 of the second processing unit 22 through the output path of the third input/output path.

Therefore, the processing unit in the neural network accelerator provided by the embodiment of the application can forward the data of the current processing unit to other processing units through the data transfer module, so that information sharing among different processing units is realized, and the flexibility of a deep learning algorithm is improved.

In some embodiments, referring to the fourth structural diagram of the neural network accelerator shown in fig. 5, the first data relay module 211 may include a forwarding sub-module 2111 and an operation sub-module 2112;

the operation submodule 2112 may be connected to the first data storage module 213, the first data processing module 213, and the forwarding submodule 2112, respectively; the forwarding sub-module 2111 may also be connected to the second data relay module 221.

That is to say, in the data relay module 211 in this embodiment of the application, in addition to the forwarding sub-module 2111, the data relay module can also be provided with an operation sub-module 2112, which is used for implementing an operation on data between different processing units.

In some embodiments, the operation sub-module 2112 may be connected to the first data storage module 212 through a fourth input/output path, and the input path of the fourth input/output path is used to read the data to be processed from the first data storage module 212, and the output path of the fourth input/output path is used to write the data to the first data storage module 212.

The arithmetic sub-module 2112 may be connected to the first data processing module 213 through a fifth input/output path, and read a processing result from the first data processing module 213 using an input path of the fifth input/output path, and write data to the first data processing module 213 using an output path of the fifth input/output path.

The operation sub-module 2112 may be connected to the forwarding sub-module 2111 through a sixth input/output path, the second target data forwarded by the second processing unit is read from the forwarding sub-module 2111 by using an input path of the sixth input/output path, and the operation result is transmitted to the forwarding sub-module 2111 by using an output path of the sixth input/output path.

Correspondingly, the forwarding sub-module 2111 may be connected to the second data relay module 221 in the second processing unit 22 through a seventh input/output path, the second target data is read from the second data relay module 221 by using the input path of the seventh input/output path, and the operation result sent by the operation sub-module 2112 is forwarded to the second data relay module 221 by using the output path of the seventh input/output path.

In some embodiments, the second data relay module 221 is similar to the first data relay module 211, and the second data relay module 221 may also include a forwarding sub-module and an operation sub-module, where the forwarding sub-module forwards data, and the operation sub-module performs operation processing on different data. The seventh input/output path is specifically connected to the forwarding sub-module in the second data relay module 221.

In some embodiments, the operation sub-module 2112 may perform operation processing on the processing result and/or the data to be processed to obtain a first operation result; correspondingly, the forwarding sub-module 2111 may send the first operation result to the second data relay module 221 as the first target data.

That is to say, the first data relay module 211 may perform operation processing on data in the current first processing unit, and forward the processing result to other processing units for processing.

Specifically, the operation submodule 2112 may be controlled to switch on at least part of the fourth input/output path to the seventh input/output path, so that the data to be processed and/or the processing result in the first processing unit 21 and the second target data sent by the second processing unit 22 can be operated in the operation submodule 2112.

In one example, the arithmetic sub-module 2112 may turn on an input path of the fourth input-output path, and output paths corresponding to the sixth input-output path and the seventh input-output path under the control of the CPU. In this way, the operation sub-module 2112 may read the to-be-processed data in the first data storage module 212 through the input path of the fourth input/output path, and perform operation on the read to-be-processed data to obtain a first operation result. Further, the operation submodule 2112 sends the obtained first operation result to the forwarding submodule 2112 through the output path of the sixth input/output path, so that the forwarding submodule 2112 can forward the first operation result as the first target data to the second data transfer module 221 through the output path of the seventh input/output path.

In another example, the operation submodule 2112 may switch on an input path of a fifth input/output path, and output paths corresponding to a sixth input/output path and a seventh input/output path under the control of the CPU. In this way, the operation sub-module 2112 may read the processing result in the first data processing module 213 through the input path of the fifth input/output path, and perform operation on the read processing result to obtain the first operation result. Further, the operation submodule 2112 sends the obtained first operation result to the forwarding submodule 2112 through the output path of the sixth input/output path, so that the forwarding submodule 2112 can forward the first operation result as the first target data to the second data transfer module 221 through the output path of the seventh input/output path.

In yet another example, the operation sub-module 2112 may turn on corresponding input paths of the fourth and fifth input-output paths and output paths corresponding to the sixth and seventh input-output paths under the control of the CPU. In this way, the operation sub-module 2112 may read the to-be-processed data stored in the first data storage module 212 through the input path of the fourth input/output path, read the processing result in the first data processing module 213 through the input path of the fifth input/output path, and perform operation on the read to-be-processed data and the read processing result to obtain the first operation result. Further, the operation submodule 2112 sends the obtained first operation result to the forwarding submodule 2112 through the output path of the sixth input/output path, so that the forwarding submodule 2112 can forward the first operation result as the first target data to the second data transfer module 221 through the output path of the seventh input/output path.

In some embodiments, the forwarding sub-module 2111 is further configured to receive second target data sent by the second data transit module 221; correspondingly, the operation submodule 2112 is further configured to perform operation processing on at least one of the second target data, the data to be processed, and the processing result to obtain a second operation result.

That is, the first data relay module 211 may receive the second target data sent by the second processing unit, and perform operation processing on the second target data and the current data in the first processing unit by using the operation processing sub-module 2112, so as to obtain a second operation result.

In one example, the arithmetic sub-module 2112 may turn on the input path corresponding to the sixth input-output path and the seventh input-output path under the control of the CPU. In this way, the forwarding sub-module 2112 may read the second target data sent by the second data forwarding module 221 through the input path of the seventh input/output path, and the operation sub-module 2112 may read the second target data through the input path of the sixth input/output path and perform an operation on the second target data to obtain a second operation result.

In another example, the operation submodule 2112 may turn on corresponding input paths of the fourth input/output path, the sixth input/output path, and the seventh input/output path under the control of the CPU. In this way, the forwarding sub-module 2112 may read the second target data sent by the second data relay module 221 through the input path of the seventh input/output path. The operation sub-module 2112 may read the second target data through the input path of the sixth input/output path, read the data to be processed in the first data storage module 212 through the input path of the fourth input/output path, and perform operation processing on the second target data and the data to be processed to obtain a second operation result.

In yet another example, the operation submodule 2112 may turn on corresponding input paths of the fifth input-output path, the sixth input-output path, and the seventh input-output path under the control of the CPU. In this way, the forwarding sub-module 2112 may read the second target data sent by the second data relay module 221 through the input path of the seventh input/output path. The operation sub-module 2112 may read the second target data through the input path of the sixth input/output path, read the processing result in the first data processing module 213 through the input path of the fifth input/output path, and perform operation processing on the second target data and the processing result to obtain a second operation result.

It can be understood that, the neural network accelerator provided in the embodiment of the present application may add a data transfer module to each processing unit, so that data sharing may be performed between the processing units, and meanwhile, data of different processing units may be operated in the data transfer module, thereby further improving flexibility of algorithm implementation.

In some embodiments, after the operation submodule 2112 obtains the second operation result, the second operation result may be written into the first data storage module 212 through an output path of the fourth input/output path, or the second operation result may be transmitted to the first data processing module 213 for processing through an output path of the fifth input/output path, or the second operation result may be sent to the forwarding submodule 2111 through an output path of the sixth input/output path, so that the forwarding submodule 2111 forwards the obtained second operation result to another processing unit, which is not limited in this embodiment of the present application.

Referring to the schematic structural diagram five of the neural network accelerator shown in fig. 6, the neural network accelerator provided in the embodiment of the present application may further include a third processing unit 23.

Here, the third processing unit 23 may be a processing unit different from the first processing unit 21 and the second processing unit 22 among a plurality of processing units in the neural network accelerator.

In some embodiments, the first data relay module 211 in the first processing unit 21 may be connected with the third data relay module 231 in the third processing unit 23. Specifically, the forwarding sub-module 2111 in the first data relay module 211 may be connected to the third data relay module 231 through the eighth input/output path. The forwarding sub-module 2111 may read the data in the third data relay module 231 through the input path of the eighth input/output path, and may send the data to the third data relay module through the output path of the eighth input/output path.

In some embodiments, the third processing unit 23 may be located adjacent to the first processing unit 21, while the third processing unit 23 is not located adjacent to the second processing unit 22. That is, the third processing unit 23, the first processing unit 21, and the second processing unit 22 are arranged in sequence, and the three processing units may be connected by respective data relay modules to form a data loop.

Based on the above structure, in some embodiments, the forwarding sub-module 2111 is further configured to send the second operation result to the third data relay module 231.

Specifically, after the operation submodule 2112 in the first data relay module 211 obtains the second operation result, the second operation result may be sent to the forwarding submodule 2111 through the output path of the sixth input/output path. In this way, the forwarding sub-module 2111 may send the obtained second operation result to the third data relay module 231 through the output path of the eighth input/output path.

Illustratively, a certain deep learning algorithm requires that data a of channel 0 is bit-inverted and then added to data B of channel 1, and the obtained result is stored in a storage module corresponding to channel 2, that is, the implementation is realized

The operation of (2). If channel 0 corresponds to processing unit 0, channel 1 corresponds to processing unit 1, and channel 2 corresponds to processing unit 2, referring to the data transmission diagram shown in fig. 7, the operation submodule in processing unit 0 can read data a from data storage module 0 to perform bit-wise inversion operation to obtain data

Next, the forwarding sub-module of processing unit 0 forwards the data

Forwarding to the forwarding sub-module in the processing unit 1, and the operation sub-module of the processing unit 1 can read the data obtained by the current forwarding sub-module

And data B currently stored in the processing unit 1, so that the received number can be comparedAccording to

Adding the data B to obtain data

Further, the forwarding sub-module in the processing unit 1 may forward the data

Forwarding submodule of the processing unit 2, which forwarding submodule can forward the received data

Stored in the data storage module of the current processing unit 2. Thus, can realize

The operation of (2).

Therefore, the neural network accelerator provided by the embodiment of the application adds the data transfer module to each processing unit, so that data of each processing unit can be shared, and the problem that a distributed storage unit in the neural network accelerator for deep learning calculation cannot share data is solved. In addition, an operation submodule is further arranged in the data transfer module, data of different processing units can be operated, and support of a hardware architecture to an algorithm is expanded.

In some embodiments, the operations submodule 2112 may include at least one unary operation circuit and/or at least one binary operation circuit. The operation sub-module 2112 may perform a unary operation of data, such as bitwise negation, power, and square, using an unary circuit. The operation sub-module 2112 may implement a binary operation such as addition, subtraction, multiplication, and division between two data by using a binary operation circuit. For example, the operation sub-module 2112 may perform an addition operation on the second target data from the second processing unit 22 and the current data to be processed in the first processing unit 21, and store the result of the addition operation in the first data storage module 212 of the first processing unit 21.

It should be noted that the operation sub-module 2112 may include only a unary operation circuit, only a binary operation circuit, and may also include both the unary operation circuit and the binary operation circuit, which is not limited in this embodiment of the present application. In addition, the number of the unary arithmetic circuits and the binary arithmetic circuits is not limited in the embodiments of the present application.

In some embodiments, when the operation submodule 2112 includes a plurality of operation circuits, the central processing unit may control the operation circuits in the operation submodule 2112 to operate according to the specific deep learning algorithm requirement, and control the operation submodule 2112 to implement different operation operations.

In some embodiments, the operation type of the at least one unary operation circuit, and/or the at least one binary operation circuit may be determined based on the configuration information. That is, the central processing unit may receive configuration information of a user, and the central processing unit may configure the operation type implemented by the operation circuit in the operation sub-module 2112 according to the received configuration information. For example, a binary operation circuit is configured by the configuration information to realize a subtraction operation or an addition operation. Thus, the operation function of the operation submodule 2112 can be expanded, and the flexibility of the neural network accelerator is improved.

In some embodiments, the first target data sent by the first processing unit 21 to the second data relay module 221 of the second processing unit 22 through the first data relay module 211 may include at least one of the following:

the data to be processed is stored in the first data storage module;

the first data processing module processes the obtained processing result;

and the third processing unit sends the third target data.

Correspondingly, in some embodiments, the second target data sent by the second processing unit 22 to the first data relay module 211 of the first processing unit 21 through the second data relay module 221 may include at least one of the following:

the data to be processed is stored in a second data storage module of the second processing unit;

the processing result obtained by the processing of the second data processing module of the second processing unit;

and the fourth processing unit transmits the fourth target data.

Referring to fig. 8, which is a schematic diagram of a neural network accelerator, the fourth processing unit 24 may be a processing unit located adjacent to the second processing unit 22. The fourth processing unit 24 may include a fourth data relay module 241, a fourth data storage module 242, and a fourth data processing module 243, similar to the other processing units. The fourth data relay module 241 may be connected to the second data relay module 221 of the second processing unit 22. Based on this, the first processing unit 21, the second processing unit 22, the third processing unit 23, and the fourth processing unit 24 are connected through the data relay module to form a data loop. Therefore, each processing unit can acquire the data of any processing unit in the data loop, and the problem that the data cannot be shared among different processing units is solved. In addition, when the neural network accelerator provided by the embodiment of the application needs to expand the processing unit, data sharing can be performed with other processing units only by adding one data transfer module to the expanded processing unit, the design of the existing data loop does not need to be changed, and the flexibility of hardware expansion is improved.

The neural network accelerator provided in the embodiments of the present application is further described below with reference to specific application scenarios.

Referring to a seventh structural schematic diagram of the neural network accelerator shown in fig. 9, the neural network accelerator provided in the embodiment of the present application may include N processing units, and each processing unit may include a convolution processing module, a vector processing module, a data relay module, and a data storage module.

The data transfer module in each processing unit is provided with four pairs of input and output paths. The first pair of input/output paths may be used to read data from the data storage module in the current processing unit and write data to the data storage module in the current processing unit. The second pair of input/output paths may be used to read data from or write data to the data processing module in the current processing unit. The third pair of input/output paths and the fourth pair of input/output paths are paths for forwarding data to two adjacent processing units by the data forwarding module. In this way, the data in each processing unit is connected in a ring shape through the data relay module in each processing unit to form a data loop, and the data in each processing unit can be used by the data processing module of the processing unit and can also be transmitted to other processing units for processing.

That is to say, each processing unit can independently occupy one data storage module, and the architecture of distributed storage and distributed operation is utilized, so that the data of each processing unit can be processed in parallel, the efficiency of data processing is improved, the overall layout of hardware is regular, and the later-stage layout and wiring are facilitated. In addition, each processing unit is connected through the data transfer module, so that information sharing among different processing units is realized, and the flexibility of the deep learning algorithm is improved.

Based on the schematic structural diagram of the neural network accelerator shown in fig. 9, when the deep learning algorithm needs to perform a shuffle (shuffle) operation on the data of the channels 0 to 4 (corresponding to the processing units 0 to 4), the processing units 0 to 3 may read the data in the respective data storage modules through the respective data relay modules, and forward the data to the data relay module corresponding to the processing unit 4 through the data relay modules. In this way, the processing unit 4 can write the data of the processing units 0 to 3 into the data storage module of the processing unit 4 based on the data of the current processing unit 4 being combined. Thus, referring to the shuffle operation diagram shown in fig. 10, the central processing unit may perform a shuffle operation on the data of each channel, that is, rearrange the data of each channel.

To sum up, in the neural network accelerator provided in the embodiment of the present application, each processing unit monopolizes one data storage module, so that the data processing efficiency is improved, and the layout and the wiring are also facilitated. In addition, each processing unit is connected through the data transfer module to share data, so that the flexibility of the deep learning algorithm is improved.

The embodiment of the present application further provides a data processing apparatus, and the data processing apparatus may include the neural network accelerator provided in the above embodiment.

The data processing apparatus can implement the function of the neural network accelerator time in each of the above structures in the embodiments of the present application, and for brevity, details are not described here again.

In the embodiment of the present Application, the data Processing Device may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA) 5, a controller, a microcontroller, and a microprocessor. The embodiment of the present application does not limit this.

The embodiment of the application also provides a neural network acceleration method, which can be applied to the neural network accelerator provided by the embodiment. Referring to the flowchart of the neural network acceleration method shown in fig. 11, the method may include the following steps:

step 1110, controlling the first processing unit to send the first target data to the second data transfer module through the first data transfer module; so that a second processing unit processes the first target data; and/or controlling the first processing unit to receive the second target data sent by the second data transfer module through the first data transfer module and process the second target data.

It can be understood that, in the neural network acceleration method provided in the embodiment of the present application, the first processing unit may receive the second target data of the second processing unit through the first data relay module for processing, and meanwhile, the second processing unit may also receive the first target data of the first processing unit through the second data relay module for processing. Therefore, information sharing among different processing units is realized, and the flexibility of the deep learning algorithm is improved.

In some embodiments, the first processing unit in the neural network accelerator further comprises a first data storage module and a first data processing module;

the first data storage module is used for storing data to be processed;

In some embodiments, the first data transit module includes a forwarding sub-module;

in step 1110, the first processing unit is controlled to send the first target data to the second data forwarding module through the first data forwarding module, and the following steps may be implemented:

controlling a forwarding submodule in the first data forwarding module to read the processing result and/or the data to be processed; and sending the processing result and/or the data to be processed to the second data transfer module as the first target data.

In some embodiments, the first data transit module includes an operation sub-module and a forwarding sub-module;

controlling an operation submodule in a first data transfer module to perform operation processing on the processing result and/or the data to be processed to obtain a first operation result;

and controlling a forwarding submodule in the first data forwarding module to send a first operation result serving as first target data to the second data forwarding module.

In some embodiments, the step 1110 of controlling the first processing unit to receive, through the first data relay module, the second target data sent by the second data relay module, and processing the second target data may be implemented through the following steps:

the control forwarding sub-module receives the second target data sent by the second data transfer module;

the control operation sub-module is used for performing operation processing on at least one of the second target data, the data to be processed and the processing result to obtain a second operation result;

and the first data processing module is used for processing the second operation result.

In some embodiments, the neural network accelerator further comprises a third processing unit; a third data transfer module of the third processing unit is connected with the forwarding sub-module;

the method further comprises the following steps:

and controlling a forwarding submodule in the first forwarding module to send a second operation result to the third data forwarding module.

In some embodiments, the operator submodule comprises at least one unary operation circuit, and/or at least one binary operation circuit;

In some embodiments, the second target data comprises at least one of:

processing results obtained by a second data processing module of the second processing unit;

In some embodiments, the first processing unit and the second processing unit are any two adjacent processing units in a plurality of processing units included in the neural network accelerator.

It should be understood that the neural network accelerator in the embodiment of the present application is the same as the neural network accelerator in the above embodiments, and for brevity, the description thereof is omitted here.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or at least two units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It should be noted that: the technical solutions described in the embodiments of the present application can be arbitrarily combined without conflict.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A neural network accelerator, comprising: a first processing unit and a second processing unit;

2. The neural network accelerator of claim 1, wherein the first processing unit further comprises a first data storage module and a first data processing module;

the first data storage module is used for storing data to be processed;

3. The neural network accelerator of claim 2, wherein the first data relay module comprises a forwarding sub-module;

4. The neural network accelerator according to claim 2, wherein the first data relay module comprises an operation submodule and a forwarding submodule;

5. The neural network accelerator of claim 4,

the forwarding sub-module is further configured to receive the second target data sent by the second data forwarding module;

6. The neural network accelerator of claim 5, further comprising a third processing unit;

7. The neural network accelerator according to any one of claims 4 to 6, wherein the arithmetic sub-module comprises at least one unary arithmetic circuit, and/or at least one binary arithmetic circuit;

8. The neural network accelerator of any one of claims 1-5, wherein the second target data comprises at least one of:

the fourth target data is sent by the fourth processing unit; and a fourth data transfer module of the fourth processing unit is connected with a second data transfer module of the second processing unit.

9. The neural network accelerator of claim 1, wherein the first processing unit and the second processing unit are adjacent processing units of any two of a plurality of processing units included in the neural network accelerator.

10. A data processing apparatus comprising a neural network accelerator as claimed in any one of claims 1 to 9.

11. The neural network acceleration method is applied to a neural network accelerator, wherein the neural network accelerator comprises a first processing unit and a second processing unit; a first data transfer module in the first processing unit is connected with a second data transfer module in the second processing unit;

the method comprises the following steps: