CN114997392A

CN114997392A - Architecture and architectural methods for neural network computing

Info

Publication number: CN114997392A
Application number: CN202210926707.XA
Authority: CN
Inventors: 贺新
Original assignee: Chengdu Picture Film And Television Technology Co ltd
Current assignee: Chengdu Picture Film And Television Technology Co ltd
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-09-02
Anticipated expiration: 2042-08-03
Also published as: CN114997392B

Abstract

The application discloses an architecture and an architecture method for neural network computing, and relates to the technical field of artificial intelligence. The neural network architecture method comprises the following steps: forming each of the plurality of particle calculation units into a calculation group; configuring a parameter set for each computing group in all the computing groups according to the computing function requirement; setting the calculation groups with the same parameter set configuration in all the calculation groups as a calculation matrix; and each calculation matrix retrieves data from the memory according to the corresponding parameter set, calculates the related parameters in the data parameter set, and returns the calculation result to the memory.

Description

Architecture and architectural methods for neural network computing

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an architecture and an architecture method for neural network computing.

Background

The neural network model calculation is to analyze and process data information such as images, sounds, texts, etc. by simulating a neuron processing mechanism of a human brain, and is an important part of an Artificial Intelligence (AI) field.

The Convolutional Neural Network (CNN) algorithm in the Neural Network model calculation is widely applied to various fields of Neural networks due to the characteristics of simple structure, strong adaptability, high robustness and the like. However, due to the complexity of data calculation of the convolutional neural network, how to perform high-speed operation on the data of the convolutional neural network is a key research point in the industry. Compared with the existing neural network application platform, a general processor, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), and the like, can flexibly control the whole data calculation process through instructions, but a lot of instructions are needed to control one neural Unit to perform operation, which results in a huge instruction number in the whole architecture, low data transmission efficiency, and low calculation speed; dedicated neural network processors can efficiently multiplex data and are therefore highly efficient, but are difficult to adjust after design.

Therefore, it is desirable to provide an architecture design and method for neural network data computation that is efficient and flexible.

Disclosure of Invention

To address one or more problems in the prior art, embodiments of the present application provide an architecture and an architectural method for neural network computing.

One aspect of the present application provides an architecture for neural network computing. The architecture includes: the neural network control unit is used for receiving the instructions from the external data storage and analyzing the instructions to generate control parameters representing specific execution behaviors of the instructions; the task allocation processing unit is used for receiving the control parameters, analyzing, disassembling and grouping the control parameters to generate a plurality of groups of configuration parameters; and for receiving operational data in an external data store; the calculation module is coupled to the task allocation processing unit and is used for receiving the multiple groups of configuration parameters and the operation data, performing operation processing on the multiple groups of configuration parameters and the operation data, and returning result data and a state report after the operation processing to the task allocation processing unit; wherein the compute module includes a plurality of compute arrays, each compute array receiving one of the plurality of sets of configuration parameters.

Another aspect of the present application provides an architecture for neural network computing, comprising: the interface circuit is used for receiving the user information and processing the user information; the external processor is used for receiving the processed user signal transmitted by the interface circuit so as to generate instructions and data; an external memory for receiving and storing the instructions and data from the external processor; and architectures as previously described.

Yet another aspect of the present application provides an architectural method for neural network computing, comprising: forming a calculation group by every i multiplied by j grain calculation units, wherein i is an integer which is greater than or equal to 1, j is an integer which is greater than or equal to 1, and i multiplied by j is an integer which is greater than or equal to 2; configuring a parameter set for each of m calculation groups according to calculation function requirements, wherein m is an integer greater than 1; setting the calculation groups with the same parameter set configuration in the m calculation groups as a calculation matrix; and each calculation matrix retrieves data from the memory according to the first parameter in the corresponding parameter set, performs operation on the data and the second parameter in the corresponding parameter set, and returns the operation result to the memory.

The architecture and architecture method disclosed in the present application, wherein the architecture comprises: the neural network control unit is used for receiving the instructions from the external data storage and analyzing the instructions to generate control parameters representing specific execution behaviors of the instructions; the task allocation processing unit is used for receiving the control parameters, analyzing, disassembling and grouping the control parameters to generate a plurality of groups of configuration parameters; and for receiving operational data in an external data store; the calculation module is coupled to the task allocation processing unit and is used for receiving the multiple groups of configuration parameters and the operation data, performing operation processing on the multiple groups of configuration parameters and the operation data, and returning result data and a state report after the operation processing to the task allocation processing unit; wherein the compute module includes a plurality of compute arrays, each compute array receiving one of the plurality of sets of configuration parameters. The number of instructions can be reduced during application of the framework, and the number of configuration parameters is greatly reduced, so that the occupied space of the configuration parameters in the register is very small in the transmission process, and the occupied space of data is more. Meanwhile, the architecture can dynamically allocate computing groups in batches to realize different computing functions, so that the data computing efficiency of the whole architecture system is greatly improved.

Drawings

FIG. 1 is a block diagram of a neural network system architecture for neural network computing in accordance with an embodiment of the present application;

FIG. 2 is a schematic block diagram of a neural network computational cluster provided in an embodiment of the present application;

FIG. 3 is a schematic block diagram of a kernel calculation unit for performing convolution calculations according to an embodiment of the present application;

FIG. 4 is a schematic block diagram illustrating a task allocation processing unit provided in accordance with an embodiment of the present application;

FIG. 5 illustrates a high speed neural network computation method provided in accordance with an embodiment of the present application;

the method comprises the following steps of 100-neural network system architecture and 12-task allocation processing unit.

Detailed Description

In order to make the technical solution and advantages of the present application more clear, specific embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments described herein are only for illustration and are not intended to limit the present application. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. It will be apparent, however, to one skilled in the art that these specific details need not be employed to practice the present application. In other instances, well-known circuits, materials, or methods have not been described in detail in order to avoid obscuring the present application.

Throughout the specification, reference to "one embodiment," "an embodiment," "one example," or "an example" means: the particular features, structures, or characteristics described in connection with the embodiment or examples are included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment," "in an embodiment," "one example" or "an example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures, or characteristics may be combined in any suitable combination and/or sub-combination in one or more embodiments or examples. Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale. It should be understood that like reference numerals refer to like elements. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

FIG. 1 illustrates a neural network system architecture 100 for neural network computing, according to an embodiment of the present application. As shown in fig. 1, the neural network system architecture 100 includes a neural network control unit 11, a task allocation processing unit 12, a calculation module 13, and an external data memory 14.

In the embodiment shown in fig. 1, the neural-network control unit 11 receives instructions COMD from an external data memory 14. In one embodiment, the instructions include, but are not limited to, data handling instructions (e.g., addresses of data in an external memory, addresses of parameters in the external memory, etc.), arithmetic instructions (e.g., data reading amount, calculation data amount, vector and scalar in matrix operation, etc.), control instructions (e.g., functions to be performed by a neural network computing unit, etc.), and logical operation instructions.

In one embodiment, the neural network control unit 11 parses and translates instructions into actions that require specific execution and generates control parameters representing the specific execution actions. These control parameters include, but are not limited to, a predefined parameter format for the neural network, parameters for each layer of the neural network, and the like. In particular, the parameters relate to the weights of the convolution kernels of the convolutional layers and the offset of each channel.

In one embodiment, the task allocation processing unit 12 receives the control parameters, parses, disassembles, and groups the control parameters, generates configuration parameters applicable to the computing module 13, and sends the configuration parameters to the computing module 13.

In one embodiment, configuration parameters include, but are not limited to: 1. defining specific functions of a specific certain high-efficiency neural network array (when the multi-input data to be processed is processed in parallel); 2. calculation parameters of the neural network (such as learning rate, regularization parameters, the number of layers of the neural network, the number of neurons in each hidden layer, the number of learning rounds, the size of small-batch data, the encoding mode of output neurons, cost function selection, weight initialization methods, the types of neuron activation functions, the scale of data participating in training models and other hyper-parameters), and memory addresses of calculation results of the neural network layers; 3. the number of input channels and output channels of the neural network layer, the type of an activation function, the size of a filter and other specific parameters.

Meanwhile, the task allocation processing unit 12 also serves as a bridging module between the computing module 13 and the external DATA storage 14, and receives the operation DATA in the external DATA storage 14 and the result DATA of the initial or intermediate computation that the computing module 13 needs to read or write into the external processor. In one embodiment, the operation data and the result data include various data sets, such as a one-dimensional array representing time or frequency spectrum sampling, or a multi-dimensional array of multiple channels, such as a three-dimensional input array representing two-dimensional pixel points and RGB channels on a plane.

In addition, the task allocation processing unit 12 also analyzes the running or idle state of each multi-channel high-efficiency neural network computing array by using a hardware arbitration mode according to the initial or intermediate computing result which needs to be read or written into the external processor by the computing module 13, and performs new task allocation.

In one embodiment, the calculation module 13 retrieves data in the external data storage 14 through the task allocation processing unit 12 according to the requirement of the configuration parameter, performs an operation on the data and the corresponding configuration parameter, returns a calculation result to the task allocation processing unit 12 after a calculation cycle is finished, and sends a status report to the task allocation processing unit 12. In one embodiment, status reports include, but are not limited to, a compute end report, a ready to receive next instruction report, and the like.

In one embodiment, the compute module 13 includes multiple efficient neural network compute arrays 131, 132, …, 13n in multi-channel parallel. Where n is an integer of 1 or more, and an appropriate value of n may be selected depending on the application. Each high-efficiency neural network computational array 13n may perform the same and/or different computational functions. And receives corresponding configuration parameters according to the executed function. For example, in some embodiments, the high efficiency neural network computational array may perform matrix computations (to support matrix operations on each neural network layer in the neural network model, including addition, multiplication, transposition, convolution, deconvolution, etc. of matrices), data pre-processing (format conversion, filtering, windowing, etc. operations on data), algorithm processing (to support algorithms that specify digital signal processing and operations of custom algorithms, such as Fourier transforms, Laplace transforms, quantization operations, etc.), data post-processing (to perform data post-processing after the neural network model computation is complete, including output result conversion, non-linear operations, etc.).

In one embodiment, each high-efficiency neural network compute array 13n, in turn, includes a plurality of compute banks operating in parallel. The number of compute groups included in each high efficiency neural network compute array 13n may be the same or different. In an embodiment of the present disclosure, each compute group in turn includes the same number of granular compute units arranged in a matrix. The task allocation processing unit 12 will allocate an appropriate number of calculation groups to form an efficient neural network calculation array 13n according to the instruction received by the neural network control unit 11. Each compute group in each high efficiency neural network compute array 13n will receive the same parameter configuration, implementing the same operational rules. In one embodiment, the data received by each compute group in each high efficiency neural network compute array 13n may be the same or different. That is to say: each compute array 13n receives only one set of configuration parameters, according to the function to be performed, which each compute group (including multiple compute units) in each compute array executes. Compared with the conventional general-purpose processor (e.g., a CPU) that performs data operation on the individual configuration parameters of each particle computing unit, the neural network system architecture 100 separates the configuration parameters from the data transmission, and each efficient neural network computing array 13n in the computing module 13 receives the configuration parameters and the data in a batch form, and the computing array 13n itself only performs the computation on the received data according to the functions that need to be implemented by the configuration parameters. The whole architecture reduces the number of instructions and the number of configuration parameters, so that the occupied space of the configuration parameters in the register is very small and the occupied space of data is more in the transmission process. Meanwhile, the neural network system architecture 100 can dynamically allocate computing groups in batches to realize different computing functions, and the computing groups realizing the same computing function form a computing array 13n, so that the data computing efficiency of the whole system is greatly improved.

Fig. 2 is a schematic block diagram of a neural network computational cluster according to an embodiment of the present application. As shown in fig. 2, each calculation group includes i × j granule calculation units. Wherein i is an integer greater than or equal to 1, j is an integer greater than or equal to 1, i × j is an integer greater than or equal to 2, and the values of i and j can be selected according to practical application occasions. In one embodiment, each compute array includes i × j =32 granular compute units operating in parallel. Each particle computing unit receives the same configuration parameters and computes the same or different data according to the same configuration parameters.

According to the embodiment of the application, each particle computing unit has a plurality of implementation structures according to the computing functions executable by the computing array. Fig. 3 is a schematic block diagram of a kernel calculation unit for performing convolution calculation according to an embodiment of the present application. As shown in fig. 3, the particle computation unit includes an input data buffer, an input parameter buffer, a multiplier array, an accumulator, and an output data buffer.

The input data buffer receives data provided by the task allocation processing unit 12.

The input parameter buffer receives the configuration parameters provided by the task allocation processing unit 12.

Each multiplier in the multiplier array reads data and parameters from the input data buffer and the input parameter buffer and performs multiplication.

The accumulator adds the calculation result of the multiplier and transmits the final data to the output parameter buffer. The data in the output parameter buffer is finally sent back to the task allocation processing unit 12. In other embodiments where an addition operation is not required, the accumulator may be omitted.

In an alternative embodiment, the particle computation unit comprises 9 multipliers. Suppose that 32 grain computation units form a computation group again. Since the particle computing units in the computing group are uniformly programmed, the number of configuration parameters is small, that is, 32 particle computing units are uniformly equipped with the same parameters. If a conventional processor is used for neural network computation, 32 × 9 instructions are required to complete multiplication, and if the granular computing unit has addition operation, the number of instructions required to be called is also required to be multiplied by 2, and the instructions consume a lot of time. The architecture only needs to give an instruction to inform the configuration parameters, all the register units can obtain the same parameter, and the whole calculation group can complete all calculations in one use cycle.

In addition, when a conventional processor processes neural network computation, each multiplier needs to send a control command to the memory first when reading a piece of data, transmit the data after waiting for the response of the memory, and feed back the received data to the memory. Thus, the instructions occupy a lot of space, for example, three instructions are needed to transfer one data, and the data transfer efficiency is only 25%. The existing architecture transmits the instruction and the data separately, processes the instruction in batch, reduces the instruction, occupies more data, and has high efficiency.

The particle computation units in each computation group can be uniformly assigned and reprogrammed by the neural network control unit 11 and the task assignment processing unit 12. For example, when the operation status of a certain computation group in a certain high-efficiency neural network computation array 13n is idle, the task allocation processing unit 12 may re-perform task allocation on the computation group, and at this time, the computation group may form a new computation array with the computation groups of other same operation tasks.

Fig. 4 is a schematic block diagram illustrating the task allocation processing unit 12 according to an embodiment of the present application. As shown in fig. 4, the task allocation processing unit 12 includes a parsing module, a function allocation module, and a data bridging module.

The analysis module receives the control parameters output by the neural network control unit 11, and analyzes and disassembles the control parameters to generate a configuration parameter set.

The function allocation module groups the configuration parameter sets analyzed by the analysis module, receives the status report of the calculation module 13 and the data result of the data bridging module, generates a plurality of groups of configuration parameters, and sends each group of configuration parameters to the corresponding high-efficiency neural network calculation array 13n in the calculation module 13 to implement corresponding operation.

The DATA bridge module receives the DATA in the external DATA storage 14 and sends it to the computing module 13, and receives the initial or intermediate computing result DATA that the computing module 13 needs to read from or write to the external DATA storage 14.

Continuing back to fig. 1. In one embodiment, the neural network control unit 11, the task allocation processing unit 12, and the computation module in the neural network system architecture 100 are integrated in an integrated circuit 101. The external data memory 14 is located outside the integrated circuit 101.

Further, the neural network system architecture 100, in some examples, also includes interface circuits and an external processor, as required by the actual application.

In one embodiment, the interface circuit receives user information and processes the user information for transmission to the external processor. In one embodiment, the interface circuit includes, but is not limited to, for example, a USB, a light module, a camera capture module, an Ethernet interface, Bluetooth, etc.

The external processor receives the signals transmitted by the interface circuit, generates instructions and data, and stores the instructions and data in the external memory 14. Integrated circuit 101 is used to perform neural network operations on instructions and data in external memory 14.

In one embodiment, the interface circuitry, external processor, integrated circuit 101, and external memory 14 would be molded in the same module.

Fig. 5 illustrates a high-speed neural network computing method according to an embodiment of the present application. The calculation method can be applied to the neural network architecture 100, and includes the following steps S1-S4.

Step S1: every i multiplied by j particle computing units form a computing group, wherein i is an integer which is greater than or equal to 1, j is an integer which is greater than or equal to 1, i multiplied by j is an integer which is greater than or equal to 2, and the values of i and j can be selected according to practical application occasions. In one embodiment, i × j equals 32. That is to say: each 32-grain calculation unit forms a calculation group.

Step S2: and configuring a parameter set for each of m calculation groups according to calculation function requirements, wherein m is an integer greater than 1.

Step S3: and setting the calculation groups with the same configuration in the m calculation groups as a calculation matrix.

Step S4: and each calculation matrix retrieves data from the memory according to the first parameter in the corresponding parameter set, performs operation on the data and the second parameter in the corresponding parameter set, and returns an operation result to the memory. In one embodiment, the first parameter includes parameters such as an addressing address, an amount of data to call, and the like. The second parameters include, for example, the weights of the convolution kernels of the convolution layer and the offset of each channel.

While the present application has been described with reference to several exemplary embodiments, it is understood that the terminology used is intended to be in the nature of words of description and illustration, rather than of limitation. As the present application may be embodied in several forms without departing from the spirit or essential characteristics thereof, it should also be understood that the above-described embodiments are not limited by any of the details of the foregoing description, but rather should be construed broadly within its spirit and scope as defined in the appended claims, and therefore all changes and modifications that fall within the meets and bounds of the claims, or equivalences of such meets and bounds are therefore intended to be embraced by the appended claims.

Claims

1. An architecture for neural network computing, comprising:

the neural network control unit is used for receiving the instructions from the external data storage and analyzing the instructions to generate control parameters representing specific execution behaviors of the instructions;

the task allocation processing unit is used for receiving the control parameters, analyzing, disassembling and grouping the control parameters to generate a plurality of groups of configuration parameters; and for receiving operational data in an external data store; and

the calculation module is coupled to the task allocation processing unit and is used for receiving the multiple groups of configuration parameters and the operation data, performing operation processing on the multiple groups of configuration parameters and the operation data, and returning result data and a state report after the operation processing to the task allocation processing unit; wherein the compute module includes a plurality of compute arrays, each compute array receiving one of the plurality of sets of configuration parameters.

2. The architecture of claim 1, wherein each compute array includes a plurality of compute groups connected in parallel, each compute group receiving the same configuration parameters.

3. The architecture of claim 2, wherein each compute group includes i x j bin compute units, where i is an integer greater than or equal to 1, j is an integer greater than or equal to 1, and i x j is an integer greater than or equal to 2.

4. The architecture of claim 3, wherein the particle computation unit comprises:

an input data cache for receiving the operational data;

an input parameter cache for receiving the configuration parameters;

the multiplier array comprises a plurality of multipliers, and each multiplier is used for reading the operation data and the configuration parameters from the input data cache and the input parameter cache and performing multiplication;

an accumulator for adding the calculation result of each multiplier and generating the result data; and

and the output parameter cache is used for receiving the result data and returning the result data to the task allocation processing unit.

5. The architecture of claim 3, wherein each of the compute groups includes 32 particle compute units.

6. The architecture of claim 1, wherein the task allocation processing unit comprises:

the analysis module is used for receiving the control parameters output by the neural network control unit, and analyzing and disassembling the control parameters to generate a configuration parameter set;

the function distribution module is used for grouping the configuration parameter sets to generate a plurality of groups of configuration parameters and sending each group of configuration parameters to the corresponding calculation array according to the status report; and

and the data bridging module is used for receiving the operation data and the result data.

7. An architecture for neural network computing, comprising:

the interface circuit is used for receiving user information and processing the user information;

the external processor is used for receiving the processed user signal transmitted by the interface circuit so as to generate instructions and data;

an external memory for receiving and storing the instructions and data from the external processor; and

the architecture of any one of claims 1-6.

8. An architectural method for neural network computing, comprising:

forming a calculation group by every i multiplied by j grain calculation units, wherein i is an integer which is greater than or equal to 1, j is an integer which is greater than or equal to 1, and i multiplied by j is an integer which is greater than or equal to 2;

configuring a parameter set for each of m calculation groups according to calculation function requirements, wherein m is an integer greater than 1;

setting the calculation groups with the same configuration of parameter sets in the m calculation groups as a calculation matrix; and

and each calculation matrix retrieves data from the memory according to the first parameter in the corresponding parameter set, performs operation on the data and the second parameter in the corresponding parameter set, and returns an operation result to the memory.

9. The architectural method of claim 8, wherein each bin computation unit comprises 9 multipliers.

10. The architectural method of claim 8, wherein i x j equals 32.