CN117289897A

CN117289897A - Self-adaptive double-frequency multiply-add array suitable for neural network accelerator

Info

Publication number: CN117289897A
Application number: CN202311576340.4A
Authority: CN
Inventors: 张�浩; 汪粲星; 李泽钜
Original assignee: Nanjing Magnichip Microelectronics Co ltd
Current assignee: Nanjing Magnichip Microelectronics Co ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2023-12-26
Anticipated expiration: 2043-11-24
Also published as: CN117289897B

Abstract

The invention discloses a self-adaptive double-frequency multiply-add array suitable for a neural network accelerator, which introduces a double-fixed data flow scheduling strategy to realize fixed output and fixed weight part, so that weight data can be further multiplexed, the bandwidth pressure of a weight static memory is relieved, the array frequency can be higher than the system frequency, and higher calculation energy efficiency is realized. The self-adaptive dual-frequency array adopted by the invention greatly reduces the operation of repeated read-write memory, dynamically adjusts the frequency of the calculation array through the input image, the weight buffer and the frequency adjusting module, controls the calculation of the whole array through the state machine, greatly increases the throughput of the array, and has clear logic, simplicity and effectiveness. The invention has high modularization degree, clear dependence among modules, simple structure and good feasibility, realizes better pressure balance between the memory and the computing array, and improves the energy efficiency of the array.

Description

Self-adaptive double-frequency multiply-add array suitable for neural network accelerator

Technical Field

The invention relates to a self-adaptive double-frequency multiply-add array suitable for a neural network accelerator, and belongs to the technical field of neural networks.

Background

In recent years, neural networks have been widely used in the fields of image recognition, speech recognition, natural language processing, and the like. With the rapid development of neural network technology, the limitations of the frequency of the computing array and the bandwidth of the memory in the neural network accelerator affect the operating efficiency and resource utilization of the neural network accelerator.

Unlike conventional computer architectures, neural network computing arrays typically use a large number of parallel computing units to process input data, the speed of operation of which is limited by the design of the hardware and the manner in which the data streams are scheduled. Meanwhile, the storage unit for storing the weight, bias and other parameters in the neural network is usually implemented by using a cache or a memory, and the read-write speed is limited by the physical characteristics of the device and the hardware design, so that the bandwidth is limited. The training and reasoning process of the neural network requires a large number of parameter read-write operations, and if the bandwidth of the storage unit is insufficient, the calculation speed of the neural network is slow. In addition, as the size of the neural network is continuously enlarged, the bandwidth bottleneck problem of the memory unit is increasingly significant.

In neural network accelerators, data in a conventional data stream flows between computing units in a fixed format. In general, the computing unit needs more data, and the transmission speed of the storage unit cannot meet the requirement of the computing unit, so that the load imbalance between the computing unit and the storage unit is formed, the highest working frequency of the computing array is far less than the theoretical highest working frequency, the front-end design and the resource redundancy are serious, and the chip efficiency is reduced.

To solve this problem, researchers have proposed many optimization methods, such as rearranging the data stream, optimizing the access manner of the memory, distributed computation, in-memory computation/near-memory computation, and the like. The above data stream rearrangement and memory optimizing access mode is from space dimension optimization, usually adopting larger input image and weight buffer, putting more data in faster storage medium as much as possible, so that the storage bandwidth satisfies the data amount required by one-time calculation of the array; or better data flow is adopted, so that the array has higher utilization rate, and the computational bottleneck caused by frequency is relieved by the space utilization rate of the array. Distributed computation and in-memory computation/near-memory computation are optimized in the time dimension to shorten the path of data read-in as much as possible, placing the computation unit directly near the memory unit so that the array can run at a higher frequency. However, the suboptimal solution phenomenon exists in the scheme, and the optimization of the space dimension often leads to the deterioration in the time dimension, so that the array speed cannot reach the theoretical highest speed; while optimization of the time dimension ignores utilization in array space.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the adaptive double-frequency multiply-add array is suitable for the neural network accelerator, improves the output fixed data stream, forms a double-fixed data stream scheduling strategy, can be applied to the neural network accelerator which supports the output fixed data stream, further improves the frequency of the computing array, and reduces the bottleneck caused by the bandwidth of the weight memory.

The invention adopts the following technical scheme for solving the technical problems:

an adaptive double-frequency multiply-add array suitable for a neural network accelerator comprises a state machine control module, an address generation module, a data multiplexing calculation module, an input image and weight buffer, a frequency adjustment module, a calculation module, an output data buffer and an output data shaping module; the address generation module, the data multiplexing calculation module, the input image and weight buffer and the frequency adjustment module are respectively connected with the state machine control module, the data multiplexing calculation module, the frequency adjustment module, the calculation module, the output data buffer and the output data shaping module are sequentially connected, the data multiplexing calculation module is also connected with the input image and the weight buffer, the input image and the weight buffer are connected with the calculation module, and the address generation module is connected with the data multiplexing calculation module;

the state machine control module is used for receiving external instructions, reading current neural network parameters, decoding the current neural network parameters and then sending the decoded current neural network parameters to the address generation module and the data multiplexing calculation module;

the address generation module is used for receiving the neural network parameters sent by the state machine control module, generating a storage address of an original input image of the neural network, receiving a pre-trained weight, writing the pre-trained weight into a corresponding position, calculating an input image address and a weight address which currently participate in calculation, generating a corresponding read-write signal, and reading the input image address and the weight address into the data multiplexing calculation module;

the data multiplexing calculation module is used for receiving the neural network parameters sent by the state machine control module, calculating the data multiplexing times by combining the hardware parameters of the neural network, wherein the data multiplexing times comprise the multiplexing times required by the input images and weights which are currently participated in calculation, sending the input images and the weights multiplexing times to the frequency adjustment module, and writing the input images and the weights which are currently participated in calculation into the input images and the weights buffer;

the frequency adjustment module is used for receiving the input images and the weight multiplexing times sent by the data multiplexing calculation module, dynamically adjusting the frequency of the calculation module after receiving the frequency configuration instruction sent by the state machine control module, sending an adjustment completion mark to the state machine control module after the frequency adjustment is completed, and controlling the input images and the weight buffer to send the input images and the weights currently participating in calculation to the calculation module after the state machine control module receives the adjustment completion mark;

the calculation module is used for receiving the input image and the weight which are currently participated in calculation, finishing two-dimensional convolution calculation or full connection calculation, outputting a calculation result to the output data buffer, and outputting the calculation result after the output data shaping module shapes the calculation result.

As a preferred scheme of the invention, the working state of the state machine control module comprises an idle mode and a working mode, wherein the working mode comprises a neural network parameter reading stage, an input image and weight loading stage, a frequency configuration stage and a calculation stage; the state machine control module toggles between an idle mode and an operational mode.

As a preferable mode of the present invention, the address generating module includes an input image read address unit, an input image write address unit, a weight data read address unit, and a weight data write address unit; the input image reading address unit is used for calculating input image addresses participating in calculation; the input image writing address unit is used for generating a storage address of an original input image of the neural network; the weight data reading address unit is used for calculating the weight address which participates in the calculation currently; the weight data writing address unit is used for receiving the weight trained in advance and writing the weight into the corresponding position.

As a preferred scheme of the present invention, the data multiplexing calculation module includes a neural network parameter receiving unit, a multiplexing frequency calculation unit, an input image and weight shaping unit, and an input image and weight rewriting unit, wherein the multiplexing frequency calculation unit includes an input image multiplexing frequency calculation unit and a weight multiplexing frequency calculation unit;

the neural network parameter receiving unit is used for receiving the neural network parameters sent by the state machine control module, decoding the neural network parameters to obtain the neural network parameters including the transverse step length, the longitudinal step length, the convolution kernel height, the convolution kernel width, the input image height, the input image width, the number of input image channels and the number of output image channels of the neural network; the input multiplexing frequency calculation unit and the weight multiplexing frequency calculation unit are respectively used for calculating multiplexing frequency needed by the input image and the weight which are currently involved in calculation by combining hardware parameters of the neural network; the input image and weight shaping unit is used for receiving the results calculated by the input image multiplexing frequency calculation unit and the weight multiplexing frequency calculation unit, calculating the specific addresses of the input image and the weight which participate in calculation and writing the specific addresses into the input image and the weight buffer, and sending the specific addresses to the input image and the weight writing unit; the input image and weight writing unit is used for judging whether the input image and the weight buffer are in a writable state or not, and writing the input image and the weight into specific addresses in the input image and the weight buffer when the input image and the weight buffer are in the writable state.

As a preferred solution of the present invention, the frequency adjustment module is configured to dynamically adjust the frequency of the calculation module, and specifically includes:

if the single parallelism of the calculation module cannot cover all the input images and the size and the step length of the convolution kernel meet the overlapping in the convolution area, namely, the multiplexing times required by the input images and the weights which are currently involved in calculation are not 0, the frequency of the calculation module is adjusted to be twice the frequency of the input images and the weight buffer; if the single parallelism of the calculation module can cover all the input images or the sizes and the step length of the convolution kernels and the adjacent convolution areas are not overlapped, namely, when the multiplexing frequency needed by the input images or the weights which are currently involved in calculation is 0, the frequency of the calculation module is kept consistent with the frequency of the input images and the weight buffer.

As a preferable scheme of the invention, the calculation module comprises a three-dimensional vector processing array, the three-dimensional vector processing array is composed of 4 rows and 16 columns of intelligent calculation vector processing units, and the calculation module realizes parallel operation of vector multiplication and addition through the three-dimensional vector processing array.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. the invention realizes that the frequency of the computing array and the on-chip memory chip work at different frequencies, greatly relieves the computing bottleneck caused by on-chip memory, improves the performance and energy efficiency of the chip, and balances the utilization rate of the array in the time dimension while taking the utilization rate of the neural network accelerator in space into consideration.

2. The invention has high hardware parameter adjustability and universality, can fully adjust the size of the hardware according to the complexity of the actual application scene, and does not influence the functions.

3. The invention has high modularization degree, clear dependency relationship among different modules, simpler scheduling scheme and reduced area overhead of the control module.

Drawings

FIG. 1 is a block diagram of an adaptive dual-frequency multiply-add array suitable for use with a neural network accelerator in accordance with the present invention;

FIG. 2 is an operational state of a state machine control module of the present invention;

FIG. 3 is a block diagram of a data multiplexing computation module and a frequency adjustment module according to the present invention;

fig. 4 is a schematic diagram of a four-row sixteen-column array.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

The invention provides a self-adaptive double-frequency multiply-add array in a neural network, which determines a data flow scheduling scheme of the current network by analyzing the current input characteristics, automatically selects the frequency relation between a computing array and other modules, reduces the frequency limit of storage in the neural network on the computing array, and further improves the performance of the computing array.

As shown in FIG. 1, the self-adaptive double-frequency multiply-add array suitable for the neural network accelerator comprises a state machine control module, an address generation module, a data multiplexing calculation module, an input image and weight buffer, a frequency adjustment module, a calculation module, an output data buffer and an output data shaping module.

The state machine control module receives an instruction input from the outside and decodes the instruction, and the instruction content specifically comprises: the base address of the image, the base address of the weight data, and the neural network parameters are input. The state machine control module is connected with all the rest modules; the address generation module receives the instruction of the state machine control module, calculates the weight parameter of the neural network and the address of the input image and generates a read-write signal; the data multiplexing calculation module receives an instruction from the state machine, calculates the number of times of the current input image and weight needed to be used later, and the input image and weight buffer is used for storing the input image and weight calculated by the data multiplexing calculation module, and the state machine decides when to take the input image and weight out; the frequency adjusting module receives signals from the state machine control module and the data multiplexing calculation module and is used for dynamically adjusting the frequency of the calculation module. The calculation module carries out two-dimensional convolution operation and full-connection calculation by receiving the input image and the weight from the input image and the weight buffer, selects whether to output data to the output data buffer or not through the control of the state machine control module, and the output data shaping module carries out further arrangement on the data and outputs the data to the subsequent module for operation.

As shown in fig. 2, the state machine control module is divided into an idle mode and an operating mode in an operating state, wherein the operating mode includes a neural network parameter reading stage, an input image and weight loading stage, a frequency configuration stage and a calculation stage, and the calculation stage includes a calculation progress stage and a calculation idle stage. The frequency configuration stage comprises a neural network parameter pre-reading stage, a multiplexing degree calculating stage and a frequency configuration stage. The computation phase includes convolution computation, full connection computation, wherein the idle mode of the state machine control module is a wait for command state.

The state machine control module toggles between an idle mode and an operational mode. The idle mode of the state machine control module refers to the standby state of the state machine control module; the working mode of the state machine control module refers to the state of the state machine control module when two-dimensional convolution or full connection calculation is performed.

The address generation module is divided into an input image read address unit, an input image write address unit, a weight data read address unit and a weight data write address unit according to different responsible objects and different preset logic relations. The subunit in the address generation module firstly acquires the base address and the network parameter of the input image and the weight data from the state machine control module, then performs address conversion through calculation, and reads the current required data from the input image and the weight buffer.

The subunits of the address generation module function as follows: the method comprises the steps of receiving a base address and various parameters of a neural network, which are given by a state machine control module, wherein the parameters are as follows: the neural network inputs the length, width and number of channels of the image. The input image reading address unit calculates the address of the input image which needs to participate in calculation each time through the parameters. The input image writing address unit is used for generating an address where the original input image is stored. The weight data read address unit is used for generating a weight address which needs to participate in calculation every time. The weight data writing address unit is used for receiving the weight from the training and writing the weight into the corresponding position. When the state machine control module is in a data weight preloading stage and a calculation stage in a working mode, the writing unit stands by and the reading unit works, and when the state machine control module enters a calculation ending stage, the writing unit works and the reading unit stands by. The modules are matched with each other to manage the access address of the state machine control module, so that the deviation between the extraction and storage of data during calculation is avoided.

The data multiplexing calculation and frequency adjustment module is shown in fig. 3, and the data multiplexing calculation module and the frequency adjustment module are core configuration modules for realizing dynamic frequency adjustment in the invention, namely, the modules for directly performing frequency adjustment on the calculation modules.

Since the data multiplexing calculation and the determination of the frequency adjustment module are determined by calculating the size of the array, the relevant content of the array will be described first, and a specific picture thereof is shown in fig. 4.

Here we set the calculation array size under the calculation module to 4 rows and 16 columns, each point in the array is a neuron, and each intelligent calculation vector calculation unit is composed of a plurality of multipliers with several bits wide, here we set the intelligent calculation vector multipliers to be composed of 16 multipliers with 8 bits wide, meaning that one neuron can directly process 16 data with 8 bits wide, and the multiplied 16 data with 8 bits wide are added by an addition tree to obtain a partial sum, which is then stored in the calculation unit of the intelligent neural network to wait for output or to perform the next addition.

The data multiplexing calculation module comprises a neural network parameter receiving unit, a multiplexing frequency calculation unit, an input image and weight shaping unit and an input image and weight rewriting unit; the multiplexing frequency calculating unit comprises an input image multiplexing frequency calculating unit and a weight multiplexing frequency calculating unit.

The data multiplexing calculation module is used for calculating the multiplexing times of the current input image and the weight. The neural network parameter receiving unit is used for decoding the neural network instruction sent by the state machine to obtain corresponding neural network parameters, and specifically comprises the following steps: the method comprises the steps of a horizontal step size, a vertical step size, a convolution kernel height, a convolution kernel width, an input image height, an input image width, the number of input image channels and the number of output image channels of the neural network. After the data are decoded, the multiplexing frequency calculation unit calculates the frequency of multiplexing the current input image and the weight data according to the current array size and the time taken out from the input image and the weight buffer, the input image and the weight shaping unit are used for receiving the result of the multiplexing frequency calculation unit and calculating the specific address of the current input image and the weight which should be written into the input image and the weight buffer, the address is received by the input image and the weight writing unit, and the input image and the weight writing unit judges whether the current input image and the weight buffer are in a writable state or not and if so, the current input image and the weight buffer are written into the input image and the weight buffer.

The input image and weight buffer is composed of two autonomously designed register files capable of simultaneously writing four data and reading four data and a plurality of single-depth multi-byte-width registers, wherein the two register files supporting writing four data and writing four data form a ping-pong structure so as to relieve the process that the computing array needs to wait for the register files to load data. The single-depth multi-byte width register is responsible for buffering each weight, respectively. Unlike input buffers, registers tend to pass through a larger area in exchange for the highest clock that is the same as the overall system, and reading data in the buffer tends to be without the highest frequency limit.

The frequency adjusting module receives the calculated multiplexing times from the data multiplexing calculating module to dynamically adjust the frequency of the current layer. The frequency adjusting module is provided with two adjustable gears, if the current input image size is large enough, the single parallelism of the calculating module cannot cover all the input images, the size and the step length of the convolution kernel meet the overlapping in the convolution area, the data do not need to be replaced frequently, the data taken out of the input images and the weight buffer can be utilized by the calculating module for many times, and therefore the calculating array can run to the frequency twice as high as the input images and the weight buffer; if the current input image size is small enough, the single parallelism of the calculation module already covers all the input images, or the size and the step length of the convolution kernel meet the condition that no overlap exists in the adjacent convolution areas, the data need to be replaced frequently, and the frequency of the calculation array is the same as the frequency of the input image and the working frequency of the weight buffer at the moment, so that effective data can be obtained in each period of the array.

The computing module comprises a three-dimensional vector processing array formed by intelligent computing vector processing units. The intelligent computing vector processing units are provided with a certain number of basic multiplication units and addition trees, the basic bit width of the multipliers and the number of the multipliers can be configured, in the invention, the basic bit width of the inside of each intelligent computing vector processing unit is set to be 8, and the number of the multipliers is set to be 16. 16 data multiply-add with bit width 8 can be processed simultaneously. With this configuration, the intelligent computing vector processing unit can process the multiplication of 16 input channels at a time, and add the results obtained by the 16 multipliers through the addition tree, and store the results in the data buffer inside the intelligent computing vector processing unit.

When the computing module realizes convolution or full-connection computing function, each intelligent computing vector processing unit represents an actual neuron, and can be set differently according to the scale and the characteristics of the network to form three-dimensional vector processing arrays with different sizes. The hardware circuit of the invention takes an intelligent computing vector processing unit as a basic unit, and is provided with a three-dimensional vector processing array with a behavior of 4 and a column of 16. A maximum of 4 different convolution regions and 16 different output channels can be supported.

The computing module performs basic neural network two-dimensional convolution computation by calling a three-dimensional vector processing array, the intelligent computing vector processing unit is used for processing the multiplication of two most basic vectors, a large number of vector multiplication and addition parallel operations can be realized through the three-dimensional vector processing array, the result obtained by the three-dimensional array can be put into an output data buffer for waiting for the addition of the next result, the output data buffer receives signals from a state machine, and the decision of when to output the data in the output data buffer is made.

After the output data shaping module receives the signals output by the output data buffer, the data is further shaped so as to ensure that the signals are output to the subsequent module in correct sequence and ensure consistency of the front and rear data.

The invention provides a self-adaptive double-frequency multiply-add array suitable for a neural network accelerator, which comprises the following specific engineering processes:

101: the state machine control module reads the parameters of the neural network, decodes the parameters and sends the parameters to the address generation module and the data multiplexing calculation module.

102, the address generation module performs the following operations after receiving the network parameters: first, a storage address of trained weights is generated, and an address of original input image storage is generated. And then calculating the input image and weight which participate in the current calculation, generating corresponding read-write addresses and signals, reading the data into a data multiplexing calculation module, and regenerating the read-write addresses and signals of the weight data which participate in the current calculation, so as to carry out the next operation.

103: after the data multiplexing calculation module obtains the neural network parameter information, the data multiplexing times are calculated by combining the current hardware parameters of the neural network. The multiplexing times required by the input image and the multiplexing times required by the weight are respectively judged and sent to the frequency adjusting module, and the corresponding input image and weight are written into the input image and the weight buffer by the data multiplexing calculating module.

104: after receiving the instruction from the state machine and the multiplexing times of the input image and the weight and the multiplexing times of the weight calculated by the data multiplexing calculation module, the frequency adjustment module selects a proper clock through a clock selector in the frequency adjustment module, when the multiplexing times of the data and the weight are not 0, the clock frequency of the calculation module is set to be twice of the input image and the weight buffer, and when any one of the data or the weight is 0, the same frequency of the calculation array, the input image and the weight buffer is kept.

105: after waiting for a period of time, the clock is stabilized, and then the frequency adjustment module returns to the state machine to finish the mark, and at the moment, the input image and the weight are read into the calculation array from the input image, the weight buffer and the convolution data, so that the corresponding convolution calculation or full connection calculation is finished.

106: and after the calculation is completed, outputting the data obtained by the calculation to an output data buffer, and outputting the data to a subsequent module through an output data shaping module.

107: and repeating 101-106 to complete the calculation of one layer in the neural network.

108: and after the calculation of the complete neural network is finished, entering a standby state, and waiting for the calculation of the next neural network.

The invention provides a novel self-adaptive double-frequency multiply-add array, which is used for further improving the output fixed data stream, forming a double-fixed data stream scheduling strategy, and being applicable to a neural network accelerator which supports the output fixed data stream, so as to further improve the frequency of the computing array and reduce the bottleneck caused by the bandwidth of a weight memory. The invention realizes good modularization and high multiplexing design in hardware, ensures that the dependency relationship between modules and the complexity of a scheduling algorithm are well controlled in the self-adaptive configurable dual-frequency array, and compared with the traditional data flow module direct communication and the hardware size required by the scheduling algorithm, the invention has good expansibility and prepares for the emerging network structure and hardware scale.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. The self-adaptive double-frequency multiply-add array suitable for the neural network accelerator is characterized by comprising a state machine control module, an address generation module, a data multiplexing calculation module, an input image and weight buffer, a frequency adjustment module, a calculation module, an output data buffer and an output data shaping module; the address generation module, the data multiplexing calculation module, the input image and weight buffer and the frequency adjustment module are respectively connected with the state machine control module, the data multiplexing calculation module, the frequency adjustment module, the calculation module, the output data buffer and the output data shaping module are sequentially connected, the data multiplexing calculation module is also connected with the input image and the weight buffer, the input image and the weight buffer are connected with the calculation module, and the address generation module is connected with the data multiplexing calculation module;

2. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the state machine control module operating states include an idle mode and an operating mode, wherein the operating mode includes neural network parameter reading, loading of input images and weights, frequency configuration, and a computation phase; the state machine control module toggles between an idle mode and an operational mode.

3. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the address generation module comprises an input image read address unit, an input image write address unit, a weight data read address unit, and a weight data write address unit; the input image reading address unit is used for calculating input image addresses participating in calculation; the input image writing address unit is used for generating a storage address of an original input image of the neural network; the weight data reading address unit is used for calculating the weight address which participates in the calculation currently; the weight data writing address unit is used for receiving the weight trained in advance and writing the weight into the corresponding position.

4. The adaptive double-frequency multiply-add array adapted for a neural network accelerator according to claim 1, wherein the data multiplexing calculation module comprises a neural network parameter receiving unit, a multiplexing number calculation unit, an input image and weight shaping unit, and an input image and weight rewriting unit, wherein the multiplexing number calculation unit comprises an input image multiplexing number calculation unit and a weight multiplexing number calculation unit;

5. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the frequency adjustment module is configured to dynamically adjust the frequency of the calculation module, specifically as follows:

6. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the computing module comprises a three-dimensional vector processing array, the three-dimensional vector processing array being comprised of 4 rows and 16 columns of intelligent computing vector processing units, the computing module implementing parallel operations of vector multiply-add through the three-dimensional vector processing array.