CN108171317B

CN108171317B - Data multiplexing convolution neural network accelerator based on SOC

Info

Publication number: CN108171317B
Application number: CN201711207259.3A
Authority: CN
Inventors: 秦智勇; 陈雷; 于立新; 庄伟�; 彭和平; 倪玮琳; 张世远
Original assignee: Beijing Microelectronic Technology Institute; Mxtronics Corp
Current assignee: Beijing Microelectronic Technology Institute; Mxtronics Corp
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2020-08-04
Anticipated expiration: 2037-11-27
Also published as: CN108171317A

Abstract

The invention provides a data multiplexing convolutional neural network accelerator based on SOC (system on chip), which is used for grouping input data such as image input, weight parameters, bias parameters and the like of a convolutional neural network, dividing a large amount of input data into reusable block data and reading the multiplexed data block by controlling a state machine. The convolutional neural network has large parameter and strong required computing power, so that the convolutional neural network accelerator needs to provide large data bandwidth and computing power. The invention can carry out multiplexing segmentation on the heavy load, realizes data multiplexing through the control unit and the address generating unit, reduces the delay and the required bandwidth of the operation of the convolutional neural network, and improves the operation efficiency.

Description

Data multiplexing convolution neural network accelerator based on SOC

Technical Field

The invention relates to a data multiplexing convolutional neural network accelerator based on an SOC (system on chip), in particular to a convolutional neural network pair of an embedded device, and belongs to the field of embedded application.

Background

With the continuous development and optimization of the convolutional Neural network CNN (convolutional Neural network), the convolutional Neural network CNN (convolutional Neural network) is widely applied to the field of pattern recognition, including the fields of image recognition, target recognition, image segmentation, target tracking and the like, and has remarkable effect, and the dominant position of the convolutional Neural network in a pattern recognition algorithm is shown.

However, the deep convolutional neural network consumes computational resources and storage resources, and cannot be directly applied to the embedded terminal. The convolutional neural network AlexNet for image recognition, the convolution and full-join operations collectively include 1.45G operations, the parameter number is 58M, if each parameter occupies 4 bytes, the model parameters require 232M bytes, which is very large for on-chip storage, if these parameters are stored in external memory, the processing rate is significantly reduced, while the processing of an image requires 1.5G operations, which also includes only convolution and full-join operations, not pooling and regularization operations. Although the convolutional neural network has a large number of parameters, the operation of the convolutional neural network is regular and a large amount of data needs to be reused, so that the operation efficiency of the convolutional neural network needs to be improved through data multiplexing, and the energy consumption required by the whole system is reduced.

Disclosure of Invention

The technical problem solved by the invention is as follows: the defects of time delay and power consumption waste caused by low operation efficiency and large-scale access to external memory of the conventional convolutional neural network are overcome, the SOC-based data multiplexing convolutional neural network accelerator is provided, the characteristic of recycling input data and convolutional kernel data is fully utilized, and the operation performance of embedded equipment on the convolutional neural network is improved.

The technical solution of the invention is as follows: an SOC-based data multiplexing convolutional neural network accelerator comprises an image address generator, an image buffer area, shift logic, a weight address generator, a weight buffer area, an offset address generator, an offset buffer area, a control unit and a computing unit array, wherein the control unit receives an externally input starting control signal, and then, controlling an offset address generator, a weight address generator and an image address generator to generate an offset write control signal, a weight write control signal and an image write control signal according to a preset timing, storing the offset, the weight and the image data in corresponding buffer areas in blocks, and then, then controlling an offset address generator, a weight address generator and an image address generator to generate read-write addresses of corresponding buffer areas, wherein the weight buffer areas and the offset buffer areas respectively output weight and offset data in the corresponding addresses to the computing unit array; the image buffer area outputs the image data in the corresponding address to the shift logic, the shift logic shifts the image data according to the shift control signal and the layer operation serial number sent by the control unit and then outputs the image data to the calculation unit array, and the calculation unit array performs multilayer convolution, pooling and multilayer full-connection operation on the image data by adopting a block operation method according to the weight data, the offset data and the image data.

The control unit comprises a main control module, a weight control state machine, a bias control state machine, an image control state machine and a write control state machine, wherein:

the main control module receives an externally input starting control signal, then divides the convolution, pooling and full-connection operation processes into a plurality of layer operations according to a preset time sequence according to the preset time and full-connection operation time of each layer of the convolutional neural network accelerator, divides each layer operation into a plurality of block operations, sends a writing control starting instruction to the writing control state machine before the layer operation starts, and sends a writing control stopping signal to the writing control state machine after all data required by the current layer operation is written into a corresponding buffer area; at the starting time of layer operation, sending a weight reading control starting signal to a weight reading control state machine, a bias reading control state machine and an image reading control state machine, and sending a shifting control signal and a layer operation serial number to a shifting logic; at the layer operation ending time and the block operation ending time, sending a layer operation ending mark and a block operation ending mark signal to a weight reading control state machine, an offset reading control state machine and an image reading control state machine, and sending a shift control signal and a layer operation serial number to shift logic;

under the control of the main control module, the weight control state machine, the bias control state machine, the image control state machine and the write control state machine respectively output corresponding read enable signals, write enable signals and chip select signals to the weight buffer area, the bias buffer area and the image buffer area, output corresponding address control signals to the weight address generator, the bias address generator and the image address generator, and the weight address generator, the bias address generator and the image address generator generate corresponding read-write addresses according to the address control signals.

The image cache and the weight cache are both of a grouping storage structure, the image cache region and the weight cache region are divided into M sub-cache regions, chip selection control ends and read-write enabling ends of the M sub-cache regions are connected in parallel, address lines are mutually independent, corresponding addresses of the sub-cache regions are used for storing image data or weight data required by one-time block operation and are written in or read out simultaneously, and M is the maximum image data size corresponding to one-time block operation.

The weight address generator and the offset address generator comprise counters, the count values of the counters are output to corresponding buffer areas as addresses, and when the address reset signals are effective, the count values of the counters are cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.

The image address generator comprises a read address generating module, a write address generating module and a read-write address gating module;

the write address generation module comprises a counter, the count value of the counter is output to the read-write address gating module as an image write address, and when the address reset signal is effective, the count value of the counter is cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.

And the read address generation module comprises R read address generation sub-modules, wherein R is the number of layers. Each read address generation submodule is used for controlling and generating addresses required by all block operations in one layer of operation, the read address generation submodule of the corresponding layer is gated according to the layer serial number, for the processing of a certain layer, three-dimensional image data is input, the address is increased progressively along the image channel direction at first, when the data reading in the image channel direction is completed, the address is increased progressively along the image channel direction after the address is increased progressively along the image two-dimensional plane column changing direction, when the data reading in the image two-dimensional plane column changing direction and the data reading in the image channel direction are completed, the address is increased progressively according to the image two-dimensional plane column changing direction, the address is increased progressively along the image channel direction until the whole block data processing is completed.

The image data includes X Y N blocks, X representing the number of blocks in a row direction, X representing the number of blocks in a column direction, and N representing the number of blocks in a channel direction; reading a piece of three-dimensional image data from the outside and storing the three-dimensional image data into M sub-buffer areas each time, wherein each sub-buffer area corresponds to one element stored in a three-dimensional data block, and the sequence of extracting the data blocks is as follows:

(1) the row sequence number i of the initialized data block is 1, the column sequence number j is 1, and the channel sequence number k is 1;

(2) sequentially reading data blocks with row serial number i, column serial number j and channel serial number k;

(3) adding 1 to k to update k, repeatedly executing the steps (2) to (3) until k is larger than or equal to N, and entering the step (4);

(4) adding 1 to j to update j, enabling k to be equal to 1, repeatedly executing the steps (2) to (4) until j is larger than or equal to Y, k is larger than or equal to N, and entering the step (5);

(5) and adding 1 to i to update i, k to 1, j to 1, repeatedly executing the steps (2) to (5) until i is greater than or equal to X, j is greater than or equal to Y, and k is greater than or equal to N, and ending.

During convolution operation, the shift logic determines the sizes of two dimensions in the two-dimensional plane direction during convolution operation according to the sequence numbers of the convolution layers, changes the sequence formed by the image data according to the sizes of the two dimensions, and determines the shift amount of the image data sequence according to the shift control signal sent by the control unit, so that the image data of each block operation entering the calculation unit array is aligned with the weight data, and during pooling or full connection processing, the shift logic directly outputs the image data in the buffer area to the calculation unit array.

The computing unit array comprises a multiplier array, an adder tree, an accumulator, a nonlinear unit and a gating output unit, wherein:

a multiplier array multiplying the image data by the weight;

the adder tree adds all product terms of the multiplier array, and the result is output to the accumulator;

an accumulator for resetting when the block operation is finished, accumulating the result output by the adder tree, and outputting the accumulated result to the nonlinear unit as the convolution result

And a nonlinear unit, which performs pooling processing on the convolution result and outputs, for example: comparing the accumulated result with 0, and taking the larger value to output;

and the gating output unit is used for receiving the output gating signal sent by the control unit and outputting a gating convolution result or a pooling result.

Compared with the prior art, the invention has the beneficial effects that:

(1) the method ensures that the image input data with huge data volume only needs to be read from the external memory once by reading the convolution weight and the offset parameter with smaller data volume for multiple times, reduces the total time delay of the external memory access and the corresponding power consumption, and improves the operation efficiency of the convolution neural network;

(2) the invention adopts a mode that the main control unit and the address generating unit are separated to be respectively responsible for the generation of the current operation data block state of the convolutional neural network and the generation of each data specific address in the current operation data block, adopts a separated control structure to divide the whole address control state machine into two parts, and has more simplified state machine and smaller area overhead and power consumption compared with the mode that one address control unit is adopted.

Drawings

FIG. 1 is a block diagram of the overall architecture of the convolutional neural network accelerator of the present invention;

FIG. 2 is a timing diagram of the control unit of the present invention;

FIG. 3 is a block diagram of the architecture of an array of compute units of the present invention;

FIG. 4 is an image read address generation state machine of the present invention;

FIG. 5 illustrates a block-based storage of image data according to the present invention;

FIG. 6 is a detailed diagram of y-direction address increment in image read address generation according to the present invention;

FIG. 7 is a detailed diagram of the increment of the x-direction address in the image read address generation of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

As shown in FIG. 1, the present invention provides an SOC (system on a chip) -based data multiplexing convolutional neural network accelerator, which includes an image address generator, an image buffer, shift logic, a weight address generator, a weight buffer, an offset address generator, an offset buffer, a control unit and a computing unit array, wherein the control unit receives an externally input start control signal, then controls the offset address generator, the weight address generator and the image address generator to generate an offset write control signal, a weight write control signal and an image write control signal according to a preset time sequence, stores offset, weight and image data in blocks into the corresponding buffers, then controls the offset address generator, the weight address generator and the image address generator to generate read/write addresses of the corresponding buffers, and the weight buffer and the offset buffer respectively store the weight data in the corresponding addresses, Outputting the offset data to the computing unit array; the image buffer area outputs the image data in the corresponding address to the shift logic, the shift logic shifts the image data according to the shift control signal and the layer operation serial number sent by the control unit and then outputs the image data to the calculation unit array, and the calculation unit array performs multilayer convolution, pooling and multilayer full-connection operation on the image data by adopting a block operation method according to the weight data, the offset data and the image data. The preset timing is shown in fig. 2.

The following components are described separately:

1. control unit

The control unit comprises a main control module, a weight control state machine, a bias control state machine, an image control state machine and a write control state machine.

1.1 Main control Module

The main control module receives an externally input starting control signal, then divides the convolution, pooling and full-connection operation processes into a plurality of layer operations according to a preset time sequence according to the preset time and full-connection operation time of each layer of the convolutional neural network accelerator, divides each layer operation into a plurality of block operations, sends a writing control starting instruction to the writing control state machine before the layer operation starts, and sends a writing control stopping signal to the writing control state machine after all data required by the current layer operation is written into a corresponding buffer area; at the starting time of layer operation, sending a weight reading control starting signal to a weight reading control state machine, a bias reading control state machine and an image reading control state machine, and sending a shifting control signal and a layer operation serial number to a shifting logic; and at the layer operation ending time and the block operation ending time, sending a layer operation ending mark signal and a block operation ending mark signal to a weight reading control state machine, an offset reading control state machine and an image reading control state machine, and sending a shift control signal and a layer operation serial number to shift logic.

The convolutional neural network accelerator comprises a plurality of stages which respectively correspond to an idle stage, 1 st to N1 th convolutional layer operation stages, a pooling layer operation stage and 1 st to N2 th fully-connected layer operation stages, a control unit is in the idle stage at the beginning, enters the 1 st convolutional layer operation stage after an accelerator starting signal is given from the outside, enters the 2 nd convolutional layer operation stage after the first convolutional layer operation stage is processed, and is sequentially carried out until the processing is completed. In the idle removing stage, each stage control unit controls four main sub-state machines, namely a write control state machine, a weight read control state machine, an offset read control state machine and an image read control state machine, in a similar mode, and controls address generation and reading and writing of corresponding data.

1.2 weight read control State machine

The weight address control signal comprises a weight read address reset signal, a weight read address hold signal and a weight read address increment signal, the weight read control state machine comprises 3 states, namely RW _ state0, RW _ state1 and RW _ state2, and is initialized to RW _ state0, and the specific operation of the state machine is as follows:

RW _ state 0: reading a read control starting signal, sending an 'effective' weight read address reset signal when the read control starting signal is effective, sending a weight chip selection signal generating 'effective' to a weight cache area after a clock cycle, setting the sum to be a weight read-write enabling signal in a 'read enabling' state, entering RW _ state1, or keeping the state of RW _ state 0;

RW _ state 1: setting the reset signal of the weighted address as 'invalid' and generating a 'valid' weighted address holding signal; reading the block operation end flag signal and the layer operation end flag signal, and entering a state RW _ state2 when the block operation end flag signal is "valid"; when the layer operation end flag signal is "valid", the method returns to RW _ state 0;

RW _ state2: the weighted address hold signal is deasserted, an "active" weighted address increment signal is generated, and a clock cycle later a jump is made back to RW _ state 1.

1.3 bias read control State machine

The offset reading control signal comprises an offset reading address reset signal, an offset reading address keeping signal and an offset reading address increasing signal, the offset reading control state machine is designed the same as the weight reading control state machine, the offset reading control state machine comprises 3 states, namely RB _ state0, RB _ state1 and RB _ state2, the states are initialized to RB _ state0, and the specific operation of the state machine is as follows:

RB _ state 0: reading a reading control starting signal, sending an effective bias reading address reset signal when the reading control starting signal is effective, sending an effective bias chip selection signal and a bias reading-writing enabling signal set to be in a read enabling state to a bias cache region after a clock cycle, and entering RB _ state1, otherwise, continuously keeping an RB _ state0 state;

RB _ state 1: setting the bias address reset signal to be invalid, and generating an effective bias address holding signal; reading the block operation end flag signal and the layer operation end flag signal, and entering a state RB _ state2 when the block operation end flag signal is "valid"; when the layer operation end flag signal is "valid", returning to RB _ state 0;

RB _ state2: the offset address hold signal is deasserted, an "active" offset address increment signal is generated, and a jump back to RB _ state1 occurs one clock cycle later.

1.4 image read control State machine

The image reading address control signal comprises an image reading address reset signal, an image reading address holding signal and an image reading address increment signal, the image reading control state machine comprises 3 states, namely RP _ state0, RP _ state1 and RP _ state2, and the image reading control state machine specifically operates as follows:

RP _ state 0: reading a reading control starting signal, sending an effective image reading address reset signal when the reading control starting signal is effective, sending an effective image chip selection signal to an image buffer area after one clock period, setting a reading and writing enable signal to be a reading and writing enable signal in a reading and writing enable state, and entering an RP _ state1, or continuing to keep an RP _ state0 state;

RP _ state 1: setting the reset signal of the image reading address as 'invalid', generating an image reading address increasing signal of 'valid', reading a block operation ending mark signal and a layer operation ending mark signal, and jumping to an RP _ state2 state when the reading block operation ending mark signal is 'valid'; when the layer operation end flag signal is "valid", returning to RP _ state 0;

RP _ state2 issues an "active" image read address reset signal, and after one clock cycle, jumps to RP _ state 1.

1.5 write control State machine

The write control signal comprises a write address reset signal, a write address holding signal, a write address increment signal and a chip selection signal, and the write control state machine comprises: two states, wr _ state0 and wr _ state1, are specifically operated as follows:

wr _ state 0: reading the write control starting signal, sending an effective write address reset signal when the write control starting signal is effective, generating an effective chip selection signal after one clock cycle, setting the read-write enable signal to be in a write enable state, entering wr _ state1, and otherwise, continuously keeping the wr _ state0 state;

wr _ state 1: the write address reset signal is set to "invalid", an "valid" write address increment signal is generated, the write control stop signal is read, and when the write control stop signal is "valid", the operation jumps to wr _ state 0.

2. Image buffer, weight buffer and offset buffer

In order to save storage space and improve reading speed, the image cache and the weight cache are both of a grouping storage structure, the image cache region and the weight cache region are divided into M sub-cache regions, chip selection control ends and read-write enabling ends of the M sub-cache regions are connected in parallel, address lines are mutually independent, and corresponding addresses of the sub-cache regions are used for storing image data or weight data required by one-time block operation and simultaneously writing or reading. M is the maximum image data size corresponding to one block operation.

The offset cache is a common SRAM, and the depth of the stored data is larger than the number of channels operated in the accelerator.

3. Address generator

To facilitate access to the data, assume that the three-dimensional image includes X Y N blocks, X representing the number of blocks in the row direction and X representing the number of blocks in the column direction; n represents the number of blocks in the channel direction; reading a piece of three-dimensional image data from the outside and storing the three-dimensional image data into M sub-buffer areas each time, wherein each sub-buffer area corresponds to one element stored in a three-dimensional data block, and the sequence of extracting the data blocks is as follows:

And the read address generation module comprises R read address generation sub-modules, wherein R is the number of layers. As shown in fig. 4, each read address generation submodule is used to control and generate addresses required by all block operations in a single-layer operation, and the read address generation submodule of the corresponding layer is gated according to the layer sequence number. For processing of a certain layer, inputting three-dimensional image data, firstly increasing the address progressively along the image channel direction, when the data reading in the image channel direction is finished, increasing the address progressively according to the column changing direction of the two-dimensional plane of the image, continuing to increase the address progressively along the image channel direction, when the data reading in the column changing direction of the two-dimensional plane of the image and the data reading in the image channel direction are finished, increasing the address progressively according to the row changing direction of the two-dimensional plane of the image, and continuing to increase the address progressively along the image channel direction until the whole data processing is finished. FIG. 5 shows a specific block storage manner of image data according to the present invention, and FIG. 6 shows a specific diagram of y-direction address increment in image read address generation according to the present invention; fig. 7 is a specific diagram illustrating the increment of the x-direction address in the image read address generation according to the present invention. The specific implementation of each read address generation submodule is as follows:

when the layer serial number is r, the operation serial number of the block in the layer is defined as y, y starts from 0, the step length is 1, and the data size of the image data block calculated by each block in the layer is a_r×b_r×c_rIn which a is_rRepresenting the number of data in the row direction of the image data block, b_rRepresenting the number of data in the column direction of the image data block, c_rThe number of data of the image data block in the channel direction is shown, and the data of the image data block are sequentially stored in the same address in the 1 st to Mth sub-buffers in sequence of row by row, column by column and channel by channel.

When y is equal to 0, all the M sub-buffer addresses are initial addresses, and the values thereof are 1, otherwise, the M sub-buffer addresses are determined according to the following principles:

when y cannot be c_rWhen dividing completely, the addresses of all M sub-buffers are increased progressively;

when y can be c_rAnd a_rWhen the product of (a) is divided exactly, divide y by a_r、b_r、c_rThe remainder of the product of (a) is rem, then, when w% (a)_r×b_r)∈[1+(rem-1)×a_r,a_r×rem]Then, the address of the w-th sub-buffer is the original address plus 1; otherwise, the address of the w-th sub-buffer is the original address minus c_r×a_rAdding 1;

when y can be c_rWhen dividing completely, let (y/c)_r)％a_rIs re, when w is divided by a_rWhen the remainder of (1) is re, the address of the w-th sub-buffer area is the original address plus 1; otherwise, the address of the w-th sub-buffer is the original address minus c_r×a_rAnd adding 1.

5. Shift logic

The convolution operation needs convolution kernel data to be operated in an image data sliding window, so that corresponding shift operation needs to be carried out on the image data by shift logic, the shift logic reads the serial number of the current convolution operation layer, the sizes of two dimensions in the two-dimensional plane direction during the convolution operation are determined, the sequence formed by the image data is changed according to the sizes of the two dimensions, the number of image data sequence shifts is determined according to a shift control signal sent by a control unit, and the image data entering the calculation unit array and subjected to each block operation are aligned with the weight data. In the pooling or full-link process, the shift logic directly outputs the image data in the buffer to the compute unit array.

6. Array of computing units

As shown in fig. 3, the calculation unit array includes a multiplier array, an adder tree, an accumulator, a nonlinear unit, and a gated output unit, wherein:

a multiplier array multiplying the image data by the weight;

the gating output unit receives the output gating signal sent by the control unit and outputs a gating convolution result or a pooling result; since the convolution operation and the full join operation are identical in operation form, the multiplier array, the adder tree and the accumulator together complete the convolution operation and the full join operation.

Example (b):

the method mainly comprises image input, weight parameters of a convolutional neural network model and bias parameters, wherein the image input is characterized in that two dimensions in a two-dimensional plane direction are large, the range is 1-107, the number of channels is gradually increased from 3 to 512 as the layer number of the convolutional neural network is deepened, the weight parameters are general convolutional kernel data, the dimensions in the two-dimensional plane direction are 7 × 7, 5 × 5, 3 × 3 and 1 × 1, the number of the channels is 3-512, only one channel is arranged in each bias parameter, and therefore only 3-512 parameters are arranged in each layer.

The data stored in the packets needs to be multiplexed by flexible address control, and the address control is completed by the control unit and the address generation unit together. The control unit is responsible for outputting the state of the data block which is currently calculated, and the address generation unit generates specific addresses of 150 groups of data according to the state of the current data block. As shown in fig. 2, the control unit controls the accelerator to read all inputs including weights, offsets, and images in turn, and starts calculation and output after the inputs are ready. When the control unit reads the weight, the data in the DRAM is read into the weight cache. The control unit first enables the chip select signal of the weight buffer and puts the weight buffer in a write state. However, addresses are needed for writing in the weight cache, the weight cache is a whole block, and the condition of multiplexing of a plurality of channel convolution kernels does not exist, so that the control unit only needs to give signals of ascending address sequences, and the weight address generator generates the addresses of ascending sequences. At the same time, the control unit sends an invalid signal to other modules, and the output is invalid at this time. The whole process of reading the weight is in the above control state until the reading of the required weight is completed to enter the next state. Read bias is similar to when reading weights, except that the control signals send valid signals to the bias buffer and bias address generator.

The process of reading image data is more complicated than reading weights and offsets. The chip select signal and the read/write status signal are identical, except for address generation. When the image data is exactly divided into small blocks of 150 data, the boundaries of the image data are very neat, all the small data blocks are exactly 150, i.e. there are no incomplete data blocks, and the writing addresses are sequentially increased, which is the same as the generation of the weighted and biased addresses.

Grouping complete means that input parameters and weight parameters can be exactly divided into 150 groups, as shown in fig. 5, the size of an input image is 25 × 25 × 96, the input image is firstly partitioned according to the size of a convolution kernel, the size of the convolution kernel is 5 × 5 × 96, but an image buffer only has 150 groups and cannot store 2400 data at a time, therefore, the image data with the size of each convolution kernel is further disassembled to become a small block of 5 × 5 × 6, so that the input image is partitioned into 16 × 5 × 5 blocks of 400 blocks, then each small block of 5 × 5 × 6 is stored into 150 groups of the image buffer, the process is generally that the image data is fetched from a DRAM into an SRAM, the number of bits of each fetch of the DRAM is generally 32 bits, therefore, the data of the small blocks sequentially fetched are sequentially placed into 150 groups of the buffer, the first image data block occupies the buffer with all addresses of 1, then the second image data block is stored, and the like all the addresses of the image group with all addresses of 2.

If we choose the second block from the x-direction, the most convenient way to process the first block and then the second block, this creates a problem, the convolution kernel contains a total of 16 blocks of data, each block processed does not produce the final result, but an intermediate result, so additional storage is needed, the next input processed requires 11 registers to store the intermediate result, if x-direction first and y-direction last z-direction needs 121 registers to store the intermediate result, 121 registers are not very large, but if 224 × 224 image input, using 3 × 3 convolution kernel, such direction selection requires 224 × 224 50176 registers, so large register array is very resource consuming, while control logic is more complex, if 50k registers each need to be individually controlled, such direction selection requires a significant number of registers to be saved as a result of the first direction, so that we can get a significant choice of the final channel number, and thus the final channel number can be set for the next input data processing, so there is no penalty of the first and second data blocks.

In general, the invention caches the part which is easy to multiplex on the chip by segmenting the load, improves the utilization efficiency of the on-chip storage data, reduces the time delay and the power consumption of frequently reading and writing the external memory, and improves the operation efficiency of the convolutional neural network.

Parts of the present invention not described in detail in the specification are common general knowledge of those skilled in the art.

Claims

1. A data multiplexing convolution neural network accelerator based on SOC is characterized by comprising an image address generator, an image buffer area, shift logic, a weight address generator, a weight buffer area, an offset address generator, an offset buffer area, a control unit and a computing unit array, wherein the control unit receives an externally input starting control signal, then, controlling an offset address generator, a weight address generator and an image address generator to generate an offset write control signal, a weight write control signal and an image write control signal according to a preset timing, storing the offset, the weight and the image data in corresponding buffer areas in blocks, and then, then controlling an offset address generator, a weight address generator and an image address generator to generate read-write addresses of corresponding buffer areas, wherein the weight buffer areas and the offset buffer areas respectively output weight and offset data in the corresponding addresses to the computing unit array; the image buffer area outputs image data in a corresponding address to a shift logic, the shift logic shifts the image data according to a shift control signal and a layer operation serial number sent by a control unit and then outputs the image data to a calculation unit array, and the calculation unit array performs multilayer convolution, pooling and multilayer full-connection operation on the image data by adopting a block operation method according to weight data, offset data and the image data;

2. The SOC-based data multiplexing convolutional neural network accelerator as claimed in claim 1, wherein the image buffer and the weight buffer are both of a packet storage structure, the image buffer and the weight buffer are divided into M sub-buffers, chip selection control terminals and read/write enable terminals of the M sub-buffers are connected in parallel, address lines are independent of each other, corresponding addresses of the sub-buffers are used for storing image data or weight data required by one block operation and simultaneously writing or reading, and M is the maximum image data size corresponding to one block operation.

3. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the weight address generator and the offset address generator comprise counters, the count values of the counters are output to the corresponding buffers as addresses, and when the address reset signal is "valid", the count values of the counters are cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.

4. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the image address generator comprises a read address generation module, a write address generation module, and a read/write address gating module;

the write address generation module comprises a counter, the count value of the counter is output to the read-write address gating module as an image write address, and when the address reset signal is effective, the count value of the counter is cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is effective, the count value of the counter is increased by 1;

the read address generation module comprises R read address generation sub-modules, wherein R is the number of layers; each read address generation submodule is used for controlling and generating addresses required by all block operations in one layer of operation, the read address generation submodule of the corresponding layer is gated according to the layer serial number, for the processing of a certain layer, three-dimensional image data is input, the address is increased progressively along the image channel direction at first, when the data reading in the image channel direction is completed, the address is increased progressively along the image channel direction after the address is increased progressively along the image two-dimensional plane column changing direction, when the data reading in the image two-dimensional plane column changing direction and the data reading in the image channel direction are completed, the address is increased progressively according to the image two-dimensional plane column changing direction, the address is increased progressively along the image channel direction until the whole block data processing is completed.

5. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the image data comprises X Y N blocks, X denotes the number of blocks in the row direction, X denotes the number of blocks in the column direction, and N denotes the number of blocks in the channel direction; reading a piece of three-dimensional image data from the outside and storing the three-dimensional image data into M sub-buffer areas each time, wherein each sub-buffer area corresponds to one element stored in a three-dimensional data block, and the sequence of extracting the data blocks is as follows:

6. The SOC-based data multiplexing convolutional neural network accelerator as claimed in claim 1, wherein during convolution operation, the shift logic determines the size of two dimensions in the two-dimensional plane direction during convolution operation according to the convolutional layer serial number, changes the sequence order of image data according to the size of the two dimensions, and determines the shift amount of the image data sequence according to the shift control signal sent by the control unit, so that the image data of each block operation entering the computing unit array is aligned with the weight data, and during pooling or full connection processing, the shift logic directly outputs the image data in the buffer area to the computing unit array.

7. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the compute unit array comprises a multiplier array, an adder tree, an accumulator, a nonlinear unit, and a gated output unit, wherein:

a multiplier array multiplying the image data by the weight;

The nonlinear unit is used for performing pooling processing on the convolution result and outputting the result;