CN108171317B - Data multiplexing convolution neural network accelerator based on SOC - Google Patents

Data multiplexing convolution neural network accelerator based on SOC Download PDF

Info

Publication number
CN108171317B
CN108171317B CN201711207259.3A CN201711207259A CN108171317B CN 108171317 B CN108171317 B CN 108171317B CN 201711207259 A CN201711207259 A CN 201711207259A CN 108171317 B CN108171317 B CN 108171317B
Authority
CN
China
Prior art keywords
image
address
data
weight
state machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711207259.3A
Other languages
Chinese (zh)
Other versions
CN108171317A (en
Inventor
秦智勇
陈雷
于立新
庄伟�
彭和平
倪玮琳
张世远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Microelectronic Technology Institute
Mxtronics Corp
Original Assignee
Beijing Microelectronic Technology Institute
Mxtronics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Microelectronic Technology Institute, Mxtronics Corp filed Critical Beijing Microelectronic Technology Institute
Priority to CN201711207259.3A priority Critical patent/CN108171317B/en
Publication of CN108171317A publication Critical patent/CN108171317A/en
Application granted granted Critical
Publication of CN108171317B publication Critical patent/CN108171317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Input (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a data multiplexing convolutional neural network accelerator based on SOC (system on chip), which is used for grouping input data such as image input, weight parameters, bias parameters and the like of a convolutional neural network, dividing a large amount of input data into reusable block data and reading the multiplexed data block by controlling a state machine. The convolutional neural network has large parameter and strong required computing power, so that the convolutional neural network accelerator needs to provide large data bandwidth and computing power. The invention can carry out multiplexing segmentation on the heavy load, realizes data multiplexing through the control unit and the address generating unit, reduces the delay and the required bandwidth of the operation of the convolutional neural network, and improves the operation efficiency.

Description

Data multiplexing convolution neural network accelerator based on SOC
Technical Field
The invention relates to a data multiplexing convolutional neural network accelerator based on an SOC (system on chip), in particular to a convolutional neural network pair of an embedded device, and belongs to the field of embedded application.
Background
With the continuous development and optimization of the convolutional Neural network CNN (convolutional Neural network), the convolutional Neural network CNN (convolutional Neural network) is widely applied to the field of pattern recognition, including the fields of image recognition, target recognition, image segmentation, target tracking and the like, and has remarkable effect, and the dominant position of the convolutional Neural network in a pattern recognition algorithm is shown.
However, the deep convolutional neural network consumes computational resources and storage resources, and cannot be directly applied to the embedded terminal. The convolutional neural network AlexNet for image recognition, the convolution and full-join operations collectively include 1.45G operations, the parameter number is 58M, if each parameter occupies 4 bytes, the model parameters require 232M bytes, which is very large for on-chip storage, if these parameters are stored in external memory, the processing rate is significantly reduced, while the processing of an image requires 1.5G operations, which also includes only convolution and full-join operations, not pooling and regularization operations. Although the convolutional neural network has a large number of parameters, the operation of the convolutional neural network is regular and a large amount of data needs to be reused, so that the operation efficiency of the convolutional neural network needs to be improved through data multiplexing, and the energy consumption required by the whole system is reduced.
Disclosure of Invention
The technical problem solved by the invention is as follows: the defects of time delay and power consumption waste caused by low operation efficiency and large-scale access to external memory of the conventional convolutional neural network are overcome, the SOC-based data multiplexing convolutional neural network accelerator is provided, the characteristic of recycling input data and convolutional kernel data is fully utilized, and the operation performance of embedded equipment on the convolutional neural network is improved.
The technical solution of the invention is as follows: an SOC-based data multiplexing convolutional neural network accelerator comprises an image address generator, an image buffer area, shift logic, a weight address generator, a weight buffer area, an offset address generator, an offset buffer area, a control unit and a computing unit array, wherein the control unit receives an externally input starting control signal, and then, controlling an offset address generator, a weight address generator and an image address generator to generate an offset write control signal, a weight write control signal and an image write control signal according to a preset timing, storing the offset, the weight and the image data in corresponding buffer areas in blocks, and then, then controlling an offset address generator, a weight address generator and an image address generator to generate read-write addresses of corresponding buffer areas, wherein the weight buffer areas and the offset buffer areas respectively output weight and offset data in the corresponding addresses to the computing unit array; the image buffer area outputs the image data in the corresponding address to the shift logic, the shift logic shifts the image data according to the shift control signal and the layer operation serial number sent by the control unit and then outputs the image data to the calculation unit array, and the calculation unit array performs multilayer convolution, pooling and multilayer full-connection operation on the image data by adopting a block operation method according to the weight data, the offset data and the image data.
The control unit comprises a main control module, a weight control state machine, a bias control state machine, an image control state machine and a write control state machine, wherein:
the main control module receives an externally input starting control signal, then divides the convolution, pooling and full-connection operation processes into a plurality of layer operations according to a preset time sequence according to the preset time and full-connection operation time of each layer of the convolutional neural network accelerator, divides each layer operation into a plurality of block operations, sends a writing control starting instruction to the writing control state machine before the layer operation starts, and sends a writing control stopping signal to the writing control state machine after all data required by the current layer operation is written into a corresponding buffer area; at the starting time of layer operation, sending a weight reading control starting signal to a weight reading control state machine, a bias reading control state machine and an image reading control state machine, and sending a shifting control signal and a layer operation serial number to a shifting logic; at the layer operation ending time and the block operation ending time, sending a layer operation ending mark and a block operation ending mark signal to a weight reading control state machine, an offset reading control state machine and an image reading control state machine, and sending a shift control signal and a layer operation serial number to shift logic;
under the control of the main control module, the weight control state machine, the bias control state machine, the image control state machine and the write control state machine respectively output corresponding read enable signals, write enable signals and chip select signals to the weight buffer area, the bias buffer area and the image buffer area, output corresponding address control signals to the weight address generator, the bias address generator and the image address generator, and the weight address generator, the bias address generator and the image address generator generate corresponding read-write addresses according to the address control signals.
The image cache and the weight cache are both of a grouping storage structure, the image cache region and the weight cache region are divided into M sub-cache regions, chip selection control ends and read-write enabling ends of the M sub-cache regions are connected in parallel, address lines are mutually independent, corresponding addresses of the sub-cache regions are used for storing image data or weight data required by one-time block operation and are written in or read out simultaneously, and M is the maximum image data size corresponding to one-time block operation.
The weight address generator and the offset address generator comprise counters, the count values of the counters are output to corresponding buffer areas as addresses, and when the address reset signals are effective, the count values of the counters are cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.
The image address generator comprises a read address generating module, a write address generating module and a read-write address gating module;
the write address generation module comprises a counter, the count value of the counter is output to the read-write address gating module as an image write address, and when the address reset signal is effective, the count value of the counter is cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.
And the read address generation module comprises R read address generation sub-modules, wherein R is the number of layers. Each read address generation submodule is used for controlling and generating addresses required by all block operations in one layer of operation, the read address generation submodule of the corresponding layer is gated according to the layer serial number, for the processing of a certain layer, three-dimensional image data is input, the address is increased progressively along the image channel direction at first, when the data reading in the image channel direction is completed, the address is increased progressively along the image channel direction after the address is increased progressively along the image two-dimensional plane column changing direction, when the data reading in the image two-dimensional plane column changing direction and the data reading in the image channel direction are completed, the address is increased progressively according to the image two-dimensional plane column changing direction, the address is increased progressively along the image channel direction until the whole block data processing is completed.
The image data includes X Y N blocks, X representing the number of blocks in a row direction, X representing the number of blocks in a column direction, and N representing the number of blocks in a channel direction; reading a piece of three-dimensional image data from the outside and storing the three-dimensional image data into M sub-buffer areas each time, wherein each sub-buffer area corresponds to one element stored in a three-dimensional data block, and the sequence of extracting the data blocks is as follows:
(1) the row sequence number i of the initialized data block is 1, the column sequence number j is 1, and the channel sequence number k is 1;
(2) sequentially reading data blocks with row serial number i, column serial number j and channel serial number k;
(3) adding 1 to k to update k, repeatedly executing the steps (2) to (3) until k is larger than or equal to N, and entering the step (4);
(4) adding 1 to j to update j, enabling k to be equal to 1, repeatedly executing the steps (2) to (4) until j is larger than or equal to Y, k is larger than or equal to N, and entering the step (5);
(5) and adding 1 to i to update i, k to 1, j to 1, repeatedly executing the steps (2) to (5) until i is greater than or equal to X, j is greater than or equal to Y, and k is greater than or equal to N, and ending.
During convolution operation, the shift logic determines the sizes of two dimensions in the two-dimensional plane direction during convolution operation according to the sequence numbers of the convolution layers, changes the sequence formed by the image data according to the sizes of the two dimensions, and determines the shift amount of the image data sequence according to the shift control signal sent by the control unit, so that the image data of each block operation entering the calculation unit array is aligned with the weight data, and during pooling or full connection processing, the shift logic directly outputs the image data in the buffer area to the calculation unit array.
The computing unit array comprises a multiplier array, an adder tree, an accumulator, a nonlinear unit and a gating output unit, wherein:
a multiplier array multiplying the image data by the weight;
the adder tree adds all product terms of the multiplier array, and the result is output to the accumulator;
an accumulator for resetting when the block operation is finished, accumulating the result output by the adder tree, and outputting the accumulated result to the nonlinear unit as the convolution result
And a nonlinear unit, which performs pooling processing on the convolution result and outputs, for example: comparing the accumulated result with 0, and taking the larger value to output;
and the gating output unit is used for receiving the output gating signal sent by the control unit and outputting a gating convolution result or a pooling result.
Compared with the prior art, the invention has the beneficial effects that:
(1) the method ensures that the image input data with huge data volume only needs to be read from the external memory once by reading the convolution weight and the offset parameter with smaller data volume for multiple times, reduces the total time delay of the external memory access and the corresponding power consumption, and improves the operation efficiency of the convolution neural network;
(2) the invention adopts a mode that the main control unit and the address generating unit are separated to be respectively responsible for the generation of the current operation data block state of the convolutional neural network and the generation of each data specific address in the current operation data block, adopts a separated control structure to divide the whole address control state machine into two parts, and has more simplified state machine and smaller area overhead and power consumption compared with the mode that one address control unit is adopted.
Drawings
FIG. 1 is a block diagram of the overall architecture of the convolutional neural network accelerator of the present invention;
FIG. 2 is a timing diagram of the control unit of the present invention;
FIG. 3 is a block diagram of the architecture of an array of compute units of the present invention;
FIG. 4 is an image read address generation state machine of the present invention;
FIG. 5 illustrates a block-based storage of image data according to the present invention;
FIG. 6 is a detailed diagram of y-direction address increment in image read address generation according to the present invention;
FIG. 7 is a detailed diagram of the increment of the x-direction address in the image read address generation of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
As shown in FIG. 1, the present invention provides an SOC (system on a chip) -based data multiplexing convolutional neural network accelerator, which includes an image address generator, an image buffer, shift logic, a weight address generator, a weight buffer, an offset address generator, an offset buffer, a control unit and a computing unit array, wherein the control unit receives an externally input start control signal, then controls the offset address generator, the weight address generator and the image address generator to generate an offset write control signal, a weight write control signal and an image write control signal according to a preset time sequence, stores offset, weight and image data in blocks into the corresponding buffers, then controls the offset address generator, the weight address generator and the image address generator to generate read/write addresses of the corresponding buffers, and the weight buffer and the offset buffer respectively store the weight data in the corresponding addresses, Outputting the offset data to the computing unit array; the image buffer area outputs the image data in the corresponding address to the shift logic, the shift logic shifts the image data according to the shift control signal and the layer operation serial number sent by the control unit and then outputs the image data to the calculation unit array, and the calculation unit array performs multilayer convolution, pooling and multilayer full-connection operation on the image data by adopting a block operation method according to the weight data, the offset data and the image data. The preset timing is shown in fig. 2.
The following components are described separately:
1. control unit
The control unit comprises a main control module, a weight control state machine, a bias control state machine, an image control state machine and a write control state machine.
1.1 Main control Module
The main control module receives an externally input starting control signal, then divides the convolution, pooling and full-connection operation processes into a plurality of layer operations according to a preset time sequence according to the preset time and full-connection operation time of each layer of the convolutional neural network accelerator, divides each layer operation into a plurality of block operations, sends a writing control starting instruction to the writing control state machine before the layer operation starts, and sends a writing control stopping signal to the writing control state machine after all data required by the current layer operation is written into a corresponding buffer area; at the starting time of layer operation, sending a weight reading control starting signal to a weight reading control state machine, a bias reading control state machine and an image reading control state machine, and sending a shifting control signal and a layer operation serial number to a shifting logic; and at the layer operation ending time and the block operation ending time, sending a layer operation ending mark signal and a block operation ending mark signal to a weight reading control state machine, an offset reading control state machine and an image reading control state machine, and sending a shift control signal and a layer operation serial number to shift logic.
Under the control of the main control module, the weight control state machine, the bias control state machine, the image control state machine and the write control state machine respectively output corresponding read enable signals, write enable signals and chip select signals to the weight buffer area, the bias buffer area and the image buffer area, output corresponding address control signals to the weight address generator, the bias address generator and the image address generator, and the weight address generator, the bias address generator and the image address generator generate corresponding read-write addresses according to the address control signals.
The convolutional neural network accelerator comprises a plurality of stages which respectively correspond to an idle stage, 1 st to N1 th convolutional layer operation stages, a pooling layer operation stage and 1 st to N2 th fully-connected layer operation stages, a control unit is in the idle stage at the beginning, enters the 1 st convolutional layer operation stage after an accelerator starting signal is given from the outside, enters the 2 nd convolutional layer operation stage after the first convolutional layer operation stage is processed, and is sequentially carried out until the processing is completed. In the idle removing stage, each stage control unit controls four main sub-state machines, namely a write control state machine, a weight read control state machine, an offset read control state machine and an image read control state machine, in a similar mode, and controls address generation and reading and writing of corresponding data.
1.2 weight read control State machine
The weight address control signal comprises a weight read address reset signal, a weight read address hold signal and a weight read address increment signal, the weight read control state machine comprises 3 states, namely RW _ state0, RW _ state1 and RW _ state2, and is initialized to RW _ state0, and the specific operation of the state machine is as follows:
RW _ state 0: reading a read control starting signal, sending an 'effective' weight read address reset signal when the read control starting signal is effective, sending a weight chip selection signal generating 'effective' to a weight cache area after a clock cycle, setting the sum to be a weight read-write enabling signal in a 'read enabling' state, entering RW _ state1, or keeping the state of RW _ state 0;
RW _ state 1: setting the reset signal of the weighted address as 'invalid' and generating a 'valid' weighted address holding signal; reading the block operation end flag signal and the layer operation end flag signal, and entering a state RW _ state2 when the block operation end flag signal is "valid"; when the layer operation end flag signal is "valid", the method returns to RW _ state 0;
RW _ state2: the weighted address hold signal is deasserted, an "active" weighted address increment signal is generated, and a clock cycle later a jump is made back to RW _ state 1.
1.3 bias read control State machine
The offset reading control signal comprises an offset reading address reset signal, an offset reading address keeping signal and an offset reading address increasing signal, the offset reading control state machine is designed the same as the weight reading control state machine, the offset reading control state machine comprises 3 states, namely RB _ state0, RB _ state1 and RB _ state2, the states are initialized to RB _ state0, and the specific operation of the state machine is as follows:
RB _ state 0: reading a reading control starting signal, sending an effective bias reading address reset signal when the reading control starting signal is effective, sending an effective bias chip selection signal and a bias reading-writing enabling signal set to be in a read enabling state to a bias cache region after a clock cycle, and entering RB _ state1, otherwise, continuously keeping an RB _ state0 state;
RB _ state 1: setting the bias address reset signal to be invalid, and generating an effective bias address holding signal; reading the block operation end flag signal and the layer operation end flag signal, and entering a state RB _ state2 when the block operation end flag signal is "valid"; when the layer operation end flag signal is "valid", returning to RB _ state 0;
RB _ state2: the offset address hold signal is deasserted, an "active" offset address increment signal is generated, and a jump back to RB _ state1 occurs one clock cycle later.
1.4 image read control State machine
The image reading address control signal comprises an image reading address reset signal, an image reading address holding signal and an image reading address increment signal, the image reading control state machine comprises 3 states, namely RP _ state0, RP _ state1 and RP _ state2, and the image reading control state machine specifically operates as follows:
RP _ state 0: reading a reading control starting signal, sending an effective image reading address reset signal when the reading control starting signal is effective, sending an effective image chip selection signal to an image buffer area after one clock period, setting a reading and writing enable signal to be a reading and writing enable signal in a reading and writing enable state, and entering an RP _ state1, or continuing to keep an RP _ state0 state;
RP _ state 1: setting the reset signal of the image reading address as 'invalid', generating an image reading address increasing signal of 'valid', reading a block operation ending mark signal and a layer operation ending mark signal, and jumping to an RP _ state2 state when the reading block operation ending mark signal is 'valid'; when the layer operation end flag signal is "valid", returning to RP _ state 0;
RP _ state2 issues an "active" image read address reset signal, and after one clock cycle, jumps to RP _ state 1.
1.5 write control State machine
The write control signal comprises a write address reset signal, a write address holding signal, a write address increment signal and a chip selection signal, and the write control state machine comprises: two states, wr _ state0 and wr _ state1, are specifically operated as follows:
wr _ state 0: reading the write control starting signal, sending an effective write address reset signal when the write control starting signal is effective, generating an effective chip selection signal after one clock cycle, setting the read-write enable signal to be in a write enable state, entering wr _ state1, and otherwise, continuously keeping the wr _ state0 state;
wr _ state 1: the write address reset signal is set to "invalid", an "valid" write address increment signal is generated, the write control stop signal is read, and when the write control stop signal is "valid", the operation jumps to wr _ state 0.
2. Image buffer, weight buffer and offset buffer
In order to save storage space and improve reading speed, the image cache and the weight cache are both of a grouping storage structure, the image cache region and the weight cache region are divided into M sub-cache regions, chip selection control ends and read-write enabling ends of the M sub-cache regions are connected in parallel, address lines are mutually independent, and corresponding addresses of the sub-cache regions are used for storing image data or weight data required by one-time block operation and simultaneously writing or reading. M is the maximum image data size corresponding to one block operation.
The offset cache is a common SRAM, and the depth of the stored data is larger than the number of channels operated in the accelerator.
3. Address generator
To facilitate access to the data, assume that the three-dimensional image includes X Y N blocks, X representing the number of blocks in the row direction and X representing the number of blocks in the column direction; n represents the number of blocks in the channel direction; reading a piece of three-dimensional image data from the outside and storing the three-dimensional image data into M sub-buffer areas each time, wherein each sub-buffer area corresponds to one element stored in a three-dimensional data block, and the sequence of extracting the data blocks is as follows:
(1) the row sequence number i of the initialized data block is 1, the column sequence number j is 1, and the channel sequence number k is 1;
(2) sequentially reading data blocks with row serial number i, column serial number j and channel serial number k;
(3) adding 1 to k to update k, repeatedly executing the steps (2) to (3) until k is larger than or equal to N, and entering the step (4);
(4) adding 1 to j to update j, enabling k to be equal to 1, repeatedly executing the steps (2) to (4) until j is larger than or equal to Y, k is larger than or equal to N, and entering the step (5);
(5) and adding 1 to i to update i, k to 1, j to 1, repeatedly executing the steps (2) to (5) until i is greater than or equal to X, j is greater than or equal to Y, and k is greater than or equal to N, and ending.
The weight address generator and the offset address generator comprise counters, the count values of the counters are output to corresponding buffer areas as addresses, and when the address reset signals are effective, the count values of the counters are cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.
The image address generator comprises a read address generating module, a write address generating module and a read-write address gating module;
the write address generation module comprises a counter, the count value of the counter is output to the read-write address gating module as an image write address, and when the address reset signal is effective, the count value of the counter is cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.
And the read address generation module comprises R read address generation sub-modules, wherein R is the number of layers. As shown in fig. 4, each read address generation submodule is used to control and generate addresses required by all block operations in a single-layer operation, and the read address generation submodule of the corresponding layer is gated according to the layer sequence number. For processing of a certain layer, inputting three-dimensional image data, firstly increasing the address progressively along the image channel direction, when the data reading in the image channel direction is finished, increasing the address progressively according to the column changing direction of the two-dimensional plane of the image, continuing to increase the address progressively along the image channel direction, when the data reading in the column changing direction of the two-dimensional plane of the image and the data reading in the image channel direction are finished, increasing the address progressively according to the row changing direction of the two-dimensional plane of the image, and continuing to increase the address progressively along the image channel direction until the whole data processing is finished. FIG. 5 shows a specific block storage manner of image data according to the present invention, and FIG. 6 shows a specific diagram of y-direction address increment in image read address generation according to the present invention; fig. 7 is a specific diagram illustrating the increment of the x-direction address in the image read address generation according to the present invention. The specific implementation of each read address generation submodule is as follows:
when the layer serial number is r, the operation serial number of the block in the layer is defined as y, y starts from 0, the step length is 1, and the data size of the image data block calculated by each block in the layer is ar×br×crIn which a isrRepresenting the number of data in the row direction of the image data block, brRepresenting the number of data in the column direction of the image data block, crThe number of data of the image data block in the channel direction is shown, and the data of the image data block are sequentially stored in the same address in the 1 st to Mth sub-buffers in sequence of row by row, column by column and channel by channel.
When y is equal to 0, all the M sub-buffer addresses are initial addresses, and the values thereof are 1, otherwise, the M sub-buffer addresses are determined according to the following principles:
when y cannot be crWhen dividing completely, the addresses of all M sub-buffers are increased progressively;
when y can be crAnd arWhen the product of (a) is divided exactly, divide y by ar、br、crThe remainder of the product of (a) is rem, then, when w% (a)r×br)∈[1+(rem-1)×ar,ar×rem]Then, the address of the w-th sub-buffer is the original address plus 1; otherwise, the address of the w-th sub-buffer is the original address minus cr×arAdding 1;
when y can be crWhen dividing completely, let (y/c)r)%arIs re, when w is divided by arWhen the remainder of (1) is re, the address of the w-th sub-buffer area is the original address plus 1; otherwise, the address of the w-th sub-buffer is the original address minus cr×arAnd adding 1.
5. Shift logic
The convolution operation needs convolution kernel data to be operated in an image data sliding window, so that corresponding shift operation needs to be carried out on the image data by shift logic, the shift logic reads the serial number of the current convolution operation layer, the sizes of two dimensions in the two-dimensional plane direction during the convolution operation are determined, the sequence formed by the image data is changed according to the sizes of the two dimensions, the number of image data sequence shifts is determined according to a shift control signal sent by a control unit, and the image data entering the calculation unit array and subjected to each block operation are aligned with the weight data. In the pooling or full-link process, the shift logic directly outputs the image data in the buffer to the compute unit array.
6. Array of computing units
As shown in fig. 3, the calculation unit array includes a multiplier array, an adder tree, an accumulator, a nonlinear unit, and a gated output unit, wherein:
a multiplier array multiplying the image data by the weight;
the adder tree adds all product terms of the multiplier array, and the result is output to the accumulator;
an accumulator for resetting when the block operation is finished, accumulating the result output by the adder tree, and outputting the accumulated result to the nonlinear unit as the convolution result
And a nonlinear unit, which performs pooling processing on the convolution result and outputs, for example: comparing the accumulated result with 0, and taking the larger value to output;
the gating output unit receives the output gating signal sent by the control unit and outputs a gating convolution result or a pooling result; since the convolution operation and the full join operation are identical in operation form, the multiplier array, the adder tree and the accumulator together complete the convolution operation and the full join operation.
Example (b):
the method mainly comprises image input, weight parameters of a convolutional neural network model and bias parameters, wherein the image input is characterized in that two dimensions in a two-dimensional plane direction are large, the range is 1-107, the number of channels is gradually increased from 3 to 512 as the layer number of the convolutional neural network is deepened, the weight parameters are general convolutional kernel data, the dimensions in the two-dimensional plane direction are 7 × 7, 5 × 5, 3 × 3 and 1 × 1, the number of the channels is 3-512, only one channel is arranged in each bias parameter, and therefore only 3-512 parameters are arranged in each layer.
The data stored in the packets needs to be multiplexed by flexible address control, and the address control is completed by the control unit and the address generation unit together. The control unit is responsible for outputting the state of the data block which is currently calculated, and the address generation unit generates specific addresses of 150 groups of data according to the state of the current data block. As shown in fig. 2, the control unit controls the accelerator to read all inputs including weights, offsets, and images in turn, and starts calculation and output after the inputs are ready. When the control unit reads the weight, the data in the DRAM is read into the weight cache. The control unit first enables the chip select signal of the weight buffer and puts the weight buffer in a write state. However, addresses are needed for writing in the weight cache, the weight cache is a whole block, and the condition of multiplexing of a plurality of channel convolution kernels does not exist, so that the control unit only needs to give signals of ascending address sequences, and the weight address generator generates the addresses of ascending sequences. At the same time, the control unit sends an invalid signal to other modules, and the output is invalid at this time. The whole process of reading the weight is in the above control state until the reading of the required weight is completed to enter the next state. Read bias is similar to when reading weights, except that the control signals send valid signals to the bias buffer and bias address generator.
The process of reading image data is more complicated than reading weights and offsets. The chip select signal and the read/write status signal are identical, except for address generation. When the image data is exactly divided into small blocks of 150 data, the boundaries of the image data are very neat, all the small data blocks are exactly 150, i.e. there are no incomplete data blocks, and the writing addresses are sequentially increased, which is the same as the generation of the weighted and biased addresses.
Grouping complete means that input parameters and weight parameters can be exactly divided into 150 groups, as shown in fig. 5, the size of an input image is 25 × 25 × 96, the input image is firstly partitioned according to the size of a convolution kernel, the size of the convolution kernel is 5 × 5 × 96, but an image buffer only has 150 groups and cannot store 2400 data at a time, therefore, the image data with the size of each convolution kernel is further disassembled to become a small block of 5 × 5 × 6, so that the input image is partitioned into 16 × 5 × 5 blocks of 400 blocks, then each small block of 5 × 5 × 6 is stored into 150 groups of the image buffer, the process is generally that the image data is fetched from a DRAM into an SRAM, the number of bits of each fetch of the DRAM is generally 32 bits, therefore, the data of the small blocks sequentially fetched are sequentially placed into 150 groups of the buffer, the first image data block occupies the buffer with all addresses of 1, then the second image data block is stored, and the like all the addresses of the image group with all addresses of 2.
If we choose the second block from the x-direction, the most convenient way to process the first block and then the second block, this creates a problem, the convolution kernel contains a total of 16 blocks of data, each block processed does not produce the final result, but an intermediate result, so additional storage is needed, the next input processed requires 11 registers to store the intermediate result, if x-direction first and y-direction last z-direction needs 121 registers to store the intermediate result, 121 registers are not very large, but if 224 × 224 image input, using 3 × 3 convolution kernel, such direction selection requires 224 × 224 50176 registers, so large register array is very resource consuming, while control logic is more complex, if 50k registers each need to be individually controlled, such direction selection requires a significant number of registers to be saved as a result of the first direction, so that we can get a significant choice of the final channel number, and thus the final channel number can be set for the next input data processing, so there is no penalty of the first and second data blocks.
In general, the invention caches the part which is easy to multiplex on the chip by segmenting the load, improves the utilization efficiency of the on-chip storage data, reduces the time delay and the power consumption of frequently reading and writing the external memory, and improves the operation efficiency of the convolutional neural network.
Parts of the present invention not described in detail in the specification are common general knowledge of those skilled in the art.

Claims (7)

1. A data multiplexing convolution neural network accelerator based on SOC is characterized by comprising an image address generator, an image buffer area, shift logic, a weight address generator, a weight buffer area, an offset address generator, an offset buffer area, a control unit and a computing unit array, wherein the control unit receives an externally input starting control signal, then, controlling an offset address generator, a weight address generator and an image address generator to generate an offset write control signal, a weight write control signal and an image write control signal according to a preset timing, storing the offset, the weight and the image data in corresponding buffer areas in blocks, and then, then controlling an offset address generator, a weight address generator and an image address generator to generate read-write addresses of corresponding buffer areas, wherein the weight buffer areas and the offset buffer areas respectively output weight and offset data in the corresponding addresses to the computing unit array; the image buffer area outputs image data in a corresponding address to a shift logic, the shift logic shifts the image data according to a shift control signal and a layer operation serial number sent by a control unit and then outputs the image data to a calculation unit array, and the calculation unit array performs multilayer convolution, pooling and multilayer full-connection operation on the image data by adopting a block operation method according to weight data, offset data and the image data;
the control unit comprises a main control module, a weight control state machine, a bias control state machine, an image control state machine and a write control state machine, wherein:
the main control module receives an externally input starting control signal, then divides the convolution, pooling and full-connection operation processes into a plurality of layer operations according to a preset time sequence according to the preset time and full-connection operation time of each layer of the convolutional neural network accelerator, divides each layer operation into a plurality of block operations, sends a writing control starting instruction to the writing control state machine before the layer operation starts, and sends a writing control stopping signal to the writing control state machine after all data required by the current layer operation is written into a corresponding buffer area; at the starting time of layer operation, sending a weight reading control starting signal to a weight reading control state machine, a bias reading control state machine and an image reading control state machine, and sending a shifting control signal and a layer operation serial number to a shifting logic; at the layer operation ending time and the block operation ending time, sending a layer operation ending mark and a block operation ending mark signal to a weight reading control state machine, an offset reading control state machine and an image reading control state machine, and sending a shift control signal and a layer operation serial number to shift logic;
under the control of the main control module, the weight control state machine, the bias control state machine, the image control state machine and the write control state machine respectively output corresponding read enable signals, write enable signals and chip select signals to the weight buffer area, the bias buffer area and the image buffer area, output corresponding address control signals to the weight address generator, the bias address generator and the image address generator, and the weight address generator, the bias address generator and the image address generator generate corresponding read-write addresses according to the address control signals.
2. The SOC-based data multiplexing convolutional neural network accelerator as claimed in claim 1, wherein the image buffer and the weight buffer are both of a packet storage structure, the image buffer and the weight buffer are divided into M sub-buffers, chip selection control terminals and read/write enable terminals of the M sub-buffers are connected in parallel, address lines are independent of each other, corresponding addresses of the sub-buffers are used for storing image data or weight data required by one block operation and simultaneously writing or reading, and M is the maximum image data size corresponding to one block operation.
3. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the weight address generator and the offset address generator comprise counters, the count values of the counters are output to the corresponding buffers as addresses, and when the address reset signal is "valid", the count values of the counters are cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is active, the counter value of the counter is incremented by 1.
4. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the image address generator comprises a read address generation module, a write address generation module, and a read/write address gating module;
the write address generation module comprises a counter, the count value of the counter is output to the read-write address gating module as an image write address, and when the address reset signal is effective, the count value of the counter is cleared; when the address holding signal is valid, the count value of the counter is unchanged; when the address increment signal is effective, the count value of the counter is increased by 1;
the read address generation module comprises R read address generation sub-modules, wherein R is the number of layers; each read address generation submodule is used for controlling and generating addresses required by all block operations in one layer of operation, the read address generation submodule of the corresponding layer is gated according to the layer serial number, for the processing of a certain layer, three-dimensional image data is input, the address is increased progressively along the image channel direction at first, when the data reading in the image channel direction is completed, the address is increased progressively along the image channel direction after the address is increased progressively along the image two-dimensional plane column changing direction, when the data reading in the image two-dimensional plane column changing direction and the data reading in the image channel direction are completed, the address is increased progressively according to the image two-dimensional plane column changing direction, the address is increased progressively along the image channel direction until the whole block data processing is completed.
5. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the image data comprises X Y N blocks, X denotes the number of blocks in the row direction, X denotes the number of blocks in the column direction, and N denotes the number of blocks in the channel direction; reading a piece of three-dimensional image data from the outside and storing the three-dimensional image data into M sub-buffer areas each time, wherein each sub-buffer area corresponds to one element stored in a three-dimensional data block, and the sequence of extracting the data blocks is as follows:
(1) the row sequence number i of the initialized data block is 1, the column sequence number j is 1, and the channel sequence number k is 1;
(2) sequentially reading data blocks with row serial number i, column serial number j and channel serial number k;
(3) adding 1 to k to update k, repeatedly executing the steps (2) to (3) until k is larger than or equal to N, and entering the step (4);
(4) adding 1 to j to update j, enabling k to be equal to 1, repeatedly executing the steps (2) to (4) until j is larger than or equal to Y, k is larger than or equal to N, and entering the step (5);
(5) and adding 1 to i to update i, k to 1, j to 1, repeatedly executing the steps (2) to (5) until i is greater than or equal to X, j is greater than or equal to Y, and k is greater than or equal to N, and ending.
6. The SOC-based data multiplexing convolutional neural network accelerator as claimed in claim 1, wherein during convolution operation, the shift logic determines the size of two dimensions in the two-dimensional plane direction during convolution operation according to the convolutional layer serial number, changes the sequence order of image data according to the size of the two dimensions, and determines the shift amount of the image data sequence according to the shift control signal sent by the control unit, so that the image data of each block operation entering the computing unit array is aligned with the weight data, and during pooling or full connection processing, the shift logic directly outputs the image data in the buffer area to the computing unit array.
7. The SOC-based data multiplexing convolutional neural network accelerator of claim 1, wherein the compute unit array comprises a multiplier array, an adder tree, an accumulator, a nonlinear unit, and a gated output unit, wherein:
a multiplier array multiplying the image data by the weight;
the adder tree adds all product terms of the multiplier array, and the result is output to the accumulator;
an accumulator for resetting when the block operation is finished, accumulating the result output by the adder tree, and outputting the accumulated result to the nonlinear unit as the convolution result
The nonlinear unit is used for performing pooling processing on the convolution result and outputting the result;
and the gating output unit is used for receiving the output gating signal sent by the control unit and outputting a gating convolution result or a pooling result.
CN201711207259.3A 2017-11-27 2017-11-27 Data multiplexing convolution neural network accelerator based on SOC Active CN108171317B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711207259.3A CN108171317B (en) 2017-11-27 2017-11-27 Data multiplexing convolution neural network accelerator based on SOC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711207259.3A CN108171317B (en) 2017-11-27 2017-11-27 Data multiplexing convolution neural network accelerator based on SOC

Publications (2)

Publication Number Publication Date
CN108171317A CN108171317A (en) 2018-06-15
CN108171317B true CN108171317B (en) 2020-08-04

Family

ID=62524477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711207259.3A Active CN108171317B (en) 2017-11-27 2017-11-27 Data multiplexing convolution neural network accelerator based on SOC

Country Status (1)

Country Link
CN (1) CN108171317B (en)

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985449B (en) * 2018-06-28 2021-03-09 中国科学院计算技术研究所 Control method and device for convolutional neural network processor
WO2020019174A1 (en) * 2018-07-24 2020-01-30 深圳市大疆创新科技有限公司 Data access method, processor, computer system and movable device
CN108681984B (en) * 2018-07-26 2023-08-15 珠海一微半导体股份有限公司 Acceleration circuit of 3*3 convolution algorithm
CN109141403B (en) * 2018-08-01 2021-02-02 上海航天控制技术研究所 Image processing system and method for small window access of star sensor
CN109146072B (en) * 2018-08-01 2021-03-23 上海天数智芯半导体有限公司 Data reuse method based on convolutional neural network accelerator
CN109086875A (en) * 2018-08-16 2018-12-25 郑州云海信息技术有限公司 A kind of convolutional network accelerating method and device based on macroinstruction set
CN109284824B (en) * 2018-09-04 2021-07-23 复旦大学 Reconfigurable technology-based device for accelerating convolution and pooling operation
CN109460813B (en) * 2018-09-10 2022-02-15 中国科学院深圳先进技术研究院 Acceleration method, device and equipment for convolutional neural network calculation and storage medium
CN112970037B (en) * 2018-11-06 2024-02-02 创惟科技股份有限公司 Multi-chip system for implementing neural network applications, data processing method suitable for multi-chip system, and non-transitory computer readable medium
CN109581185B (en) * 2018-11-16 2021-11-09 北京时代民芯科技有限公司 SoC chip laser simulation single particle irradiation detection and fault positioning method and system
CN109359735B (en) * 2018-11-23 2020-12-04 浙江大学 Data input device and method for accelerating deep neural network hardware
CN111340201A (en) * 2018-12-19 2020-06-26 北京地平线机器人技术研发有限公司 Convolutional neural network accelerator and method for performing convolutional operation thereof
CN109740732B (en) * 2018-12-27 2021-05-11 深圳云天励飞技术有限公司 Neural network processor, convolutional neural network data multiplexing method and related equipment
CN111382094B (en) * 2018-12-29 2021-11-30 深圳云天励飞技术有限公司 Data processing method and device
CN109886400B (en) * 2019-02-19 2020-11-27 合肥工业大学 Convolution neural network hardware accelerator system based on convolution kernel splitting and calculation method thereof
CN109886395B (en) * 2019-03-06 2020-11-24 上海熠知电子科技有限公司 Data reading method for multi-core image processing convolutional neural network
CN111667046A (en) * 2019-03-08 2020-09-15 富泰华工业(深圳)有限公司 Deep learning acceleration method and user terminal
CN111832585B (en) * 2019-04-16 2023-04-18 杭州海康威视数字技术股份有限公司 Image processing method and device
CN110222819B (en) * 2019-05-13 2021-04-20 西安交通大学 Multilayer data partition combined calculation method for convolutional neural network acceleration
CN111985628B (en) * 2019-05-24 2024-04-30 澜起科技股份有限公司 Computing device and neural network processor comprising same
CN110390383B (en) * 2019-06-25 2021-04-06 东南大学 Deep neural network hardware accelerator based on power exponent quantization
CN110598858A (en) * 2019-08-02 2019-12-20 北京航空航天大学 Chip and method for realizing binary neural network based on nonvolatile memory calculation
CN110516801B (en) * 2019-08-05 2022-04-22 西安交通大学 High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN110458285B (en) * 2019-08-14 2021-05-14 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN110533177B (en) * 2019-08-22 2023-12-26 安谋科技(中国)有限公司 Data read-write device, method, equipment, medium and convolution accelerator
CN110956258B (en) * 2019-12-17 2023-05-16 深圳鲲云信息科技有限公司 Neural network acceleration circuit and method
CN111340224B (en) * 2020-02-27 2023-11-21 浙江芯劢微电子股份有限公司 Accelerated design method of CNN (computer network) suitable for low-resource embedded chip
WO2021179289A1 (en) * 2020-03-13 2021-09-16 深圳市大疆创新科技有限公司 Operational method and apparatus of convolutional neural network, device, and storage medium
CN111753962B (en) * 2020-06-24 2023-07-11 国汽(北京)智能网联汽车研究院有限公司 Adder, multiplier, convolution layer structure, processor and accelerator
CN111651378B (en) * 2020-07-06 2023-09-19 Oppo广东移动通信有限公司 Data storage method, soC chip and computer equipment
CN111915001B (en) * 2020-08-18 2024-04-12 腾讯科技(深圳)有限公司 Convolution calculation engine, artificial intelligent chip and data processing method
CN112070217B (en) * 2020-10-15 2023-06-06 天津大学 Internal storage bandwidth optimization method of convolutional neural network accelerator
CN112950656A (en) * 2021-03-09 2021-06-11 北京工业大学 Block convolution method for pre-reading data according to channel based on FPGA platform
CN113128688B (en) * 2021-04-14 2022-10-21 北京航空航天大学 General AI parallel reasoning acceleration structure and reasoning equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator

Also Published As

Publication number Publication date
CN108171317A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN108171317B (en) Data multiplexing convolution neural network accelerator based on SOC
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
WO2020258528A1 (en) Configurable universal convolutional neural network accelerator
CN109598338B (en) Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
CN111445012B (en) FPGA-based packet convolution hardware accelerator and method thereof
CN108985450B (en) Vector processor-oriented convolution neural network operation vectorization method
CN109409512B (en) Flexibly configurable neural network computing unit, computing array and construction method thereof
CN107301455B (en) Hybrid cube storage system for convolutional neural network and accelerated computing method
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
TWI634489B (en) Multi-layer artificial neural network
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN110222818A (en) A kind of more bank ranks intertexture reading/writing methods for the storage of convolutional neural networks data
CN110580519B (en) Convolution operation device and method thereof
CN113222130A (en) Reconfigurable convolution neural network accelerator based on FPGA
CN112487750A (en) Convolution acceleration computing system and method based on memory computing
CN114003201A (en) Matrix transformation method and device and convolutional neural network accelerator
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
US20230047364A1 (en) Partial sum management and reconfigurable systolic flow architectures for in-memory computation
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN108920097A (en) A kind of three-dimensional data processing method based on Laden Balance
CN113610221A (en) Variable expansion convolution operation hardware system based on FPGA
CN113627587A (en) Multichannel convolutional neural network acceleration method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant