CN109948775B

CN109948775B - Configurable neural convolution network chip system and configuration method thereof

Info

Publication number: CN109948775B
Application number: CN201910128679.5A
Authority: CN
Inventors: 孙建辉; 蔡阳健; 虞刚
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2019-02-21
Filing date: 2019-02-21
Publication date: 2021-10-19
Anticipated expiration: 2039-02-21
Also published as: CN109948775A

Abstract

The present disclosure provides a configurable neural convolutional network chip system and a configuration method thereof. The configurable neural convolutional network chip system comprises at least one neural network configuration unit, wherein each neural network configuration unit comprises a sparse unit which is used for respectively sparsely configuring each local pixel and a corresponding weight coefficient thereof so as to adapt to the size change of a convolutional kernel; the filter multiply-accumulate array is used for performing convolution operation on each sparsely configured local pixel and a preset convolution kernel, multiplying the corresponding convolution operation result by using a weight coefficient corresponding to each local pixel, and accumulating; the accumulation unit is used for adding the preset bias coefficient and convolution accumulation results output by the filter multiplication accumulation array so as to adjust the difficult and easy activation degree of the hidden neuron; a maximal pooling unit for maximal pooling hidden layer neurons to reduce the number of neurons for subsequent convolution.

Description

Configurable neural convolution network chip system and configuration method thereof

Technical Field

The disclosure belongs to the field of chip design, and particularly relates to a configurable neural convolution network chip system and a configuration method thereof.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The neural convolutional network is a feedforward neural network, and the artificial neurons of the neural convolutional network can respond to a part of surrounding units in a coverage range and have excellent performance on large-scale image processing. The convolutional network includes convolutional and pooling layers.

The inventor finds that the following problems exist in the current neural network chip or circuit structure: the architecture configuration form of the neural convolution network is fixed, and the neural convolution network cannot be suitable for different convolutions and sizes, so that hardware resource waste is caused; and the whole nerve convolution process has large calculation amount, so that the calculation power consumption is large.

Disclosure of Invention

According to one aspect of one or more embodiments of the present disclosure, a configurable neuro-convolutional network chip system is provided, which has low power consumption and configurable effect of resource reuse, and is applied to image feature and image edge contour feature identification.

The configurable neural convolutional network chip system comprises at least one neural network configuration unit, wherein each neural network configuration unit comprises:

the sparse unit is used for respectively sparsely configuring each local pixel and the corresponding weight coefficient thereof so as to adapt to the size change of the convolution kernel;

the filter multiply-accumulate array is used for performing convolution operation on each sparsely configured local pixel and a preset convolution kernel, multiplying the corresponding convolution operation result by using a weight coefficient corresponding to each local pixel, and accumulating;

the accumulation unit is used for adding the preset bias coefficient and convolution accumulation results output by the filter multiplication accumulation array so as to adjust the difficult and easy activation degree of the hidden neuron;

a maximal pooling unit for maximal pooling hidden layer neurons to reduce the number of neurons for subsequent convolution.

According to another aspect of one or more embodiments of the present disclosure, there is provided a configuration method of a configurable neuro-convolutional network chip system, which has the effects of low power consumption and configurable resource reuse.

The configuration method of the configurable neural convolution network chip system comprises the following steps:

each local pixel and the corresponding weight coefficient thereof are respectively sparsely configured to adapt to the size change of a convolution kernel;

performing convolution operation on each sparsely configured local pixel and a preset convolution kernel, multiplying corresponding convolution operation results by using a weight coefficient corresponding to each local pixel, and accumulating;

adding the preset bias coefficient and the convolution accumulation result to adjust the difficult and easy activation degree of the hidden neuron;

hidden layer neurons are maximally pooled to reduce the number of neurons for subsequent convolutions.

The beneficial effects of this disclosure are:

(1) in the configurable neural convolutional network chip system disclosed by the invention, pixel data, weight coefficients, an activation layer and hidden neuron bias coefficient connection lines can be subjected to pre-warping processing and rerouting configuration so as to adapt to network structure readjustment force for processing different image feature recognition tasks; the architecture of the neural convolutional network can be reconfigured while the utilization of hardware resources is maximized, so that the neural convolutional network is suitable for different convolutional kernel sizes and is suitable for different edge feature extraction.

(2) The filter multiply-accumulate array performs a low-Power-consumption management mechanism of Power Gating (PG) and Clock Gating (CG) to optimize Power consumption, uses the Power Gating technology (PG) to close the filter multiply-accumulate array which does not work, reduces various Power consumptions such as dynamic Power consumption and static Power consumption, uses the Clock Gating technology (CG) to forbid Clock turnover of the filter multiply-accumulate array, reduces dynamic Power consumption, enters a last maintaining stage, and maintains data after convolution of an old Clock; connectivity of a variable number of local input layer neurons to a single hidden layer neuron can be routed to suit different feature recognition or different convolution kernel operations.

(3) The method and the device have the advantages that the weight coefficients stored in the weight coefficient memory are synchronously broadcasted to the filter multiply-accumulate array in a multicast mode, so that data sharing and rapid synchronous loading between the weight coefficient memory and the filter multiply-accumulate array are realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a schematic structural diagram of a configurable neural convolutional network chip system according to an embodiment of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Interpretation of terms:

The english interpretation in fig. 1 is as follows:

MACs: a multiply-add accumulator array.

Filter _ MACs _ 1: multiplying the 1 st filter by the accumulation array to complete the calculation of one hidden neuron in the hidden layer;

filter _ MACs _ k: multiplying the k filter by the accumulation array to complete the calculation of k hidden neurons in the hidden layer;

VDD: a working voltage;

enak _ On _ Off: if the filter multiply accumulate coefficient multicast of array is enabled, Enak _ On _ Off is set to 1, that is, when the filter multiply accumulate coefficient multicast is in high level, multicast data can be input into the interface through the sub-bus;

PG: enabling power supply, wherein when the PG unit is closed, the corresponding module can not supply power and enters a closed state;

CG: enabling a clock, wherein when a CG unit is enabled, the clock enters a module needing convolution operation, when the CG unit is forbidden, clock turnover is forbidden, a corresponding convolution operation module does not update data, and only old data is maintained;

clock: in this synchronous system, the filter multiplies the unique synchronous clock signal at which the accumulation array operates.

Example 1

As shown in fig. 1, a configurable neural convolutional network chip system of this embodiment includes at least one neural network configuration unit, where each neural network configuration unit includes: sparse unit, filter multiply accumulate array, accumulate unit and max pooling unit.

In a specific implementation, the sparse unit is used for respectively sparsely configuring each local pixel and the corresponding weight coefficient thereof so as to adapt to the size change of the convolution kernel.

Specifically, the process of sparsely configuring each local pixel and the corresponding weight coefficient thereof by the sparse unit is as follows:

(1) pixel data coefficient thinning configuration: changing the unused pixel points into 0, so that when the multiplication of the subsequent pixel point data and the corresponding coefficient is carried out, the multiplicand is changed into 0, and the result is directly 0;

(2) weight coefficient thinning configuration: if the pixel data is subjected to sparsification, the weight coefficient is kept unchanged; if the pixel data is not subjected to thinning processing, the weighting coefficient is subjected to thinning processing, namely the corresponding coefficient multiplier is changed into 0, and the obtained result is also 0 when the pixel data and the corresponding coefficient are multiplied; if the weight coefficient kernel becomes small, as shown in fig. 1, the weight coefficient connection switch (weights _ spark _ configure) to the multiplication unit is disabled by lowering the multiplication operation of the multiplicand and multiplier.

In specific implementation, the filter multiply-accumulate array is used for performing convolution operation on each sparsely configured local pixel and a preset convolution kernel, and then multiplying and accumulating corresponding convolution operation results by using a weight coefficient corresponding to each local pixel.

In one embodiment, the accumulation unit is configured to adjust a level of difficult activation of hidden neurons.

In a specific implementation, a max-pooling unit is used to maximize pooling of hidden layer neurons to reduce the number of neurons for subsequent convolution.

The maximum pooling unit performs down-sampling processing on the convolved hidden layer neurons to reduce the number of the hidden layer neurons, reduce the number of times of convolution operation performed later and achieve the purpose of dimension reduction.

The pooled neurons reuse the configurable neural network of the present disclosure for subsequent convolution operations.

Example 2

The configurable neural convolutional network chip system of this embodiment, on the basis of embodiment 1, further includes:

and the pixel data memory is used for storing all local pixel data in the image.

Specifically, image pixel data is pre-compressed, edge and contour feature extraction is considered, only single-channel gray data is reserved, and the rest chrominance channel data are removed, so that the subsequent calculation task amount and pixel storage resource overhead are greatly reduced;

the image is divided into different local areas, and extraction of different edge features based on local area input pixels is carried out, namely, local pixel data and a convolution kernel are subjected to convolution operation to prepare for extraction of different edge features.

Example 3

and the weight coefficient initialization unit is used for initializing, performing integer and aligning the weight coefficient corresponding to each local pixel into data with preset digits.

For example: and initializing the weight coefficient corresponding to each local pixel, carrying out integer transformation and aligning to 16 bits.

Example 4

the clock gating unit is connected with the filter multiply-accumulate array; the clock gating unit is used for realizing whether to update the calculation data according to the output clock signal.

The filter multiply accumulate array is composed of several filter MAC modules. If the data calculated by a certain filter MAC module in the filter multiply-accumulate array only needs to be kept and does not need to be updated, the clock input by the filter MAC module is forbidden to be a monotone level through a gating clock technology, so that the new calculation data updating is avoided, only old data after convolution operation is reserved, the dynamic energy consumption of the filter MAC module is reduced, when the gating unit is forbidden, the clock passes through again, the calculation data is updated, and meanwhile, the clock passes through after a period of time, so that the influence of charge leakage caused by electric leakage can be reduced.

Example 5

The configurable neural convolutional network chip system of this embodiment, on the basis of embodiment 3, further includes:

and the weight coefficient memory is used for storing the weight coefficient processed by the weight coefficient initialization unit.

In a specific implementation, the weight coefficient memory broadcasts the weight coefficients stored therein to the filter multiply accumulate array synchronously through a multicast mode to realize the sharing of the weight coefficients.

As shown in fig. 1, multicast (multi broadcast), i.e. data from a fixed shared coefficients memory (16bit fixed shared weights memory), passes through a multicast data bus and reaches the coefficient input port of each convolution Filter operation module, and if the coefficient input switch (Ena _ on _ off) of each Filter MAC module (Filter _ MAC) is turned on, the shared coefficients flow into the convolution Filter module, and if not turned on, the switch (Ena _ on _ off) is turned off.

Wherein, multicast is one of 3 basic destination address types of IPv6 data packets, and multicast is one-point-to-multipoint communication.

In the chip system of this embodiment, a convolution processing hardware architecture of a plurality of local pixel data matrices and a shared coefficient data matrix, and a coefficient connection route before convolution of pixel data and a weight coefficient may be configured to adapt to different convolution kernel sizes and to adapt to different feature extractions; accessing an integer aligned coefficient memory, and playing data of a coefficient memory bank to a subsystem number bus input interface of a filter multiply-accumulate array through a bus by utilizing a shared weight coefficient multicast network in consideration of coefficient sharing; the parallel filter multiply-accumulate arrays can synchronously process a plurality of local pixel input arrays to obtain different feature maps for subsequent pooling.

Example 6

a power supply gating unit connected to the filter multiply-accumulate array; and the power supply gating unit is used for controlling the start-stop working state of the filter multiply-accumulate array.

If a certain filter MAC module in the filter multiply-accumulate array does not need to be calculated, the power supply of the filter MAC module is cut off by using the power supply gating unit, and the static energy consumption and the dynamic energy consumption of the whole chip are reduced.

The filter multiply-accumulate array of this embodiment can be power-gated by a PG (power gating) unit, if the filter multiply-accumulate array does not need to perform calculation, PG is prohibited, i.e. the working power supply is turned off to eliminate static power consumption and dynamic power consumption, and when the filter multiply-accumulate array needs to be changed, the PG gate needs to be enabled first to supply power; the filter multiply-accumulate array can perform clock gating through a CG (clock gating) unit, temporarily stops the data updating of the filter multiply-accumulate array, enters a maintaining stage, stores the calculated data of the previous clock, and reduces the dynamic power consumption. The coefficient data after the 16-bit sharing coefficient memory is subjected to integer and 16-bit format alignment is stored in the 16-bit sharing coefficient memory.

Example 7

and the offset coefficient memory is used for storing a preset offset coefficient.

Wherein the bias coefficients are used to adjust the neuron output.

In the chip system of the present embodiment, the pixel/coefficient connection has a routing function, has low power consumption, maximizes resource utilization, and can configure the neural convolutional network.

Example 8

The configuration method of the configurable neural convolutional network chip system of the embodiment comprises the following steps:

step 1: each local pixel and the corresponding weight coefficient thereof are respectively sparsely configured to adapt to the size change of a convolution kernel;

step 2: performing convolution operation on each sparsely configured local pixel and a preset convolution kernel, multiplying corresponding convolution operation results by using a weight coefficient corresponding to each local pixel, and accumulating;

and step 3: adding the preset bias coefficient and the convolution accumulation result to adjust the difficult and easy activation degree of the hidden neuron;

and 4, step 4: hidden layer neurons are maximally pooled to reduce the number of neurons for subsequent convolutions.

In a specific implementation, before each local pixel and its corresponding weight coefficient are sparsely configured, the method includes:

pre-compressing image pixel data, only retaining single-channel gray data, and removing the rest chroma channel data;

dividing the image into different local areas to obtain all local pixel data in the image;

and initializing the weight coefficient corresponding to each local pixel, and carrying out integer and alignment to data with preset digits.

In another embodiment, the power gating unit is used to control the on-off operation of the filter multiply accumulate array.

In another embodiment, whether to update the calculation data is implemented by using a clock signal output by a clock gating unit.

In another embodiment, the weight coefficients stored in the weight coefficient memory are synchronously broadcast to the filter multiply accumulate array via a multicast format to enable weight coefficient sharing between the weight coefficient memory and the filter multiply accumulate array.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A configurable neural convolutional network chip system, comprising at least one neural network configuration unit, each neural network configuration unit comprising:

the sparse unit is used for respectively sparsely configuring each local pixel and the corresponding weight coefficient thereof so as to adapt to the size change of the convolution kernel; the image is divided into different local areas, and extraction of different edge features based on local area input pixels is carried out, namely convolution operation is carried out on local pixel data and convolution kernels so as to prepare for extracting different edge features; the filter multiply-accumulate array is used for performing convolution operation on each sparsely configured local pixel and a preset convolution kernel, multiplying the corresponding convolution operation result by using a weight coefficient corresponding to each local pixel, and accumulating; the weight coefficient memory synchronously broadcasts the weight coefficients stored in the weight coefficient memory to the filter multiply accumulation array in a multicast mode so as to realize the sharing of the weight coefficients;

the convolution processing hardware architecture of the multi-local pixel data matrix and the shared coefficient data matrix, and the coefficient connection route before convolution of the pixel data and the weight coefficient can be configured to adapt to different convolution kernel sizes and different feature extraction; accessing an integer aligned weight coefficient memory, and playing data of a coefficient memory bank to a subsystem number bus input interface of a filter multiply-accumulate array through a bus by utilizing a shared weight coefficient multicast network in consideration of coefficient sharing; the parallel filter multiply-accumulate array can synchronously process a plurality of local pixel input arrays to obtain different feature mappings so as to carry out subsequent pooling processing;

a maximum pooling unit for performing maximum pooling processing on hidden layer neurons to reduce the number of neurons for subsequent convolution;

2. The configurable neuro-convolutional network chip system of claim 1, further comprising:

3. The configurable neuro-convolutional network chip system of claim 1, further comprising:

4. The configurable neuro-convolutional network chip system of claim 3, further comprising:

5. The configurable neuro-convolutional network chip system of claim 1, further comprising:

6. The configurable neuro-convolutional network chip system of claim 1, further comprising:

7. A method of configuring a configurable neural convolutional network chip system as claimed in any one of claims 1-6, comprising:

8. The method of configuring a configurable neural convolutional network chip system of claim 7, wherein before sparsely configuring each local pixel and its corresponding weight coefficient, respectively, comprises:

dividing the image into different local areas to obtain all local pixel data in the image; and initializing the weight coefficient corresponding to each local pixel, and carrying out integer and alignment to data with preset digits.

9. The method of configuring a configurable neural convolutional network chip system of claim 7, wherein before sparsely configuring each local pixel and its corresponding weight coefficient, further comprising: and controlling the start-stop working state of the filter multiply-accumulate array by using the power supply gating unit.

10. The method of configuring a configurable neural convolutional network chip system of claim 7, wherein before sparsely configuring each local pixel and its corresponding weight coefficient, further comprising: whether the calculation data is updated or not is realized by using the clock signal output by the clock gating unit.

11. The method of configuring a configurable neural convolutional network chip system of claim 7, wherein before sparsely configuring each local pixel and its corresponding weight coefficient, further comprising: and synchronously broadcasting the weight coefficients stored in the weight coefficient memory to the filter multiply-accumulate array through a multicast mode so as to realize the sharing of the weight coefficients between the weight coefficient memory and the filter multiply-accumulate array.