CN109409512B

CN109409512B - Flexibly configurable neural network computing unit, computing array and construction method thereof

Info

Publication number: CN109409512B
Application number: CN201811133940.2A
Authority: CN
Inventors: 任鹏举; 樊珑; 赵博然; 宗鹏陈; 陈飞; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2021-02-19
Anticipated expiration: 2038-09-27
Also published as: CN109409512A

Abstract

The invention discloses a flexibly configurable neural network computing unit, a computing array and a construction method thereof, wherein the neural network computing unit comprises: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing; the configurable memory module includes: a characteristic map data buffer, a step data buffer and a weight data buffer; the configurable control module includes: a counter module and a state machine module; the multiplication and addition calculation module comprises: a multiplier and an accumulator. The invention can support any type of convolution calculation, supports multi-size convolution kernel parallel calculation, fully explores the flexibility and data reusability of the convolution neural network calculation unit, greatly reduces the system power consumption caused by data migration, and improves the calculation efficiency of the system.

Description

Flexibly configurable neural network computing unit, computing array and construction method thereof

Technical Field

The invention belongs to the field of neural network hardware architecture, and particularly relates to a flexibly configurable neural network computing unit, a computing array and a construction method thereof.

Background

The flexible hardware computing architecture has a significant impact on the hardware implementation of convolutional neural networks. The convolutional layer is used as the most main structure in the convolutional neural network and has the characteristics of large calculation amount, strong data reusability and the like. The convolutional layer shares the characteristic through the weight, so that the complexity of a network model is reduced, the number of parameters is greatly reduced, and the complicated characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided.

In the convolutional neural network, the convolutional layer is mainly used for convolving the same group of input feature map data with a group of convolution kernels of different output channels, then obtaining output feature maps with the same number as the output channels, and completing feature extraction of the feature maps. As convolutional neural networks are continuously developed and learned and the demand for neural networks is gradually increased, the types of neural network models are more and more, the depth of the networks is gradually deepened, and the convolutional layer convolution mode becomes complicated and variable.

Therefore, the neural network computing unit architecture which is high in flexibility, high in computing performance and capable of being recycled has great significance for hardware implementation of the convolutional layer. At present, most of convolution layer calculation units can only complete one type of convolution mode in hardware implementation, cannot support calculation of convolution layers with different types in a network model, and cannot fully utilize data reusability of the convolution layers.

Disclosure of Invention

The invention aims to provide a neural network computing unit, a computing array and a construction method thereof, which can be flexibly configured, can effectively enhance the flexibility of a convolutional layer in hardware implementation, improve the computing efficiency of a system, and fully play the data reusability of the convolutional layer, thereby reducing the power consumption of the system to a certain extent and reducing the use of storage resources.

In order to achieve the purpose, the technical scheme is as follows:

a flexibly configurable neural network computational unit, comprising: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing;

the configurable memory module includes: a characteristic map data buffer, a step data buffer and a weight data buffer;

the configurable control module includes: a counter module and a state machine module;

the multiplication and addition calculation module comprises: a multiplier and an accumulator.

Further, the feature map data buffer is used for storing part of feature map data used in convolution calculation and recycling the feature map data with data sharing, wherein the maximum length of the buffer is L1, and the size of the buffer is max { K }₁A₁，K₂A₂，…，K_iA_iK is the size of a convolution kernel in the convolution layer, A is the number of input channels needing to be mapped in the computing unit, and i is the serial number of the convolution layer in the target network;

step size dataThe buffer is used for providing data needing to be updated for the characteristic map buffer when the convolution kernel updates the step data in a sliding mode, the maximum length of the buffer is L2, and the size of the buffer is max { S {₁A₁，S₂A₂，…，S_iA_iS is the step length of a convolution kernel in the convolution layer;

the weight data buffer is used for storing weight data and can recycle the data, the length of the buffer is L3, and the size of the buffer is max { K₁A₁B₁，K₂A₂B₂，…，K_iA_iB_iB is the number of output channels to be mapped in the computing unit.

Furthermore, the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter;

in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged for different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module.

Furthermore, the neural network computing unit is provided with a characteristic diagram data input port and a weight data input port;

the characteristic diagram data input port is connected with the input end of the first selector; two output ends of the first selector are respectively connected with an input end of the step data cache buffer and a first input end of the second selector, an output end of the step data cache buffer is connected with a second input end of the second selector, and an output end of the second selector is connected with an input end of the characteristic diagram data cache buffer;

the weight data input port is connected with the input end of the weight data buffer;

the output end of the characteristic map data buffer and the output end of the weight data buffer are respectively connected with two input ends of the multiplier; the output end of the multiplier is connected with the output end of the neural network computing unit through the register, the accumulator and the fourth selector.

Furthermore, in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged at different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module;

the state of the feature map data buffer comprises the following steps: an initialization state, a data preparation state, a waiting state, a full cycle state, an update data state, a half cycle state and a non-cycle state;

the state of the weight data buffer comprises: initialization state, data ready state, wait state, full cycle state, and no cycle state.

Further, initializing a state, wherein the state is an original state that no data enters the computing unit;

a data preparation state in which input data enters the calculation unit but the amount of the input data is insufficient to start the calculation;

a waiting state, wherein when convolution kernels with different sizes exist and convolution operation is carried out in parallel, in order to ensure the synchronism of output result data, a calculation unit with a smaller convolution kernel size needs to wait for a calculation unit with a larger convolution kernel size because the calculation unit with a smaller convolution kernel size has less calculation amount;

the full cycle state is that if the data currently output by the buffer is recycled, the data can return to the tail of the space allocated by the buffer while entering the multiply-add calculation module, so that the recycling is completed;

updating a data state, wherein the state only exists in the feature map data buffer, and under the condition that currently output data does not need to be reused, the data enters the multiplication and addition calculation module and simultaneously takes out new data from the step size data buffer and inputs the new data to the tail part of the feature map data buffer;

a semi-cycle state, wherein the state only exists in the feature map data buffer and follows the state of the updated data, and the currently output data can return to the previous position of the updated data in the buffer while entering the multiply-add calculation module;

and (4) a non-circulation state, wherein under the condition that the data currently output by the buffer does not need to be recycled, the data only enters the multiplication and addition calculation module and is not returned to the original buffer.

A calculation array is generated by instantiating a plurality of configurable calculation units, the calculation array is divided into areas, different areas can provide different convolution layer parameters, and parallel calculation of different types of convolution modes is completed.

A computational array is generated by connecting flexibly configurable neural network computational units in a row fixed data stream manner; the size of the computing array is determined by hardware resources, a target network model and the computing performance requirements of the system; calculating the width of the array to be K, wherein the size of K is required to be as follows: greater than or equal to the maximum size K of the convolution kernel in the network model_maxAnd is greater than or equal to the sum of sizes of convolution kernels which need to be calculated in parallel when convolution kernels with different sizes exist in the same convolution layer; the basic length of the calculation array is H, the size of H is the minimum size of all convolution layer output characteristic graphs in the network model, and the actual length of the array is 2 according to the number of specific hardware resources and the calculation performance requirement of the systemⁿExpanding for multiples; when convolution layers with convolution kernels of different sizes need to be calculated in parallel, the sizes of the convolution kernels are assumed to be K respectively₁、K₂、…、K_iIn which there is

Transversely dividing the calculation array into i regions with the scale of K₁*H、K₂*H、…、K_iAnd H, inputting different convolution type parameters into different regions, and configuring a storage and calculation module by a calculation unit in each region to finish the parallel calculation of the multi-size convolution kernel.

A construction method of a flexibly configurable neural network computing unit comprises the following steps:

firstly, extracting network parameters according to a model of a target network;

step two, combining the step one to design a configurable storage module in the neural network computing unit, which is used for storing part of feature map data and weight data for computing, and comprises the following steps: a characteristic map data buffer, a step data buffer and a weight data buffer;

thirdly, combining the configurable control module in the neural network computing unit designed in the first step, configuring different cache sizes for the storage module by the configurable control module in different convolution modes, generating various working modes of each cache in convolution calculation and controlling the cache to work in corresponding modes; the configurable control module structure comprises: a counter module and a state machine module;

step four, combining the step one to design a multiplication and addition calculation module in the neural network calculation unit, which is used for calculating the partial sum of the convolution result obtained by multiplying the characteristic diagram by the weight and accumulating, and comprises the following steps: a multiplier and an adder which can perform time division multiplexing;

step five, combining the step two, the step three and the step four, providing five convolutional layer parameters of a convolutional kernel size k, a convolutional kernel step length s, an output characteristic diagram size h, an input channel number a and an output channel number b mapped by the calculation unit to a configurable control module of the neural network calculation unit through an external input port; the configurable control module configures the size of a cache space required by the layer of convolution calculation for the configurable storage module and controls the configurable storage module to output corresponding data to the multiply-add calculation module; different convolutional layers can complete partial calculation of convolution on the same neural network computing unit by providing corresponding convolution parameters to the computing unit.

Further, the first step specifically comprises: extracting required parameters according to the target network model, wherein the required parameters comprise: convolution kernel size K for each convolution layer_iAnd a sliding step S_iOutput feature size H for each convolutional layer_iThe number of input and output channels A to be mapped in each convolutional layer computing unit_iAnd B_iWherein i is the number of layers of the convolutional layer;

in the second step: the characteristic map data buffer is used for storing partial pixel data used in convolution calculation and circularly utilizing the pixel data with data sharing, and the buffer length is max { K }₁A₁，K₂A₂，…，K_iA_i}; step data buffer is used for updating step data in convolution kernel sliding modeThe data needing to be updated is provided to the feature map cache buffer in time, and the buffer length max { S }₁A₁，S₂A₂，…，S_iA_i}; the weight data buffer is used for storing weight data and can recycle the data, and the buffer length is max { K }₁A₁B₁，K₂A₂B₂，…，K_iA_iB_i}；

In the third step: the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter; in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged in different convolution kernel sizes, the state machine skips states according to the numerical values of counters in the counter module, and the states comprise: an initialization state, a data ready state, a wait state, a full cycle state, an update data state, a half cycle state, and a no cycle state.

Further, different states in the state machine determine different working modes of the buffer, specifically:

initializing a state, wherein the state is an original state that no data enters the computing unit;

Further, the fourth step is specifically: the multiplication and addition calculation module comprises a multiplier and an accumulator; the working frequency of the N-time multiplier and accumulator is improved by a time division multiplexing method, and N neural network computing units can share one multiplier and one accumulator; the accumulated number of accumulators is equal to the convolution kernel size of the current convolutional layer.

Furthermore, a calculation array is generated by instantiating a plurality of configurable calculation units, the array is divided into areas, different areas can provide different convolution layer parameters, and parallel calculation of different types of convolution modes is completed.

Further, in the fifth step, an external input port provides five input signals of a convolution kernel size k, a convolution kernel step length s, an output characteristic diagram size h, an input channel number a and an output channel number b which are mapped inside the computing unit; the control module configures the size of a storage space for each cache buffer, the length of the characteristic map data cache buffer is configured as k × a, the length of the step data cache buffer is configured as s × a, and the length of the weight data cache buffer is configured as k × a × b; the control module configures the size of an upper limit value for each counter, the upper limit values of the input data counter and the output data counter are configured to k a, the upper limit value of the input weight counter is configured to k a b, the upper limit value of the output channel number counter is configured to b, and the upper limit value of the output characteristic diagram size counter is configured to h; when the input counters are all at the upper limit value, the calculation unit starts to calculate, and each output counter performs corresponding counting and controls the jump of the state machine; when each output counter is at the upper limit value, it indicates that one convolution calculation for part or all of the output channels of the convolution layer is completed.

Furthermore, according to the hardware resources and the calculation performance requirements of the system, a plurality of configurable calculation units can be instantiated and connected with one another to generate a convolution calculation array, and convolution calculation of convolution layers of different types can be completed by the array; in a partial network model, convolution kernels with two or more sizes are arranged in the same convolution layer, an array can be divided into areas, different areas provide different convolution parameters, in order to guarantee the synchronism of output results of all the areas of the calculation array, the time difference of the output results generated between calculation units in different areas can be calculated by calculating the difference of the sizes of the different convolution kernels, the calculation units in the areas with less calculation amount wait for the calculation units in the areas with more calculation amount until the time difference is zero, and then calculation is started, so that the synchronism of the output results of the array is guaranteed, and parallel calculation of different types of convolution modes is completed.

Compared with the prior art, the invention has the following beneficial effects: the invention discloses a flexibly configurable neural network computing unit, a computing array and a construction method thereof. A plurality of configurable neural network computing units are generated by instantiation and are arranged to generate a complete convolution computing array, the array can be divided into areas, different convolution parameters are input into different areas, and parallel computing of different convolution modes can be completed. The invention designs a hardware architecture of the convolutional layer in the convolutional neural network, on the premise of ensuring the calculation performance of the system, the hardware architecture can support the convolutional mode of the convolutional layer in different network models, greatly improves the flexibility of the system, fully utilizes the data reusability of the convolutional neural network in each working mode cached in a calculation unit, effectively reduces the system power consumption generated by data migration, and lightens the burden on the aspect of storage to a certain extent, and a calculation array formed by a plurality of calculation units can support convolution kernels of different sizes to perform calculation in parallel, thereby fully exploiting the algorithm parallelism and the data reusability of the convolutional layer in the convolutional neural network.

Drawings

FIG. 1 is a schematic diagram of an overall structure of a flexibly configurable neural network computing unit according to the present invention;

FIG. 2 is a schematic diagram of a control module of a flexibly configurable neural network computing unit according to the present invention;

FIG. 3 is a schematic diagram of a state machine for a feature map data buffer in a control module;

FIG. 4 is a schematic diagram illustrating the generation of a computational array by a plurality of computational units according to the present invention.

Detailed Description

The invention will be described in further detail below with reference to the accompanying drawings,

referring to fig. 1, a flexibly configurable neural network computing unit of the present invention includes: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing; the configurable memory module includes: a characteristic map data buffer, a step data buffer and a weight data buffer; the configurable control module includes: a counter module and a state machine module; the multiplication and addition calculation module comprises: a multiplier and an accumulator.

The feature map data buffer is used for storing part of feature map data used in convolution calculation and recycling the feature map data with data sharing, the maximum length of the buffer is L1, and the size of the buffer is max { K }₁A₁，K₂A₂，…，K_iA_iK is the size of a convolution kernel in the convolution layer, A is the number of input channels needing to be mapped in the computing unit, and i is the serial number of the convolution layer in the target network; the step size data buffer is used for providing data needing to be updated for the feature map buffer when the convolution kernel updates the step size data in a sliding mode, the maximum length of the buffer is L2, and the size of the buffer is max { S {₁A₁，S₂A₂，…，S_iA_iS is the step length of a convolution kernel in the convolution layer; the weight data buffer is used for storing weight data and can recycle the data, the length of the buffer is L3, and the size of the buffer is max { K₁A₁B₁，K₂A₂B₂，…，K_iA_iB_iB is the number of output channels to be mapped in the computing unit.

The neural network computing unit is provided with a feature map data input port and a weight data input port; the characteristic diagram data input port is connected with the input end of the first selector 1; two output ends of the first selector 1 are respectively connected with an input end of a step data cache buffer and a first input end of the second selector 2, an output end of the step data cache buffer is connected with a second input end of the second selector 2, and an output end of the second selector 2 is connected with an input end of the characteristic diagram data cache buffer; the weight data input port is connected with the input end of the weight data buffer; the output end of the characteristic map data buffer and the output end of the weight data buffer are respectively connected with two input ends of the multiplier; the output end of the multiplier is connected with the output end of the neural network computing unit through the register, the accumulator and the fourth selector 4. The output end of the characteristic diagram data buffer also passes through the input end of the third selector connector.

Please refer to fig. 2, which illustrates an internal structure of the buffer control module, which mainly comprises a counter module and a state machine module; the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter; in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged for different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module.

Please refer to fig. 3, which is a schematic diagram of a state machine of the feature map data buffer in the control module, where the states include: an initialization state S0, a data ready state S1, a wait state S6, a full cycle state S2, an update data state S3, a half cycle state S4, and a no cycle state S5, wherein different states determine different operation modes of the buffer; the control signal input from the outside to the calculation array provides five kinds of information of the convolution kernel size, the convolution window sliding step length, the size of the output characteristic diagram, the input channel number and the output channel number of the array mapping to the buffer control module, so that the corresponding upper limit value of each counter can be obtained, the state machine of the corresponding convolution kernel size is controlled by the numerical value of the counter to carry out state skip, and the operation of each storage buffer under different convolution kernel sizes is completed. The state of the weight data buffer comprises: initialization state, data ready state, wait state, full cycle state, and no cycle state.

Please refer to fig. 4, which illustrates an example of determining different operation modes of the buffer according to different states in the state machine, specifically:

and (4) a non-circulation state, wherein under the condition that the data currently output by the buffer does not need to be recycled, the data only enters the multiplication and addition calculation module and is not returned to the original buffer. Connecting a plurality of computing units in a row fixed data flow mode to generate a computing array, wherein the scale of the computing array is determined by hardware resources, a target network model and the computing performance requirement of a system; calculating the width of the array to be K, wherein the size of K is required to be as follows: greater than or equal to the maximum size K of the convolution kernel in the network model_maxAnd is greater than or equal to the sum of sizes of convolution kernels which need to be calculated in parallel when convolution kernels with different sizes exist in the same convolution layer; the basic length of the calculation array is H, the size of H is the minimum size of all convolution layer output characteristic graphs in the network model, and the actual length of the array can be calculated by 2 according to the number of specific hardware resources and the calculation performance requirement of the systemⁿExpanding for multiples; when convolution layers with convolution kernels of different sizes need to be calculated in parallel, the sizes of the convolution kernels are assumed to be K respectively₁、K₂、…、K_iIn which there is

The invention relates to a construction method of a flexibly configurable neural network computing unit, which comprises the following steps:

The invention provides a flexibly configurable neural network computing unit, a computing array and a construction method, different types of convolution modes only need to provide partial convolution parameters for the computing unit, the computing unit can configure an internal storage and computing module by itself, and convolution calculation of a corresponding mode can be completed by inputting feature maps and weight data into the computing unit, so that the flexibility of the convolution layer in the convolutional neural network when the convolution layer is realized through hardware is greatly improved, and the quick deployment of the convolution layer on the hardware is facilitated.

In the invention, external input provides convolution layer parameters for a control module of a computing unit, and the control module reasonably configures storage space and controls a storage module to work in a corresponding mode; each cache in the storage module has a plurality of working modes respectively, and shared data can be utilized to the maximum extent; different convolution layers can be completed on the same calculation array built by the configurable calculation unit, and the operation of convolution kernel parallel convolution calculation with different sizes can be completed on the array by dividing the area; the invention can support any type of convolution calculation, supports multi-size convolution kernel parallel calculation, fully explores the flexibility and data reusability of the convolution neural network calculation unit, greatly reduces the system power consumption caused by data migration, and improves the calculation efficiency of the system.

Claims

1. A flexibly configurable neural network computational unit, comprising: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing;

the multiplication and addition calculation module comprises: a multiplier and an accumulator;

the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter;

in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged for different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module;

the state of the buffer of the feature map comprises the following steps: an initialization state, a data preparation state, a waiting state, a full cycle state, an update data state, a half cycle state and a non-cycle state;

the state of the weight data buffer comprises: an initialization state, a data preparation state, a waiting state, a full circulation state and a non-circulation state;

the feature map data buffer is used for storing part of feature map data used in convolution calculation and recycling the feature map data with data sharing, the maximum length of the buffer is L1, and the size of the buffer is max { K }₁A₁，K₂A₂，…，K_iA_iK is the scale of convolution kernel in convolution layerThe size A is the number of input channels to be mapped in the computing unit, and i is the serial number of the convolution layer in the target network;

the step size data buffer is used for providing data needing to be updated for the feature map buffer when the convolution kernel updates the step size data in a sliding mode, the maximum length of the buffer is L2, and the size of the buffer is max { S {₁A₁，S₂A₂，…，S_iA_iS is the step length of a convolution kernel in the convolution layer;

2. The flexibly configurable neural network computing unit of claim 1, wherein the convolution computing unit is provided with a feature map data input port and a weight data input port;

3. The flexibly configurable neural network computing unit of claim 1,

4. A computational array is characterized in that the flexibly configurable neural network computational units of any one of claims 1 to 3 are connected and generated in a row fixed data stream mode, the computational array is divided into regions, different regions can provide different convolutional layer parameters, and parallel computation of different types of convolutional modes is completed;

the size of the computing array is determined by hardware resources, a target network model and the computing performance requirements of the system; calculating the width of the array to be K, wherein the size of K is required to be as follows: greater than or equal to the network modelMaximum size K of the medium convolution kernel_maxAnd is greater than or equal to the sum of sizes of convolution kernels which need to be calculated in parallel when convolution kernels with different sizes exist in the same convolution layer; the basic length of the calculation array is H, the size of H is the minimum size of all convolution layer output characteristic graphs in the network model, and the actual length of the array is 2 according to the number of specific hardware resources and the calculation performance requirement of the systemⁿExpanding for multiples; when convolution layers with convolution kernels of different sizes need to be calculated in parallel, the sizes of the convolution kernels are assumed to be K respectively₁、K₂、…、K_iIn which there is

5. A method of constructing a flexibly configurable neural network computational unit, comprising a flexibly configurable neural network computational unit of any one of claims 1 to 3, comprising the steps of:

6. The method for constructing a flexibly configurable neural network computational unit according to claim 5, wherein the first step is specifically: extracting required parameters according to the target network model, wherein the required parameters comprise: convolution kernel size K for each convolution layer_iAnd a sliding step S_iOutput feature size H for each convolutional layer_iThe number of input and output channels A to be mapped in each convolutional layer computing unit_iAnd B_iWherein i is the number of layers of the convolutional layer;

in the second step: the characteristic map data buffer is used for storing partial pixel data used in convolution calculation and circularly utilizing the pixel data with data sharing, and the buffer length is max { K }₁A₁，K₂A₂，…，K_iA_i}; the step data buffer is used for providing data needing to be updated to the feature map buffer when the convolution kernel updates the step data in a sliding way, and the buffer length is max { S }₁A₁，S₂A₂，…，S_iA_i}; the weight data buffer is used for storing weight data and can recycle the data, and the buffer length is max { K }₁A₁B₁，K₂A₂B₂，…，K_iA_iB_i}；