CN109409512B - Flexibly configurable neural network computing unit, computing array and construction method thereof - Google Patents

Flexibly configurable neural network computing unit, computing array and construction method thereof Download PDF

Info

Publication number
CN109409512B
CN109409512B CN201811133940.2A CN201811133940A CN109409512B CN 109409512 B CN109409512 B CN 109409512B CN 201811133940 A CN201811133940 A CN 201811133940A CN 109409512 B CN109409512 B CN 109409512B
Authority
CN
China
Prior art keywords
data
buffer
state
convolution
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811133940.2A
Other languages
Chinese (zh)
Other versions
CN109409512A (en
Inventor
任鹏举
樊珑
赵博然
宗鹏陈
陈飞
郑南宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811133940.2A priority Critical patent/CN109409512B/en
Publication of CN109409512A publication Critical patent/CN109409512A/en
Application granted granted Critical
Publication of CN109409512B publication Critical patent/CN109409512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention discloses a flexibly configurable neural network computing unit, a computing array and a construction method thereof, wherein the neural network computing unit comprises: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing; the configurable memory module includes: a characteristic map data buffer, a step data buffer and a weight data buffer; the configurable control module includes: a counter module and a state machine module; the multiplication and addition calculation module comprises: a multiplier and an accumulator. The invention can support any type of convolution calculation, supports multi-size convolution kernel parallel calculation, fully explores the flexibility and data reusability of the convolution neural network calculation unit, greatly reduces the system power consumption caused by data migration, and improves the calculation efficiency of the system.

Description

Flexibly configurable neural network computing unit, computing array and construction method thereof
Technical Field
The invention belongs to the field of neural network hardware architecture, and particularly relates to a flexibly configurable neural network computing unit, a computing array and a construction method thereof.
Background
The flexible hardware computing architecture has a significant impact on the hardware implementation of convolutional neural networks. The convolutional layer is used as the most main structure in the convolutional neural network and has the characteristics of large calculation amount, strong data reusability and the like. The convolutional layer shares the characteristic through the weight, so that the complexity of a network model is reduced, the number of parameters is greatly reduced, and the complicated characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided.
In the convolutional neural network, the convolutional layer is mainly used for convolving the same group of input feature map data with a group of convolution kernels of different output channels, then obtaining output feature maps with the same number as the output channels, and completing feature extraction of the feature maps. As convolutional neural networks are continuously developed and learned and the demand for neural networks is gradually increased, the types of neural network models are more and more, the depth of the networks is gradually deepened, and the convolutional layer convolution mode becomes complicated and variable.
Therefore, the neural network computing unit architecture which is high in flexibility, high in computing performance and capable of being recycled has great significance for hardware implementation of the convolutional layer. At present, most of convolution layer calculation units can only complete one type of convolution mode in hardware implementation, cannot support calculation of convolution layers with different types in a network model, and cannot fully utilize data reusability of the convolution layers.
Disclosure of Invention
The invention aims to provide a neural network computing unit, a computing array and a construction method thereof, which can be flexibly configured, can effectively enhance the flexibility of a convolutional layer in hardware implementation, improve the computing efficiency of a system, and fully play the data reusability of the convolutional layer, thereby reducing the power consumption of the system to a certain extent and reducing the use of storage resources.
In order to achieve the purpose, the technical scheme is as follows:
a flexibly configurable neural network computational unit, comprising: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing;
the configurable memory module includes: a characteristic map data buffer, a step data buffer and a weight data buffer;
the configurable control module includes: a counter module and a state machine module;
the multiplication and addition calculation module comprises: a multiplier and an accumulator.
Further, the feature map data buffer is used for storing part of feature map data used in convolution calculation and recycling the feature map data with data sharing, wherein the maximum length of the buffer is L1, and the size of the buffer is max { K }1A1,K2A2,…,KiAiK is the size of a convolution kernel in the convolution layer, A is the number of input channels needing to be mapped in the computing unit, and i is the serial number of the convolution layer in the target network;
step size dataThe buffer is used for providing data needing to be updated for the characteristic map buffer when the convolution kernel updates the step data in a sliding mode, the maximum length of the buffer is L2, and the size of the buffer is max { S {1A1,S2A2,…,SiAiS is the step length of a convolution kernel in the convolution layer;
the weight data buffer is used for storing weight data and can recycle the data, the length of the buffer is L3, and the size of the buffer is max { K1A1B1,K2A2B2,…,KiAiBiB is the number of output channels to be mapped in the computing unit.
Furthermore, the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter;
in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged for different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module.
Furthermore, the neural network computing unit is provided with a characteristic diagram data input port and a weight data input port;
the characteristic diagram data input port is connected with the input end of the first selector; two output ends of the first selector are respectively connected with an input end of the step data cache buffer and a first input end of the second selector, an output end of the step data cache buffer is connected with a second input end of the second selector, and an output end of the second selector is connected with an input end of the characteristic diagram data cache buffer;
the weight data input port is connected with the input end of the weight data buffer;
the output end of the characteristic map data buffer and the output end of the weight data buffer are respectively connected with two input ends of the multiplier; the output end of the multiplier is connected with the output end of the neural network computing unit through the register, the accumulator and the fourth selector.
Furthermore, in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged at different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module;
the state of the feature map data buffer comprises the following steps: an initialization state, a data preparation state, a waiting state, a full cycle state, an update data state, a half cycle state and a non-cycle state;
the state of the weight data buffer comprises: initialization state, data ready state, wait state, full cycle state, and no cycle state.
Further, initializing a state, wherein the state is an original state that no data enters the computing unit;
a data preparation state in which input data enters the calculation unit but the amount of the input data is insufficient to start the calculation;
a waiting state, wherein when convolution kernels with different sizes exist and convolution operation is carried out in parallel, in order to ensure the synchronism of output result data, a calculation unit with a smaller convolution kernel size needs to wait for a calculation unit with a larger convolution kernel size because the calculation unit with a smaller convolution kernel size has less calculation amount;
the full cycle state is that if the data currently output by the buffer is recycled, the data can return to the tail of the space allocated by the buffer while entering the multiply-add calculation module, so that the recycling is completed;
updating a data state, wherein the state only exists in the feature map data buffer, and under the condition that currently output data does not need to be reused, the data enters the multiplication and addition calculation module and simultaneously takes out new data from the step size data buffer and inputs the new data to the tail part of the feature map data buffer;
a semi-cycle state, wherein the state only exists in the feature map data buffer and follows the state of the updated data, and the currently output data can return to the previous position of the updated data in the buffer while entering the multiply-add calculation module;
and (4) a non-circulation state, wherein under the condition that the data currently output by the buffer does not need to be recycled, the data only enters the multiplication and addition calculation module and is not returned to the original buffer.
A calculation array is generated by instantiating a plurality of configurable calculation units, the calculation array is divided into areas, different areas can provide different convolution layer parameters, and parallel calculation of different types of convolution modes is completed.
A computational array is generated by connecting flexibly configurable neural network computational units in a row fixed data stream manner; the size of the computing array is determined by hardware resources, a target network model and the computing performance requirements of the system; calculating the width of the array to be K, wherein the size of K is required to be as follows: greater than or equal to the maximum size K of the convolution kernel in the network modelmaxAnd is greater than or equal to the sum of sizes of convolution kernels which need to be calculated in parallel when convolution kernels with different sizes exist in the same convolution layer; the basic length of the calculation array is H, the size of H is the minimum size of all convolution layer output characteristic graphs in the network model, and the actual length of the array is 2 according to the number of specific hardware resources and the calculation performance requirement of the systemnExpanding for multiples; when convolution layers with convolution kernels of different sizes need to be calculated in parallel, the sizes of the convolution kernels are assumed to be K respectively1、K2、…、KiIn which there is
Figure BDA0001814267770000041
Transversely dividing the calculation array into i regions with the scale of K1*H、K2*H、…、KiAnd H, inputting different convolution type parameters into different regions, and configuring a storage and calculation module by a calculation unit in each region to finish the parallel calculation of the multi-size convolution kernel.
A construction method of a flexibly configurable neural network computing unit comprises the following steps:
firstly, extracting network parameters according to a model of a target network;
step two, combining the step one to design a configurable storage module in the neural network computing unit, which is used for storing part of feature map data and weight data for computing, and comprises the following steps: a characteristic map data buffer, a step data buffer and a weight data buffer;
thirdly, combining the configurable control module in the neural network computing unit designed in the first step, configuring different cache sizes for the storage module by the configurable control module in different convolution modes, generating various working modes of each cache in convolution calculation and controlling the cache to work in corresponding modes; the configurable control module structure comprises: a counter module and a state machine module;
step four, combining the step one to design a multiplication and addition calculation module in the neural network calculation unit, which is used for calculating the partial sum of the convolution result obtained by multiplying the characteristic diagram by the weight and accumulating, and comprises the following steps: a multiplier and an adder which can perform time division multiplexing;
step five, combining the step two, the step three and the step four, providing five convolutional layer parameters of a convolutional kernel size k, a convolutional kernel step length s, an output characteristic diagram size h, an input channel number a and an output channel number b mapped by the calculation unit to a configurable control module of the neural network calculation unit through an external input port; the configurable control module configures the size of a cache space required by the layer of convolution calculation for the configurable storage module and controls the configurable storage module to output corresponding data to the multiply-add calculation module; different convolutional layers can complete partial calculation of convolution on the same neural network computing unit by providing corresponding convolution parameters to the computing unit.
Further, the first step specifically comprises: extracting required parameters according to the target network model, wherein the required parameters comprise: convolution kernel size K for each convolution layeriAnd a sliding step SiOutput feature size H for each convolutional layeriThe number of input and output channels A to be mapped in each convolutional layer computing unitiAnd BiWherein i is the number of layers of the convolutional layer;
in the second step: the characteristic map data buffer is used for storing partial pixel data used in convolution calculation and circularly utilizing the pixel data with data sharing, and the buffer length is max { K }1A1,K2A2,…,KiAi}; step data buffer is used for updating step data in convolution kernel sliding modeThe data needing to be updated is provided to the feature map cache buffer in time, and the buffer length max { S }1A1,S2A2,…,SiAi}; the weight data buffer is used for storing weight data and can recycle the data, and the buffer length is max { K }1A1B1,K2A2B2,…,KiAiBi};
In the third step: the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter; in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged in different convolution kernel sizes, the state machine skips states according to the numerical values of counters in the counter module, and the states comprise: an initialization state, a data ready state, a wait state, a full cycle state, an update data state, a half cycle state, and a no cycle state.
Further, different states in the state machine determine different working modes of the buffer, specifically:
initializing a state, wherein the state is an original state that no data enters the computing unit;
a data preparation state in which input data enters the calculation unit but the amount of the input data is insufficient to start the calculation;
a waiting state, wherein when convolution kernels with different sizes exist and convolution operation is carried out in parallel, in order to ensure the synchronism of output result data, a calculation unit with a smaller convolution kernel size needs to wait for a calculation unit with a larger convolution kernel size because the calculation unit with a smaller convolution kernel size has less calculation amount;
the full cycle state is that if the data currently output by the buffer is recycled, the data can return to the tail of the space allocated by the buffer while entering the multiply-add calculation module, so that the recycling is completed;
updating a data state, wherein the state only exists in the feature map data buffer, and under the condition that currently output data does not need to be reused, the data enters the multiplication and addition calculation module and simultaneously takes out new data from the step size data buffer and inputs the new data to the tail part of the feature map data buffer;
a semi-cycle state, wherein the state only exists in the feature map data buffer and follows the state of the updated data, and the currently output data can return to the previous position of the updated data in the buffer while entering the multiply-add calculation module;
and (4) a non-circulation state, wherein under the condition that the data currently output by the buffer does not need to be recycled, the data only enters the multiplication and addition calculation module and is not returned to the original buffer.
Further, the fourth step is specifically: the multiplication and addition calculation module comprises a multiplier and an accumulator; the working frequency of the N-time multiplier and accumulator is improved by a time division multiplexing method, and N neural network computing units can share one multiplier and one accumulator; the accumulated number of accumulators is equal to the convolution kernel size of the current convolutional layer.
Furthermore, a calculation array is generated by instantiating a plurality of configurable calculation units, the array is divided into areas, different areas can provide different convolution layer parameters, and parallel calculation of different types of convolution modes is completed.
Further, in the fifth step, an external input port provides five input signals of a convolution kernel size k, a convolution kernel step length s, an output characteristic diagram size h, an input channel number a and an output channel number b which are mapped inside the computing unit; the control module configures the size of a storage space for each cache buffer, the length of the characteristic map data cache buffer is configured as k × a, the length of the step data cache buffer is configured as s × a, and the length of the weight data cache buffer is configured as k × a × b; the control module configures the size of an upper limit value for each counter, the upper limit values of the input data counter and the output data counter are configured to k a, the upper limit value of the input weight counter is configured to k a b, the upper limit value of the output channel number counter is configured to b, and the upper limit value of the output characteristic diagram size counter is configured to h; when the input counters are all at the upper limit value, the calculation unit starts to calculate, and each output counter performs corresponding counting and controls the jump of the state machine; when each output counter is at the upper limit value, it indicates that one convolution calculation for part or all of the output channels of the convolution layer is completed.
Furthermore, according to the hardware resources and the calculation performance requirements of the system, a plurality of configurable calculation units can be instantiated and connected with one another to generate a convolution calculation array, and convolution calculation of convolution layers of different types can be completed by the array; in a partial network model, convolution kernels with two or more sizes are arranged in the same convolution layer, an array can be divided into areas, different areas provide different convolution parameters, in order to guarantee the synchronism of output results of all the areas of the calculation array, the time difference of the output results generated between calculation units in different areas can be calculated by calculating the difference of the sizes of the different convolution kernels, the calculation units in the areas with less calculation amount wait for the calculation units in the areas with more calculation amount until the time difference is zero, and then calculation is started, so that the synchronism of the output results of the array is guaranteed, and parallel calculation of different types of convolution modes is completed.
Compared with the prior art, the invention has the following beneficial effects: the invention discloses a flexibly configurable neural network computing unit, a computing array and a construction method thereof. A plurality of configurable neural network computing units are generated by instantiation and are arranged to generate a complete convolution computing array, the array can be divided into areas, different convolution parameters are input into different areas, and parallel computing of different convolution modes can be completed. The invention designs a hardware architecture of the convolutional layer in the convolutional neural network, on the premise of ensuring the calculation performance of the system, the hardware architecture can support the convolutional mode of the convolutional layer in different network models, greatly improves the flexibility of the system, fully utilizes the data reusability of the convolutional neural network in each working mode cached in a calculation unit, effectively reduces the system power consumption generated by data migration, and lightens the burden on the aspect of storage to a certain extent, and a calculation array formed by a plurality of calculation units can support convolution kernels of different sizes to perform calculation in parallel, thereby fully exploiting the algorithm parallelism and the data reusability of the convolutional layer in the convolutional neural network.
Drawings
FIG. 1 is a schematic diagram of an overall structure of a flexibly configurable neural network computing unit according to the present invention;
FIG. 2 is a schematic diagram of a control module of a flexibly configurable neural network computing unit according to the present invention;
FIG. 3 is a schematic diagram of a state machine for a feature map data buffer in a control module;
FIG. 4 is a schematic diagram illustrating the generation of a computational array by a plurality of computational units according to the present invention.
Detailed Description
The invention will be described in further detail below with reference to the accompanying drawings,
referring to fig. 1, a flexibly configurable neural network computing unit of the present invention includes: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing; the configurable memory module includes: a characteristic map data buffer, a step data buffer and a weight data buffer; the configurable control module includes: a counter module and a state machine module; the multiplication and addition calculation module comprises: a multiplier and an accumulator.
The feature map data buffer is used for storing part of feature map data used in convolution calculation and recycling the feature map data with data sharing, the maximum length of the buffer is L1, and the size of the buffer is max { K }1A1,K2A2,…,KiAiK is the size of a convolution kernel in the convolution layer, A is the number of input channels needing to be mapped in the computing unit, and i is the serial number of the convolution layer in the target network; the step size data buffer is used for providing data needing to be updated for the feature map buffer when the convolution kernel updates the step size data in a sliding mode, the maximum length of the buffer is L2, and the size of the buffer is max { S {1A1,S2A2,…,SiAiS is the step length of a convolution kernel in the convolution layer; the weight data buffer is used for storing weight data and can recycle the data, the length of the buffer is L3, and the size of the buffer is max { K1A1B1,K2A2B2,…,KiAiBiB is the number of output channels to be mapped in the computing unit.
The neural network computing unit is provided with a feature map data input port and a weight data input port; the characteristic diagram data input port is connected with the input end of the first selector 1; two output ends of the first selector 1 are respectively connected with an input end of a step data cache buffer and a first input end of the second selector 2, an output end of the step data cache buffer is connected with a second input end of the second selector 2, and an output end of the second selector 2 is connected with an input end of the characteristic diagram data cache buffer; the weight data input port is connected with the input end of the weight data buffer; the output end of the characteristic map data buffer and the output end of the weight data buffer are respectively connected with two input ends of the multiplier; the output end of the multiplier is connected with the output end of the neural network computing unit through the register, the accumulator and the fourth selector 4. The output end of the characteristic diagram data buffer also passes through the input end of the third selector connector.
Please refer to fig. 2, which illustrates an internal structure of the buffer control module, which mainly comprises a counter module and a state machine module; the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter; in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged for different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module.
Please refer to fig. 3, which is a schematic diagram of a state machine of the feature map data buffer in the control module, where the states include: an initialization state S0, a data ready state S1, a wait state S6, a full cycle state S2, an update data state S3, a half cycle state S4, and a no cycle state S5, wherein different states determine different operation modes of the buffer; the control signal input from the outside to the calculation array provides five kinds of information of the convolution kernel size, the convolution window sliding step length, the size of the output characteristic diagram, the input channel number and the output channel number of the array mapping to the buffer control module, so that the corresponding upper limit value of each counter can be obtained, the state machine of the corresponding convolution kernel size is controlled by the numerical value of the counter to carry out state skip, and the operation of each storage buffer under different convolution kernel sizes is completed. The state of the weight data buffer comprises: initialization state, data ready state, wait state, full cycle state, and no cycle state.
Please refer to fig. 4, which illustrates an example of determining different operation modes of the buffer according to different states in the state machine, specifically:
initializing a state, wherein the state is an original state that no data enters the computing unit;
a data preparation state in which input data enters the calculation unit but the amount of the input data is insufficient to start the calculation;
a waiting state, wherein when convolution kernels with different sizes exist and convolution operation is carried out in parallel, in order to ensure the synchronism of output result data, a calculation unit with a smaller convolution kernel size needs to wait for a calculation unit with a larger convolution kernel size because the calculation unit with a smaller convolution kernel size has less calculation amount;
the full cycle state is that if the data currently output by the buffer is recycled, the data can return to the tail of the space allocated by the buffer while entering the multiply-add calculation module, so that the recycling is completed;
updating a data state, wherein the state only exists in the feature map data buffer, and under the condition that currently output data does not need to be reused, the data enters the multiplication and addition calculation module and simultaneously takes out new data from the step size data buffer and inputs the new data to the tail part of the feature map data buffer;
a semi-cycle state, wherein the state only exists in the feature map data buffer and follows the state of the updated data, and the currently output data can return to the previous position of the updated data in the buffer while entering the multiply-add calculation module;
and (4) a non-circulation state, wherein under the condition that the data currently output by the buffer does not need to be recycled, the data only enters the multiplication and addition calculation module and is not returned to the original buffer. Connecting a plurality of computing units in a row fixed data flow mode to generate a computing array, wherein the scale of the computing array is determined by hardware resources, a target network model and the computing performance requirement of a system; calculating the width of the array to be K, wherein the size of K is required to be as follows: greater than or equal to the maximum size K of the convolution kernel in the network modelmaxAnd is greater than or equal to the sum of sizes of convolution kernels which need to be calculated in parallel when convolution kernels with different sizes exist in the same convolution layer; the basic length of the calculation array is H, the size of H is the minimum size of all convolution layer output characteristic graphs in the network model, and the actual length of the array can be calculated by 2 according to the number of specific hardware resources and the calculation performance requirement of the systemnExpanding for multiples; when convolution layers with convolution kernels of different sizes need to be calculated in parallel, the sizes of the convolution kernels are assumed to be K respectively1、K2、…、KiIn which there is
Figure BDA0001814267770000101
Transversely dividing the calculation array into i regions with the scale of K1*H、K2*H、…、KiAnd H, inputting different convolution type parameters into different regions, and configuring a storage and calculation module by a calculation unit in each region to finish the parallel calculation of the multi-size convolution kernel.
The invention relates to a construction method of a flexibly configurable neural network computing unit, which comprises the following steps:
firstly, extracting network parameters according to a model of a target network;
step two, combining the step one to design a configurable storage module in the neural network computing unit, which is used for storing part of feature map data and weight data for computing, and comprises the following steps: a characteristic map data buffer, a step data buffer and a weight data buffer;
thirdly, combining the configurable control module in the neural network computing unit designed in the first step, configuring different cache sizes for the storage module by the configurable control module in different convolution modes, generating various working modes of each cache in convolution calculation and controlling the cache to work in corresponding modes; the configurable control module structure comprises: a counter module and a state machine module;
step four, combining the step one to design a multiplication and addition calculation module in the neural network calculation unit, which is used for calculating the partial sum of the convolution result obtained by multiplying the characteristic diagram by the weight and accumulating, and comprises the following steps: a multiplier and an adder which can perform time division multiplexing;
step five, combining the step two, the step three and the step four, providing five convolutional layer parameters of a convolutional kernel size k, a convolutional kernel step length s, an output characteristic diagram size h, an input channel number a and an output channel number b mapped by the calculation unit to a configurable control module of the neural network calculation unit through an external input port; the configurable control module configures the size of a cache space required by the layer of convolution calculation for the configurable storage module and controls the configurable storage module to output corresponding data to the multiply-add calculation module; different convolutional layers can complete partial calculation of convolution on the same neural network computing unit by providing corresponding convolution parameters to the computing unit.
The invention provides a flexibly configurable neural network computing unit, a computing array and a construction method, different types of convolution modes only need to provide partial convolution parameters for the computing unit, the computing unit can configure an internal storage and computing module by itself, and convolution calculation of a corresponding mode can be completed by inputting feature maps and weight data into the computing unit, so that the flexibility of the convolution layer in the convolutional neural network when the convolution layer is realized through hardware is greatly improved, and the quick deployment of the convolution layer on the hardware is facilitated.
In the invention, external input provides convolution layer parameters for a control module of a computing unit, and the control module reasonably configures storage space and controls a storage module to work in a corresponding mode; each cache in the storage module has a plurality of working modes respectively, and shared data can be utilized to the maximum extent; different convolution layers can be completed on the same calculation array built by the configurable calculation unit, and the operation of convolution kernel parallel convolution calculation with different sizes can be completed on the array by dividing the area; the invention can support any type of convolution calculation, supports multi-size convolution kernel parallel calculation, fully explores the flexibility and data reusability of the convolution neural network calculation unit, greatly reduces the system power consumption caused by data migration, and improves the calculation efficiency of the system.

Claims (6)

1. A flexibly configurable neural network computational unit, comprising: the system comprises a configurable storage module, a configurable control module and a multiply-add calculation module capable of time division multiplexing;
the configurable memory module includes: a characteristic map data buffer, a step data buffer and a weight data buffer;
the configurable control module includes: a counter module and a state machine module;
the multiplication and addition calculation module comprises: a multiplier and an accumulator;
the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter;
in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged for different convolution kernel sizes, and the state machine skips states according to the numerical values of counters in the counter module;
the state of the buffer of the feature map comprises the following steps: an initialization state, a data preparation state, a waiting state, a full cycle state, an update data state, a half cycle state and a non-cycle state;
the state of the weight data buffer comprises: an initialization state, a data preparation state, a waiting state, a full circulation state and a non-circulation state;
the feature map data buffer is used for storing part of feature map data used in convolution calculation and recycling the feature map data with data sharing, the maximum length of the buffer is L1, and the size of the buffer is max { K }1A1,K2A2,…,KiAiK is the scale of convolution kernel in convolution layerThe size A is the number of input channels to be mapped in the computing unit, and i is the serial number of the convolution layer in the target network;
the step size data buffer is used for providing data needing to be updated for the feature map buffer when the convolution kernel updates the step size data in a sliding mode, the maximum length of the buffer is L2, and the size of the buffer is max { S {1A1,S2A2,…,SiAiS is the step length of a convolution kernel in the convolution layer;
the weight data buffer is used for storing weight data and can recycle the data, the length of the buffer is L3, and the size of the buffer is max { K1A1B1,K2A2B2,…,KiAiBiB is the number of output channels to be mapped in the computing unit.
2. The flexibly configurable neural network computing unit of claim 1, wherein the convolution computing unit is provided with a feature map data input port and a weight data input port;
the characteristic diagram data input port is connected with the input end of the first selector; two output ends of the first selector are respectively connected with an input end of the step data cache buffer and a first input end of the second selector, an output end of the step data cache buffer is connected with a second input end of the second selector, and an output end of the second selector is connected with an input end of the characteristic diagram data cache buffer;
the weight data input port is connected with the input end of the weight data buffer;
the output end of the characteristic map data buffer and the output end of the weight data buffer are respectively connected with two input ends of the multiplier; the output end of the multiplier is connected with the output end of the neural network computing unit through the register, the accumulator and the fourth selector.
3. The flexibly configurable neural network computing unit of claim 1,
initializing a state, wherein the state is an original state that no data enters the computing unit;
a data preparation state in which input data enters the calculation unit but the amount of the input data is insufficient to start the calculation;
a waiting state, wherein when convolution kernels with different sizes exist and convolution operation is carried out in parallel, in order to ensure the synchronism of output result data, a calculation unit with a smaller convolution kernel size needs to wait for a calculation unit with a larger convolution kernel size because the calculation unit with a smaller convolution kernel size has less calculation amount;
the full cycle state is that if the data currently output by the buffer is recycled, the data can return to the tail of the space allocated by the buffer while entering the multiply-add calculation module, so that the recycling is completed;
updating a data state, wherein the state only exists in the feature map data buffer, and under the condition that currently output data does not need to be reused, the data enters the multiplication and addition calculation module and simultaneously takes out new data from the step size data buffer and inputs the new data to the tail part of the feature map data buffer;
a semi-cycle state, wherein the state only exists in the feature map data buffer and follows the state of the updated data, and the currently output data can return to the previous position of the updated data in the buffer while entering the multiply-add calculation module;
and (4) a non-circulation state, wherein under the condition that the data currently output by the buffer does not need to be recycled, the data only enters the multiplication and addition calculation module and is not returned to the original buffer.
4. A computational array is characterized in that the flexibly configurable neural network computational units of any one of claims 1 to 3 are connected and generated in a row fixed data stream mode, the computational array is divided into regions, different regions can provide different convolutional layer parameters, and parallel computation of different types of convolutional modes is completed;
the size of the computing array is determined by hardware resources, a target network model and the computing performance requirements of the system; calculating the width of the array to be K, wherein the size of K is required to be as follows: greater than or equal to the network modelMaximum size K of the medium convolution kernelmaxAnd is greater than or equal to the sum of sizes of convolution kernels which need to be calculated in parallel when convolution kernels with different sizes exist in the same convolution layer; the basic length of the calculation array is H, the size of H is the minimum size of all convolution layer output characteristic graphs in the network model, and the actual length of the array is 2 according to the number of specific hardware resources and the calculation performance requirement of the systemnExpanding for multiples; when convolution layers with convolution kernels of different sizes need to be calculated in parallel, the sizes of the convolution kernels are assumed to be K respectively1、K2、…、KiIn which there is
Figure FDA0002815258900000031
Transversely dividing the calculation array into i regions with the scale of K1*H、K2*H、…、KiAnd H, inputting different convolution type parameters into different regions, and configuring a storage and calculation module by a calculation unit in each region to finish the parallel calculation of the multi-size convolution kernel.
5. A method of constructing a flexibly configurable neural network computational unit, comprising a flexibly configurable neural network computational unit of any one of claims 1 to 3, comprising the steps of:
firstly, extracting network parameters according to a model of a target network;
step two, combining the step one to design a configurable storage module in the neural network computing unit, which is used for storing part of feature map data and weight data for computing, and comprises the following steps: a characteristic map data buffer, a step data buffer and a weight data buffer;
thirdly, combining the configurable control module in the neural network computing unit designed in the first step, configuring different cache sizes for the storage module by the configurable control module in different convolution modes, generating various working modes of each cache in convolution calculation and controlling the cache to work in corresponding modes; the configurable control module structure comprises: a counter module and a state machine module;
step four, combining the step one to design a multiplication and addition calculation module in the neural network calculation unit, which is used for calculating the partial sum of the convolution result obtained by multiplying the characteristic diagram by the weight and accumulating, and comprises the following steps: a multiplier and an adder which can perform time division multiplexing;
step five, combining the step two, the step three and the step four, providing five convolutional layer parameters of a convolutional kernel size k, a convolutional kernel step length s, an output characteristic diagram size h, an input channel number a and an output channel number b mapped by the calculation unit to a configurable control module of the neural network calculation unit through an external input port; the configurable control module configures the size of a cache space required by the layer of convolution calculation for the configurable storage module and controls the configurable storage module to output corresponding data to the multiply-add calculation module; different convolutional layers can complete partial calculation of convolution on the same neural network computing unit by providing corresponding convolution parameters to the computing unit.
6. The method for constructing a flexibly configurable neural network computational unit according to claim 5, wherein the first step is specifically: extracting required parameters according to the target network model, wherein the required parameters comprise: convolution kernel size K for each convolution layeriAnd a sliding step SiOutput feature size H for each convolutional layeriThe number of input and output channels A to be mapped in each convolutional layer computing unitiAnd BiWherein i is the number of layers of the convolutional layer;
in the second step: the characteristic map data buffer is used for storing partial pixel data used in convolution calculation and circularly utilizing the pixel data with data sharing, and the buffer length is max { K }1A1,K2A2,…,KiAi}; the step data buffer is used for providing data needing to be updated to the feature map buffer when the convolution kernel updates the step data in a sliding way, and the buffer length is max { S }1A1,S2A2,…,SiAi}; the weight data buffer is used for storing weight data and can recycle the data, and the buffer length is max { K }1A1B1,K2A2B2,…,KiAiBi};
In the third step: the counter module comprises an input data counter, an input weight counter, an output data counter, an output channel number counter and an output characteristic graph size counter; in the state machine module, corresponding feature map buffer state machines and weight buffer state machines are arranged in different convolution kernel sizes, the state machine skips states according to the numerical values of counters in the counter module, and the states comprise: an initialization state, a data ready state, a wait state, a full cycle state, an update data state, a half cycle state, and a no cycle state.
CN201811133940.2A 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof Active CN109409512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811133940.2A CN109409512B (en) 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811133940.2A CN109409512B (en) 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof

Publications (2)

Publication Number Publication Date
CN109409512A CN109409512A (en) 2019-03-01
CN109409512B true CN109409512B (en) 2021-02-19

Family

ID=65465369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811133940.2A Active CN109409512B (en) 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof

Country Status (1)

Country Link
CN (1) CN109409512B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109656623B (en) * 2019-03-13 2019-06-14 北京地平线机器人技术研发有限公司 It executes the method and device of convolution algorithm operation, generate the method and device of instruction
CN110084739A (en) * 2019-03-28 2019-08-02 东南大学 A kind of parallel acceleration system of FPGA of the picture quality enhancement algorithm based on CNN
CN110070178B (en) * 2019-04-25 2021-05-14 北京交通大学 Convolutional neural network computing device and method
GB2591106B (en) * 2020-01-15 2022-02-23 Graphcore Ltd Control of data transfer between processors
CN113807506B (en) * 2020-06-11 2023-03-24 杭州知存智能科技有限公司 Data loading circuit and method
CN111610963B (en) * 2020-06-24 2021-08-17 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN112418418A (en) * 2020-11-11 2021-02-26 江苏禹空间科技有限公司 Data processing method and device based on neural network, storage medium and server
CN112346704B (en) * 2020-11-23 2021-09-17 华中科技大学 Full-streamline type multiply-add unit array circuit for convolutional neural network
CN113138957A (en) * 2021-03-29 2021-07-20 北京智芯微电子科技有限公司 Chip for neural network inference and method for accelerating neural network inference
CN113592067B (en) * 2021-07-16 2024-02-06 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN116648694A (en) * 2021-12-24 2023-08-25 华为技术有限公司 Method for processing data in chip and chip

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105681628A (en) * 2016-01-05 2016-06-15 西安交通大学 Convolution network arithmetic unit, reconfigurable convolution neural network processor and image de-noising method of reconfigurable convolution neural network processor
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105681628A (en) * 2016-01-05 2016-06-15 西安交通大学 Convolution network arithmetic unit, reconfigurable convolution neural network processor and image de-noising method of reconfigurable convolution neural network processor
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array

Also Published As

Publication number Publication date
CN109409512A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109409512B (en) Flexibly configurable neural network computing unit, computing array and construction method thereof
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
CN110458279B (en) FPGA-based binary neural network acceleration method and system
CN108171317B (en) Data multiplexing convolution neural network accelerator based on SOC
CN109919311B (en) Method for generating instruction sequence, method and device for executing neural network operation
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
CN105892989B (en) Neural network accelerator and operational method thereof
US20200202198A1 (en) Neural network processor
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN111445012A (en) FPGA-based packet convolution hardware accelerator and method thereof
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN113469350B (en) Deep convolutional neural network acceleration method and system suitable for NPU
EP3674982A1 (en) Hardware accelerator architecture for convolutional neural network
CN108304925B (en) Pooling computing device and method
CN111738433A (en) Reconfigurable convolution hardware accelerator
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
CN113313247A (en) Operation method of sparse neural network based on data flow architecture
CN110414672B (en) Convolution operation method, device and system
CN110716751B (en) High-parallelism computing platform, system and computing implementation method
CN115879530A (en) Method for optimizing array structure of RRAM (resistive random access memory) memory computing system
CN113312285B (en) Convolutional neural network accelerator and working method thereof
CN114912596A (en) Sparse convolution neural network-oriented multi-chip system and method thereof
CN111008697B (en) Convolutional neural network accelerator implementation architecture
CN112001492A (en) Mixed flow type acceleration framework and acceleration method for binary weight Densenet model
Wang et al. An FPGA-Based Reconfigurable CNN Training Accelerator Using Decomposable Winograd

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant