CN115759213A

CN115759213A - Convolution neural network coprocessor for one-dimensional convolution

Info

Publication number: CN115759213A
Application number: CN202211077216.9A
Authority: CN
Inventors: 张琛; 王新安; 李健; 王晨阳; 李秋平
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2023-03-07

Abstract

A convolution neural network coprocessor for one-dimensional convolution comprises a central controller, an on-chip RAM and a multiplication accumulator array, wherein the multiplication accumulator array comprises Kout x Xin multiplication accumulators which are divided into Kout layers and Xin columns, each multiplication accumulator receives input feature diagram data of a channel and weight data of convolution kernels of the channel to perform convolution operation of one-dimensional convolution kernels and one-dimensional input feature diagrams of the single channel, the same row of multiplication accumulators receives input feature diagram data of the same channel, each layer of multiplication accumulator corresponds to one convolution kernel, and each multiplication accumulator receives weight data of one channel of the convolution kernels corresponding to the layer where the multiplication accumulator is located.

Description

Convolution neural network coprocessor for one-dimensional convolution

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a convolutional neural network coprocessor for one-dimensional convolution.

Background

In many engineering applications, one-dimensional signals, such as voice signals and bioelectric signals, need to be processed, and because the voice signals and the bioelectric signals are characterized in that the amplitude of the signal voltage changes with time, the signals can be recorded as a one-dimensional array arranged in sequence on a time axis, i.e., a one-dimensional signal. At present, the application field of artificial intelligence is wider and wider, and the artificial intelligence plays an important role in one-dimensional signal processing, for example, an artificial intelligence algorithm can be used for performing semantic analysis on a voice signal; the collected bioelectricity signals are classified by an artificial intelligence algorithm, and the health information of the human body and the like can be analyzed. Compared with traditional data analysis, artificial intelligence represents a great advantage in processing efficiency and recognition accuracy. Currently, there is an increasing demand for one-dimensional signal analysis on portable devices, such as wearable devices. For example, in order to realize real-time and accurate health condition monitoring, it has become a new trend to apply artificial intelligence algorithms to portable health monitoring devices, such as smart watches. Portable devices have very stringent requirements for power consumption due to their limited battery capacity. However, the artificial intelligence algorithm generally has the characteristics of large data volume and large calculation amount, and the low calculation efficiency and high power consumption overhead of the artificial intelligence algorithm completed by using the traditional general microcontroller are unacceptable. In the biomedical field, many research teams at home and abroad propose special bioelectrical signal processors based on the characteristics of the artificial intelligence algorithm so as to reduce the power consumption and delay when the artificial intelligence algorithm processes the bioelectrical signals.

Because portable devices have strict requirements on power consumption and hardware cost, current research mainly provides a corresponding hardware acceleration architecture by using an artificial intelligence algorithm with low complexity, such as an SVM (Support Vector Machine) or a small-scale NN (Neural Network) algorithm. Due to the simple implementation of SVM-based bioelectric signal processors, accuracy is high in some applications and has been widely adopted in early research work. Hsu et al propose a low power consumption machine learning assisted ECG (Electrocardiogram) processor [1] for mobile health applications. A function switchable Classification engine is designed, which can switch between SVM (support vector machine) and MLC (Maximum Likelihood Classification), and features extracted by a preceding stage processing engine are input into the Classification engine to finish Classification. In 2018, zhejiang university proposed a low power ECG processor [2] for arrhythmia detection, which also included an SVM classifier. In order to reduce the complexity of the SVM classifier, principal component analysis is used for feature dimension reduction.

Compared with a support vector machine, the neural network can realize higher classification precision. At present, CNN (Convolutional Neural Networks) is a very robust algorithm for noise, and can sufficiently distinguish morphological differences in fine signals in case of data noise, and is increasingly used for one-dimensional signal processing, for example, CNN [3] is one of the most common algorithms for ECG arrhythmia detection in biomedicine. Because the CNN algorithm has the characteristics of intensive calculation and large weight, the operation speed of the current processor is still to be improved, and a processor capable of effectively accelerating the CNN algorithm is needed.

Reference:

[1]S.-Y.Hsu,Y.Ho,P.-Y.Chang,C.Su,and C.-Y.Lee,“A 48.6-to-105.2μW Machine Learning Assisted Cardiac Sensor SoC for Mobile Healthcare Applications,”IEEE Journal of Solid-State Circuits,vol.49,no.4,Art.no.4,Apr.2014.

[2]Z.Chen et al.,“An Energy-Efficient ECG Processor With Weak-Strong Hybrid Classifier for Arrhythmia Detection,”IEEE Transactions on Circuits and Systems II:Express Briefs,vol.65,no.7,Art.no.7,Jul.2018.

[3]Shu Lih Oh,et al."Automated beat-wise arrhythmia diagnosis using modified U-net on extended electrocardiographic recordings with heterogeneous arrhythmia types."Computers in Biology and Medicine,vol.105,pp.92-101,Dec.2018.

disclosure of Invention

The invention mainly solves the technical problem of effectively accelerating the CNN algorithm in one-dimensional signal processing.

In one embodiment, the present invention provides a convolutional neural network coprocessor for one-dimensional convolution, comprising:

the central controller is used for generating data addresses and controlling all parts in the convolutional neural network coprocessor to work;

the on-chip RAM is connected with the central controller and is used for storing input characteristic diagram data and weight data of the convolution kernel;

the multiplication accumulator array is connected with the central controller and the on-chip RAM and used for reading input feature map data and weight data of convolution kernels from the on-chip RAM according to the data address and performing convolution operation between the input feature map and the convolution kernels, wherein the multiplication accumulator array comprises Kout multiplied by Xin multiplication accumulators, both Kout and Xin are natural numbers larger than 0, kout is the number of layers, and Xin is the number of columns, and each multiplication accumulator is used for performing convolution operation of a single-channel one-dimensional convolution kernel and a one-dimensional input feature map;

when convolution operation is carried out, the on-chip RAM broadcasts the input feature map data and the weight data of the convolution kernel to all the multiplication accumulators in parallel; each multiplication accumulator receives input characteristic diagram data of one channel, and the same row of multiplication accumulators at different layers share the input characteristic diagram data of one channel; each layer of multiplication accumulator corresponds to one convolution kernel, and each multiplication accumulator receives weight data of one channel of the convolution kernel corresponding to the layer where the multiplication accumulator is located.

The convolution neural network coprocessor for one-dimensional convolution according to the embodiment comprises a central controller, an on-chip RAM and a multiplication accumulator array, wherein the multiplication accumulator array comprises Kout multiplied by xi multiply accumulators which are divided into Kout layers and xi columns, each multiplication accumulator receives input feature diagram data of one channel and weight data of convolution kernels of one channel to perform convolution operation of one-dimensional convolution kernels and one-dimensional input feature diagrams of the single channel, the same column of multiplication accumulators receives input feature diagram data of the same channel, each layer of multiplication accumulator corresponds to one convolution kernel, each multiplication accumulator receives weight data of one channel of the convolution kernels corresponding to the layer where the multiplication accumulator is located, and the multiplication accumulator array achieves parallel calculation among the channels of the input feature diagram xi and the channels of the output feature diagram Kout due to the fact that the number of the channels of the output feature diagram is the same as the number of the convolution kernels, meanwhile, input feature diagram data are shared among the layers of the multiplication accumulators, access times are reduced, and operation of the one-dimensional neural network is effectively accelerated.

Drawings

FIG. 1 is a schematic diagram of a convolutional neural network;

FIG. 2 is a schematic diagram of a convolutional neural network coprocessor for one-dimensional convolution according to an embodiment;

FIG. 3 is a diagram illustrating an exemplary multiply accumulator array;

FIG. 4 is a diagram of the mapping relationship between the multiply accumulator and the input signature graph channel and the convolution kernel channel;

FIG. 5 is a diagram illustrating an exemplary multiply accumulator;

FIG. 6 is a diagram illustrating one embodiment of a one-dimensional convolution process performed by a multiply accumulator;

FIG. 7 is a diagram illustrating a structure of a multiply accumulator array according to another embodiment;

FIG. 8 is a diagram illustrating an arrangement format of output signature data stored in an on-chip RAM according to an embodiment;

FIG. 9 is a schematic structural diagram of a convolutional neural network coprocessor for one-dimensional convolution according to another embodiment;

FIG. 10 is a diagram illustrating an exemplary multiply accumulator array according to yet another embodiment;

FIG. 11 is a diagram illustrating an exemplary embodiment of a pooling unit;

FIG. 12 is a diagram illustrating an arrangement format of input feature map data of a fully connected layer, an arrangement format after weight blocking, and a correspondence relationship therebetween, according to an embodiment;

FIG. 13 is a diagram illustrating a format of an operation instruction according to an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments have been given like element numbers associated therewith. In the following description, numerous specific details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).

Definitions of some terms used in this application:

1D-CNN: one-dimensional Convolutional Neural Networks, namely Convolutional Neural Networks with One-dimensional characteristic diagrams as input characteristic diagrams;

MAC: multiplier-accumulator, multiply accumulator;

RAM: random Access Memory, random Access Memory;

DMA: direct Memory Access, direct Memory Access;

FIFO: first In First Out, first In First Out.

The convolutional neural network CNN has robustness, can extract useful features from signals interfered by noise, has better resolution capability on subtle changes among the signals, and can ensure higher accuracy in multi-classification tasks. Due to the above advantages of CNN, its application in one-dimensional signals is also becoming more widespread, for example, 1D-CNN is increasingly applied in bioelectrical signal classification applications in medicine, such as myocardial ischemia detection, arrhythmia detection, epilepsy detection, and other scenarios.

In recent years, many research teams have proposed deep neural network accelerators capable of accelerating CNN operations, such as DianNao series of cambrian, UNPU of korea science and technology institute, and Thinker of qinghua university, but these are designed for image processing application scenarios, and have complex hardware structures and high power consumption overhead, and are not suitable for portable health monitoring devices. There are some research groups that propose CNN accelerator hardware architectures for bio-signal processing. Taiwan university of transportation proposes a CNN coprocessor for real-time detection of epilepsy, with which a RISC-V core is interconnected via a custom interface to configure the coprocessor and transfer data. The CNN coprocessor is controlled by a state machine consisting of a plurality of micro control units, the calculation of a CNN algorithm is completed by 1 PE (Processing Element), the operation of all layers of the CNN shares the PE, and the corresponding micro control unit is selected to control the execution of the PE according to different operation stages. The method saves the hardware area, but has lower operation efficiency; in addition, since the micro control unit is determined according to the algorithm steps, the coprocessor is only suitable for CNN algorithm acceleration with a fixed structure, and the application scenarios are limited.

Summarizing, in the research work of the existing biological signal classification processor, the accuracy of the SVM classifier is not high, and the CNN accelerator is provided for a specific algorithm in a certain application, so that the parameter configuration of the CNN algorithm can not be realized; moreover, for the CNN algorithm with different parameters, the problem of low operation efficiency may occur.

The invention provides a novel convolutional neural network coprocessor, which has good compromise among calculation accuracy, operation throughput rate and hardware utilization rate so as to at least solve the problems. The hardware architecture of the convolutional neural network coprocessor is accelerated aiming at a CNN algorithm, so that better accuracy can be kept in different application scenes. In order to obtain higher throughput rate, the invention provides the MAC array aiming at 1D-CNN convolution operation acceleration, and realizes parallel operation between input characteristic diagram channels and between output characteristic diagram channels; meanwhile, considering the need of adapting to convolution kernels with different sizes, the MAC array does not perform parallel calculation among the weights of the convolution kernels, but each MAC completes one-dimensional convolution operation of a single channel, and hardware resource waste caused by the fact that the convolution kernels of the CNN algorithm are small in size is avoided. The coprocessor can be applied to equipment such as computers, smart phones and smart watches, is connected with a main processor such as a CPU (central processing unit) in the equipment, and assists in carrying out convolutional neural network operation under the instruction of the main processor.

In order to better understand the technical scheme of the present invention, a convolutional neural network CNN is first briefly introduced. Referring to fig. 1, the convolutional neural network is mainly composed of these layers: convolutional layers, activation layers, pooling (Pooling) layers, and fully-connected layers (fully-connected layers are the same as in conventional neural networks). In practice, the convolutional layer and the active layer are also commonly referred to as convolutional layer. After convolution and activation operation, the input CNN feature graph is subjected to pooling operation, finally full-connection layer operation is performed, and an inference result is output, for example, for a classification task, the inference result includes the probability of each class. The convolution, activation, and pooling operations may be performed multiple times, and thus the CNN may include multiple convolution layers, activation layers, and pooling layers (denoted by n layers in fig. 1, where n is a natural number greater than 0). The convolutional neural network coprocessor of the present invention mainly aims at one-dimensional signals, and thus, a one-dimensional feature map, i.e., a feature map with a size of I _ len × 1, is input to CNN, where I _ len is the length of the feature map.

Referring to fig. 2, the convolutional neural network coprocessor for one-dimensional convolution according to an embodiment of the present invention includes a central controller 1, an on-chip RAM2, and a multiply accumulator array (referred to as "MAC array") 3, which are described below.

The central controller 1 is used for receiving an operation instruction sent by the main processor, decoding the operation instruction, and generating a data address and a corresponding control signal so as to control each component in the convolutional neural network coprocessor to work, thereby ensuring accurate reading and writing of data and execution of operation. For example, for the on-chip RAM2 and the MAC array 3, the central controller 1 controls writing and reading of data in the on-chip RAM2, and controls the MAC array 3 to perform accurate operation.

The on-chip RAM2 is connected to the central controller 1, and stores input feature map data and weight data of the convolution kernel.

The input characteristic diagram can have a plurality of channels, the convolution kernels can also have a plurality of channels, the number of the channels of each convolution kernel is the same as that of the channels of the input characteristic diagram, and the number of the channels of the output characteristic diagram is the same as that of the convolution kernels. Here, the input feature map and the output feature map are for each layer, the input feature map of each layer is different, the output feature map of the previous layer is the input feature map of the next layer, for example, when convolution operation is performed, the on-chip RAM2 may store the input feature map of the convolution layer, and when pooling operation is performed, the on-chip RAM2 may store the input feature map of the pooling layer, that is, the output feature map of the active layer.

The multiplication accumulator array 3 is connected with the central controller 1 and the on-chip RAM2, and is used for reading the input feature map data and the weight data of the convolution kernel from the on-chip RAM2 according to the data address generated by the central controller 1 and performing convolution operation between the input feature map and the convolution kernel. Referring to fig. 3, the multiplier-accumulator array 3 includes Kout × Xin multiplier accumulators, where Kout and Xin are both natural numbers greater than 0, kout is the number of layers, and Xin is the number of columns, that is, the multiplier-accumulator array 3 is divided into Kout layers, each layer has Xin multiplier accumulators, and each multiplier accumulator is used for performing convolution operation of a single-channel one-dimensional convolution kernel and a one-dimensional input feature map.

The MAC array provided by the invention is mainly used for accelerating the operation of the convolution layer in the 1D-CNN algorithm. The convolution calculation part of the 1D-CNN algorithm can be divided into 4 layers of loops, and the calculation process is expressed by pseudo codes as follows:

in the pseudo code, k represents an index of an output feature map, and ofmap _ num represents the number of channels of the output feature map, that is, the number of convolution kernels; y denotes an index of an element in the output feature map of one channel, and ofmap _ size denotes a length of the output feature map; x represents an index of the input feature map, and ifmap _ num represents the number of channels of the input feature map; i denotes the index of the weight in the convolution kernel, K _ size denotes the length of the convolution kernel (i.e. the convolution kernel size), i.e. the convolution kernel comprises K _ size weights; stride represents the step size, bias represents the bias value, and ReLu is the activation function.

The MAC array proposed by the present invention performs parallel operations on loop 2 and loop 4 in the above pseudo code to speed up the convolution operation process. The number of layers Kout and the number of columns Xin of the MAC array 3 correspond to the parallel computation parameters after the loop 4 and the loop 2 of the 1D-CNN algorithm are expanded, respectively. Kout and Xin can be set according to the characteristics and actual needs of the one-dimensional feature map processed by the convolutional neural network, and are not limited here. The inventors have investigated and found that many one-dimensional signals, such as bioelectric signals, generally have a channel number that is a multiple of 4 and a channel number of convolution kernels, and thus Kout and Xin may preferably be a multiple of 4, for example Kout =4 and Xin =4.

Referring to fig. 3, when performing convolution operation, the on-chip RAM2 broadcasts input feature map data and weight data of convolution kernels to all MACs in parallel, each MAC receives input feature map data of one channel, and the same row of MACs of different layers share input feature map data of one channel, that is, the same row of MACs receives input feature map data of the same channel; each layer of MAC corresponds to a convolution kernel, and each MAC receives the weight data of a channel of the convolution kernel corresponding to the layer where the MAC is located. For example, kout =4,xin =4, the mapping relationship between each MAC and the input signature graph channel and the convolution kernel channel is shown in fig. 4, where I _ Cx represents the x-th channel of the input signature graph, and Kk _ Cx represents the x-th channel of the k-th convolution kernel.

Therefore, the parallel operation among the channels of the input feature map Xin can be realized by Xin MACs of each layer, and the acceleration of the cycle 2 is correspondingly realized; kout layer MAC array can realize parallel operation among Kout convolution kernels, and corresponds to the acceleration of cycle 4. The whole MAC array 3 can perform Kout x Xin single-channel one-dimensional convolution operation in parallel. The MAC array 3 realizes the parallel computation between the channels of the input characteristic diagram Xin and the channels of the output characteristic diagram Kout, and simultaneously, the data of the input characteristic diagram is shared between MAC layers, so that the access frequency is reduced, and the operation of the one-dimensional convolutional neural network is effectively accelerated.

The hardware configuration of the MAC is shown in fig. 5. Each MAC includes a multiplier 301, an adder 302, a third multiplexer 303, a Partial Sum Register (Partial Sum Register) 304, a fourth multiplexer 305, and an output Register 306. A first input end and a second input end of the multiplier 301 are respectively used for inputting the weight data and the input feature map data, and an output end of the multiplier is connected with a first input end of the adder 302; a second input terminal of the adder 302 is connected to an output terminal of the third multiplexer 303, and an output terminal thereof is connected to an input terminal of the partial sum register 304; the first input end of the third multiplexer 303 is connected with the output end of the partial sum register 304, the second input end can be set to 0, and the gating end SEL0 is connected with the central controller 1; the partial sum register 304 is used for temporarily storing the partial sum of the intermediate process of the single-channel one-dimensional convolution operation; a first input end of the fourth multiplexer 305 is connected with an output end of the output register 306, a second input end is connected with an output end of the partial sum register 304, and a gating end SEL1 is connected with the central controller 1; the input of the output register 306 is connected to the output of the fourth multiplexer 305, and the convolution result Psum (which is also a partial sum of the entire MAC layer) is stored in the output register 306 for output to the next calculation.

The detailed calculation process of the MAC will be described below by taking 5 × 1 input feature map and 3 × 1 convolution kernel as examples, and this single-channel one-dimensional convolution process is shown in fig. 6, where the step size of convolution is 1, the convolution kernel is one channel of the convolution kernel, and the input feature map is one channel of the input feature map. In the first clock cycle after the convolution starts, the central controller 1 sets the gating terminal SEL0=1, sel1=0, mac calculates psum0=0+ a 1+ w1 and stores it in the partial sum register 304; in the second clock cycle, the central controller 1 sets the gating terminal SEL0=0, sel1=0, mac calculates psum1= psum0+ A2 × W2 and stores it in the partial sum register 304; in the third clock cycle, the central controller 1 sets a gating terminal SEL0=0, sel1=1, mac calculates to obtain psum2= psum1+ A3 × W3= P1, and updates the obtained single-channel convolution characteristic value P1 to the output register 306 for outputting to the next component for the next calculation; then the convolution kernel needs to slide according to step 1 to calculate the next characteristic value P2, so in the fourth clock cycle, the central controller 1 sets the gating terminal SEL0=1, sel1=0, mac calculates psum0=0+ a 2+ w1; and performing subsequent calculation in the same way until the characteristic value P2 is calculated, wherein the characteristic value P3 is calculated in the same way as the above process, and finally obtaining the complete part and Psum. It can be seen that, every K _ size clock cycles, the central controller 1 sets SEL0 to 1 to start the calculation of a feature value, and then sets SEL0 to perform the accumulation of partial sums, and the feature value is obtained by calculation, wherein the partial sums to be accumulated are stored in the partial sum register 304; in addition, every K _ size clock cycles, central controller 1 sets SEL1 to update the feature value output by partial sum register 304 into output register 306 for further calculation, and then sets SEL1 to 0 to keep the value in output register 306 unchanged.

The convolution results of all MACs at each layer need to be added, usually with an offset, before the activation operation. Referring to fig. 7 and 9, the MAC array 3 in one embodiment further includes an adder tree 31 and an active operation unit 32, where there is one adder tree 31 for each layer of MAC, and the adder tree 31 in each layer is connected to each MAC in the layer for accumulating convolution operation results of all MACs in the layer to obtain a convolution partial sum. For example, each layer has three MACs, and the obtained convolution operation results are [ P11, P12, P13], [ P21, P22, P23], and [ P31, P32, P33], respectively, so that the sum of the convolution parts obtained by the accumulation of the adder tree 31 is [ P11+ P21+ P31, P12+ P22+ P32, P13+ P23+ P33].

The on-chip RAM2 is also used to store offset values of convolution kernels and output profile data. The adder tree 31 of each layer is further connected to the on-chip RAM2, and is configured to obtain an offset value of the convolution kernel corresponding to the layer from the on-chip RAM2, and add the sum of the convolution portions to the offset value to obtain output feature map data to be activated.

The activation operation unit 32 is connected to the adder tree 31 and the on-chip RAM2, and is configured to perform function activation processing on the output feature map data to be activated to obtain output feature map data, write the output feature map data into the on-chip RAM2, and wait for a next layer of operation. The output characteristic diagram data to be activated of each MAC layer is the output characteristic diagram data of one channel after function activation processing, and the Kout layer can obtain the output characteristic diagram data of Kout channels. The activation operation may use a ReLu function as the activation function.

Since the MAC array 3 can only perform convolution operation on Kout convolution kernels and Xin channels of each convolution kernel in one round of operation, block or block calculation needs to be implemented in the case of multiple convolution kernels and multiple channels. Specifically, when the number of channels Chi of the input feature map is greater than Xin and/or the number of convolution kernels Cho is greater than Kout, the input feature map data may be divided into Chi/Xin groups, each group including the input feature map data of Xin channels, and the weight data of the convolution kernels may be divided into Cho/Kout groups, each group including the weight data of Kout convolution kernels, where Chi and Cho are both natural numbers greater than 0.

When convolution operation is carried out, the MAC array 3 reads each group of weight data from the on-chip RAM2 in sequence, for each group of weight data, the MAC array 3 reads each group of input feature map data from the on-chip RAM2 in sequence, convolution operation is carried out on the input feature map data and the group of weight data, and (Cho/Kout) x (Chi/Xin) round convolution operation is carried out totally, so that convolution operation between the input feature map data of all channels and all convolution kernels is completed.

Referring to fig. 7, in order to accommodate block or packet computations, each adder tree 31 may also be connected to a FIFO memory 33, the FIFO memory 33 being used to temporarily store the convolution portion sums. The sum of the convolution portions obtained by the adder tree 31 is added to the sum of convolution portions stored in the FIFO memory 33 in each convolution operation, and then rewritten to the FIFO memory 33 to update the sum of convolution portions in the FIFO memory 33, as well as in the next convolution operation. When the convolution operation of a group of weight data and all input feature map data groups is completed, the FIFO memory 33 accumulates partial sums of all channels, the adder tree 31 acquires the offset value of the convolution kernel corresponding to the layer where the adder tree is located from the on-chip RAM2, adds the convolution partial sums in the FIFO memory 33 and the offset value to obtain output feature map data to be activated, and then empties the FIFO memory 33. The output characteristic diagram data to be activated is processed by the activation arithmetic unit 32 to obtain the output characteristic diagram data. After that, the MAC array 3 continues reading the next set of weight data, and performs the same process.

Take Kout =4,xin =4, and a convolutional layer with the following parameters: input profile channel Chi =8, length I _ len, number of convolution kernels (i.e. output profile channel) Cho =16, convolution kernel size K _ size, convolution step size stride =1. The input feature map data is divided into 2 groups, each group including 4 channels of input feature map data, the weight data of the convolution kernel is divided into 4 groups, each group including the weight data of 4 convolution kernels, and the convolution layer is divided into 8 rounds of calculation. The 1 st round of operation completes the calculation of the 1 st group of convolution kernels and the 1 st group of input feature map data, the obtained convolution part sums are stored in the FIFO memory 33, the 2 nd round of operation completes the calculation of the 1 st group of convolution kernels and the 2 nd group of input feature map data, at this time, the FIFO memory 33 accumulates the part sums of all channels, the activation operation unit 32 stores the output feature map into the on-chip RAM2, and because of 4 convolution kernels, the output feature maps of 4 channels, namely the output feature maps of the 1 st to 4 th channels, can be obtained; similarly, the 3 rd and 4 th rounds of operation complete the calculation of the output characteristic diagrams of the 5 th to 8 th channels, the rest rounds of calculation are analogized, after the 8 rounds of operation are completed, the output characteristic diagrams of all the channels are stored in the on-chip RAM2, and the convolution layer operation is finished.

Because most one-dimensional signals have the characteristics of relatively small input data quantity and small quantity of parameters of the convolutional layer, the output characteristic diagram of the convolutional layer can be completely temporarily stored in the on-chip RAM2 without transferring data to an external memory. If the convolutional neural network coprocessor operates based on 16-bit fixed point numbers and the on-chip RAM2 has a bandwidth of 64 bits, according to the above grouping calculation strategy, the data arrangement format of the output feature map of the 16 channels of the convolutional layer stored in the on-chip RAM2 is shown in fig. 8, where ofmap _ cx represents the xth channel of the output feature map, it can be seen that the output feature maps of the 1 st to 4 th channels occupy the storage space of the first cached I _ len-K _ size +1 64 bits, and the output feature maps of the remaining channels are also written into the on-chip RAM2 in a manner that every 4 channels occupy the storage space of the length I _ len-K _ size +1 64 bits. The output characteristic diagram temporarily stored in the on-chip RAM2 can be directly sent to the next component for operation, for example, the output characteristic diagram is directly sent to the pooling operation unit to complete the down-sampling operation, so that the redundant read-write operation of an external memory is avoided, the operation efficiency is improved, and the power consumption of the read-write memory is reduced.

Referring to fig. 9 and 10, the on-chip RAM2 in one embodiment includes an Input RAM (Input RAM) 21, a Weight RAM (Weight RAM) 22, a Bias RAM (Bias RAM) 23, and an output RAM (Ouput RAM) 24.

The input RAM21 is connected to the MAC array 3 for storing input profile data and broadcasting the input profile data in parallel to all the MACs. Specifically, the input RAM21 may be used to store input feature map data before each convolution operation and broadcast the input feature map data to all MACs at the time of the convolution operation.

The weight RAM22 is connected to the MAC array 3 for storing the weight data of the convolution kernels and broadcasting the weight data of the convolution kernels to all the MACs in parallel. The weight RAM22 includes Kout partitions, each layer MAC having one partition corresponding thereto, each partition storing weight data of all channels of one convolution kernel, each partition being used to broadcast the stored weight data to the MAC of the corresponding layer.

Fig. 9 and 10 take Kout =4, xin =4 as an example, then the Weight RAM22 includes 4 partitions, respectively denoted Weight RAM [0], weight RAM [1], weight RAM [2], and Weight RAM [3]. As shown in fig. 10, if the operation of the convolutional neural network coprocessor is performed based on 16-bit fixed point numbers, each partition has an output bandwidth of 64 bits, and in the process of the convolutional operation, each partition broadcasts weight data of 4 channels in a convolutional kernel to the MAC in the layer in parallel, and each MAC receives weight data of 1 channel. It will be appreciated that the input RAM21 also has an output bandwidth of 64 bits, and broadcasts the input profile data of 4 channels in parallel to all MACs, with the same column of MACs at different layers sharing the input profile data of 1 channel.

The offset RAM23 is connected to the adder tree 31, and stores offset values of the convolution kernels.

The output RAM24 is connected to the activation arithmetic unit 32 for storing output characteristic map data.

Referring to fig. 9, the convolutional neural network coprocessor for one-dimensional convolution according to an embodiment of the present invention further includes a pooling operation unit 4, where the pooling operation unit 4 is connected to the input RAM21 and the output RAM24, is responsible for pooling operation of the pooling layer, and is used to perform pooling operation on the output feature map data stored in the output RAM24, and store the pooled output feature map data in the input RAM21 as input feature map data of the next convolutional operation or full link layer operation. The pooling operation performed by the pooling operation unit 4 may be maximum pooling, average pooling, or the like.

Referring to fig. 11, the pooling unit 4 in this embodiment is mainly composed of a comparator CMP to complete the maximum pooling operation.

The number C _ num of the comparators CMP in the pooling operation unit 4 can be determined according to actual needs, and for convenience of description, the 4 comparators CMP are taken as an example for description. As shown in fig. 11, in one embodiment, the signal port of the comparator CMP includes a clock terminal CLK, a control terminal CE, a gate terminal SEL, an input terminal a and an output terminal Y, and the input/output terminals are 16 bits wide. Assuming that the output bandwidth of the output RAM24 is 4 × 16bit, the output RAM can output 4 16-bit output feature map data from different channels in parallel, and the output feature map data are respectively sent to 4 16-bit comparators CMP to complete the maximum pooling operation, and each comparator CMP completes the pooling operation on the output feature map data of one channel. The number of comparisons required for the comparator CMP to output one feature value is equal to the pooling size P _ size. When SEL =0, a is compared with 0, and the value of a is assigned to Y; when SEL =1, a is compared with Y, and a larger value is assigned to Y. Every P _ size clock cycles, the first clock cycle SEL =0 and the subsequent clock cycle SEL =1, and the comparator CMP outputs a characteristic value every P _ size clock cycles, and writes the characteristic value into the input RAM 21.

In the process of the pooling operation, the input RAM21 is converted into an output buffer of the pooling layer operation, and the output characteristic diagram after the pooling operation is temporarily stored in the input RAM21 to be used as the input characteristic diagram of the next convolution operation or the full connection layer operation. The method is also beneficial to avoiding redundant read-write operation on the external memory, improving the operation efficiency and reducing the power consumption.

Similarly, for the pooling layer operation, if the number of channels of the output feature map stored in the output RAM24 is greater than C _ num, it is also necessary to perform a blocking operation, i.e., to divide the output feature map into Cho/C _ num blocks, and perform a round-pooling operation of Cho/C _ num, where each round performs a pooling operation on C _ num channels. The arrangement format of the data in the input RAM21 is similar to that in the output RAM24, and for example, if the arrangement format of the output feature map data in the output RAM24 is as shown in fig. 8, the arrangement format of the output feature map data in the input RAM21 after the pooling operation is also as shown in fig. 8, but the occupied storage space is different.

The MAC array of the invention can also be used for accelerating the operation of the full connection layer in the 1D-CNN algorithm. For the operation of the full connection layer, it is usually necessary to perform a flattening process on the output feature map that is pooled at the upper layer, expand the output feature map into 1 one-dimensional feature vector, and then perform an inner product operation with the weight of the full connection layer. If the full-connection operation is performed according to the operation rule, which is equivalent to performing convolution operation on a single-channel input characteristic diagram, the output bandwidth of the on-chip RAM2 is greatly wasted, and most of MAC is idle. In order to solve the problem, the invention provides a block operation method of a full connection layer.

When the full connection layer operation is performed, firstly, the input characteristic diagram stored in the input RAM21 after the pooling is completed is not subjected to the flatten treatment, and the original storage arrangement mode is kept; and the weights of the full connection layer are stored in blocks, and the number of the blocks is equal to the number of channels of the output characteristic diagram of the upper layer, namely the number of channels Chi of the input characteristic diagram data input into the RAM 21.

Specifically, the weight data corresponding to each output feature value of the fully-connected layer needs to be divided into Chi blocks, then the weight data of the fully-connected layer is used as the weight data of a convolution kernel, that is, the weight data corresponding to each output feature value is used as the weight data of one convolution kernel, each block is used as one channel of the convolution kernel, and if the fully-connected layer has K output feature values, the weight data of K convolution kernels can be obtained, and the weight data of the K convolution kernels is stored in the weight RAM22, where K is a natural number greater than 0.

The MAC array 3 reads input feature map data from the input RAM21, reads weight data from the weight RAM22, performs convolution operation of the input feature map and the K convolution kernels, completes operation of the full link layer, obtains an estimation result, and stores the estimation result in the output RAM 24.

Assuming that the output feature map of the pooling layer has 8 channels and the length is l _ len, the weight corresponding to one output feature value is also stored in 8 blocks, the number of weights included in each block is equal to l _ len, which is equivalent to cutting the weight vector into a convolution kernel of 8 channels, and the full-connection weights corresponding to the rest output feature values are also partitioned. FIG. 12 shows a diagram of the first full-link Weight blocks of output feature values stored in Weight RAM [0], where fcw _ cx denotes the xth block, and there is a correspondence between each block and the channel of the input feature map. After the block is divided, the calculation of the full connection layer is equivalent to the convolution operation of the convolution kernel with the input feature diagram with 8 channels and the length I _ len, and the equivalent substitution process can be expressed by the following formula:

OUT [0] represents a first output feature value, ifmap [ i ] represents an ith feature value of the input feature map, fcw [ i ] represents an ith weight corresponding to the first output feature value, ifmap [ chi ] represents an ith channel of the input feature map, fcw [ chi ] represents an ith block of weight data corresponding to the first output feature value, and i is a natural number greater than 0.

The difference from the convolution layer is that the length of the convolution kernel of the convolution operation is the same as the length of the input characteristic diagram, and is I _ len, at this time, the SEL signal of the multiplexer 302 is needed to control the MAC to accumulate the partial sum of the middle process of the convolution operation for I _ len-1 time, and then the convolution operation result is sent to the adder tree 31. If Kout =4, the output characteristic value requires two rounds of calculation as known from the block calculation of the previous convolutional layer operation, wherein the first round completes the convolution of channels 1-4, and the second round completes the convolution of channels 5-8. By using the block strategy, the whole MAC array 3 can calculate Kout output characteristic values in parallel, and the efficiency of full-link layer operation is improved.

It should be noted that fig. 8, 10 to 12 take the convolutional neural network coprocessor capable of processing 16 bits of data and the on-chip RAM2 having an input/output bandwidth of 64 bits as an example, but not limited thereto, and other bits and bandwidths may be set according to actual needs.

Referring to fig. 9, the convolutional neural network coprocessor for one-dimensional convolution according to an embodiment of the present invention further includes a direct memory access module (DMA) 5, a first multiplexer (Switch 1) 6, and a second multiplexer (Switch 2) 7. The pooling operation unit 4 includes an output terminal and an input terminal.

The direct memory access module 5 is used for reading the input profile data, weight data, and offset values from the external memory and writing them into the input RAM21, weight RAM22, and offset RAM23, respectively, and also for writing back the inference results stored in the output RAM24 to the external memory.

The first multiplexer 6 comprises an input terminal, a first output terminal and a second output terminal, the input terminal is connected with the output RAM24, the first output terminal is connected with the direct memory access module 5, and the second output terminal is connected with the input terminal of the pooling arithmetic unit 4.

The second multiplexer 7 comprises a first input connected to the output of the pooling arithmetic unit 4, a second input connected to the direct memory access module 5, and an output connected to the input RAM 21.

The direct memory access module 5 may be connected via a data bus to the weight RAM22, the offset RAM23, a first output of the first multiplexer 6 and a second input of the second multiplexer 7.

When the convolutional neural network coprocessor reads the input characteristic diagram data from the external memory, the second input end of the second multiplexer 7 is conducted, so that the direct memory access module 5 writes the input characteristic diagram data into the input RAM 21; when the pooling operation is performed, the second output end of the first multiplexer 6 is conducted with the first input end of the second multiplexer 7, so that the pooling operation unit 4 reads the output characteristic diagram data in the output RAM24 to perform the pooling operation and stores the output characteristic diagram data in the input RAM 21; when the inference result is output, the first output terminal of the first multiplexer 6 is turned on, so that the direct memory access module 5 reads the inference result from the output RAM24 and writes it back to the external memory.

It can be seen that the input RAM21 has two data sources, in the stage of reading the input characteristic diagram data from the external memory, the data sources are data buses connected to the direct memory access module 5, and the direct memory access module 5 writes data into the input RAM21 through the data buses; in the pooling level operation stage, the data source is a pooling operation unit 4. The data output by the output RAM24 has two directions, and in the operation stage of the pooling layer, the data is output to the pooling operation unit 4; in the stage of outputting the inference result, the data is output to a data bus connected to the direct memory access module 5, and the inference result is written back to the external memory by the direct memory access module 5. That is, in the process of pooling operation, the functions of the input RAM21 and the output RAM24 are exchanged, the output RAM24 storing the output characteristic diagram of convolutional layer operation provides the input characteristic diagram in the pooling layer operation, and the output characteristic diagram of the pooling layer operation is stored in the input RAM21, so that the input RAM21 and the output RAM24 are fully utilized, the number of times of reading and writing the external memory is reduced, the operation efficiency is improved, and the power consumption of reading and writing the memory is reduced.

Referring to fig. 9, the convolutional neural network coprocessor for one-dimensional convolution according to an embodiment of the present invention further includes an instruction memory (instmem) 8, where the instruction memory 8 is configured to store an operation instruction, where the operation instruction includes an operation code and an operation parameter, the operation code is configured to identify an instruction type, and the instruction type includes an input instruction, a convolutional operation instruction, a pooling operation instruction, a full link layer operation instruction, and an output instruction. The central controller 1 is also connected with the instruction memory 8, and is used for reading the operation instruction from the instruction memory 8, identifying the type of the instruction through the operation code, decoding the instruction, and performing corresponding control operation according to the operation parameter.

In order to realize high flexibility and configurability of the 1D-CNN algorithm in operation on the convolutional neural network coprocessor, a set of operation instructions is defined in the invention by combining the characteristics of the algorithm, and the operation instructions are used for accurately controlling each layer of operation. Referring to fig. 13, the operation instructions according to an embodiment of the present invention include an INPUT Instruction (INPUT), a convolution operation instruction (CONV), a pooling operation instruction (POOL), a full link layer operation instruction (FC), and an OUTPUT instruction (OUTPUT). The length of the operation code can be 3 bits, the length of the operation parameter can be variable, each operation instruction can comprise a plurality of operation parameters, and the length and the number of the operation parameters are different according to the type of the operation instruction. Each operation command will be specifically described below with reference to fig. 13.

The operation parameters of the input instruction include a start address D _ addr of the input feature map data in the external memory, an input feature map length I _ len, and an input feature map channel number Chi. The operation parameters of the convolution operation instruction comprise a start address W _ addr of weight data in an external memory, an input feature map length I _ len, an input feature map channel number Chi, a length K _ size of a convolution kernel and an output feature map channel number Cho. The operational parameters of the pooling operation instruction include an input feature map length I _ len, an input feature map channel number Chi, and a pooling size P _ size. The operating parameters of the fully-connected layer operation instruction include the start address W _ addr of the weight data in external memory, the input profile length I _ len, and the output profile length O _ len. The operation parameters of the output instruction include a write address D _ addr of the output feature map data in the external memory, an output feature map length O _ len, and an output feature map channel number Cho.

The 1D-CNN algorithm to be executed is first mapped into an instruction stream and a data stream composed of the above-described operation instructions, and then the main processor writes the instructions into the instruction memory 8. When the 1D-CNN algorithm is executed, the central controller 1 reads an operation instruction from the instruction memory 8, identifies the instruction type by an operation code, and decodes it.

When an input instruction is read, the central controller 1 controls the second input end of the second multiplexer 7 to be conducted, so that the input port of the input RAM21 is connected to the data bus, and the direct storage access module 5 is controlled to read input characteristic diagram data from the external memory and write the data into the input RAM21 according to the operation parameters of the input instruction.

When a convolution operation instruction is read, the central controller 1 controls the direct storage access module 5 to read the weight data and the offset value of the convolution kernel from the external memory according to the operation parameters of the convolution operation instruction, and the weight data and the offset value are respectively written into the weight RAM22 and the offset RAM23, and the operation parameters are required to be sent to the direct storage access module 5 during reading; then, the central controller 1 generates data addresses of the weight RAM22 and the offset RAM23 based on the operation parameters of the convolution operation instruction, controls the MAC array 3 to read input profile data from the input RAM21, read weight data from the weight RAM22, read offset values from the offset RAM23, perform convolution operation to obtain output profile data, and writes the output profile data into the output RAM 24.

When a pooling operation instruction is read, the central controller 1 controls the second output terminal of the first multiplexer 6 and the first input terminal of the second multiplexer 7 to be conducted, so that the input port of the input RAM21 is connected to the output terminal of the pooling operation unit 4, the output port of the output RAM24 is connected to the input terminal of the pooling operation unit 4, and then the pooling operation unit 4 is controlled to read the output characteristic map data in the output RAM24 for pooling operation according to the operation parameters of the pooling operation instruction and store the output characteristic map data in the input RAM 21.

When a full connection layer operation instruction is read, the central controller 1 controls the direct storage access module 5 to read the weight data of the full connection layer from the external memory and write the weight data into the weight RAM22 according to the operation parameters of the full connection layer operation instruction, and the operation parameters are required to be sent to the direct storage access module 5 during reading; then, the data address of the weight RAM22 is generated based on the operation parameter of the full link layer operation instruction, and the MAC array 3 is controlled to read the input profile data from the input RAM21, read the weight data from the weight RAM22, perform convolution operation to obtain an estimation result, and write the estimation result into the output RAM 24.

When reading an output instruction, the central controller 1 controls the first output terminal of the first multiplexer 6 to be connected, so that the output port of the output RAM24 is connected to the data bus, and controls the direct memory access module 5 to read an inferred result from the output RAM24 and write the inferred result back to the external memory, and when writing, the operating parameter needs to be sent to the direct memory access module 5.

It should be noted that fig. 13 shows the operation code and the number of bits of the operation parameter, but the number is only one embodiment, and is not limited thereto, and other numbers may be set according to actual needs.

Referring to fig. 9, in an embodiment, the instruction memory 8 is connected to the main processor through an AXI-Lite bus, the 1D-CNN algorithm to be executed is first mapped into an instruction stream and a data stream formed by the operation instructions, and then the main processor writes the instructions into the instruction memory 8 through the AXI-Lite bus; the direct memory access module 5 is connected to the external memory via an AXI bus, and the direct memory access module 5 writes the inference result to a specified address of the external memory via the AXI bus.

According to the convolutional neural network coprocessor for one-dimensional convolution of the embodiment, the MAC array comprises Kout x xi MAC which are divided into Kout layers and xi columns, each MAC receives input feature diagram data of one channel and weight data of convolution kernels of one channel, convolution operation of one-dimensional convolution kernels of the single channel and one-dimensional input feature diagrams is carried out, the MAC in the same column receives input feature diagram data of the same channel, each layer of MAC corresponds to one convolution kernel, and each MAC receives weight data of one channel of the convolution kernels corresponding to the layer where the MAC is located, so that parallel calculation between the channels of the input feature diagrams xi and between Kout channels of the output feature diagrams is achieved, meanwhile, the input feature diagram data are shared among the MAC layers, operation of the one-dimensional convolutional neural network is effectively accelerated, access times are reduced, operation efficiency is improved, and power consumption of reading and writing of a memory is reduced.

Meanwhile, the invention compromises the operation throughput rate and the hardware cost, each MAC can complete the one-dimensional convolution operation of a single channel without adopting the parallel operation among the convolution kernel weights, and can adapt to the convolution kernels with different sizes, thereby avoiding the condition of hardware resource vacancy when the convolution kernel size is smaller and reducing the hardware resource waste.

In one embodiment, for convolution operation with more input channels and more convolution kernels, the MAC array can complete acceleration operation of the whole convolution process in multiple turns by partial, temporary storage and block calculation methods, so that the hardware scale of the MAC array can be smaller, a larger-scale algorithm can be completed by limited hardware, and the hardware cost is reduced.

In one embodiment, the functions of the input RAM and the output RAM are interchanged in the pooling operation process, so that the read-write times of the external memory are reduced, the operation efficiency is improved, and the power consumption of the read-write memory is reduced.

In one embodiment, the weight data of the full connection layer is partitioned and converted into convolution operation, and the MAC array is used for parallel computation, so that the acceleration of the operation of the full connection layer is realized, and hardware resources are fully utilized.

The hardware architecture of the convolutional neural network coprocessor is accelerated aiming at a CNN algorithm, so that the hardware architecture can keep better accuracy in different application scenes. In addition, in some embodiments of the present invention, the convolutional neural network coprocessor can be compatible with CNN algorithms of different application scenarios through two ways, firstly, the configurable operation parameters are realized in the form of operation instructions, and parameters during operation, such as convolutional kernel size, input feature map channel number, pooling size, and the like, can be adjusted by sending the operation instructions to the convolutional neural network coprocessor, so that CNN algorithms with different parameters can be compatible; secondly, for a larger-scale CNN algorithm, the operation can be completed in a multi-round block calculation mode, so that the flexibility and the applicable scene of the coprocessor are greatly increased.

The present invention has been described in terms of specific examples, which are provided to aid in understanding the invention and are not intended to be limiting. Numerous simple deductions, modifications or substitutions may also be made by those skilled in the art in light of the present teachings.

Claims

1. A convolutional neural network coprocessor for one-dimensional convolution, comprising:

the central controller is used for generating a data address and controlling all parts in the convolutional neural network coprocessor to work;

when convolution operation is carried out, the on-chip RAM broadcasts the input characteristic diagram data and the weight data of the convolution kernel to all the multiplication accumulators in parallel; each multiplication accumulator receives input characteristic diagram data of one channel, and the same row of multiplication accumulators at different layers share the input characteristic diagram data of one channel; each layer of multiplication accumulator corresponds to one convolution kernel, and each multiplication accumulator receives weight data of one channel of the convolution kernel corresponding to the layer where the multiplication accumulator is located.

2. The convolutional neural network coprocessor of claim 1, wherein the on-chip RAM is further to store bias values for convolution kernels and output profile data;

the multiplication accumulator array also comprises an adder tree and an activation operation unit, wherein each layer of multiplication accumulators is provided with one adder tree, and the adder tree of each layer is connected with each multiplication accumulator of the layer and is used for accumulating convolution operation results of all the multiplication accumulators of the layer to obtain convolution partial sums;

the adder tree of each layer is also connected with the on-chip RAM and is used for acquiring the offset value of the convolution kernel corresponding to the layer from the on-chip RAM and adding the convolution part and the offset value to obtain the output characteristic diagram data to be activated;

and the activation operation unit is connected with the adder tree and the on-chip RAM, and is used for performing function activation processing on the output characteristic diagram data to be activated to obtain output characteristic diagram data and writing the output characteristic diagram data into the on-chip RAM.

3. The convolutional neural network coprocessor of claim 2, wherein each adder tree is connected with a FIFO memory;

when the number Chi of channels of the input feature map is larger than Xin and/or the number Cho of convolution kernels is larger than Kout, dividing the input feature map data into Chi/Xin groups, wherein each group comprises the input feature map data of Xin channels, the weight data of the convolution kernels are divided into Cho/Kout groups, each group comprises the weight data of Kout convolution kernels, and both Chi and Cho are natural numbers larger than 0;

the multiplication accumulator array reads each group of weight data from the on-chip RAM in sequence, and for each group of weight data, the multiplication accumulator array reads each group of input feature map data from the on-chip RAM in sequence and performs convolution operation with the group of weight data respectively, so that convolution operation between the input feature map data of all channels and all convolution kernels is completed;

the FIFO memory is used for temporarily storing convolution part sums, wherein the convolution part sums obtained by the adder tree in each round of convolution operation are added with the convolution part sums stored by the FIFO memory, then the convolution part sums are written into the FIFO memory to update the convolution part sums in the FIFO memory, when one group of weight data finishes the convolution operation with all input feature map data groups, the adder tree obtains the offset value of the convolution kernel corresponding to the layer where the adder tree is located from the on-chip RAM, the convolution part sums in the FIFO memory and the offset value are added to obtain output feature map data to be activated, and then the FIFO memory is emptied.

4. The convolutional neural network coprocessor of claim 2 or 3, wherein the on-chip RAM comprises input RAM, weight RAM, bias RAM, and output RAM;

the input RAM is connected with the multiply accumulator array and is used for storing input characteristic diagram data and broadcasting the input characteristic diagram data to all the multiply accumulators in parallel;

the weight RAM is connected with the multiply accumulator array and is used for storing weight data of a convolution kernel and broadcasting the weight data of the convolution kernel to all multiply accumulators in parallel, wherein the weight RAM comprises Kout subareas, each layer of the multiply accumulators is provided with a corresponding subarea, each subarea stores the weight data of one convolution kernel, and each subarea is used for broadcasting the stored weight data to the multiply accumulators of the corresponding layer;

the bias RAM is connected with the adder tree and is used for storing bias values of convolution kernels;

the output RAM is connected with the activation operation unit and used for storing output characteristic diagram data.

5. The convolutional neural network coprocessor according to claim 4, further comprising a pooling operation unit connected to the input RAM and the output RAM, for pooling output feature map data stored in the output RAM and storing in the input RAM as input feature map data for a next convolutional operation or full link layer operation.

6. The convolutional neural network coprocessor of claim 5, wherein the weight RAM is further configured to store weight data for a fully-connected layer when performing fully-connected layer operations;

when full-connection layer operation is carried out, dividing weight data corresponding to one output characteristic value of a full-connection layer into Chi blocks, wherein Chi is the number of channels of input characteristic diagram data stored in the input RAM, the weight data corresponding to each output characteristic value is used as the weight data of a convolution kernel, each block is used as one channel of the convolution kernel, the obtained weight data of K convolution kernels are stored in the weight RAM, and K is a natural number greater than 0 and represents the number of output characteristic values of the full-connection layer;

and the multiplication accumulator array reads input feature map data from the input RAM, reads weight data from the weight RAM, and performs convolution operation of the input feature map and the K convolution kernels, so that operation of a full connection layer is completed, an inference result is obtained, and the inference result is stored in the output RAM.

7. The convolutional neural network coprocessor of claim 6, further comprising a direct memory access module, a first multiplexer, and a second multiplexer; the pooling operation unit comprises an output end and an input end;

the direct memory access module is used for reading input feature map data, weight data and bias values from an external memory and writing the input feature map data, the weight data and the bias values into the input RAM, the weight RAM and the bias RAM respectively, and is also used for writing the inference result stored in the output RAM back to the external memory;

the first multiplexer comprises an input end, a first output end and a second output end, the input end is connected with the output RAM, the first output end is connected with the direct memory access module, and the second output end is connected with the input end of the pooling operation unit;

the second multiplexer comprises a first input end, a second input end and an output end, the first input end is connected with the output end of the pooling operation unit, the second input end is connected with the direct storage access module, and the output end is connected with the input RAM;

when the input characteristic diagram data is read from the external memory, the second input end of the second multiplexer is conducted, so that the direct memory access module writes the input characteristic diagram data into the input RAM; when the pooling operation is carried out, the second output end of the first multiplexer is conducted with the first input end of the second multiplexer, so that the pooling operation unit reads the output characteristic diagram data in the output RAM to carry out the pooling operation and stores the pooled data in the input RAM; when the inference result is output, the first output end of the first multiplexer is conducted, so that the direct memory access module reads the inference result from the output RAM and writes the inference result back to the external memory.

8. The convolutional neural network coprocessor of claim 7, further comprising an instruction memory for storing operational instructions, the operational instructions comprising opcodes and operational parameters, the opcodes for identifying instruction types, the instruction types comprising input instructions, convolutional operation instructions, pooled operation instructions, full link layer operation instructions, and output instructions;

the central controller is used for reading the operation instruction from the instruction memory, identifying the type of the instruction through the operation code, decoding the instruction and carrying out corresponding control operation according to the operation parameters.

9. The convolutional neural network coprocessor of claim 8, wherein the operating parameters of the input instruction include a start address of input feature map data in external memory, an input feature map length, and an input feature map channel number;

the operation parameters of the convolution operation instruction comprise a starting address of weight data in an external memory, the length of an input feature diagram, the number of channels of the input feature diagram, the length of a convolution kernel and the number of channels of an output feature diagram;

the operation parameters of the pooling operation instruction comprise the length of an input feature map, the number of channels of the input feature map and the pooling size;

the operation parameters of the full connection layer operation instruction comprise a starting address of weight data in an external memory, an input characteristic diagram length and an output characteristic diagram length;

the operation parameters of the output instruction comprise a write address of the output characteristic diagram data in the external memory, the length of the output characteristic diagram and the number of channels of the output characteristic diagram;

when the input instruction is read, the central controller controls a second input end of the second multiplexer to be conducted, and controls the direct storage access module to read input characteristic diagram data from an external memory and write the data into the input RAM according to the operating parameters of the input instruction;

when the convolution operation instruction is read, the central controller controls the direct storage access module to read weight data and an offset value of a convolution kernel from an external memory according to an operation parameter of the convolution operation instruction, the weight data and the offset value are respectively written into the weight RAM and the offset RAM, data addresses of the weight RAM and the offset RAM are generated according to the operation parameter of the convolution operation instruction, the multiplier accumulator array is controlled to read input feature map data from the input RAM, read weight data from the weight RAM, read the offset value from the offset RAM and carry out convolution operation to obtain output feature map data, and the output feature map data are written into the output RAM;

when the pooling operation instruction is read, the central controller controls the second output end of the first multiplexer to be connected with the first input end of the second multiplexer, controls the pooling operation unit to read the output characteristic diagram data in the output RAM according to the operation parameters of the pooling operation instruction to perform pooling operation, and stores the pooled data in the input RAM;

when the full-connection layer operation instruction is read, the central controller controls the direct storage access module to read the weight data of the full-connection layer from an external memory and write the weight data into the weight RAM according to the operation parameters of the full-connection layer operation instruction, generates a data address of the weight RAM according to the operation parameters of the full-connection layer operation instruction, controls the multiplier-accumulator array to read input feature map data from the input RAM, reads the weight data from the weight RAM to perform convolution operation to obtain an inference result, and writes the inference result into the output RAM;

when the output instruction is read, the central controller controls the first output end of the first multiplexer to be conducted, and controls the direct storage access module to read the inferred result from the output RAM and write the inferred result back to the external memory.

10. The convolutional neural network coprocessor of claim 9, wherein the instruction memory is connected to a main processor through an AXI-Lite bus, the main processor writing instructions into the instruction memory through the AXI-Lite bus; the direct memory access module is connected with an external memory through an AXI bus, and the direct memory access module writes the inference result to a specified address of the external memory through the AXI bus.