CN110991630A

CN110991630A - Convolutional neural network processor for edge calculation

Info

Publication number: CN110991630A
Application number: CN201911091701.XA
Authority: CN
Inventors: 郭炜; 王宇吉; 魏继增
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-11-10
Filing date: 2019-11-10
Publication date: 2020-04-10

Abstract

A simple control instruction for a convolutional neural network is provided, basic operations of the convolutional neural network such as a convolutional layer, a pooling layer, a ReLU activation function and a full connection layer can be realized, and the applicability of an accelerator for the convolutional neural network is realized through the combined sequencing of the instruction. The accelerator realizes an efficient ripple calculation array, can increase the reusability of data to the maximum extent and reduce the access times of the ripple calculation array to the data cache unit in the data reading process. The accelerator simultaneously supports the calculation of two data accuracies of a 16-bit fixed point number and an 8-bit fixed point number, can realize the calculation of the mixed accuracy of different layers of the same network, and greatly reduces the power consumption of the accelerator. The accelerator ensures the scheduling times of the minimum data in the cache part, reduces the power consumption, and simultaneously, the convolution layer calculation and the full connection layer calculation share the same pulse calculation array, thereby greatly improving the resource utilization rate.

Description

Convolutional neural network processor for edge calculation

Technical Field

The invention relates to a convolutional neural network processor. And more particularly to a convolutional neural network processor for edge-oriented computation.

Background

In recent years, cloud computing has become an increasingly mainstream trend. Cloud computing enables a company to store and process data (and other computing tasks) over a remote server network (colloquially referred to as the "cloud") in addition to its own physical hardware. The integration and centralization of cloud computing prove to be high in cost effectiveness and flexibility, but the rise of the internet of things and mobile computing brings little pressure on network bandwidth. Eventually, not all smart devices need to operate with cloud computing. In some cases, such round-trip transmission of data needs to be minimized or avoided. Meanwhile, with the explosive development of IoT (internet of things) and the popularization of wireless networks, the number of network edge devices (mobile terminals, sensors, etc.) and generated data are rapidly increasing, and according to IDC prediction, the total amount of global data is greater than 40zb (zettabyte) by 2020, and 45% of data generated by the internet of things are processed at the edge of the network. In this case, the centralized processing mode with the cloud computing model as the core cannot efficiently process the data generated by the edge device. In the context of everything interconnection, a centralized cloud computing processing model (traditional cloud computing) mainly has 4 defects: poor real-time performance, insufficient bandwidth, large energy consumption, and adverse data security and privacy [2-4 ]. In order to solve the above problems, an edge computing model for computing mass data generated by an edge device is developed, the edge computing is a novel computing model for performing computing at a network edge, the operated objects include downlink data from a cloud service and uplink data from a universal interconnection service, and the edge of the edge computing refers to any computing and network resource between a data source and a cloud computing center path. The edge calculation model has 3 distinct advantages: a large amount of temporary data are processed at the edge of the network, and are not uploaded to the cloud, so that the pressure of network bandwidth and power consumption of a data center is greatly reduced; data are processed close to a data producer, so that system delay is greatly reduced, and service response time is shortened; the edge calculation does not upload the user privacy data any more, so that the safety and privacy of the user data are protected.

With the advent of the edge computing era, Artificial Intelligence (AI) is gradually migrating from the cloud to edge devices. For example, an autonomous vehicle, which requires a large number of sensors to collect data, is reported by Intel in the 2016 report as 4TB per day for an autonomous vehicle. In order to enable a safe and reliable operation of the vehicle, the autopilot system needs to react immediately to the surrounding environment. In this scenario, any delay may be fatal, considering that the central server may take several seconds to transmit data back and forth, which cannot be all uploaded to the cloud processor, requiring storage and computation in the edge compute nodes. As another example, the rapid development of computer vision presents a challenge to the image sensor to process intelligently, i.e., the image sensor is no longer just an image acquisition device, and needs to make local, real-time, and intelligent decisions, such as tasks of classification, recognition, tracking, and the like. Today, AI is ubiquitous from consumers to enterprise applications. With the explosive growth of connected devices, coupled with the need for privacy/confidentiality, low latency, and bandwidth limitations, AI models trained in the cloud are increasingly required to run on the edge. Among many AI models, deep learning models have been widely used in image recognition, speech recognition, natural language processing, and intelligent management, because they have the characteristics of multiple presentation layers, and these application fields are also the most common applications in edge devices.

The convolutional neural network is an efficient identification method which is developed in recent years and draws great attention, the complexity of the feedback neural network can be effectively reduced through the unique network structure, particularly in the field of pattern classification, the convolutional neural network avoids complex preprocessing of images and can directly input original images, and therefore the convolutional neural network is widely applied. In general, the basic structure of a convolutional neural network includes two layers, one of which is a feature extraction layer, and the input of each neuron is connected to the local acceptance domain of the previous layer and extracts the features of the local acceptance domain. Once the local feature is extracted, the position relation between the local feature and other features is determined; the other is a feature mapping layer, each calculation layer of the network is composed of a plurality of feature mappings, each feature mapping is a plane, and the weights of all neurons on the plane are equal. The feature mapping structure adopts a sigmoid function with small influence function kernel as an activation function of the convolution network, so that the feature mapping has displacement invariance.

Disclosure of Invention

The invention aims to solve the technical problem of providing an edge-oriented convolutional neural network processor which can greatly improve the resource utilization rate.

The technical scheme adopted by the invention is as follows: an edge-computing-oriented convolutional neural network processor, comprising:

the instruction cache unit is used for storing an instruction set facing the convolutional neural network;

the decoder is used for reading the instruction from the instruction cache unit, reading the parameter of the length corresponding to the instruction type according to the instruction type and outputting a starting signal of a corresponding type according to the parameter type;

the control unit receives the starting signal output by the decoder and starts the control subunit of the corresponding type to output the control signal according to the type of the starting signal;

the input characteristic diagram buffer unit stores input characteristic diagram data required in the convolution calculation process and outputs corresponding input characteristic diagram data according to the received control signal output by the control unit;

the intermediate result cache unit stores intermediate data generated in the calculation process, and reads and outputs corresponding intermediate data according to the received control signal output by the control unit;

the weight cache unit is used for storing weight data required in the convolution calculation process and outputting corresponding weight data according to the received control signal output by the control unit;

and the processing unit receives input characteristic diagram data from the input characteristic diagram buffer unit, intermediate data from the intermediate result buffer unit and weight data from the weight buffer unit according to the received control signal output by the control unit, correspondingly processes the input characteristic diagram data, the intermediate data and the weight data, stores the processed input characteristic diagram data into the input characteristic diagram buffer unit and stores the processed intermediate data into the intermediate result buffer unit.

The invention provides a convolutional neural network processor facing to edge calculation, which aims to support various conventional convolutional neural networks and provides a simple control instruction for the convolutional neural networks, can realize basic operations of convolutional neural networks such as convolutional layers, pooling layers, ReLU activation functions and full connection layers, and realizes the applicability of an accelerator facing to the convolutional neural networks through the combined sequencing of the instructions. The accelerator realizes an efficient ripple calculation array, can increase the reusability of data to the maximum extent and reduce the access times of the ripple calculation array to the data cache unit in the data reading process. The accelerator simultaneously supports the calculation of two data accuracies of a 16-bit fixed point number and an 8-bit fixed point number, can realize the calculation of the mixed accuracy of different layers of the same network, and greatly reduces the power consumption of the accelerator. The accelerator ensures the scheduling times of the minimum data in the cache part, reduces the power consumption, and simultaneously, the convolution layer calculation and the full connection layer calculation share the same pulse calculation array, thereby greatly improving the resource utilization rate.

Drawings

FIG. 1 is a block diagram of the overall architecture of an edge-computing oriented convolutional neural network processor of the present invention;

FIG. 2 is a block diagram of the overall composition of the systolic calculation array of the present invention;

FIG. 3 is a block diagram of the overall structure of a multiply-add computing unit in a systolic computing array;

FIG. 4 is a block diagram showing the overall configuration of an activation function calculation unit according to the present invention;

fig. 5 is a block diagram showing the overall configuration of the pooled computation unit in the present invention.

Wherein

1: instruction cache unit 2: decoder

3: control unit 3.1: input feature map control subunit

3.2: weight control subunit 3.3: pulse calculation array control subunit

3.4: activation function control subunit 3.5: pooling control subunit

3.6: intermediate result control subunit 4: input feature map buffer unit

5: intermediate result buffer unit 6: weight cache unit

7: processing unit 7.1: shift register group

7.1.1: input signature shift register set 7.1.2: weight shift register group

7.2: pulsation calculation array 7.2.1: multiply-add computing unit

7.3: activation function calculation unit 7.3.1: comparator with a comparator circuit

7.3.2: zero value register 7.3.3: multi-way selector

7: pooling calculation unit 7.4.1: comparator with a comparator circuit

7.4.2: result register 7.4.3: multiplier and method for generating a digital signal

7.4.4: adder 7.4.5: mean constant register

Detailed Description

An edge-computation-oriented convolutional neural network processor according to the present invention is described in detail below with reference to the embodiments and the drawings.

In the architecture, 1) a simple instruction set architecture is provided, which includes common calculation modes in the convolutional neural network: convolution calculation, pooling operation, full-link calculation and activating function, and the convolution neural network operation is completed through a plurality of instructions. 2) A ripple calculation array is designed, and convolution layer calculation and full-connection layer calculation are simultaneously supported. 3) And designing a pooling calculation unit to finish the pooling layer calculation. 4) And designing an activation function calculation unit to complete the calculation of the activation function in the convolutional neural network.

As shown in fig. 1, the convolutional neural network processor facing edge calculation of the present invention includes: the device comprises an instruction cache unit 1, a decoder 2, a control unit 3, an input characteristic diagram cache unit 4, an intermediate result cache unit 5, a weight cache unit 6 and a processing unit 7. Wherein the content of the first and second substances,

the instruction cache unit 1 is used for storing an instruction set facing the convolutional neural network.

The convolutional neural network includes 3 types of common layer structures including convolutional layers, pooling layers and full-link layers, and an activation function operation, in order for the accelerator of the present invention to support all convolutional neural networks, it is necessary to store a set of instruction sets facing the convolutional neural network in the instruction cache unit 1, where the instruction sets facing the convolutional neural network stored in the instruction cache unit 1 include:

convolution calculation instruction parameters including instruction type parameters, input feature map size parameters, weight size parameters, filling size parameters, stride size parameters, multiplication and addition unit use number parameters, input feature map cache unit initial address parameters and intermediate result cache unit initial address parameters, wherein the first 6 parameters are 8 bits, and the last 2 parameters are 16 bits;

calculating instruction parameters in a full-connection mode, wherein the instruction parameters comprise instruction type parameters, input feature diagram size parameters, weight size parameters, use number parameters of a multiplication and addition unit, input feature diagram cache unit initial address parameters and intermediate result cache unit initial address parameters, the first 4 parameters are 8 bits, and the last 2 parameters are 16 bits;

pooling calculation instruction parameters, including an instruction type parameter, a pooling type parameter, an input feature map size parameter, a weight size parameter, a stride size parameter, an input feature map cache unit first address parameter and an intermediate result cache unit first address parameter, wherein the first 5 parameters are 8 bits, and the last 2 parameters are 16 bits;

the instruction set facing the convolutional neural network is stored in an instruction cache unit 1 in a binary format; because the accelerator is only used in the feedforward process of the neural network, the instruction of the accelerator is not changed after being determined, the instruction cache unit does not update data in the running process of the accelerator and only has a data reading process. The instruction cache unit 1 is a Static Random Access Memory (SRAM) with 64KB and 8bit width, and is provided with an 8-bit input interface, an 8-bit output interface, a 16-bit address interface, a 1-bit clock signal, a 1-bit read-write signal and a 1-bit start signal.

And the decoder 2 is used for reading the instruction from the instruction cache unit 1, reading the parameter with the length corresponding to the instruction type according to the instruction type, and outputting a starting signal with a corresponding type according to the parameter type.

The decoder 2 is used for realizing the reading function of the instruction. When the decoder starts to operate, the decoder 2 first reads the instruction from the instruction cache unit, reads the parameter of the length corresponding to the instruction type according to the instruction type, and gives a start signal to the control subunit of the corresponding instruction in the control unit 3 to enable the control subunit to operate. After the corresponding operation is completed, the decoder 2 receives an end signal, and the decoder 2 reads the next instruction and the corresponding parameters from the instruction cache unit 1, and repeats the above processes until all instructions are completed.

The decoder 2 takes the output of the instruction buffer unit 1 with 8 bits as input, and the output of the decoder 2 is as follows: outputting 6 1-bit data which are respectively an input characteristic diagram control subunit starting signal, a weight control subunit starting signal, a pulse calculation array control subunit starting signal, an activation function number control subunit starting signal, a pooling subunit starting signal and an intermediate result control subunit starting signal; 6 pieces of 8-bit data are output, which are respectively as follows: inputting the size of a characteristic diagram, the weight, the filling size, the stride size, the number of used multiplication and addition units and the pooling type; and 2 pieces of 16-bit data are output, namely the first address of the input characteristic diagram cache unit and the first address of the intermediate result cache unit.

And the control unit 3 receives the starting signal output by the decoder 2, and starts the control subunit of the corresponding type to output the control signal according to the type of the starting signal. The control unit 3 comprises:

(1) the input characteristic diagram control subunit 3.1 receives an input characteristic diagram control subunit start signal from the decoder 2, is used for controlling the reading and writing sequence of data in the input characteristic diagram cache unit 4, and outputs a 1-bit enable signal, a 1-bit reading and writing signal and a 16-bit input characteristic diagram address signal to the input characteristic diagram cache unit 4;

an 8-bit row position register and an 8-bit column position register are arranged in the input characteristic diagram control subunit 3.1. In order to make full use of the data reusability, the input profile control subunit 3.1 reads the input profile in a row-first manner. The calculation formula of the input feature map address is as follows:

Addr_c＝Addr_a+Row×N+Col 1

wherein Addr _ c represents the input feature map address; addr _ a represents the input characteristic diagram cache unit first address in the instruction parameters from the instruction cache unit 1; n represents the input profile size in the instruction parameters from instruction cache unit 1; row denotes a line position register; col denotes a column position register;

(2) the weight control subunit 3.2 receives a weight control subunit starting signal from the decoder 2, outputs a 1-bit enable signal, a 1-bit read-write signal and a 16-bit weight address to the weight cache unit 6, and is used for controlling the reading sequence of data in the weight cache unit 6, and in the starting signal, the weight address is added with 1 in each period;

(3) a ripple calculation array control subunit 3.3, which receives the ripple calculation array control subunit start signal from the decoder 2, and outputs 3 1-bit signals, which are an input characteristic diagram shift register group start signal, a weight shift register group start signal, and a ripple calculation array start signal, respectively, for controlling the data shift process and the ripple calculation process of the shift register group in the processing unit 7; the ripple calculation array control subunit 3.3 first controls the shift register group 7.1 to receive the weight data according to the weight shift register group start signal, and gradually transmits the weight data to the weight register of each multiply-add calculation unit according to the period. The shift register bank 7.1 is then controlled to accept input signature data by means of a shift register bank start signal. Finally, the ripple calculation array 7.2 is started by the ripple calculation array start signal.

(4) And the activation function control subunit 3.4 receives the activation function control subunit start signal from the decoder 2, and outputs 1-bit signal for controlling the realization of the activation function calculation unit in the processing unit 7.

(5) The pooling control subunit 3.5 receives a pooling subunit start signal from the decoder 2, outputs 1bit, and is used for controlling the realization of a pooling calculation unit in the processing unit 7;

(6) an intermediate result control subunit 3.6, which receives the 2 intermediate result control subunit start signal from the decoder, outputs a 1-bit enable signal, a 1-bit read-write signal and a 16-bit intermediate result address to the intermediate result cache unit in the processing unit 7, and is used for controlling the reading and writing sequence of data in the intermediate result cache unit in the processing unit 7; an 8-bit row position register and an 8-bit column position register are arranged in the middle result control subunit 3.6. The intermediate result address calculation formula is as follows:

Addr_w＝Addr_i+Row×M+Col 2

M＝(N+2×P-K)/(S+1) 3

wherein Addr _ w represents the intermediate result address; addr _ i represents the intermediate result first address; m is a parameter; n represents the input feature map size from the instruction parameters; k represents the weight size from the instruction parameters; p represents the fill size from the instruction parameter; s represents the stride size from the instruction parameter; row denotes a line position register; col denotes a column position register.

And the (IV) input characteristic diagram buffer unit 4 stores input characteristic diagram data required in the convolution calculation process, and outputs corresponding input characteristic diagram data according to the received control signal output by the control unit 3.

The input characteristic diagram cache unit 4 outputs the required input characteristic diagram data to the shift register group 7.1 in the processing unit 7 in the operation process, and stores the calculation result of the activation function calculation unit 7.3 in the processing unit 7; considering that the capacity of the input characteristic diagram cache unit is limited to limit the size of the internal cache of the chip, the input characteristic diagram cache unit 4 is connected with an external storage unit 9 through a data bus 8 for data exchange under the condition of need. That is, the input feature map data required for calculation is transmitted to the input feature map buffer unit 4 through the data bus 8, and the subsequent calculation is performed. Therefore, the input profile buffer unit 4 holds both data input and input interfaces. In the invention, the input characteristic diagram buffer unit 4 is a static random access memory SRAM with the size of 64KB and the bit width of 8 bits, and the interface of the input characteristic diagram buffer unit 4 comprises an 8-bit input interface, an 8-bit output interface, a 16-bit address interface, a 1-bit clock signal interface, a 1-bit read-write signal interface and a 1-bit start signal interface.

And (V) the intermediate result buffer unit 5 stores the intermediate data generated in the calculation process, and reads and outputs the corresponding intermediate data according to the received control signal output by the control unit 3.

The intermediate result cache unit 5, during the operation, inputs the data required for the pooling calculation to the pooling calculation unit 7.4 in the processing unit 7, and saves the calculation result of the activation function calculation unit 7.3 in the processing unit 7; considering that the capacity of the intermediate result cache unit is limited to the size of the internal cache of the chip), the intermediate result cache unit 5 is connected with the external storage unit 9 through the data bus 8 to exchange data when needed, and transmits an intermediate result required by calculation to the intermediate result cache unit 5 for the next calculation; the intermediate result buffer unit 5 simultaneously keeps data input and output interfaces, the intermediate result buffer unit 5 is a static random access memory SRAM with the size of 64KB and the bit width of 8 bits, and the interfaces comprise an 8-bit input interface, an 8-bit output interface, a 16-bit address interface, a 1-bit clock signal interface, a 1-bit read-write signal interface and a 1-bit start signal interface.

And the weight buffer unit 6 stores the weight data required in the convolution calculation process, and outputs corresponding weight data according to the received control signal output by the control unit 3.

The weight cache unit 6 inputs the required weight data to the shift register group 7.1 in the processing unit 7 in the operation process; in operation, the weight cache unit 6 need not accept data updates, but only needs to output the weight data for the corresponding address. However, considering that the capacity of the weight cache unit 6 is limited, the weight cache unit 6 is connected with the external storage unit 9 through the data bus 8 to exchange data when necessary, and transmits weight data required by calculation to the weight cache unit 6 to perform the following calculation; the weight cache unit maintains both data input and output interfaces. The weight cache unit 6 is a static random access memory SRAM with 64KB and 8-bit width, and the interface comprises an 8-bit input interface, an 8-bit output interface, a 16-bit address interface, a 1-bit clock signal interface, a 1-bit read-write signal and a 1-bit start signal interface.

And (seventh) a processing unit 7 which receives the input feature map data from the input feature map buffer unit 4, the intermediate data from the intermediate result buffer unit 5, and the weight data from the weight buffer unit 6 according to the received control signal output by the control unit 3, performs corresponding processing on the input feature map data, the intermediate data, and the weight data, stores the processed input feature map data in the input feature map buffer unit 4, and stores the processed intermediate data in the intermediate result buffer unit 5. The processing unit 7 comprises:

(1) the shift register group 7.1 realizes serial input and parallel output of input characteristic diagram data and weight data according to the control instruction of the pulsation calculation array control subunit 3.3 in the control unit 3; the shift register bank 7.1 includes 1 8-bit input, 200-bit output input profile shift register bank 7.1.1, and a 8-bit input, 48-bit output weight shift register bank 7.1.2.

The shift register group 7.1.1 receives 8-bit input data in each cycle according to the input characteristic diagram, and after 25 cycles, all data received in 25 cycles, which are 200 bits in total, are output together in the next cycle. The weight shift register group 7.1.2 receives an 8-bit input data in each cycle, and after 6 cycles, all the data received in 6 cycles, 48 bits in total, are output together in the next cycle.

(2) The ripple calculation array 7.2 receives an input characteristic diagram output by the input characteristic diagram shift register group in the shift register group 7.1 and weight data output by the weight shift register group to perform inner product calculation according to a control instruction of a ripple calculation array control subunit 3.3 in the control unit 3; as shown in fig. 2, the ripple calculation array 7.2 is formed by arranging 150 multiplication and addition calculation units 7.2.1 into a matrix in a 6 × 25 manner, wherein each row has 25 units and has 6 rows; wherein, the multiplication and addition computing units 7.2.1 of each row are connected in series in sequence, the input of the first multiplication and addition computing unit 7.2.1 of each row is connected with the weight shift register group 7.1.2, and the last multiplication and addition computing unit 7.2.1 is connected with the activation function computing unit 7.3; the multiplication and addition computing units 7.2.1 in each row are connected in series in sequence, and the input of the first multiplication and addition computing unit 7.2.1 in each row is connected with the input characteristic diagram shift register group 7.1.1. The pulsation calculation array 7.2 realizes the parallel calculation of the weight of 5x5, and improves the running speed of the accelerator.

The systolic calculation array 7.2 is a calculation structure that flows the data stream synchronously through adjacent two-dimensional array elements. For convolution operation, the pulse calculation array needs to expand an input feature map (input) and weights of convolution calculation from a two-dimensional array or a multi-dimensional array into a plurality of one-dimensional arrays, and each one-dimensional array expanded by the input feature map represents an input feature map corresponding to each sliding process of a weight window in the convolution calculation. In the pulse calculation array, a one-dimensional array with expanded weight is fixed in a register of each corresponding calculation unit in advance, the one-dimensional array with expanded input feature maps flows from top to bottom in sequence, each intermediate result obtained by calculation flows from left to right, and after all data of the corresponding one-dimensional array are multiplied and accumulated, the final result is the convolution calculation result of a convolution window.

As shown in fig. 3, each multiply-add computing unit 7.2.1 includes an 8-bit input feature map register a, a 16-bit intermediate result register b, an 8-bit weight register c, a multiplier d and an adder e, wherein the input end of the input feature map register a is connected to the output end of the input feature map register a in the previous multiply-add computing unit 7.2.1 in the same column, the input end of the weight register c is connected to the output end of the weight register c in the previous multiply-add computing unit 7.2.1 in the same row, the input end of the intermediate result register b is connected to the output end of the intermediate result register b in the previous multiply-add computing unit 7.2.1 in the same row, the result of the data output from the input feature map register a multiplied by the data output from the weight register c via the multiplier d is added to the data output from the intermediate result register b via the adder e, and the added to the next multiply-add computing unit The output of the input feature map register a is further connected to the input end of the input feature map register a in the next multiply-add computing unit 7.2.1 in the same column, and the output end of the weight register c is further connected to the input end of the weight register c in the next multiply-add computing unit 7.2.1 in the same row.

(3) An activation function calculation unit 7.3, configured to implement Relu activation function calculation for a convolutional neural network, includes: a comparator 7.3.1, a 16-bit zero-value register 7.3.2 and a multiplexer 7.3.3, wherein, one input end of the comparator 7.3.1 is connected with the output ends of the ripple calculation array 7.2 and the pooling calculation unit 7.4, the other input end is connected with the output end of the zero-value register 7.3.2, the input end of the multiplexer 7.3.3 is connected with the output ends of the ripple calculation array 7.2 and the pooling calculation unit 7.4, the output end of the zero-value register 7.3.2 and the output end of the comparator 7.3.1, the output end of the multiplexer 7.3.3 is connected with the input ends of the input characteristic diagram buffer unit 4 and the intermediate result buffer unit 5, wherein, the control signal input ends of the comparator 7.3.1 and the multiplexer 7.3.3.3 are connected with the activation function control subunit 3.4 in the control unit 3.4, the Relu activation function calculation of the convolutional neural network is performed according to the control instruction of the activation function control subunit 3.4, in the calculation process, the comparator 7.3.1 judges whether the data input by the feature map cache unit 4 and the intermediate result cache unit 5 is less than zero, if so, the multiplexer 7.3.3 takes the output of the zero value register 7.3.2 as the output, otherwise, the multiplexer 7.3.3 directly takes the input of the feature map cache unit 4 and the intermediate result cache unit 5 as the output;

(4) a pooling calculation unit 7.4 for performing mean pooling and maximum pooling calculations, as shown in fig. 4, comprising a comparator 7.4.1, a multiplier 7.4.3, an adder 7.4.4, a 32-bit result register 7.4.2 for holding intermediate results, and a 16-bit mean constant register 7.4.5 for holding a mean constant; wherein, the control signal input end of the result register 7.4.2 is connected to the pooling control subunit 3.5 in the control unit 3, one input end of the comparator 7.4.1 is connected to the output end of the intermediate result buffer unit 5, the other input end is connected to one output end of the result register 7.4.2, the output end of the comparator 7.4.1 is connected to one input end of the result register 7.4.2, the multipliers 7.4.3 are respectively connected to the output end of the intermediate result buffer unit 5 and the output end of the average constant register 7.4.5, the input ends of the adders 7.4.4 are respectively connected to the output end of the multiplier 7.4.3 and one output end of the result register 7.4.2, the other input end of the result register 7.4.2 is connected to the output end of the adder 7.4.4, and the final output end of the result register 7.4.2 is connected to the input end of the activation function calculating; the result register 7.4.2 performs an average calculation or a maximum comparison based on the pooling type parameter in the pooling control subunit 3.5 pooling instruction. If the pooling type parameter is maximum pooling, the comparator 7.4.1 receives the output data of the result register 7.4.2 and compares the output data with the data input by the intermediate result cache unit 5, then stores the maximum value in the result register 7.4.2, and outputs the maximum value to the activation function calculation unit 7.3 through the result register 7.4.2; if the pooling type parameter is average pooling, the input data of the intermediate result buffer unit 5 and the data output by the average constant register 7.4.5 are multiplied by the multiplier 7.4.3, and then added by the adder 7.4.4 and the data output by the result register 7.4.2, and the added result is stored in the result register 7.4.2 and output to the activation function calculating unit 7.3 by the result register 7.4.2.

Claims

1. An edge-computation-oriented convolutional neural network processor, comprising:

the instruction cache unit (1) is used for storing an instruction set facing the convolutional neural network;

the decoder (2) is used for reading the instruction from the instruction cache unit (1), reading the parameter with the length corresponding to the instruction type according to the instruction type, and outputting a starting signal with a corresponding type according to the parameter type;

the control unit (3) receives the starting signal output by the decoder (2), and starts the control subunit of the corresponding type to output the control signal according to the type of the starting signal;

the input characteristic diagram buffer unit (4) stores input characteristic diagram data required in the convolution calculation process and outputs corresponding input characteristic diagram data according to the received control signal output by the control unit (3);

the intermediate result buffer unit (5) stores intermediate data generated in the calculation process, and reads and outputs corresponding intermediate data according to the received control signal output by the control unit (3);

the weight buffer unit (6) is used for storing weight data required in the convolution calculation process and outputting corresponding weight data according to the received control signal output by the control unit (3);

and the processing unit (7) receives input characteristic diagram data from the input characteristic diagram buffer unit (4), intermediate data from the intermediate result buffer unit (5) and weight data from the weight buffer unit (6) according to the received control signal output by the control unit (3), correspondingly processes the input characteristic diagram data, the intermediate data and the weight data, stores the processed input characteristic diagram data into the input characteristic diagram buffer unit (4) and stores the processed intermediate data into the intermediate result buffer unit (5).

2. The convolutional neural network processor for edge-oriented computation according to claim 1, wherein the instruction set of the convolutional neural network stored in the instruction cache unit (1) comprises:

and pooling calculation instruction parameters comprising an instruction type parameter, a pooling type parameter, an input feature map size parameter, a weight size parameter, a stride size parameter, an input feature map cache unit first address parameter and an intermediate result cache unit first address parameter, wherein the first 5 parameters are 8 bits, and the last 2 parameters are 16 bits.

3. An edge-computation-oriented convolutional neural network processor as claimed in claim 2, wherein the instruction set of the convolutional neural network is saved in binary format into the instruction cache unit (1); the instruction cache unit (1) is a static random access memory with 64KB and 8bit width, and is provided with an 8-bit input interface, an 8-bit output interface, a 16-bit address interface, a 1-bit clock signal, a 1-bit read-write signal and a 1-bit start signal.

4. The convolutional neural network processor for edge-oriented computation according to claim 1, wherein when the decoder (2) starts to operate, the decoder (2) first reads the instruction from the instruction cache unit, and according to the instruction type, reads the parameter of the length corresponding to the instruction type, and gives a start signal to the control subunit of the corresponding instruction in the control unit (3), and when the corresponding operation is completed, the decoder (2) receives an end signal, and the decoder (2) reads the next instruction and the corresponding parameter from the instruction cache unit (1) until all instructions are completely operated.

5. An edge-computation-oriented convolutional neural network processor as claimed in claim 1, wherein said control unit (3) comprises:

the input characteristic diagram control subunit (3.1) receives an input characteristic diagram control subunit starting signal from the decoder (2), is used for controlling the reading and writing sequence of data in the input characteristic diagram cache unit (4), and outputs a 1-bit enable signal, a 1-bit reading and writing signal and a 16-bit input characteristic diagram address signal to the input characteristic diagram cache unit (4);

the calculation formula of the input feature map address is as follows:

Addr_c＝Addr_a+Row×N+Col (1)

wherein Addr _ c represents the input feature map address; addr _ a represents the input characteristic diagram cache unit initial address in the instruction parameter from the instruction cache unit (1); n represents the input signature size in the instruction parameters from the instruction cache unit (1); row denotes a line position register; col denotes a column position register;

the weight control subunit (3.2) receives a weight control subunit starting signal from the decoder (2), outputs a 1-bit enable signal, a 1-bit read-write signal and a 16-bit weight address to the weight cache unit (6) and is used for controlling the reading sequence of data in the weight cache unit (6), and the weight address is added with 1 in each period in the starting signal;

the pulsation calculation array control subunit (3.3) receives a pulsation calculation array control subunit starting signal from the decoder (2), and outputs 3 1-bit signals which are an input characteristic diagram shift register group starting signal, a weight shift register group starting signal and a pulsation calculation array starting signal respectively and are used for controlling a data shift process and a pulsation calculation process in the processing unit (7);

an activation function control subunit (3.4) for receiving an activation function control subunit start signal from the decoder (2) and outputting 1-bit signal for controlling the activation function calculation in the processing unit (7);

a pooling control subunit (3.5) receiving the pooling subunit start signal from the decoder (2) and outputting 1bit for controlling the pooling calculation in the processing unit (7);

an intermediate result control subunit (3.6) for receiving an intermediate result control subunit start signal from the decoder (2) for controlling the read/write sequence of the intermediate result buffer data in the processing unit (7); the intermediate result address calculation formula of the intermediate result control subunit (3.6) is as follows:

Addr_w＝Addr_i+Row×M+Col (2)

M＝(N+2×P-K)/(S+1) (3)

6. The convolutional neural network processor for edge-oriented computation of claim 1, wherein the input feature map buffer unit (4) outputs the required input feature map data to the processing unit (7) in the operation process, and stores the computation result of the activation function in the processing unit (7); the input characteristic diagram buffer unit (4) is also connected with an external storage unit (9) through a data bus (8) for data exchange.

7. The convolutional neural network processor for edge-oriented computation of claim 1, wherein the intermediate result buffer unit (5) inputs data required for pooling computation into the processing unit (7) during operation, and stores the result of the computation of the activation function in the processing unit (7); the intermediate result cache unit (5) is also connected with an external storage unit (9) through a data bus (8) for data exchange, and an intermediate result required by calculation is transmitted into the intermediate result cache unit (5), the intermediate result cache unit (5) is a Static Random Access Memory (SRAM) with the size of 64KB and the bit width of 8 bits, and the interface comprises an 8-bit input interface, an 8-bit output interface, a 16-bit address interface, a 1-bit clock signal interface, a 1-bit read-write signal interface and a 1-bit start signal interface.

8. The convolutional neural network processor for edge-oriented computation of claim 1, wherein the weight buffer unit (6) is configured to input required weight data to the processing unit (7); the weight cache unit (6) is also connected with an external storage unit (9) through a data bus (8) for data exchange, the weight cache unit (6) is a static random access memory with 64KB and 8-bit width, and the interface comprises an 8-bit input interface, an 8-bit output interface, a 16-bit address interface, a 1-bit clock signal interface, a 1-bit read-write signal interface and a 1-bit start signal interface.

9. An edge-computation-oriented convolutional neural network processor as claimed in claim 1, wherein said processing unit (7) comprises:

(1) the shift register group (7.1) realizes serial input and parallel output of input characteristic diagram data and weight data according to the control instruction of the pulsation calculation array control subunit (3.3) in the control unit (3); the shift register group (7.1) comprises an input characteristic diagram shift register group (7.1.1) with 1 8-bit input and 200-bit output and a weight shift register group (7.1.2) with 8-bit input and 48-bit output;

(2) the ripple calculation array (7.2) receives an input characteristic diagram output by the input characteristic diagram shift register group in the shift register group (7.1) and weight data output by the weight shift register group to perform inner product calculation according to a control instruction of a ripple calculation array control subunit (3.3) in the control unit (3); the ripple calculation array (7.2) is formed by arranging 150 multiplication and addition calculation units (7.2.1) into a matrix according to a 6-by-25 mode, wherein each row comprises 25 units, and the total number of the units is 6; wherein, the multiplication and addition computing units (7.2.1) of each row are connected in series in sequence, the input of the first multiplication and addition computing unit (7.2.1) of each row is connected with the weight shift register group (7.1.2), and the last multiplication and addition computing unit (7.2.1) is connected with the activation function computing unit (7.3); the multiplication and addition computing units (7.2.1) of each row are sequentially connected in series, and the input of the first multiplication and addition computing unit (7.2.1) of each row is connected with the input characteristic diagram shift register group (7.1.1);

(3) an activation function calculation unit (7.3) for implementing the Relu activation function calculation of a convolutional neural network, comprising: a comparator (7.3.1), a 16-bit zero-value register (7.3.2) and a multiplexer (7.3.3), wherein one input end of the comparator (7.3.1) is respectively connected with the output ends of the ripple calculation array (7.2) and the pooling calculation unit (7.4), the other input end is connected with the output end of the zero-value register (7.3.2), the input end of the multiplexer (7.3.3) is respectively connected with the output ends of the ripple calculation array (7.2) and the pooling calculation unit (7.4), the output end of the zero-value register (7.3.2) and the output end of the comparator (7.3.1), the output end of the multiplexer (7.3.3) is respectively connected with the input ends of the input characteristic diagram buffer unit (4) and the intermediate result buffer unit (5), wherein the control signal input ends of the comparator (7.3.1) and the multiplexer (7.3.3) are respectively connected with the activation sub-control unit (3.4) in the control unit (3), performing Relu activation function calculation of the convolutional neural network according to a control instruction of an activation function control subunit (3.4), wherein in the calculation process, a comparator (7.3.1) judges whether data input by a feature map cache unit (4) and an intermediate result cache unit (5) is smaller than zero, if so, a multiplexer (7.3.3) takes the output of a zero value register (7.3.2) as output, otherwise, the multiplexer (7.3.3) directly takes the input of the feature map cache unit (4) and the intermediate result cache unit (5) as output;

(4) the pooling calculation unit (7.4) is used for performing average value pooling and maximum value pooling calculation and comprises a comparator (7.4.1), a multiplier (7.4.3), an adder (7.4.4), a 32-bit result register (7.4.2) for storing an intermediate result and a 16-bit average constant register (7.4.5) for storing an average constant; wherein, the control signal input end of the result register (7.4.2) is connected with the pooling control subunit (3.5) in the control unit (3), one input end of the comparator (7.4.1) is connected with the output end of the intermediate result buffer unit (5), the other input end is connected with one output end of the result register (7.4.2), the output end of the comparator (7.4.1) is connected with one input end of the result register (7.4.2), the multiplier (7.4.3) is connected with the output end of the intermediate result buffer unit (5) and the output end of the average constant register (7.4.5) respectively, the input end of the adder (7.4.4) is respectively connected with the output end of the multiplier (7.4.3) and one output end of the result register (7.4.2), the other input end of the result register (7.4.2) is connected with the output end of the adder (7.4.4), and the final output end of the result register (7.4.2) is connected with the input end of the activating function calculating unit (7.3); the result register (7.4.2) performs an average calculation or a maximum comparison based on the pooling type parameter in the pooling control subunit (3.5) pooling instruction.

10. An edge-oriented convolutional neural network processor as claimed in claim 9, wherein each of said multiply-add computing units (7.2.1) comprises an 8-bit input signature register (a), a 16-bit intermediate result register (b), an 8-bit weight register (c), a multiplier (d) and an adder (e), wherein the input of said input signature register (a) is connected to the output of the input signature register (a) in the same row of previous multiply-add computing units (7.2.1), the input of said weight register (c) is connected to the output of the weight register (c) in the same row of previous multiply-add computing units (7.2.1), the input of said intermediate result register (b) is connected to the output of the intermediate result register (b) in the same row of previous multiply-add computing units (7.2.1), the data output by the input characteristic diagram register (a) and the data output by the weight register (c) are multiplied by a multiplier (d), added by an adder (e) and output to the intermediate result register (b) of the next multiply-add calculating unit (7.2.1), the output of the input characteristic diagram register (a) is also connected with the input end of the input characteristic diagram register (a) in the next multiply-add calculating unit (7.2.1) in the same column, and the output end of the weight register (c) is also connected with the input end of the weight register (c) in the next multiply-add calculating unit (7.2.1) in the same row.