CN111191774A

CN111191774A - Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof

Info

Publication number: CN111191774A
Application number: CN201811355622.0A
Authority: CN
Inventors: 党韩兵; 刘文庭; 刘学彦; 尹东; 詹进
Original assignee: Shanghai Fullhan Microelectronics Co ltd
Current assignee: Shanghai Fullhan Microelectronics Co ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2020-05-22
Anticipated expiration: 2038-11-14
Also published as: CN111191774B

Abstract

The invention discloses a simplified convolutional neural network-oriented low-cost accelerator architecture and a processing method thereof, wherein the accelerator architecture comprises the following components: the device comprises a data and weight tensor storage unit, a data reading unit, a data vector storage unit, a data vector reading unit, a data vector registering unit, a weight reading unit, m groups of weight vector storage units, m groups of weight vector reading units, m groups of weight vector registering units and m groups of vector operation units.

Description

Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof

Technical Field

The invention relates to the technical field of calculation and artificial intelligence, in particular to a simplified convolutional neural network-oriented low-cost accelerator architecture and a processing method thereof.

Background

In recent years, many problems such as image object recognition, speech recognition, and natural language processing have been greatly developed, and these advances have benefited from the development of deep learning techniques. The neural network structure as the basis of deep learning is continuously evolving, and the convolutional neural network which is the most important in the field of image recognition is also continuously evolving. The main motivation for the evolution of convolutional neural networks was to improve performance, and therefore a series of "general convolutional neural networks" were introduced, such as AlexNet, ZFNET, VGGNet, GoogleNet and ResNet. The performance of the network is improved mainly by increasing the number of layers, the core calculation structure is not changed greatly, and the network is relatively friendly to the design of a hardware accelerator, and the conventional convolutional neural network accelerator is designed mainly by referring to the network structure, and usually takes 3D convolution of 3 multiplied by N as a basic operation unit.

However, for some terminal devices with limited resources, the above general convolutional neural network is too large in scale and difficult to deploy, the network is quantized to another evolution direction, and another series of "simplified convolutional neural networks" appears, for example: SqueezeNet, MobileNet, shuffleNet, DenseNet, MobileNet V2, MnasNet, etc. Compared with a general convolutional neural network, the simplified network structure has the characteristics of isomerization and fragmentation, the structural difference between the network and the network is increased, more lightweight components are generated, and the difference of the computing modes among the components is larger, for example, MobileNet V2 has the main structure of 1x1xN convolution (the ratio reaches more than 97%) and deep separable convolution, however, the processing efficiency of the simplified convolutional neural network by adopting the existing neural network accelerator is low or the power consumption is overlarge.

Disclosure of Invention

In order to overcome the defects in the prior art, the present invention provides a simplified convolutional neural network-oriented low-cost accelerator architecture and a processing method thereof, so as to solve the problem that the efficiency of the conventional neural network accelerator is too low or the power consumption is too large when the conventional neural network accelerator processes the simplified convolutional neural network.

To achieve the above object, the present invention provides a simplified convolutional neural network oriented low-cost accelerator architecture, which includes:

the data and weight tensor storage unit is used for storing an input data tensor, an input weight tensor and an output data tensor;

the data reading unit is used for reading the data tensor from the data and weight tensor storage unit, segmenting and vectorizing the read tensor according to a preset rule, and storing the tensor into the data vector storage unit;

the data vector storage unit is used for storing a batch of data vectors;

the data vector reading unit is used for reading data vectors from the data vector storage unit according to a preset rule and writing the data vectors into the data vector register unit;

the data vector register unit is used for registering 1 or more data vectors;

the weight reading unit is used for reading the weight tensor from the data and weight tensor storage unit, segmenting and vectorizing the read tensor according to a preset rule and storing the segmented and vectorized tensor to the corresponding weight vector storage unit;

m groups of weight vector storage units, wherein each group of weight units respectively stores a batch of weight vectors;

m sets of weight vector reading units, wherein each set of weight vector reading unit reads parameter vectors from corresponding weight vector storage units according to preset rules and writes the read weight vectors into corresponding weight vector register units;

m groups of weight vector register units, wherein each group of weight vector register units is used for registering 1 or more weight vectors;

and the m groups of vector operation units are used for acquiring corresponding data and weight vectors from the data vector register units and the weight vector register units for processing, storing processing results into the data and weight tensor storage units, and processing a pair of data and weight vectors by each group of vector operation units every clock cycle. m groups of vector operation units share data vectors and share weight vectors

Preferably, the data vector reading unit and the vector operation unit adopt a handshake synchronization mechanism to coordinate the reading and operation processing of the data vector.

Preferably, the m sets of vector operation units share data vectors and share weight vectors exclusively.

Preferably, the weight reading unit and the vector operation unit coordinate the reading and operation of the weight vector by adopting a handshake synchronization mechanism.

Preferably, the vector operation unit includes:

the vector multiplication and accumulation operation unit is used for reading a pair of data vectors and weight vectors in each period and carrying out multiplication and accumulation operation on the data vectors and the weight vectors to obtain a multiplication and accumulation operation result sequence Rv;

the grouping and accumulation arithmetic unit is used for receiving the output result sequence Rv of the vector multiplication and accumulation processing unit, grouping the received multiplication and accumulation result sequences according to a preset rule and accumulating the grouped multiplication and accumulation result sequences together to obtain an accumulation result Rg;

the activation pooling operation unit is used for performing activation pooling on the output sequence Rg of the grouping accumulation operation unit, and the processed output result is recorded as Ra;

and the grouping accumulation storage unit is used for storing the operation result of the grouping accumulation operation unit.

Preferably, the packet accumulation operation unit includes:

a data receiving unit for receiving the result Rv of the vector multiply-accumulate operation unit;

the historical accumulation result judging and processing unit is used for judging whether the received grouping to which the current Rv belongs has a historical accumulation result or not, if so, reading the corresponding historical accumulation result, accumulating the historical accumulation result and the current Rv, recording the accumulated result as Rg, and starting the accumulation completion judging and processing unit; if the group to which the current Rv belongs does not have a historical accumulation result, directly taking the current Rv as an accumulation result, recording the accumulated result as Rg, and starting an accumulation completion judgment processing unit;

the affiliated grouping accumulation completion judging and processing unit is used for judging whether the current Rv is the last member of the affiliated grouping, if the current Rv is the last member, the affiliated grouping final accumulation is completed, the accumulated value Rg is output to the activated pooling operation unit for subsequent processing, then the processing unit enters the all-grouping accumulation judgment and processing unit, if the current Rv is not the last member of the affiliated grouping, the affiliated grouping final accumulation is not completed, the current accumulated result Rg is stored in the grouping accumulation storage unit to be used as the historical accumulation result of the affiliated grouping, and then the processing unit enters the all-grouping accumulation judgment and processing unit;

and the all-packet accumulation judging and processing unit is used for judging whether all the packet accumulation processing is finished, if so, finishing the current flow, and if not, returning to the data receiving unit to continue the packet accumulation processing.

Preferably, the active pooling operation unit includes:

the batch standardization and activation processing unit is used for carrying out batch standardization and activation processing on the grouping accumulation result;

the judging unit is used for judging whether the pooling or the deep separable convolution processing is carried out, if the deep separable convolution processing is carried out, the deep separated convolution processing unit is started, and if the deep separated convolution processing is carried out, the pooling processing unit is started;

the deep separation convolution processing unit is used for carrying out deep separation convolution processing on the data subjected to the batch standardization and activation processing, and carrying out batch standardization and activation processing on the data subjected to the deep separation convolution again;

the pooling processing unit is used for pooling the data after batch standardization and activation processing;

and the tensor processing unit is used for selecting the processing result of the deep separation convolution processing unit or the pooling processing unit and carrying out tensor accumulation processing on the stored appointed tensor.

Preferably, the vector multiply accumulate unit and the packet accumulate unit are in one-to-one or many-to-one relationship.

Preferably, the packet accumulation operation unit and the active pooling operation unit are in one-to-one or many-to-one relationship.

In order to achieve the above object, the present invention further provides a processing method of a simplified convolutional neural network-oriented low-cost accelerator architecture, which includes the following steps:

step S1, reading the data tensor from the data and weight tensor storage unit by using the data reading unit, and storing the read tensor into the data vector storage unit after segmenting and vectorizing the tensor according to a preset rule;

step S2, reading the data vector from the data vector storage unit and writing the data vector into the data vector register unit according to a certain rule by the data vector reading unit;

step S3, a weight reading unit is used for reading the weight tensor from the data and weight tensor storage unit, and the read tensor is segmented and vectorized according to a preset rule and then stored in m groups of weight vector storage units;

step S4, using m sets of weight vector reading units to read parameter vectors from corresponding weight vector storage units according to preset rules, and writing the read weight vectors into corresponding weight vector register units;

and step S5, acquiring corresponding data and weight vectors from the data vector register unit and the weight vector register unit by using m groups of vector operation units for processing, and storing the processing result into the data and weight tensor storage unit, wherein each group of vector operation units processes a pair of data and weight vectors every clock cycle.

Preferably, the step S5 further includes:

step S500, reading a pair of data vectors D and weight vectors W by using a vector multiply accumulate operation unit in each period, and performing multiply accumulate operation on the data vectors D and the weight vectors W to obtain a multiply accumulate operation result sequence Rv;

step S501, a grouping and accumulation operation unit is used for receiving the output result sequence Rv of the vector multiply-accumulate processing unit, and the received multiply-accumulate result sequence is grouped and accumulated together according to a preset rule to obtain an accumulation result Rg;

and step S502, performing activation pooling processing on the output sequence Rg of the grouping accumulation operation unit by using the activation pooling operation unit, and storing the output sequence Rg into a grouping accumulation storage unit.

Compared with the prior art, the simplified convolutional neural network-oriented low-cost accelerator architecture and the processing method thereof solve the problems of low efficiency or excessive power consumption when the conventional neural network accelerator processes the simplified convolutional neural network.

Drawings

FIG. 1 is a schematic structural diagram of a simplified convolutional neural network-oriented low-cost accelerator architecture according to the present invention;

FIG. 2 is a detailed structure diagram of a vector operation unit according to an embodiment of the present invention;

FIG. 3 is a detailed structure diagram of a block-and-accumulate unit according to an embodiment of the present invention;

FIG. 4 is a detailed diagram of an active pooling unit according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating two sets of vector multiply accumulate units corresponding to a set of group accumulate units and a set of active pooled motion units in an embodiment of the present invention;

FIG. 6 is a diagram of an exemplary set of active pooling units corresponding to two sets of multiply-accumulate units and a group-accumulate unit in an embodiment of the present invention;

FIG. 7 is a flowchart of a processing method of a simplified convolutional neural network-oriented low-cost accelerator according to the present invention;

FIG. 8 is a detailed flowchart of step S5 according to an embodiment of the present invention;

FIG. 9 is a detailed flowchart of step S501 according to an embodiment of the present invention;

FIG. 10 is a detailed flowchart of step S502 according to an embodiment of the present invention;

FIG. 11 is a diagram of 3D tensors for data according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating weight 4D tensors according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating weight tensor division in an embodiment of the present invention;

FIG. 14 is a diagram illustrating data tensor division according to an embodiment of the present invention;

FIG. 15 is a row diagram of partitioned data and weight tensors according to an embodiment of the present invention;

FIG. 16 is a flow chart of row-level 3D convolution processing according to an embodiment of the present invention;

FIG. 17 is a block diagram of vector multiply accumulate results for row-level 3D convolution processing according to an embodiment of the present invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

Fig. 1 is a schematic structural diagram of a simplified convolutional neural network-oriented low-cost accelerator architecture according to the present invention. As shown in fig. 1, the architecture of a simplified convolutional neural network oriented low-cost accelerator and a processing method thereof of the present invention includes:

the data and weight tensor storage unit 100 is configured to store an input data tensor, an input weight tensor, and an output data tensor, and in the system, the data and weight tensor storage unit 100 may be implemented by an SRAM (Static Random-Access memory) or an sdram (synchronous Dynamic Random Access memory).

The data reading unit 101 is configured to read a data tensor from the data and weight tensor storage unit 100, and store the read tensor into the data vector storage unit 102 after segmenting and vectorizing the tensor according to a preset rule.

The data vector storage unit 102 is used for storing a batch of data vectors, the data vector storage unit 102 is physically implemented by generally using an SRAM (Static Random-Access Memory), one clock cycle is supported to read one data vector, the power consumption and delay overhead of Access of the unit are lower compared with a data and weight tensor storage unit, the data vector storage unit caches data needing repeated Access, the Access frequency of the data and weight tensor storage unit with higher Access cost is reduced, and therefore the bandwidth and power consumption of a system are saved.

The data vector reading unit 103 is configured to read a data vector from the data vector storage unit 102 according to a preset rule and write the data vector into the data vector register unit 104, and the data vector reading unit 103 and the vector operation unit coordinate reading and operation processing of the data vector by using a "handshake synchronization mechanism".

The data vector register unit 104 is configured to register 1 or several data vectors, and a physical implementation is usually implemented by register building, which can provide the access data vector requirements with extremely low latency and extremely low power consumption for the operation array.

The weight reading unit 105 is configured to read a weight tensor from the data and weight tensor storage unit 100, and segment and vectorize the read tensor according to a preset rule, and store the segmented and vectorized tensor in the corresponding weight vector storage unit.

The m sets of weight vector storage units 106-1 to 06-m, each set of weight unit storing a set of weight vectors, similar to the data storage unit 102, the weight vector storage unit is also implemented physically by SRAM (Static Random-access memory), and supports reading a parameter vector in one clock cycle.

m sets of weight vector reading units 107-1 to 107-m, each set of weight vector reading unit reads parameter vectors from corresponding weight vector storage units according to preset rules and writes the read weight vectors into corresponding weight vector register units, and the weight reading units and the vector operation units also adopt a handshake synchronization mechanism to coordinate the reading and operation processing of the weight vectors.

m sets of weight vector register units 108-1 to 108-m, each set of weight vector register unit is used for registering 1 or several weight vectors, similar to the data vector register unit 104, the physical implementation of the register is usually constructed, and the register can provide the weight vector access requirements of extremely low delay and extremely low power consumption for the operation array.

And the m groups of vector operation units 109-1 to 109-m are used for acquiring corresponding data and weight vectors from the data vector register unit and the weight vector register unit for processing, storing processing results into the data and weight tensor storage units, and each group of vector operation units can process a pair of data and weight vectors per clock cycle. It should be noted that m groups of vector operation units 109-1 to 109-m share data vectors and exclusive weight vectors

FIG. 2 is a detailed structure diagram of a vector operation unit according to an embodiment of the present invention. As shown in fig. 2, each vector operation unit:

the vector multiply-accumulate unit 201 is used for reading a pair of data vectors D (D [1] … D [ n ]) and weight vectors W (W [1] … W [ n ]) every period, and performing multiply-accumulate operation thereon, as shown in the following formula:

wherein n is the dimension of the vector, Rv is the result of the multiply-accumulate operation, and vecMAC (W, D) is used for representing the vector multiply-accumulate operation for the convenience of description.

A grouping and accumulation operation unit 202, configured to receive the output result sequence Rv of the vector multiply-accumulate processing unit 201, and group and accumulate the received multiply-accumulate result sequence according to a preset rule, specifically, as shown in fig. 3, the grouping and accumulation operation unit 202 further includes:

a data receiving unit 2021, configured to receive the result Rv of the vector multiply-accumulate operation unit;

a historical accumulation result determining and processing unit 2022, configured to determine whether a received packet to which the current Rv belongs has a historical accumulation result, if so, read a corresponding historical accumulation result, accumulate the historical accumulation result and the current Rv, record the accumulated result as Rg, and start the accumulation completion determining and processing unit 2023; if the group to which the current Rv belongs does not have a history accumulation result, the current Rv is directly used as an accumulation result, the accumulated result is recorded as Rg, and the accumulation completion judgment processing unit 2023 is started

The affiliated packet accumulation completion judgment processing unit 2023 is used for judging whether the current Rv is the last member of the affiliated packet, if the current Rv is the last member, the affiliated packet is subjected to final accumulation completion, an accumulated value Rg is output to the activated pooling operation unit for subsequent processing, and then the processing unit 2024 enters all packet accumulation judgment processing units, if the current Rv is not the last member of the affiliated packet, the affiliated packet is not subjected to final accumulation, the current accumulated result Rg is stored in the packet accumulation storage unit to serve as the historical accumulated result of the affiliated packet, and then the processing unit 2024 enters all packet accumulation judgment processing units;

the all-packet accumulation judging unit 2024 is configured to judge whether all the packet accumulation processing is finished, if yes, end the current flow, and otherwise, return to the data receiving unit 2021 to continue the packet accumulation processing.

An active pooling operation unit 203, configured to perform active pooling on an output sequence Rg of the packet accumulation operation unit, where a processed output result is denoted as Ra, as shown in fig. 4, the active pooling operation unit 203 includes:

a batch normalization and Activation processing unit 2031 configured to perform batch normalization and Activation processing (Activation) on the packet accumulation result;

a judging unit 2032 for judging whether pooling processing or deep separable convolution processing is to be performed, and if deep separated convolution processing is selected, starting the deep separated convolution processing unit 2033, and if pooling processing is selected, starting the pooling processing unit 2034;

a deep separation Convolution processing unit 2033, configured to perform deep separation Convolution (Depthwise separation Convolution) processing on the data after the batch normalization and activation processing, and perform batch normalization and activation processing again on the data after the deep separation Convolution;

a Pooling processing unit 2034 for Pooling (Pooling) the data after the batch normalization and activation processing;

tensor processing section 2035 selects the result of the processing by depth separation convolution processing section 2033 or pooling processing section 2034 and performs Tensor addition (Tensor Add) processing on the stored specified Tensor.

A grouping accumulation storage unit 204, configured to store the operation result of the grouping accumulation operation unit 202.

It should be noted here that, in the design of the array of the accelerator, the vector multiply-accumulate unit, the group accumulate unit and the active pooling unit are not limited to one-to-one correspondence, and may be many-to-one. As shown in fig. 5, an example is given in which two sets of vector multiply accumulate units correspond to one set of grouping accumulate units and one set of active pooling motion units, and the grouping accumulate units can combine the output results of the vector multiply accumulate units of different channels to perform grouping accumulate processing. As shown in fig. 6, an example is given in which two sets of multiply-accumulate operation units and two sets of group-accumulate operation units correspond to one set of activated pooling operation units, and at this time, pooling is activated so that some requirements of operation processing requiring multi-channel information can be fulfilled, for example: the MFM (Max-Feature-Map) processing requirement of one of the convolutional network structures is reduced.

FIG. 7 is a flowchart of a processing method of a simplified convolutional neural network-oriented low-cost accelerator according to the present invention. As shown in fig. 7, the processing method of a simplified convolutional neural network oriented low-cost accelerator of the present invention includes the following steps:

step S2, reading the data vector from the data vector storage unit and writing the data vector into the data vector register unit according to a preset rule by using the data vector reading unit;

step S4, using m sets of weight vector reading units to read parameter vectors from corresponding weight vector storage units according to a certain rule, and writing the read weight vectors into corresponding weight vector register units;

and step S5, acquiring corresponding data and weight vectors from the data vector register unit and the weight vector register unit by using m groups of vector operation units for processing, and storing the processing results into the data and weight tensor storage units, wherein each group of vector operation units can process a pair of data and weight vectors per clock cycle.

Specifically, as shown in fig. 8, step S5 further includes:

step S500, a pair of data vectors D (D [1] … D [ n ]) and weight vectors W (W [1] … W [ n ]) are read by a vector multiply-accumulate operation unit every period, and the multiply-accumulate operation is performed on the data vectors, as shown in the following formula:

Step S501, a grouping and accumulation arithmetic unit is used for receiving the output result sequence Rv of the vector multiply-accumulate processing unit, and the received multiply-accumulate result sequence is grouped and accumulated together according to a certain rule. Fig. 9 shows a packet accumulation processing flow in step S501, and the specific flow is described as follows:

step S501a, receiving the result Rv of the vector multiply-accumulate unit, and then step S501 b;

step S501b, judging whether the grouping to which the current Rv belongs has a historical accumulation result, if so, turning to step S501c, otherwise, turning to step S501 d;

step S501c, reading a corresponding historical accumulation result from the grouping accumulation storage unit, accumulating the historical accumulation result and the current Rv, recording the accumulated result as Rg and transferring to step S501 e;

step S501d, if no historical accumulation result exists in the group to which the Rv belongs, directly taking the Rv as an accumulation result, recording the accumulated result as Rg and transferring to step S501 e;

step S501e, judging whether the current Rv is the last member of the belonged group, if the current Rv is the last member, switching to step S501f, and if not, switching to step S501 g;

in step S501f, if the final accumulation of the group is completed, the accumulated value Rg is output to the active pooling operation unit for subsequent processing, and the process goes to step S501 h.

Step S501g, if the final accumulation of the belonged grouping is not completed, storing the current accumulation result Rg into the grouping accumulation storage unit as the historical accumulation result of the belonged grouping, and going to step S501 h;

step S501h, if all the packet accumulation processing is finished, the current flow is finished, otherwise, the flow proceeds to step S501a, and the packet accumulation processing is continued.

Step S502, performing active pooling processing on the output sequence Rg of the packet accumulation unit by using an active pooling operation unit, recording the processed output result as Ra, and storing the Ra into the packet accumulation storage unit, where fig. 10 shows an active pooling processing flow of step S502, and the flow is specifically described as follows:

step S502a, performing Batch Normalization (Batch Normalization) and Activation (Activation) on the packet accumulation result, and then proceeding to step S502 b;

step S502b, judging whether the pooling process or the deep separable convolution process is selected, if the pooling process is selected, then going to step S502d, otherwise going to step S502 c;

step S502c, performing depth separation Convolution (depthwiseseparator Convolution) processing on the batch normalized and activated data, and proceeding to step S502 e;

step S502d, performing Pooling (Pooling) processing on the data after batch standardization and activation processing, and then turning to step S502 f;

step S502e, carrying out batch standardization and activation processing on the data after the deep separation convolution again, and turning to step S502 f;

in step S502f, Tensor accumulation (Tensor Add) processing is performed on the result of the processing in step S502d or step S502e and the stored designated Tensor.

The following will be illustrated by specific examples:

embodiment 1, an embodiment of convolutional neural network processing, as follows:

the input data of the convolutional neural network is a 3D tensor of HI × WI × CI, as shown in fig. 11, where WI, HI, and CI are the width, height, and channel number of the input data, respectively. The weight of the input to the convolutional network is a 3 × 3 × CI × M4D tensor, as shown in fig. 12, where 3 and 3 are the width and height of the convolution kernel, CI is the number of input channels and M is the number of output channels, assuming that both the horizontal and vertical step size (Stride) of the convolution process are 1.

FIG. 13 shows an example of weight division by dividing the weight into 3 × 1 × d tensors in the vertical direction and the input direction, where

(

Representing rounding down x), the number of divisions along the input channel direction is

And then converting the divided tensor into a weight vector with the dimension n, and if the converted dimension is less than n, complementing the insufficient dimension with 0 to form the n-dimensional vector. After dividing and vectorizing the weight tensor, each weight channel is a tensor of 1 × 3 × L × M, which is denoted as WT, and the element of the tensor WT is a vector with dimension n.

Fig. 14 shows an example of data division corresponding to weight division, where the left side of the diagram is the first line of division, the right side of the diagram is the second line of division, and so on. It should be noted that since the step size in the vertical direction of the convolution operation is 1, and the divided tensor height is 3, the divided data of the first line and the second line are overlapped, and the size of the divided data tensor is also 3 × 1 × d, which is the same as that of the weight tensor. And then converting the divided tensor into a vector with the dimension n, and if the converted dimension is less than n, complementing the insufficient dimension with 0 to form the vector with the dimension n. Under the condition that convolution is not performed by Padding, the input data is divided into (HI-2) xWIxL tensor which is marked as DT, and the element of the tensor DT is a vector with dimension n.

Referring to the block diagram of the operation array in fig. 1, if the number of the output weight channel is denoted as t (t is 1,2, …, M), the method for associating the output weight channel with the weight vector storage unit is as follows:

P＝((t-1)％m)+1

wherein "%" represents modulus operation, P is the number of the weight storage unit associated with the output channel t, the weight of the output channel t is vectorized and then stored in the weight vector storage unit with the number P during array operation, if M > M, a plurality of output channels are associated with one weight storage unit, and at the moment, the associated output channels are processed one by one according to a certain rule, generally according to the priority processing with a smaller t/M value.

Since the processing method of each vector operation unit and each output channel is similar, the following describes the process of performing convolution operation processing on the accelerator by taking only one output channel as an example.

FIG. 15 shows a row taken horizontally from the divided data tensor and weight tensor, data taking actions DT [1] [1], DT [1] [2] [1], …, DT [1] [ WI ] [1], abbreviated as D [1], D [2], …, D [ WI ] for descriptive convenience, and weight taking actions WT [1] [1] [1] [1], WT [1] [2] [1] and WT [1] [3] [1] for descriptive convenience, abbreviated as W [1], W [2] and W [3 ]. The flow of the row-level 3D convolution processing on the fetched data and weight rows is shown in fig. 16, and is described in detail as follows:

step 1601, reading the weight vector W [ i ] from the weight vector, and writing the weight vector W [ i ] into the corresponding weight vector register unit, wherein the registered weight vector and D [1], D [2], …, D [ WI ] are time-division multiplexed when performing vector multiplication accumulation, the weight vector is read from the weight vector storage unit once, WI times can be multiplexed, and the switching period of the corresponding weight vector is WI. The frequency of accessing the weight vector storage unit is reduced through the time-sharing multiplexing of the weight vectors, so that the power consumption of the system is reduced.

Step 1602, reading the data vector dj from the data vector storage unit every cycle, and writing the data vector dj into the data vector register unit, where the data vector register unit switches once every cycle, and the registered vectors are not multiplexed in time, but if multiple output channels are simultaneously performed, the data vectors can be multiplexed in multiple channels simultaneously. In addition, the data of the data vector storage unit is multiplexed for 3 times aiming at different weights, so that the access requirements on the weights with higher access cost and the data tensor storage unit are reduced, and the system bandwidth and the power consumption can be reduced.

Step 1603, a vector multiply accumulate operation is performed on the vectors W [ i ] and D [ j ], and the result is marked as Rv [ i ] [ j ].

And 1604, performing grouping and accumulation processing on the Rv [ i ] [ j ], wherein the grouping method is as shown in fig. 17, the Rv connected by an arrow is used as a group, the grouping and accumulation operation unit accumulates the Rv of each group together, and the final accumulation result is the result of the row-level 3D convolution operation.

Compared with the prior art, the operation array structure of the neural network accelerator has the following advantages:

1. the operation array adopts a mode that 1 group of data vectors corresponds to m (m is more than or equal to 1) groups of weight vectors, and all vector operation units in the array share one group of data vector input and share one group of weight vector input.

2. The data and the weight in the operation display are only used for distinguishing the description of the data flow, and the storage contents of the data and the weight can be interchanged aiming at different calculation tasks, and at the moment, all vector operation units in the display essentially become shared weight vectors and exclusive data vectors.

3. The data and the weight in the operation array respectively adopt a 3-level storage access mode: the device comprises a vector registering unit, a vector storage unit and a tensor storage unit; the 3-level storage units respectively correspond to storage devices with different access efficiencies and power consumptions in physical implementation, and the bandwidth and the power consumption of the system are reduced in a hierarchical multiplexing mode.

4. In the vector register unit layer, according to different requirements of calculation tasks, the operation array can select to switch data vectors or weight vectors every clock cycle, specifically as follows:

1) switching data vectors, and keeping weight vectors unchanged;

2) switching weight vectors, and keeping the data vectors unchanged;

3) simultaneously switching the data vector and the weight vector;

5. the data storage and operation of the operation array core take the vector as a basic unit, the data vector can be simultaneously multiplexed among different vector operation units, and the weight can be multiplexed in the same vector operation unit in a time-sharing manner.

6. The multiply-accumulate operation of the operation array is divided into two parts: the vector multiply accumulate operation unit and the grouping accumulate operation unit, the vector multiply accumulate is responsible for the core operation, the grouping accumulate can provide enough flexibility to meet the requirements of different operation tasks, because the grouping accumulate is based on the operation result of the vector multiply accumulate, compared with the times of multiply accumulate processing, the magnitude of the result is reduced by n times (n is usually more than 16), thereby the grouping accumulate processing unit can provide support for the flexible operation processing of the system only by spending less cost.

7. The deep separation convolution processing is used as a part of the activated pooling operation unit and is processed by an independent operation unit, and the advantages that the traditional 3D convolution operation and the deep separation convolution operation are separated and are processed by different operation parts are as follows: the performance and cost loss caused by compromise of the traditional 3D convolution and the depth separation convolution with overlarge data and weight multiplexing characteristics of the accelerator operation array is avoided.

8. The independent operation part is adopted for processing, the deep separation convolution can be in operation cascade connection with the traditional 3D convolution, and the operation processing of two layers is combined into the operation processing of one layer, so that the requirement of the access bandwidth of the deep separation convolution data can be greatly reduced;

9. the processing of the deep classifiable convolution and the pooling processing both need line-level data cache space, and in order to avoid overlarge cache space, the two parts of processing are parallel, so that the system implementation cost can be effectively reduced by multiplexing the input data cache;

10. tensor accumulation (Tensor Add) processing is brought into an activated pooling processing flow, and repeated reading and writing of a current processing layer can be effectively avoided when Tensor accumulation processing is carried out on processing results of different layers.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A reduced convolutional neural network oriented low cost accelerator architecture comprising:

the data vector storage unit is used for storing a batch of data vectors;

the data vector register unit is used for registering 1 or more data vectors;

and the m groups of vector operation units are used for acquiring corresponding data and weight vectors from the data vector register units and the weight vector register units for processing, storing processing results into the data and weight tensor storage units, and processing a pair of data and weight vectors by each group of vector operation units every clock cycle.

2. The reduced convolutional neural network-oriented low-cost accelerator architecture of claim 1, wherein: the data vector reading unit and the vector operation unit adopt a handshake synchronization mechanism to coordinate the reading and operation processing of the data vector.

3. The reduced convolutional neural network-oriented low-cost accelerator architecture of claim 1, wherein: the m groups of vector operation units share data vectors and share weight vectors independently.

4. The reduced convolutional neural network-oriented low-cost accelerator architecture of claim 1, wherein: the weight reading unit and the vector operation unit adopt a handshake synchronization mechanism to coordinate the reading and operation processing of the weight vector.

5. The reduced convolutional neural network-oriented low-cost accelerator architecture of claim 1, wherein the vector operation unit comprises:

6. The reduced convolutional neural network-oriented low-cost accelerator architecture of claim 5, wherein the packet accumulation arithmetic unit comprises:

7. The reduced convolutional neural network-oriented low-cost accelerator architecture of claim 5, wherein the active pooling unit comprises:

8. The reduced convolutional neural network-oriented low-cost accelerator architecture of claim 5, wherein: the vector multiply-accumulate operation unit and the grouping accumulate operation unit as well as the grouping accumulate operation unit and the activation pooling operation unit are in one-to-one or many-to-one relationship.

9. A processing method for a simplified convolutional neural network-oriented low-cost accelerator architecture comprises the following steps:

10. The processing method of a reduced convolutional neural network-oriented low-cost accelerator architecture as claimed in claim 9, wherein step S5 further comprises: