CN111178508B

CN111178508B - Computing device and method for executing full connection layer in convolutional neural network

Info

Publication number: CN111178508B
Application number: CN201911374362.6A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Zhuhai Eeasy Electronic Tech Co ltd
Current assignee: Zhuhai Eeasy Electronic Tech Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2024-04-05
Anticipated expiration: 2039-12-27
Also published as: CN111178508A

Abstract

The invention discloses an operation device and a method for executing a full connection layer in a convolutional neural network, wherein the device comprises a data management module, a data processing module and a data processing module, wherein the data management module is used for storing and reading a feature map and weight data according to a preset format, the preset format comprises effective data quantity, a zone bit area and an effective data storage area, and the zone bit area represents whether the data of the corresponding position of the feature map or the weight data is effective or not; and the convolution calculation device is used for carrying out convolution operation on the effective data of the feature map and the effective data in the weight data according to the information of the zone bit zone. The embodiment of the invention has at least the following effects: by removing invalid data from the sparse feature map and the corresponding weight data, the data volume is reduced, the data storage space is saved, the bandwidth required by operation is reduced, a large amount of redundant calculation is removed, the calculation process is realized through hardware, the calculation speed is increased, and the calculation efficiency is improved.

Description

Computing device and method for executing full connection layer in convolutional neural network

Technical Field

The present invention relates to convolutional neural networks, and more particularly, to an arithmetic device and method for executing a full connection layer in a convolutional neural network.

Background

The deep learning technology based on the convolutional neural network can perform image recognition, detection, voice recognition and the like with higher accuracy, so that the deep learning technology is widely applied to the fields of safety monitoring, driving assistance, intelligent accompanying robots, intelligent medical treatment and the like.

The full connection layer in the convolutional neural network is a main network layer, the network layer mainly plays a role in classifying characteristic points in the neural network, the convolutional neural network can be regarded as a convolutional kernel 1x1 in practical application, and a convolutional layer with only one point is output, but the operation amount is usually large because the characteristic data and classification items are very large. The convolutional neural network model commonly used at present has large calculation amount and occupies a large amount of data storage space.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides an operation device for executing the full connection layer in the convolutional neural network, which can reduce the calculated amount and save the data storage space.

The invention also provides a method for the full connection layer in the convolutional neural network by using the operation device for executing the full connection layer in the convolutional neural network.

An arithmetic device for executing a full connection layer in a convolutional neural network according to an embodiment of the first aspect of the present invention includes: the data management module is used for storing and reading the feature map and the weight data according to a preset format, wherein the preset format comprises effective data quantity, a zone bit area and an effective data storage area, and the zone bit area represents whether the data of the corresponding position of the feature map or the weight data is effective or not; and the convolution calculation device is used for carrying out convolution operation on the effective data of the feature map and the effective data in the weight data according to the information of the zone bit zone.

The operation device for executing the full connection layer in the convolutional neural network has at least the following beneficial effects: invalid data is removed from the sparse feature map and the corresponding weight data, so that the data volume is reduced, the data storage space is saved, a large number of redundant calculations are reduced, the calculation process is realized through hardware, the calculation speed is increased, and the calculation efficiency is improved.

According to some embodiments of the invention, the convolution computing device comprises: the marker bit processing module is used for removing marker bit information with invalid markers at the same positions in the marker bit region of the feature map and the marker bit region of the weight data, and processing the marker bit information to obtain effective marker bit information of the feature map and effective marker bit information of the weight data respectively; the data selection module is used for searching that marks at the same position in the effective mark bit information of the feature map and the effective mark bit information of the weight data are effective mark information, acquiring position information of all effective mark bits, and sequentially selecting a corresponding number of effective feature points and effective weights according to the positions of all effective mark bits; the multiplication accumulation module comprises a distribution unit and two processing units, wherein the processing units comprise a plurality of multiplication accumulators; the allocation unit is used for allocating the corresponding effective feature points and the effective weights to the two processing units for multiplication accumulation calculation according to the feature map and the bit width in the weight data. The modules adopt an online pipeline and parallel mode, so that the operation speed is further improved.

According to some embodiments of the invention, the flag bit processing module removes data where the same position of the flag bit region of the feature map and the weight data is invalid through a 16-bit priority selector. Invalid data in the sparse feature map and the weight data are all 0, and the priority selector can quickly remove all the invalid data.

According to some embodiments of the invention, the processing unit comprises 8-bit multiply accumulators. This arrangement can handle multiply-accumulate of feature maps and weight data for a variety of common bit width combinations.

According to a second aspect of the present invention, a method for fully connected layers in a convolutional neural network, comprises the steps of: a storage and reading step, namely, carrying out storage and reading on the feature map and the weight data according to a preset format, wherein the preset format comprises effective data quantity, a zone bit area and an effective data storage area, and the zone bit area represents whether the data of the corresponding position of the feature map or the weight data is effective or not; and a convolution calculation step, namely carrying out convolution operation on the effective data of the feature map and the effective data in the weight data according to the information of the zone bit zone.

The method for the full connection layer in the convolutional neural network has at least the following beneficial effects: by removing invalid data from the sparse feature map and the corresponding weight data, the data volume is reduced, the data storage space is saved, the bandwidth required by operation is reduced, a large amount of redundant calculation is eliminated, the calculation speed is increased, and the calculation efficiency is improved.

According to some embodiments of the invention, the convolution calculating step comprises: a flag bit processing step of reading the flag bit region of the feature map and the flag bit region of the weight data according to groups with the same preset length, removing the data with invalid positions of the flag bit region of the feature map and the flag bit region of the weight data, processing to obtain valid flag bit information of the feature map and valid flag bit information of the weight data, and sending an end signal when all the processing of the flag bit region is completed; a step of extracting effective data, which is to determine the size of a search window according to the bit width of the feature map and the weight data, search the effective marker bit information of the feature map and the marker information of the weight data in the search window, wherein markers at the same position are effective marker information, obtain the positions of all effective marker bits, and sequentially select a plurality of pairs of effective feature points and effective weights according to the positions of all effective marker bits; and an allocation calculation step, namely allocating the effective feature points and the effective weights to two processing units for multiplication accumulation calculation before receiving the end signal according to the feature map and the bit width of the weight data. The steps are carried out in parallel in a running mode, so that the calculation time is saved, and the overall calculation efficiency is improved.

According to some embodiments of the invention, the size of the search window is configured to: if the bit width of the feature map and the weight data is 16 bits at maximum, the size of the search window is 16 bits; and if the bit width of the feature map and the weight data is 8 bits at maximum, the size of the search window is 32 bits. The hardware efficiency can be better, and the next accumulation calculation is convenient.

According to some embodiments of the invention, in the flag bit processing step, the preset length is 128 bits; in the effective data extraction step, the effective feature points and the effective weights are 8 pairs. The two data can obtain a better hardware efficiency, which is beneficial to the calculation of the multiply-accumulate module.

According to some embodiments of the invention, a next starting position of the search window is determined by a position of the valid flag bits corresponding to a last pair of valid feature points and valid weights currently selected. And ensuring that all the effective feature points and the effective weights which should participate in the operation.

According to some embodiments of the invention, ping-pong storage is used in the reading and convolution calculation process when the weight data has multiple copies. The memory space in the calculation process can be saved, and the hardware cost is reduced.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic block diagram of an arithmetic device according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature map and a weight data storage according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of an arithmetic device including internal modules of a convolution computing device according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a multiply-accumulate module in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of steps of a method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a convolution calculation step in a method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of flag bit processing in the method according to the embodiment of the present invention;

FIG. 8 is a schematic diagram of effective data extraction in a method according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of allocation data of an allocation unit in a method according to an embodiment of the present invention;

reference numerals:

a data management module 100, a convolution computing device 200, a flag bit processing module 210, a data selecting module 220, a multiply-accumulate module 230, an allocation unit 231, and a processing unit 232;

a storage reading step S100, a convolution calculation step S200, a flag bit processing step S210, a valid data extraction step S220, and an allocation calculation step S230.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

Referring to fig. 1, an arithmetic device according to an embodiment of the present invention includes: the data management module 100 and the convolution computing device 200. The data management module 100 is configured to store and read the feature map and the weight data according to a preset format. The preset format comprises effective data quantity, a zone bit area and an effective data storage area, wherein the zone bit area represents whether the data of the corresponding position of the feature map or the weight data is effective or not. Taking a feature map as an example, if one feature map includes 8×8×128 feature points, and a plurality of non-zero valid Data exists in the feature points, referring to fig. 2, the number of valid Data is recorded in the Data0 area, and valid Data is sequentially stored in the Data2 area, and whether the Data at the corresponding position in the original feature map is the non-zero valid Data is indicated by a bit. Referring to fig. 2, the first Data is valid, and the flag of the corresponding location is 1, and the Data is correspondingly stored in the first location in the Data2 area; the second data is 0, the second data is invalid, the corresponding position mark is 0, and the second data is not stored; the third data is also zero invalid, the corresponding mark is 0, and the data is not stored; the fourth Data is valid, the flag of the corresponding location is 1, and the Data is correspondingly stored in the second location in the Data2 area. Therefore, the size of a Data1 area of a feature map containing 8×8×128 feature points is 8×128 bytes. The weight data comprises a plurality of weights, and the effective weights are stored still in the preset format; it should be understood that the number of weights in the weight data coincides with the feature points of the corresponding feature map. The convolution calculation device 200 performs convolution operation on the effective data of the feature map and the effective data of the weight data based on the flag bit information.

Referring to fig. 3, in the arithmetic device according to the embodiment of the present invention, a convolution calculating device 200 includes: a flag bit processing module 210, a data selection module 220, and a multiply-accumulate module 230. The flag bit processing module 210 is configured to remove flag bit information that the flags at the same position in the flag bit region of the feature map and the flag bit region of the weight data are invalid, and process the flag bit information to obtain valid flag bit information of the feature map and valid flag bit information of the weight data respectively. The data selection module 220 is configured to search for valid flag bit information of the feature map and valid flag bit information of the weight data, where the flags at the same position are valid flag bit information, obtain position information of all valid flag bits, and sequentially select a corresponding number of valid feature points and valid weights according to positions of all valid flag bits. The multiply-accumulate module 230 is configured to multiply-accumulate the valid feature points and the valid weights, see fig. 4, and includes: a distribution unit 231 and two processing units 232; the allocation unit 231 allocates corresponding effective feature points and effective weights to the two processing units for multiply-accumulate calculation according to bit widths in the feature map and the weight data.

The steps of the method of embodiments of the present invention will be described below in conjunction with a specific example. Taking a feature map containing 8x8x128 feature points and a convolution kernel as 8x8, wherein the weight data is 16, and the output is 1x1x 16; wherein each weight data is the same as the data amount of its corresponding feature map, i.e. 8x8x128 weights.

Referring to fig. 5, the method of an embodiment of the present invention includes: the memory reading step S100 and the convolution calculating step S200. The storage reading step S100 is: and storing and reading the feature map and the weight data according to a preset format, wherein the preset format comprises the number of effective data, a zone bit area and an effective data storage area, and the zone bit area represents whether the data of the corresponding position of the feature map or the weight data is effective or not. Taking the feature map as an example, referring to fig. 2, in an embodiment of the present invention, the reading process is: firstly, reading Data0, namely the effective Data quantity; and then reading Data1, namely the zone bit zone, and determining the reading length of Data2 according to the effective Data quantity to read out the Data 2. Because the weight data has 16 parts and is a plurality of parts of data, the data is temporarily stored by adopting ping-pong storage after being read out. The convolution calculation step S200 is: and carrying out convolution operation on the effective data in the feature map and the effective data in the weight data according to the information of the zone bit zone.

Referring to fig. 6, the convolution calculating step S200 includes: a flag bit processing step S210, a valid data extracting step S220, and an allocation calculating step S230. In the embodiment of the present invention, the flag bit processing step S210, the valid data extracting step S220 and the allocation calculating step S230 complete the convolution calculation of the whole connection layer through on-line pipelining and parallel processing, and it is understood that these reference numerals do not mean a strict limitation of the steps. In the step S210 of flag bit processing, the flag bit regions of the feature map and the flag bit regions of the weight data are read in groups with the same preset length, the data that the same positions of the flag bit regions of the feature map and the flag bit regions of the weight data are invalid is removed, the valid flag bit information of the feature map and the valid flag bit information of the weight data are obtained through processing, and an end signal is sent when all the processing of the flag bit regions is completed. The end signal is used to inform the effective data extraction step S220 and the allocation calculation step S230 that the calculation process is completed. The effective data extracting step S220 further processes the effective flag information in S210, and determines the size of the search window according to the feature map and the bit width of the weight data; and searching the effective zone bit information of the feature map and the effective zone bit information of the weight data in the search window, wherein the marks at the same position are effective zone bit information, acquiring the positions of all effective zone bits, and sequentially selecting a plurality of pairs of effective feature points and effective weights according to the positions of all effective zone bits. And an allocation calculation step S230, wherein the effective feature points and the effective weights are allocated to the two processing units for multiplication accumulation calculation before the end signal is received according to the feature map and the bit width of the weight data. The flag bit processing step S210 performs a screening first, so as to further increase the processing speed.

In the embodiment of the present invention, the flag bit processing step S210 refers to fig. 7 for the processing of the flag bit. In the embodiment of the invention, each processing period simultaneously reads the information of each 128-bit zone of the feature map and the weight data, 128-bit information is divided into 8 16-bit sections in a pipelining and parallel mode, the 16-level priority selector is used for completely removing invalid zone bits which are zero in each 16-bit section, then the new valid zone bit information is obtained by compact splicing, the information is output, and when the information processing of the last group of zone bit zone is finished, an end signal is generated and transmitted to the next stage. Fig. 7 is a processing example of part of flag bit data, and when the flag bit at a certain same position in the feature map and the weight data is 0, the flag bit information at the corresponding position is removed, and the valid flag bit information 111101111011110111111 of the feature map and the valid flag bit information 111011111111101011011 of the weight data are obtained through processing.

The valid data extraction step S220 further processes the valid flag bit information obtained in the flag bit processing step S210, obtains position information of all valid flag bits, and obtains corresponding valid data according to the position information. Referring to fig. 8, the rightmost data in the figure is the valid flag bit information read first. In the embodiment of the present invention, each period of the valid data extraction step S220 provides 8 valid feature points and 8 valid weights for the multiply-accumulate module to participate in the multiply-accumulate operation. The module firstly determines the size of a search window according to the effective zone bit information output by the previous stage and the maximum bit width of the feature map and the weight data. In the embodiment of the invention, if the maximum bit width is 8 bits in the feature map and the weight data, the search window of 32 bits is corresponding; if the maximum bit width in the feature map and the weight data is 16 bits, the feature map and the weight data correspond to a search window of 16 bits. In fig. 8, the search window is 32 bits, and first, the positions of the flag bits of the feature map and the weight data are found in the search window, and are non-zero at the same time. The first row in fig. 8 represents valid flag bit information of the feature map; the third row represents valid flag bit information of the weight data; the fifth row is only used for identification reminding, does not represent actual storage, and represents the valid flag bit information of the position. Next, 8 pieces of effective data are selected from each of the feature map and the weight data in order according to the effective flag information, and the position information of the effective data in each effective data storage area is found, for example, the second row in fig. 8 represents the position number of 8 effective feature points of the feature map in its effective data storage area (where idx is 0 representing the first position of the effective data storage area), the fourth row represents the position number of 8 effective weights of the weight data in its effective data storage area (where idx is 0 representing the first position of the effective data storage area), and the calculation of the storage position of the effective data is obtained by counting the data of the effective flag preceding the effective data corresponding to the flag bit. It should be understood that the positions of a pair of valid feature points and valid weights of the identification area at the same location in their respective data storage areas are not necessarily the same, and referring to fig. 8, in the first search window, the last pair of valid feature points and valid weights selected in order, the position idx of the valid feature point in the valid data storage area in the feature map is 13, and the position idx of the valid weight in the valid data storage area of the weight data is 12. Finally, non-zero valid data participating in the operation, i.e. valid feature points and valid weights, are fetched from the feature map and weight data in parallel according to the position information idx and submitted to the multiply-accumulate module 230. The next cycle then continues to select non-zero data within the new search window until an end signal from the flag bit processing step S210 is received. Referring to fig. 8, the position of the new search window is determined by the position of the last pair of valid feature points and valid weights, and the next bit of the last pair of valid data pairs (feature points and weights) is selected for this time.

The allocation calculation step S230 is completed by a multiply-accumulate module 230, and an allocation unit in the multiply-accumulate module allocates the valid data pairs (feature points and weights) provided by the valid data extraction step S220 to two processing units for multiply-accumulate operation according to the feature map and the bit width of the weight data, and completes the last calculation according to the end signal sent by the flag bit processing step S210, so as to obtain the multiply-accumulate operation result of the whole full-connection layer. Each processor includes 8 multiplication accumulators with 8 bits, the operand distribution conditions of the data sources (feature map and weight data) with different bit widths corresponding to the single processing unit are shown in fig. 9, for example, the bit widths of the feature points and the weights are 8 bits, one processing unit can complete multiplication accumulation of 8 feature points and weights, and the data and the multiplication accumulators are in one-to-one correspondence according to the example in the figure. If the bit width of the feature point is 8 bits and the weight bit width is 16 bits, one processing unit can complete multiplication accumulation of 4 pairs of effective data, and 4 pairs of effective data are distributed to a first processing unit and another 4 pairs of effective data are distributed to a second processing unit according to the method in fig. 9, and then the result is obtained by accumulation.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present invention.

Claims

1. An arithmetic device for executing a full connection layer in a convolutional neural network, comprising:

the data management module is used for storing and reading the feature map and the weight data according to a preset format, wherein the preset format comprises effective data quantity, a zone bit area and an effective data storage area, and the zone bit area represents whether the data of the corresponding position of the feature map or the weight data is effective or not;

the convolution calculation device is used for carrying out convolution operation on the effective data of the feature map and the effective data in the weight data according to the information of the zone bit zone;

the convolution computing device further includes:

the marker bit processing module is used for removing marker bit information with invalid markers at the same positions in the marker bit region of the feature map and the marker bit region of the weight data, and processing the marker bit information to obtain effective marker bit information of the feature map and effective marker bit information of the weight data respectively;

the data selection module is used for searching that marks at the same position in the effective mark bit information of the feature map and the effective mark bit information of the weight data are effective mark information, acquiring position information of all effective mark bits, and sequentially selecting a corresponding number of effective feature points and effective weights according to the positions of all effective mark bits;

the multiplication accumulation module comprises a distribution unit and two processing units, wherein the processing units comprise a plurality of multiplication accumulators; the allocation unit is used for allocating the corresponding effective feature points and the effective weights to the two processing units for multiplication accumulation calculation according to the feature diagrams and bit widths in the weight data;

and the zone bit processing module removes the data which are invalid in the same position of the zone bit zone of the feature map and the weight data through a 16-bit priority selector.

2. The computing device for performing fully-connected layers in a convolutional neural network of claim 1, wherein the processing unit comprises 8-bit multiply-accumulators.

3. A method for convolutional neural network full connectivity layer using the apparatus of any one of claims 1 to 2, comprising the steps of:

a storage and reading step, namely, carrying out storage and reading on the feature map and the weight data according to a preset format, wherein the preset format comprises effective data quantity, a zone bit area and an effective data storage area, and the zone bit area represents whether the data of the corresponding position of the feature map or the weight data is effective or not;

a convolution calculation step, namely carrying out convolution operation on the effective data of the feature map and the effective data in the weight data according to the information of the zone bit zone;

the convolution calculating step further includes:

a flag bit processing step of reading the flag bit region of the feature map and the flag bit region of the weight data according to groups with the same preset length, removing the data with invalid positions of the flag bit region of the feature map and the flag bit region of the weight data, processing to obtain valid flag bit information of the feature map and valid flag bit information of the weight data, and sending an end signal when all the processing of the flag bit region is completed;

a step of extracting effective data, which is to determine the size of a search window according to the bit width of the feature map and the weight data, search the effective marker bit information of the feature map and the marker information of the weight data in the search window, wherein markers at the same position are effective marker information, obtain the positions of all effective marker bits, and sequentially select a plurality of pairs of effective feature points and effective weights according to the positions of all effective marker bits;

and an allocation calculation step, namely allocating the effective feature points and the effective weights to two processing units for multiplication accumulation calculation before receiving the end signal according to the feature map and the bit width of the weight data.

4. A method for a fully connected layer in a convolutional neural network according to claim 3, wherein the size of the search window is configured to:

if the bit width of the feature map and the weight data is 16 bits at maximum, the size of the search window is 16 bits;

and if the bit width of the feature map and the weight data is 8 bits at maximum, the size of the search window is 32 bits.

5. The method for a fully connected layer in a convolutional neural network of claim 3,

the preset length in the flag bit processing step is 128 bits;

the effective feature points selected in the effective data extraction step are 8 pairs of effective weights.

6. A method for a fully connected layer in a convolutional neural network according to claim 3, wherein the next starting position of the search window is determined by the position of the all valid flag bits corresponding to the last pair of valid feature points and valid weights currently selected.

7. A method for fully connected layers in a convolutional neural network according to claim 3, wherein ping-pong storage is used in the reading and convolutional calculation process when there are multiple copies of the weight data.