CN111199268B

CN111199268B - Implementation method and device of full connection layer, electronic equipment and computer readable storage medium

Info

Publication number: CN111199268B
Application number: CN201811375742.7A
Authority: CN
Inventors: 李炜; 曹庆新
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2023-04-07
Anticipated expiration: 2038-11-19
Also published as: CN111199268A; WO2020103653A9; WO2020103653A1

Abstract

The application discloses a method and a device for realizing a full connection layer, electronic equipment and a computer readable storage medium, comprising the following steps of: the method comprises the steps that a plurality of input features aiming at a full connection layer are obtained, each input feature comprises a plurality of first feature components, and when the total number of the obtained input features reaches a first preset threshold value, the input features are input to a plurality of data processing units corresponding to the full connection layer at the same time; then, acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristic of the full connection layer; then, in the plurality of data processing units, according to the plurality of first feature components and the weighting coefficients corresponding to the plurality of first feature components, determining the output feature corresponding to each input feature in parallel. By adopting the embodiment of the application, the multiplexing of the weight coefficient of the full connection layer can be realized, and the utilization rate of the multiplication accumulator in the neural network can be improved.

Description

Implementation method and device of full connection layer, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of neural networks, and in particular, to a method and an apparatus for implementing a full connection layer, an electronic device, and a computer-readable storage medium.

Background

Currently, convolutional Neural Networks (CNNs) are widely used in various fields of artificial intelligence. The CNN is a deep feed-forward artificial neural network including a convolutional layer, a pooling layer, and a Fully Connected (FC) layer. Compared with other neural network algorithms, the CNN can process larger images, and has the characteristics of large calculation amount, large bandwidth requirement and relatively fixed operation. In the FC layer, local features of processing objects need to be integrated into global features. Although the calculation amount of the FC layer is small, the amount of the weight coefficient (weight) required is large. In most cases, the weight required for the FC layer accounts for over 70% of the weight of the entire neural network. In the existing implementation scheme of the FC layer, once the system generates a local feature, the weight is read immediately to convert the local feature, which not only cannot implement multiplexing of the FC layer weight, but also results in low utilization rate of a Multiply Accumulator (MAC) in the whole neural network.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for realizing a full connection layer and a computer readable storage medium, which can realize multiplexing of weight coefficients of an FC layer and improve the utilization rate of MAC in a neural network.

A first aspect of the embodiments of the present application provides a method for implementing a full connection layer, including:

obtaining a plurality of input features for a fully connected layer, each input feature of the plurality of input features comprising a plurality of first feature components;

when the total number of the acquired multiple input features reaches a first preset threshold value, simultaneously inputting the multiple input features to multiple data processing units corresponding to the full connection layer;

acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristic of the full connection layer;

in the plurality of data processing units, according to the plurality of first feature components and the weighting coefficients corresponding to the plurality of first feature components, determining output features corresponding to each input feature in parallel.

Correspondingly, a second aspect of the embodiments of the present application provides an apparatus for implementing a full connection layer, including:

an obtaining module to obtain a plurality of input features for a fully connected layer, each input feature of the plurality of input features comprising a plurality of first feature components;

the transmission module is used for simultaneously inputting the plurality of input features to the plurality of data processing units corresponding to the full connection layer when the total number of the plurality of acquired input features reaches a first preset threshold;

the obtaining module is further configured to obtain a weight coefficient of each of the plurality of first feature components in the output feature of the fully-connected layer;

and the processing module comprises a plurality of data processing units corresponding to the full connection layer, is used for determining the plurality of first characteristic components and the weight coefficients corresponding to the plurality of first characteristic components, and determines the output characteristic corresponding to each input characteristic in parallel.

Wherein the transmission module is further configured to:

simultaneously inputting one of the plurality of input features to each of the plurality of data processing units;

the processing module is further configured to:

in each data processing unit, determining an output feature corresponding to the input feature according to the plurality of first feature components in the input feature and the weighting coefficients corresponding to the plurality of first feature components.

Wherein the transmission module is further configured to:

simultaneously inputting the plurality of input features to each of the plurality of data processing units corresponding to the fully connected layer;

the processing module is further configured to:

in each data processing unit, determining at least one second feature component in a plurality of second feature components included in the output feature corresponding to each input feature according to the plurality of first feature components in each input feature and the weight coefficients corresponding to the plurality of first feature components;

and combining the at least one second characteristic component determined by each data processing unit to obtain an output characteristic corresponding to each input characteristic.

Wherein the output feature corresponding to the one input feature comprises a plurality of second feature components;

the processing module is further configured to:

determining a contribution value of each first feature component in the one input feature to each second feature component in the plurality of second feature components according to the corresponding weight coefficient of the plurality of first feature components;

and determining an output characteristic corresponding to the input characteristic according to the contribution value.

Wherein the processing module is further configured to:

removing the first feature component for which the contribution value has been determined from said each data processing unit.

Wherein the processing module is further configured to:

determining a contribution value of each first feature component in each input feature to each second feature component in the at least one second feature component according to the corresponding weight coefficients of the plurality of first feature components;

determining the at least one second feature component based on the contribution value.

Wherein the obtaining module is further configured to:

determining a storage space required for storing the weight coefficients;

when the storage space is smaller than a second preset threshold, the operation of obtaining the weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristic of the full connection layer is performed.

A third aspect of embodiments of the present application provides an electronic device, including: a processor, a memory, a communication interface, and a bus;

the processor, the memory and the communication interface are connected through the bus and complete mutual communication;

the memory stores executable program code;

the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to execute the implementation method of the full connection layer disclosed in the first aspect of the embodiment of the present application.

Accordingly, an embodiment of the present application provides a storage medium, where the storage medium is used to store an application program, and the application program is used to execute, at runtime, the implementation method of the full connection layer disclosed in the first aspect of the embodiment of the present application.

Accordingly, an embodiment of the present application provides an application program, where the application program is configured to execute, at runtime, the implementation method of the full connection layer disclosed in the first aspect of the embodiment of the present application.

When the total quantity of the acquired input features reaches a first preset threshold value, the input features are simultaneously input to a plurality of data processing units corresponding to the full connection layer; then, acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristic of the full connection layer; then, in the plurality of data processing units, according to the plurality of first feature components and the weighting coefficients corresponding to the plurality of first feature components, determining the output feature corresponding to each input feature in parallel. The method provided by the embodiment of the application realizes multiplexing of the weight coefficients compared with the method in the prior art that the weight coefficients are read once when the FC layer calculation is performed on each pair of input features. In addition, the selection of the first preset threshold corresponding to the total number of the input features may be based on the number of the data processing units corresponding to the FC layer, so that each data processing unit at least undertakes the calculation of the FC layer of one input feature, thereby avoiding the situation that the data processing units are idle and improving the utilization rate of the MAC.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a neural network processor provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of an implementation method of a full connection layer according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an input feature provided by an embodiment of the present application;

FIG. 4 is a diagram illustrating a feature of writing input to a PE according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an output feature provided by an embodiment of the present application;

fig. 6 is a schematic flowchart of another implementation method of a full connection layer according to an embodiment of the present application;

FIG. 7 is a diagram illustrating a feature of writing input to a PE according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of another output characteristic provided by embodiments of the present application;

FIG. 9 is a schematic structural diagram of an apparatus for implementing a fully-connected layer according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a neural network processor according to an embodiment of the present disclosure. As shown in the figure, the neural network processor in the embodiment of the present application includes a data storage, a data reading unit, a weight storage, a data restoring unit, and a plurality of data Processing units (PEs). Wherein each PE may include an input data buffer, a convolution operation unit, and an output data buffer. The data memory is used for storing input feature maps (input feature maps) generated by each layer in the neural network computing process, so that the input features are integrated into output features (output feature maps); the data reading unit is used for reading the input characteristics in the data memory and sending the input characteristics into the input data cache of the corresponding PE; the weight memory is used for storing weight coefficients (weights) required by each layer in the calculation process of the neural network, and can also be a weight matrix; the data restoring unit is used for storing the output characteristics in the output data cache into the data memory. The PE is used for completing the calculation of the FC layer, wherein the convolution operation unit is used for reading the input characteristics in the input data cache and executing the calculation of the FC layer or other convolution operations, and the output data cache is used for storing the output characteristics calculated by the convolution operation unit. Based on the above neural network processor, the embodiments of the present application provide the following implementation methods of a full connection layer.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a method for implementing a full link layer according to an embodiment of the present disclosure. As shown in the figures, the method in the embodiment of the present application includes:

s201, acquiring a plurality of input features aiming at the full connection layer, wherein each input feature in the plurality of input features comprises a plurality of first feature components.

In a specific implementation, the CNN includes a convolutional layer, a pooling layer, and a full link layer. In the process of image processing by using the convolutional neural network, the fully connected layer can integrate a large number of image features obtained after the processing of the convolutional layer and the pooling layer so as to perform subsequent classification or other processing on the image. Therefore, the input features generated in the CNN for the FC layer may be acquired in real time, and the acquired input features may be stored in the data storage, wherein a plurality of first feature components of each input feature may be subjected to batch processing, so that each input feature may be stored in one batch, and each first feature component occupies one input channel (ci) in the batch, and each first feature component may be a number, a vector, or a matrix, and so on. For convenience of description, the xth feature component of the input feature is hereinafter referred to as cix.

For example: as shown in fig. 3, the input features T0, T1, T2, …, T15 are stored in the data memory in batch0, batch1, …, batch15, respectively. Wherein each input feature comprises 6 feature components, which correspond to ci0, ci1, …, ci5, respectively.

And S202, when the total number of the acquired multiple input features reaches a first preset threshold, simultaneously inputting the multiple input features to multiple data processing units corresponding to the full connection layer.

In a specific implementation, a plurality of data processing units corresponding to the FC layer may be configured to perform FC layer calculation for input features in a plurality of latches, for example: PE0 performs FC layer calculation for the input feature in batch0, PE1 performs FC layer calculation for the input feature in batch1, …, PE15 performs FC layer calculation for the input feature in batch 15. The first preset threshold may be determined according to the number of data processing units corresponding to the FC and/or the storage capacity of the data storage. Such as: if there are 16 data processing units corresponding to the FC, the first preset threshold may be 16.

When the total number of the acquired plurality of input features reaches a first preset threshold, one of the acquired plurality of input features may be simultaneously input to each of the plurality of data processing units. For each acquired input feature, the input feature may be first read from the data store and then input to one of the plurality of data processing units corresponding to the FC layer, where the input feature may be stored in an input data cache of the data processing unit. Accordingly, each data processing unit also has only one input characteristic, and the input characteristics in each data processing unit are different in order to improve the utilization rate of the multiplication accumulator.

For example: as shown in fig. 3, a total of 16 input features are acquired, including T0, T1, …, T15, which are stored in batch0, batch1, …, batch15, respectively. The FC layer has 16 data processing units, including PE0, PE1, …, PE15. Therefore, as shown in FIG. 4, T0 input PE0, T1 input PE1, …, and T15 input PE15 may be provided. Wherein, the feature component of each input feature can be read from each batch in the following order and stored in the input data buffer of the corresponding PE ("batch a-cib" denotes the b th ci in batch):

batch0-ci0，batch1-ci0，…,batch15-ci0,

batch0-ci1，batch1-ci1，…,batch15-ci1,

……

batch0-ci5，batch1-ci5，…,batch15-ci5。

s203, acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristics of the full connection layer.

In a specific implementation, in the CNN, the weight coefficients required by the FC layer may be read from a Double Data Rate (DDR) and stored in a weight memory. The output feature comprises a plurality of second feature components, and each first feature component contributes to each second feature component. Therefore, the obtained weight coefficients include the weight coefficient of each first feature component in each second feature component.

As shown in fig. 5, in the same way as the storage of the input features, a plurality of second feature components included in the output features may also be stored in a batch in the output data buffer of the corresponding data processing unit, and each second feature component occupies one output channel (co). For convenience of description, the xth second feature component of the output feature will be referred to as cox hereinafter.

And S204, in the data processing units, according to the first characteristic components and the weight coefficients corresponding to the first characteristic components, determining the output characteristics corresponding to the input characteristics in parallel.

In a specific implementation, the weight coefficients of each first feature component in each second feature component may be broadcast to the data processing unit one by one, and when the broadcast of the weight coefficients of the first feature component is completed, another first feature component is switched to broadcast the weight coefficients. When the parameters of CNN are determined, the weight coefficients required for computing each input feature in the FC layer are the same, and thus the weight coefficients required in each data processing unit are also the same.

For example: each input feature comprises 6 first feature components ci0, ci1, …, ci5, and each output feature comprises 128 second feature components co0, co1, …, co127. The weight coefficients may be broadcast to each PE in the following order. Wherein, cix-coy represents the weight coefficient of the x-th first characteristic component in the y-th second characteristic component.

ci0-co0，ci0-co1，…，ci0-co127，

ci1-co0，ci1-co1，…，ci1-co127，

……

ci5-co0，ci5-co1，…，ci5-co127

Next, in each data processing unit, a contribution value of each first feature component in one input feature input to the data processing unit to each second feature component in a plurality of second feature components included in an output feature corresponding to the input feature may be determined according to a weight coefficient, where the contribution value may be a product of the first feature component and the weight coefficient. It should be noted that operations in multiple data units are executed in parallel.

For example, the input features T0 include ci0, ci1, and ci2, which are 0, 6, and 5, respectively, and ci0-co0, ci1-co0, ci2-co0, and ci3-co0 are 0.12, 0.15, and 0.2, respectively, then the contribution values of ci0, ci1, and ci2 to co0 are 0.12= 0.6 × 0.15=0.9, 5 × 0.2=1, respectively.

Wherein each first feature component in the input features is multiplexed for maximum and data buffering of the data processing unit is minimized. For a plurality of first feature components in the input feature, a contribution value of one of the first feature components to each of the second feature components in the output feature may be first calculated, and then the first feature components are deleted from the data processing unit; and then switches to another first feature component for calculation. That is, after one of the first feature components is applied to each of the second feature components, the first feature components are switched to be calculated.

For example: after ci0 is input into the PE, ci0-co0, ci0-co1, … and ci0-co127 are broadcasted to the PE, so that the PE can firstly calculate the contribution value of ci0 to co0, co1, … and co127 and delete ci0 from the input data cache; then after ci1 is input to the PE, ci1-co0, ci1-co1, …, ci1-co127 are broadcast to the PE, and the PE can calculate the contribution value of ci1 to co0, co1, …, co127 and delete ci1 from the input data cache …. By analogy, the contribution value of each ci to each co can be obtained, and thus each second feature component can be obtained.

Then, an output feature is determined from the contribution values, wherein the sum of the contribution values of each first feature component may be taken as a corresponding second feature component, and a plurality of second feature components together constitute the output feature.

For example, as shown in fig. 5, the output data buffer of each PE stores the output characteristics corresponding to the input characteristics inputted to the PE, wherein the output characteristics are formed by co0, co1, … and co127.

It should be noted that after each data processing unit calculates the output characteristics, the output characteristics need to be stored in the output data cache, and then the output characteristics are read and stored in the data storage through the data restoring unit. Therefore, if the storage capacity of the output data buffer is smaller than the size of the storage space occupied by the output characteristic, it is necessary to group the plurality of second characteristic components in the output characteristic.

For example: 128 co can be stored in the output data buffer of PE0, and the output characteristic P0 corresponding to the input characteristic T0 contains 148 second characteristic components, so 148 is decomposed into 128+20. First, 128 second feature components are calculated in PE0, and the 128 second feature components may be stored in the output data buffer of PE0, and then the remaining 20 second feature components are calculated after the 128 second feature components are read out from the output data buffer of PE 0.

To sum up, the core idea of the implementation method of the full connection layer in the embodiment of the present application is: and simultaneously broadcasting the weight coefficient to a plurality of data processing units, wherein each data processing unit is responsible for calculating the acquired input characteristic, so that a plurality of PEs can synchronously calculate the FC layers with a plurality of input characteristics, and the purpose of multiplexing the weight coefficient is achieved.

The following describes a complete implementation flow of the implementation method of the full connection layer in the embodiment of the present application by using an example.

Suppose that 16 input features T0, T1, …, T15 are available for acquisition. Which are stored in the data memories' batch0, batch1, …, batch15, respectively, each input feature comprising 6 first feature components ci0, ci1, …, ci5, respectively (as shown in fig. 3). The output characteristics corresponding to T0, T1, … and T15 are P0, P1, … and P15, respectively, and each output characteristic includes 128 second characteristic components co0, co1, … and co127. The specific implementation flow is as follows:

1) T0, T1, …, T15 are read from batch0, batch1, …, batch15 in the following order and stored in the input data buffers of PE0, PE1, …, PE15, respectively. Wherein, ci0 of each input feature can be read first, and ci1, … of each input feature can be read and put into the corresponding PE. As shown in fig. 4, these values correspond to ci0, ci1, …, ci5 to which T0 is input in PE 0. To PE1, ci0, ci1, …, ci5 of T1 are inputted. …, ci0, ci1, …, ci5 of T15 are input to PE15.

batch0-ci0，batch1-ci0，…,batch15-ci0,

batch0-ci1，batch1-ci1，…,batch15-ci1,

……

batch0-ci5，batch1-ci5，…,batch15-ci5。

2) The weight coefficients for each co for each ci can be broadcast to PE0, PE1, …, PE15 in the following order. The following broadcast order of the weighting factors enables each PE to multiplex each ci to the maximum extent and minimize the input data buffer, because the cix can be deleted from the input data buffer when its contribution to each of co0, co1, …, and co127 is calculated at once.

ci0-co0，ci0-co1，…，ci0-co127，

ci1-co0，ci1-co1，…，ci1-co127，

……

ci5-co0，ci5-co1，…，ci5-co127

3) As shown in fig. 5, co0, co1, …, co127 corresponding to the output characteristics are calculated in turn in each PE and stored in the output data buffer, resulting in P0, P1, …, P15.

In the embodiment of the application, a plurality of input features for a fully-connected layer are acquired, each input feature in the plurality of input features comprises a plurality of first feature components, and when the total number of the acquired plurality of input features reaches a preset threshold, the plurality of input features are simultaneously input into a plurality of data processing units corresponding to the fully-connected layer; then, acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristic of the full connection layer; then, in the plurality of data processing units, according to the plurality of first characteristic components and the weight coefficients corresponding to the plurality of first characteristic components, determining the output characteristic corresponding to each input characteristic in parallel. Compared with the method for performing FC layer calculation every time an input feature is generated in the prior art, the method provided by the embodiment of the application accumulates the input features, and after the input features are accumulated to a certain number, the FC layer calculation of the input features is completed on the premise that the weight coefficients are read only once by using a mode that a plurality of data processing units determine the input features in parallel, so that the purpose of multiplexing the weight coefficients of the FC layer is achieved, the bandwidth for reading the weight coefficients is greatly reduced, and the utilization rate of the MAC of the neural network is improved. In addition, invalid data which is calculated is deleted in time through the matching of the reading sequence of the input characteristic components and the broadcasting sequence of the weight coefficients, and the data caching pressure in the data processing unit is effectively reduced.

Referring to fig. 6, fig. 6 is a schematic flowchart of another implementation method of a full connection layer according to an embodiment of the present disclosure. As shown in the figure, the method in the embodiment of the present application includes:

s601, acquiring a plurality of input features aiming at the full connection layer, wherein each input feature in the plurality of input features comprises a plurality of first feature components.

In a specific implementation, the CNN includes a convolutional layer, a pooling layer, and a full link layer. In the process of utilizing the convolutional neural network image processing, the fully connected layer can integrate a large number of image features obtained after the processing of the convolutional layer and the pooling layer so as to perform subsequent classification or other processing on the image. Therefore, the input features generated in the CNN for the FC layer may be acquired in real time, and the acquired input features may be stored in the data storage, wherein a plurality of first feature components of each input feature may be subjected to batch processing, so that each input feature may be stored in one batch, and each first feature component occupies one ci in the batch, and each first feature component may be a number, a vector, a matrix, or the like.

And S602, when the total number of the acquired multiple input features reaches a preset threshold, simultaneously inputting the multiple input features to each data processing unit in multiple data processing units corresponding to the full connection layer.

In a specific implementation, the preset threshold may be determined according to the number of data processing units and/or the storage capacity of the data storage, such as 16, 10, and so on, for FC. For each acquired input feature, the input feature may be first read from the data storage, and then input to each of the plurality of data processing units corresponding to the FC layer, where the input feature may be stored in an input data cache of the data processing unit. Thus, each data processing unit possesses all the input features captured.

For example: the total 16 input features acquired to the system generation include T0, T1, …, T15, which are stored in batch0, batch1, …, batch15, respectively. The FC layer has 16 data processing units, including PE0, PE1, …, PE15. Therefore, as shown in fig. 7, T0 may be input into PE0, PE1, …, PE15; inputting T1 into PE0, PE1, … and PE15; …; finally, inputting T15 into PE0, PE1, … and PE15. Specifically, the characteristic components of each T0, T1, …, T15 may be read from the data memory in the following order and stored in the input data buffer of each PE.

batch0-ci0，batch0-ci1，…,batch0-ci5,

batch1-ci0，batch1-ci1，…,batch1-ci5,

……

batch15-ci0，batch15-ci1，…,batch15-ci5。

S603, acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristics of the full connection layer.

In a specific implementation, in the CNN, the weight coefficients required by the FC layer may be read from the DDR and stored in the weight memory. The output feature comprises a plurality of second feature components, and each first feature component contributes to each second feature component. Therefore, the obtained weight coefficient is the weight coefficient of each first feature component in each second feature component.

And S604, in each data processing unit, determining at least one second characteristic component in a plurality of second characteristic components contained in the output characteristic corresponding to each input characteristic according to the plurality of first characteristic components in each input characteristic and the weight coefficients corresponding to the plurality of first characteristic components.

In a specific implementation, the plurality of second feature components may be numbered, and the second feature components are allocated to each data processing unit one by one according to the sequence from the small number to the large number for processing. And the number of the second characteristic components contained in each output characteristic is the same. Accordingly, a plurality of weight coefficients required for each data processing unit may be broadcast to each data processing unit one by one according to the second feature component processed in the data processing unit, where the weight coefficients required for each data processing unit are different.

Then, in each data processing unit, each of at least one second feature component that the data processing unit is responsible for processing is determined according to the weight coefficient, wherein operations in the plurality of data processing units are performed simultaneously.

For example: as shown in fig. 8, the output characteristic contains 128 second characteristic components co0, co1, …, co127 in total. The total number of PEs was 16, including PE0, PE1, …, PE15. Then co0 may be allocated first to PE0, co1 to PE1, co2 to PE2, …, and co15 to PE15; then co16 was assigned to PE0 and co17 was assigned to PE1, …. By analogy, the second feature component obtained for calculation in the PEi includes co (i + j 16), where i =0,1,2, …,15, j =0,1 …,7, and after co (i + j 16) is obtained, it may be stored in the output data buffer.

Accordingly, a corresponding weight coefficient may be input to each PE in the following order, where "cix-coy (PEz)" means that a weight coefficient of cix in coy is input to PEz.

ci0-co0(PE0)，ci0-co1(PE1)，…，ci0-co15(PE15)，

ci0-co16(PE0)，ci0-co17(PE1)，…，ci0-co31(PE15)，

……

ci0-co112(PE0)，ci0-co113(PE1)，…，ci0-co127(PE15)，

ci1-co0(PE0)，ci1-co1(PE1)，…，ci1-co15(PE15)，

ci1-co16(PE0)，ci1-co17(PE1)，…，ci1-co31(PE15)，

……

ci5-co112(PE0)，ci5-co113(PE1)，…，ci5-co127(PE15)

In order to maximize the multiplexing of each first feature component in the input features and to minimize the data buffering of the data processing unit. After one of the first feature components is applied to each of the second feature components, the first feature components are switched to perform calculation.

For example: co0, co16, co32, …, co112 need to be calculated in PE 0. Then for ci0, ci1, …, ci5, the contribution value of ci0 to co0, co16, co32, …, co112 can be calculated, and ci0 can be deleted from PE 0; calculating the contribution values of ci1 to co0, co16, co32, … and co112, and deleting ci1 from PE 0; ….

S605, combining the at least one second characteristic component determined by each data processing unit to obtain an output characteristic corresponding to each input characteristic.

For example, as shown in fig. 8, co0, co16, co32, …, co112 of the output feature P0 corresponding to the feature T0 may be input from the output data buffer of PE0, and co1, co17, co33, …, co113 of P0 may be obtained from PE 1; …, and co15, co31, co47, …, co127, obtained from P0 in PE15. Then, co was combined from small to large according to the serial numbers to co0, co1, co2, …, and co127 as P0.

Optionally, bandwidth consumption is avoided by reading the weight coefficients from the DDR multiple times during the calculation of the FC layer for multiple input features. The size of the memory space occupied by the weight coefficients required for completing the calculation of all the input features in the FC layer can be determined first, and the memory space is ensured to be smaller than the memory capacity of the weight memory, so that the required weight coefficients can be read out from the DDR all at once and stored in the weight memory.

When the weight memory cannot store all the weight coefficients, the plurality of second feature components in the output feature need to be grouped. For example: the weight memory can only store the weight coefficients of ci0, ci1, … and ci5 in co0, co1, … and co100, so that the co0, co1, … and co100 can be calculated first, and then the co101, co102, … and co127 can be calculated.

To sum up, the core idea of the implementation method of the full connection layer in the embodiment of the present application is: and for each output feature, splitting a plurality of second feature components contained in the output feature into a plurality of groups, wherein each PE in a plurality of PEs corresponding to the FC layer is responsible for calculating one group. Therefore, the FC calculation of a plurality of input characteristics can be synchronously performed by a plurality of PEs, one output characteristic can be calculated by combining a plurality of PEs, and the purpose of multiplexing weight coefficients is achieved.

1) T0, T1, …, T15 are read from batch0, batch1, …, batch15 in this order and broadcast to PE0, PE1, …, PE15. As shown in fig. 7, T0, T1, …, T15 are input to each PE;

batch0-ci0，batch0-ci1，…,batch0-ci5,

batch1-ci0，batch1-ci1，…,batch1-ci5,

……

batch15-ci0，batch15-ci1，…,batch15-ci5。

2) Co0, co1, …, co127 of each output feature are divided into 16 groups (16 PEs), wherein the i-th group includes co (i + j 16), i =0,1,2, …,15, j =0,1 …,7. Then PEi can be used to calculate co (i + j 16) for each output feature;

3) According to co (i + j 16) corresponding to the PEi, the corresponding weight coefficient is input to the PEi as follows.

ci0-co0(PE0)，ci0-co1(PE1)，…，ci0-co15(PE15)，

ci0-co16(PE0)，ci0-co17(PE1)，…，ci0-co31(PE15)，

……

ci0-co112(PE0)，ci0-co113(PE1)，…，ci0-co127(PE15)，

ci1-co0(PE0)，ci1-co1(PE1)，…，ci1-co15(PE15)，

ci1-co16(PE0)，ci1-co17(PE1)，…，ci1-co31(PE15)，

……

ci5-co112(PE0)，ci5-co113(PE1)，…，ci5-co127(PE15)

4) As shown in fig. 8, co (i + j 16) of each output feature of P0, P1, … and P15 is calculated in PEi from the weight coefficient and T0, T1, … and T15;

5) Co (i + j 16) of the same output signature is taken from each PE and combined to yield P0, P1, …, P15.

In the embodiment of the application, a plurality of input features for a fully-connected layer are acquired, each input feature in the plurality of input features comprises a plurality of first feature components, and when the total number of the acquired plurality of input features reaches a preset threshold, the plurality of input features are simultaneously input to each data processing unit in a plurality of data processing units corresponding to the fully-connected layer; then, acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristic of the full connection layer; then, in each data processing unit, at least one second feature component of a plurality of second feature components included in the output feature corresponding to each input feature is determined in parallel according to the plurality of first feature components and the weight coefficients corresponding to the plurality of first feature components, and finally, the determined at least one second feature component determined by each data unit is combined to the input feature. In the embodiment of the application, one output characteristic is split into a plurality of characteristic component groups, each data processing unit calculates one of the characteristic component groups, the purpose that a plurality of data processing units jointly calculate one output characteristic is achieved, and the operation in the plurality of data processing units is performed in parallel, so that the purposes of multiplexing FC layer weight coefficients, reducing the bandwidth of reading weight coefficients and improving the MAC utilization rate of a neural network are achieved.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an implementation apparatus of a full connection layer according to an embodiment of the present disclosure. As shown in the figures, the apparatus in the embodiment of the present application includes:

an obtaining module 901 is configured to obtain a plurality of input features for a fully connected layer, where each input feature in the plurality of input features includes a plurality of first feature components.

In a specific implementation, the CNN includes a convolutional layer, a pooling layer, and a full link layer. In the process of image processing by using the convolutional neural network, the fully connected layer can integrate a large number of image features obtained after the processing of the convolutional layer and the pooling layer so as to perform subsequent classification or other processing on the image. Therefore, the input features of the FC layer generated in the CNN can be acquired in real time, and the acquired input features are stored in the data storage, wherein a plurality of first feature components of each input feature can be processed in batch, so that each input feature can be stored in one batch, and each first feature component occupies one ci, and each first feature component can be a number, a vector, or a matrix, and so on.

A transmission module 902, configured to, when the total number of the obtained multiple input features reaches a first preset threshold, simultaneously input the multiple input features into multiple data processing units corresponding to the full connection layer.

In a specific implementation, a plurality of data processing units corresponding to the FC layer may be configured to perform computation of the FC layer on input features in a plurality of batchs, respectively. The first preset threshold may be determined according to the number of data processing units corresponding to the FC and/or the storage capacity of the data storage. Such as: if there are 16 data processing units corresponding to the FC, the first preset threshold may be 16.

When the total number of the acquired multiple input features reaches a first preset threshold, one of the acquired multiple input features may be input to each data processing unit in the multiple data units at the same time. For each acquired input feature, the input feature may be first read from the data store and then input to one of the plurality of data processing units corresponding to the FC layer, where the input feature may be stored in an input data cache of the data processing unit. Accordingly, each data processing unit also has only one input characteristic, and the input characteristics in each data processing unit are different in order to improve the utilization rate of the multiplication accumulator.

Alternatively, the acquired plurality of inputs may be input to each of the plurality of data processing units at the same time. For each acquired input feature, the input feature may be first read from the data storage and then input to each of the plurality of data processing units corresponding to the FC layer, where the input feature may be stored in an input data cache of the data processing unit. Thus, each data processing unit possesses all the input features captured.

The obtaining module 901 is further configured to obtain a weight coefficient of each first feature component in the output features of the fully-connected layer.

In a specific implementation, in the CNN, the weight coefficients required by the FC layer may be read from the DDR and stored in the weight memory. The output characteristic comprises a plurality of second characteristic components, and each first characteristic component contributes to each second characteristic component. Therefore, the obtained weight coefficients include the weight coefficient of each first feature component in each second feature component.

And a processing module 903, configured to determine, in parallel, an output feature corresponding to each input feature according to the multiple first feature components and the weight coefficients corresponding to the multiple first feature components in multiple data processing units, where the processing module 903 includes multiple data processing units corresponding to a full connection layer.

In a specific implementation, the weight coefficient of each first feature component in each second feature component may be broadcast to the data processing unit one by one first. When the parameters of the CNN are determined, the weight coefficients required for calculating each input feature in the FC layer are the same.

Next, in each data processing unit, a contribution value of each first feature component in one input feature input to the data processing unit to each second feature component in a plurality of second feature components included in an output feature corresponding to the input feature may be determined according to a weight coefficient, where the contribution value may be a product of the first feature component and the weight coefficient.

Wherein each first feature component in the input features is multiplexed for maximum and data buffering of the data processing unit is minimized. For a plurality of first feature components in the input feature, the contribution value of one of the first feature components to each of the second feature components in the output feature may be first calculated, and then the first feature component may be deleted from the data processing unit; and then switches to another first feature component for calculation. That is, after one of the first feature components is applied to each of the second feature components, the first feature components are switched to be calculated.

Optionally, in each data processing unit, at least one second feature component of the plurality of second feature components included in the output feature corresponding to each input feature may be determined according to the plurality of first feature components in each input feature and the weighting coefficients corresponding to the plurality of first feature components. And then combining the at least one first characteristic component determined by each data processing unit to obtain an output characteristic.

Specifically, the plurality of second feature components may be numbered, and the second feature components may be assigned to each data processing unit one by one in the order of the numbers from small to large for processing. And the number of the second characteristic components contained in each output characteristic is the same. Accordingly, a corresponding weight coefficient may be input to each data processing unit in accordance with the second feature component processed in each data processing unit. Then, in each data processing unit, each of at least one second feature component that the data processing unit is responsible for processing is determined according to the weight coefficient.

Wherein each first feature component in the input features is multiplexed for maximum and data buffering of the data processing unit is minimized. After one of the first feature components is applied to each of the second feature components, the first feature components are switched to perform calculation.

In the embodiment of the application, a plurality of input features aiming at a fully-connected layer are obtained, each input feature in the plurality of input features comprises a plurality of first feature components, and when the total number of the obtained plurality of input features reaches a preset threshold, the plurality of input features are firstly and simultaneously input into a data processing unit corresponding to the fully-connected layer; then, acquiring a weight coefficient of each first characteristic component in the plurality of first characteristic components in the output characteristic of the full connection layer; then, in the plurality of data processing units, according to the plurality of first characteristic components and the weight coefficients corresponding to the plurality of first characteristic components, determining the output characteristic corresponding to each input characteristic in parallel. Compared with the method for performing FC layer calculation every time an input feature is generated in the prior art, the method in the embodiment of the application accumulates the input features, and then performs parallel calculation on the plurality of input features by using the plurality of data processing units after accumulating a certain number of input features. The calculation of the FC layers of a plurality of input features is completed on the premise that the weight coefficient is read only once, so that the aim of multiplexing the weight coefficient of the FC layers is fulfilled, the bandwidth for reading the weight coefficient is greatly reduced, and the utilization rate of the MAC (media access control) of the neural network is improved. In addition, invalid data which is calculated is deleted in time through the matching of the reading sequence of the input characteristic components and the broadcasting sequence of the weight coefficients, and the data caching pressure in the data processing unit is effectively reduced.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown, the electronic device may include: at least one processor 1001, such as a CPU, at least one communication interface 1002, at least one memory 1003, at least one bus 1004. The bus 1004 is used to enable connection communication among these components. In this embodiment, the communication interface 1002 of the electronic device in this application is a wired sending port, and may also be a wireless device, for example, including an antenna apparatus, for performing signaling or data communication with other node devices. The memory 1003 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1003 may optionally be at least one memory device located remotely from the processor 1001. A set of program codes is stored in the memory 1003 and the processor 1001 is used to call the program codes stored in the memory for performing the following operations:

acquiring a weight coefficient of each first feature component in the plurality of first feature components in the output feature of the full connection layer;

in the multiple data processing units, according to the multiple first characteristic components and the weight coefficients corresponding to the multiple first characteristic components, output characteristics corresponding to each input characteristic are determined in parallel. .

The processor 1001 is further configured to perform the following operation steps:

simultaneously inputting the plurality of input features to each of the plurality of data processing units.

determining a storage space required for storing the weight coefficients;

It should be noted that, the embodiment of the present application also provides a storage medium, where the storage medium is used to store an application program, and the application program is used to execute, when running, an operation performed by an electronic device in an implementation method of a full connection layer as shown in fig. 2 and fig. 6.

It should be noted that, the embodiment of the present application also provides an application program, where the application program is configured to execute, when running, operations performed by the electronic device in the implementation method of the full connection layer shown in fig. 2 and fig. 6.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others. The above-mentioned embodiments further explain the objects, technical solutions and advantages of the present application in detail. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

obtaining a plurality of input features of an image for a fully connected layer, each input feature of the plurality of input features comprising a plurality of first feature components;

when the total number of the acquired input features reaches a first preset threshold value, simultaneously inputting the input features to a plurality of data processing units corresponding to the full connection layer, wherein the first preset threshold value is determined according to the number of the data processing units;

obtaining a weight coefficient of each of the plurality of first feature components in the output feature of the fully-connected layer, including: determining the size of a storage space occupied by weight coefficients required by the calculation of all the input features in a fully-connected FC layer, and when the storage space is smaller than or equal to the storage capacity of a weight memory, reading all the weight coefficients required by the calculation of the FC layer at one time and storing the weight coefficients in the weight memory;

in the multiple data processing units, determining an output feature corresponding to each input feature in parallel according to the multiple first feature components and the weighting coefficients corresponding to the multiple first feature components, including: when the plurality of input features are input to each data processing unit in the plurality of data processing units at the same time, in each data processing unit, determining at least one second feature component in a plurality of second feature components contained in an output feature corresponding to each input feature according to the plurality of first feature components in each input feature and the weighting coefficients corresponding to the plurality of first feature components; combining the at least one second feature component determined by each data processing unit to obtain an output feature corresponding to each input feature, wherein each data processing unit has the acquired plurality of input features;

and classifying or processing the image according to the output characteristic corresponding to each input characteristic.

2. The method of claim 1, wherein the determining, in the plurality of data processing units, the output feature corresponding to each input feature in parallel according to the plurality of first feature components and the weighting coefficients corresponding to the plurality of first feature components comprises:

when one input feature of the plurality of input features is input to each data processing unit of the plurality of data processing units at the same time, in each data processing unit, an output feature corresponding to the one input feature is determined according to the plurality of first feature components in the one input feature and the weighting coefficients corresponding to the plurality of first feature components.

3. The method of claim 2, wherein the output feature corresponding to the one input feature comprises a plurality of second feature components;

the determining, according to the plurality of first feature components in the one input feature and the weighting coefficients corresponding to the plurality of first feature components, an output feature corresponding to the one input feature includes:

determining a contribution value of each first feature component in the input feature to each second feature component in the second feature components according to the weighting coefficients corresponding to the first feature components;

4. The method according to claim 3, wherein after determining the contribution value of each first feature component in the one input feature to each second feature component in the plurality of second feature components according to the weighting coefficients corresponding to the plurality of first feature components, the method further comprises:

5. The method according to claim 1, wherein the determining at least one second feature component of a plurality of second feature components included in the output feature corresponding to each input feature according to the plurality of first feature components in each input feature and the weighting coefficients corresponding to the plurality of first feature components comprises:

determining a contribution value of the each first feature component in the each input feature to each second feature component in the at least one second feature component according to the weighting coefficients corresponding to the plurality of first feature components;

6. An image processing apparatus, characterized in that the apparatus comprises:

an acquisition module to acquire a plurality of input features of an image for a fully connected layer, each input feature of the plurality of input features comprising a plurality of first feature components;

a transmission module, configured to input the multiple input features to the multiple data processing units corresponding to the full connection layer simultaneously when the total number of the obtained multiple input features reaches a first preset threshold, where the transmission module includes: determining the size of a storage space occupied by weight coefficients required by the calculation of all the multiple input features in a fully-connected FC layer, when the storage space is smaller than or equal to the storage capacity of a weight memory, reading all the weight coefficients required by the calculation of the FC layer at one time, and storing the weight coefficients in the weight memory, wherein the first preset threshold is determined according to the number of the multiple data processing units;

a processing module, where the processing module includes a plurality of data processing units corresponding to the full connection layer, and is configured to determine, in the plurality of data processing units, an output feature corresponding to each input feature in parallel according to the plurality of first feature components and weight coefficients corresponding to the plurality of first feature components, and the processing module includes: when the plurality of input features are input to each data processing unit in the plurality of data processing units at the same time, in each data processing unit, determining at least one second feature component in a plurality of second feature components contained in an output feature corresponding to each input feature according to the plurality of first feature components in each input feature and the weighting coefficients corresponding to the plurality of first feature components; combining the at least one second feature component determined by each data processing unit to obtain an output feature corresponding to each input feature, wherein each data processing unit has the acquired plurality of input features;

the processing module is further configured to classify or process the image according to the output feature corresponding to each input feature.

7. An electronic device, comprising: a processor, a memory, a communication interface, and a bus;

the memory stores executable program code;

the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for executing the image processing method according to any one of claims 1 to 5.

8. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the image processing method according to any one of claims 1 to 5.