CN114511766A

CN114511766A - Image identification method based on deep learning and related device

Info

Publication number: CN114511766A
Application number: CN202210092535.0A
Authority: CN
Inventors: 徐钒鑫; 吴伟煊; 刘蓓蓓; 吕赫; 向伟
Original assignee: Southwest Minzu University
Current assignee: Southwest Minzu University
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-17

Abstract

The application discloses an image recognition method based on deep learning and a related device. And performing multiple rounds of first convolution operations on each to-be-processed feature map based on a preset first expansion factor, so as to determine a first total feature map corresponding to each to-be-processed image according to a feature identification result obtained by each round of first convolution operations. And determining the recognition result of the image to be recognized through a decoder. Because each round of first convolution operation corresponds to different first expansion factors, the feature recognition result obtained by each round of first convolution operation is in different receptive fields. And the feature recognition result of each round of first convolution operation can be used as an input item of the next round, so that the first total feature map representing feature result fusion under different receptive fields can be obtained by the last round of first convolution operation. The parameter quantity of the neural network model can be greatly reduced through the process, and the convergence speed of the model is improved.

Description

Image identification method based on deep learning and related device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image recognition method and a related apparatus based on deep learning.

Background

The image recognition technology is mostly based on an image classification algorithm of deep learning, and the features of the image to be recognized are extracted in a pooling layer or multilayer convolution mode. In order to improve the classification accuracy, downsampling operation is often performed in the related art to expand the scope of the receptive field, so that semantic information of the feature image is increased. However, the scope of the receptive field is enlarged, which results in the loss of detail information of the characteristic image.

In order to solve the problems, a characteristic pyramid mode is mostly adopted at present, the characteristics of the image to be identified under different receptive fields are extracted through a plurality of convolution kernels, and then the characteristics under different receptive fields are subjected to characteristic fusion. Although the processing method can repair the detailed information of the feature map to a certain extent, the number of parameters of the neural network model is greatly increased, and the convergence rate of the neural network is reduced.

Disclosure of Invention

The embodiment of the application provides an image identification method based on deep learning and a related device, which are used for carrying out multi-round convolution on a to-be-processed feature map through expansion factors with different numerical values so as to fuse feature identification results under different receptive fields. The parameter quantity of the neural network model is reduced, and the convergence speed is improved.

In a first aspect, an embodiment of the present application provides an image identification method based on deep learning, where the method includes:

performing feature extraction on an image to be identified to obtain a plurality of feature maps to be processed, which represent the image to be identified and have different dimensions; wherein, the sizes of all characteristic graphs to be processed are the same;

performing multiple rounds of first convolution operations on each characteristic diagram to be processed based on a preset first expansion factor, and determining a first total characteristic diagram corresponding to each characteristic diagram to be processed according to a characteristic identification result obtained by each round of first convolution operations; the first expansion factors corresponding to each round of first convolution operation are different, and the first expansion factors with different numerical values represent that the receptive fields corresponding to the feature recognition results are different;

determining a preset classification to which the first total feature map belongs through a decoder, and taking the preset classification as an identification result of the image to be identified; wherein the first volume operation process is as follows:

performing convolution operation on an input item by adopting a first convolution kernel based on a first expansion factor corresponding to the round to obtain a first sub-feature graph corresponding to the input item; performing convolution operation on the first sub-feature map by adopting a second convolution kernel to obtain a feature identification result of the input item under the receptive field corresponding to the first expansion factor; wherein, the input items of the first convolution operation of the first round are the characteristic graphs to be processed, and the input items of the first convolution operation of the non-first round are the characteristic identification results obtained in the previous round; the first convolution kernel and the second convolution kernel are different in size.

According to the embodiment of the application, the characteristic extraction is carried out on the image to be recognized, and a plurality of characteristic graphs to be processed corresponding to the image to be recognized are obtained. And performing multiple rounds of first convolution operations on each to-be-processed feature map based on a preset first expansion factor, so as to determine a first total feature map corresponding to each to-be-processed image according to a feature identification result obtained by each round of first convolution operations. And determining the preset classification of the first total feature map through a decoder so as to obtain the identification result of the image to be identified. Because the first expansion factors corresponding to each round of first convolution operation are different, and the first expansion factors with different values represent that the receptive fields corresponding to the feature recognition results are different, the feature recognition results obtained by each round of first convolution operation are in different receptive fields. And the feature recognition result of each round of first convolution operation can be used as an input item of the next round, so that the first total feature map representing feature result fusion under different receptive fields can be obtained by the last round of first convolution operation. The parameter quantity of the neural network model can be greatly reduced through the process, and the convergence speed of the model is improved.

In some possible embodiments, the preset dimensions of the first convolution kernel and the second convolution kernel are the same, and after the obtaining of the multiple feature maps to be processed that characterize the different dimensions of the image to be identified, the method further includes:

before performing multiple rounds of first volume operation on each feature map to be processed, performing dimension processing on each feature map to be processed so as to enable the dimension of each feature map to be processed to be the same as the preset dimension.

In the embodiment of the present application, the preset dimensions of the first convolution kernel and the second convolution kernel are the same, and before performing multiple rounds of the first convolution operation on each feature map to be processed, it is necessary to perform dimension processing on each feature map to be processed, so that the dimension of each feature map to be processed is the same as the preset dimension. Therefore, the sub-feature graph obtained by performing convolution operation on the feature graph to be processed through the first convolution kernel can be directly used for convolution through the second convolution kernel, and the processing speed is improved.

In some possible embodiments, the number of the second convolution kernels is greater than the number of the first convolution kernels, and performing convolution on the input items by using the first convolution kernels includes:

performing convolution operation on each picture in the input item to obtain a first sub-feature map corresponding to each picture; wherein the picture is a feature picture to be processed or a feature identification result;

the performing convolution operation on the first sub-feature map by using a second convolution kernel comprises:

and performing convolution operation on each picture in the input item, and adding pixel values at the same position of each picture in the convolution process to obtain feature extraction results corresponding to all pictures in the input item.

The number of the second convolution kernels is larger than that of the first convolution kernels, and when the first sub-feature graph output by the first convolution kernels is convolved by the second convolution kernels, multiple feature recognition results of the first sub-feature graph under different dimensionalities can be obtained, and detail information of the image is improved.

In some possible embodiments, the determining, by the decoder, the preset classification to which the first total feature map belongs includes:

the control regression head carries out second convolution operation on the first total feature map based on a second expansion factor so as to determine a region of interest corresponding to the first total feature map; and the number of the first and second electrodes,

controlling the classification head to perform a third convolution operation on the first total feature map based on the second expansion factor so as to determine a classification detection result of the first total feature map; wherein the second expansion factor is different in value from the first expansion factor;

if the classification detection result represents that the first total feature map is an identifiable image, performing category identification on the region of interest of the first total feature map, and determining a preset classification to which the first total feature map belongs according to an identification result; if the image is not the recognizable image, outputting prompt information indicating that the image cannot be recognized.

The method and the device for detecting the region of interest control the regression head to determine the region of interest of the first total feature map, control the classification head to determine the classification detection result of the first total feature map, and determine the preset classification of the region of interest by performing class identification on the region of interest when the classification detection result represents that the first total feature map is an identifiable image. The regression head and the classification head are synchronously processed in a parallel mode, and the processing speed is improved.

In some possible embodiments, the performing, by the control regression head, a second convolution operation on the first total feature map based on a second expansion factor to determine the region of interest corresponding to the first total feature map includes:

performing convolution operation on the first total feature map by adopting a third convolution kernel based on the second expansion factor to obtain a second sub-feature map under the receptive field corresponding to the second expansion factor; wherein the third convolution kernel is the same size as the first convolution kernel;

performing convolution operation on the second sub-feature graph by adopting a fourth convolution kernel to obtain a second total feature graph corresponding to the second sub-feature graph; performing convolution operation on the second total characteristic diagram by adopting a fifth convolution core to obtain the region of interest; wherein the fourth convolution kernel is the same size as the second convolution kernel and the number of the fourth convolution kernels is greater than the number of the third convolution kernels; the fifth convolution kernel is the same size as the first convolution kernel.

The method comprises the steps of controlling a regression head to perform convolution operation on a first total feature map by adopting a third convolution kernel based on a second expansion factor to obtain a second sub-feature map under a receptive field corresponding to the second expansion factor, performing convolution operation on the second sub-feature map by adopting a fourth convolution kernel to obtain a second total feature map representing a feature fusion result of the second sub-feature map, and finally determining an interested region from the second total feature map by a fifth convolution kernel with the interested region identification capability to improve the selection precision of the interested region.

In some possible embodiments, the controlling the classification head to perform a third convolution operation on the first total feature map based on the second dilation factor to determine the classification detection result of the first total feature map, including:

performing convolution operation on the first total feature map by adopting a sixth convolution kernel based on the second expansion factor to obtain a third sub-feature map under the receptive field corresponding to the second expansion factor; wherein the sixth convolution kernel is the same size as the first convolution kernel;

performing convolution operation on the third sub-feature graph by adopting a seventh convolution kernel to obtain a third total feature graph corresponding to the third sub-feature graph; performing convolution operation on the third total characteristic graph by adopting an eighth convolution core to obtain the classification detection result; wherein the sixth convolution kernel and the eighth convolution kernel are each the same size as the first convolution kernel; the seventh convolution kernel is the same size as the second convolution kernel, and the number of the seventh convolution kernels is greater than the number of the sixth convolution kernels.

The classification head is controlled to perform convolution operation on the first total feature map by adopting a sixth convolution kernel based on the second expansion factor to obtain a third sub-feature map under the receptive field corresponding to the second expansion factor, then the third sub-feature map is subjected to convolution operation by adopting a seventh convolution kernel to obtain a third total feature map representing the feature fusion result of the third sub-feature map, and finally the third total feature map capable of being identified by the neural network model is screened out by an eighth convolution kernel with classification detection capability so as to improve the model identification precision.

In some possible embodiments, at least one expanded volume block is included in each of the classification header and the regression header in the decoder.

This application carries out many rounds of convolutions through the inflation factor of different numerical values to the characteristic map of handling to fuse the characteristic identification result under the different receptive field, need not to carry out the multiscale characteristic fusion under the similar characteristic pyramid, avoided the characteristic loss that the characteristic weight size of characteristic fusion in-process differs and cause, only need 1 return to the head and 1 classification head can guarantee the identification precision of classification result in this application decoder.

In a second aspect, an embodiment of the present application provides an image recognition apparatus based on deep learning, the apparatus including:

the characteristic extraction module is configured to perform characteristic extraction on an image to be identified and acquire a plurality of characteristic maps to be processed, which represent different dimensions of the image to be identified; wherein, the sizes of all characteristic graphs to be processed are the same;

the feature fusion module is configured to perform multiple rounds of first volume operations on each feature map to be processed based on a preset first expansion factor, so as to determine a first total feature map corresponding to each feature map to be processed according to a feature identification result obtained by each round of first volume operations; the first expansion factors corresponding to each round of first convolution operation are different, and the first expansion factors with different numerical values represent that the receptive fields corresponding to the feature recognition results are different;

the image identification module is configured to execute the preset classification determined by the decoder to the first total feature map, and the preset classification is used as the identification result of the image to be identified; wherein the first volume operation process is as follows:

In some possible embodiments, the preset dimensions of the first convolution kernel and the second convolution kernel are the same, and after the obtaining of the multiple feature maps to be processed representing different dimensions of the image to be recognized is performed, the feature extraction module is further configured to:

before performing multiple rounds of first convolution operations on each feature map to be processed, performing dimension processing on each feature map to be processed so that the dimension of each feature map to be processed is the same as the preset dimension.

In some possible embodiments, the number of the second convolution kernels is greater than the number of the first convolution kernels, the performing the convolution operation on the input items with the first convolution kernels is performed, and the feature fusion module is configured to:

performing the convolution operation on the first sub-feature map with the second convolution kernel, the feature fusion module configured to:

In some possible embodiments, said determining, by a decoder, a preset classification to which the first total feature map belongs is performed, the image recognition module being configured to:

and if the classification detection result represents that the first total feature map is an identifiable image, performing category identification on the region of interest of the first total feature map, and determining a preset classification to which the first total feature map belongs according to the identification result.

In some possible embodiments, the performing the control regression head performs a second convolution operation on the first total feature map based on a second expansion factor to determine a region of interest corresponding to the first total feature map, and the image identification module is configured to:

performing convolution operation on the second sub-feature graph by adopting a fourth convolution kernel to obtain a second total feature graph corresponding to the second sub-feature graph; performing convolution operation on the second total characteristic diagram by adopting a fifth convolution core to obtain the region of interest; wherein the third convolution kernel and the fifth convolution kernel are both the same size as the first convolution kernel; the fourth convolution kernel is the same size as the second convolution kernel and the number of the fourth convolution kernels is greater than the number of the third convolution kernels.

In some possible embodiments, performing a third convolution operation on the first total feature map by the control classification head based on the second dilation factor to determine a classification detection result of the first total feature map, the image recognition module configured to:

In a third aspect, an embodiment of the present application further provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement any of the methods as provided in the first aspect of the application.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the electronic device to perform any one of the methods as provided in the first aspect of the present application.

In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program that, when executed by a processor, implements any of the methods as provided in the first aspect of the present application.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a feature pyramid FPN according to an embodiment of the present disclosure;

fig. 2 is a schematic view of a Yolof network structure shown in the embodiment of the present application;

fig. 3a is a flowchart illustrating an overall image recognition method based on deep learning according to an embodiment of the present application;

FIG. 3b is a schematic diagram of a neural network structure according to the present application;

fig. 4 is a block diagram of an image recognition apparatus 400 based on deep learning according to an embodiment of the present application;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described in detail and clearly with reference to the accompanying drawings. In the description of the embodiments of the present application, "/" will mean "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" in the text is only an association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: three cases of a alone, a and B both, and B alone exist, and in addition, "a plurality" means two or more than two in the description of the embodiments of the present application.

In the description of the embodiments of the present application, the term "plurality" means two or more unless otherwise specified, and other terms and the like should be understood similarly, and the preferred embodiments described herein are only for the purpose of illustrating and explaining the present application, and are not intended to limit the present application, and features in the embodiments and examples of the present application may be combined with each other without conflict.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of these steps is not limited to the order of execution provided by the embodiments of the present application. The method may be executed in sequence or in parallel according to the embodiments or methods shown in the drawings during actual processing or execution by a control device.

As mentioned above, in order to improve the classification accuracy in the related art, a downsampling operation is performed to enlarge the scope of the receptive field, so as to increase the semantic information of the feature image. However, the scope of the receptive field is enlarged, which results in the loss of detail information of the characteristic image.

In the correlation technique, a characteristic pyramid mode is mostly adopted, the characteristics of the image to be identified in different receptive fields are extracted through a plurality of convolution kernels, and then the characteristics in different receptive fields are subjected to characteristic fusion. In particular, FIG. 1 illustrates a network structure of a feature pyramid FPN. As shown in fig. 1, the FPN network includes a bottom-up convolution line (as shown on the left side of fig. 1), i.e., a top-down characteristic convolution. And a top-down convolution line (as shown on the right side of fig. 1). Because the high layer has more feature semantics, the low layer has less feature semantics but has more position information, the FPN performs side connection from top to bottom on the high-layer features of the low-resolution and high-semantic information and the low-layer features of the high-resolution and low-semantic information in a mode of fusing the feature images adjacent to the left feature image, so that the features under all scales have abundant semantic information, and the detail information of the feature images is improved. The specific method is that the higher layer features of two feature layers are up-sampled by 2 times (namely, a proper interpolation algorithm is adopted to insert new elements between pixel points on the basis of the pixels of the original image, and the image size is enlarged by one time). The lower layer features are convolved by 1 × 1 to change the number of channels of the lower layer features, and then simply add the corresponding elements of the result of convolving the upsampling and 1 × 1.

The method has the advantages that the characteristic extraction is carried out on the image of each scale, the multi-scale characteristic image can be generated, and all the characteristic images have strong semantic information. The method has the disadvantages of more model parameters and larger memory occupation.

In order to solve the above problems, the inventive concept of the present application is: and performing feature extraction on the image to be recognized to obtain a plurality of feature maps to be processed corresponding to the image to be recognized. And performing multiple rounds of first convolution operations on each to-be-processed feature map based on a preset first expansion factor, so as to determine a first total feature map corresponding to each to-be-processed image according to a feature identification result obtained by each round of first convolution operations. And determining the preset classification of the first total feature map through a decoder so as to obtain the identification result of the image to be identified. Because the first expansion factors corresponding to each round of first convolution operation are different, and the first expansion factors with different values represent that the receptive fields corresponding to the feature recognition results are different, the feature recognition results obtained by each round of first convolution operation are in different receptive fields. And the feature recognition result of each round of first convolution operation can be used as an input item of the next round, so that the first total feature map representing feature result fusion under different receptive fields can be obtained by the last round of first convolution operation. The parameter quantity of the neural network model can be greatly reduced through the process, and the convergence speed of the model is improved.

To facilitate understanding of the technical solution provided by the present application, a brief description is first made on the Yolof neural network, as shown in fig. 2:

fig. 2 shows a network structure of the Yolof neural network, including a feature extraction skeleton Backbone, an Encoder and a Decoder. The Backbone is used for performing feature extraction on the image to be identified so as to obtain a feature image of the image to be identified.

The output of the backhaul in this network structure contains 2 versions, one of which is a feature image with a sampling rate of 32 relative to the input image and a channel number 2048, i.e., C5 shown in fig. 2. The second is a characteristic image with a sampling rate of 16 and a channel number of 2048 relative to the input image, i.e., DC5 shown in fig. 2.

The feature image is then input into the Encoder, and the network structure employs an expanded Encoder scaled Encoder as the Encoder. The scaled Encoder takes the features output by the Backbone as input, firstly reduces the number of feature channels to 512 by using 1x1 convolution and 3x3 convolution, and then extracts the features by using 4 continuous residual modules. In each residual module, the number of characteristic channels is reduced to 1/4 by using 1x1 convolution, then the receptive field is increased by using 3x3 hole convolution, and finally the number of the characteristic channels is expanded by 4 times by using 1x1 convolution. The expansion factors of the cavity convolution in the 4 residual error modules are respectively 2, 4, 6 and 8. Thus, the feature fusion results of the feature image under different receptive fields can be obtained after 4 rounds of expansion convolution operations (each expansion factor corresponds to one round).

And finally, inputting the feature fusion result into the Decoder. The Decoder of Yolof includes a regression header (Conv-BN-ReLU) containing 4 expanded volume blocks and a classification header (Conv-BN-ReLU) containing 2 expanded volume blocks. Wherein, the expansion convolution blocks are characterized by needing to do expansion convolution operations for several times. Each anchor point (anchor) in the regression header has a prediction block object prediction (i.e., region of interest), and the final classification score is obtained by multiplying the output of the classification branch by the object prediction. Namely, the regression head is used for determining the region of interest in the feature fusion result, and the classification head is used for performing classification identification on the region of interest of the feature fusion result to acquire the belonged classification of the input image.

The feature fusion process of the encoder in the Yolof neural network only adopts one-level features for detection, and does not need to adopt multi-layer feature detection as the feature pyramid FPN shown in FIG. 1. The memory footprint is greatly reduced compared to FPN. Based on the above, in the network structure based on Yolof, multiple rounds of convolution operations are performed on the feature graph to be processed by adopting convolution kernels with different sizes to obtain the first total feature graph representing feature result fusion under different receptive fields, and parameters required by the operations are greatly reduced compared with Yolof, so that the model convergence rate is improved.

The following describes an image recognition method based on deep learning in detail with reference to the accompanying drawings, specifically as shown in fig. 3a, including the following steps:

step 301: performing feature extraction on an image to be identified to obtain a plurality of feature maps to be processed, which represent the image to be identified and have different dimensions; wherein the sizes of all feature graphs to be processed are the same;

the embodiment of the application selects RegNetX-400MF as a basic construction feature extraction network, the network comprises 4 modules, the feature graph output by each module is reduced by 2 times compared with the feature graph output by the previous module, and downsampling is completed by increasing convolution step length. Each module is formed by cascading different block numbers, wherein the number is [1,2,7 and 12], and a similar residual error network structure is formed. Each convolution layer is followed by a Batch Normalization layer (Batch Normalization) and a correction nonlinear function layer (ReLU). The Batch Normalization layer can normalize the output, and the ReLU layer adds nonlinearity to the output of the upper layer. The characteristic dimension of RegNetX-400MF is 384. That is, feature extraction is performed on N images to be recognized to obtain a feature image of (N, 384, H, W), where H and W are the length and width dimensions of the image. Namely N pieces of characteristic maps to be processed with 384 dimensions and H multiplied by W.

Step 302: performing multiple rounds of first convolution operations on each characteristic diagram to be processed based on a preset first expansion factor, and determining a first total characteristic diagram corresponding to each characteristic diagram to be processed according to a characteristic identification result obtained by each round of first convolution operations; the first expansion factors corresponding to each round of first convolution operation are different, and the first expansion factors with different numerical values represent that the receptive fields corresponding to the feature recognition results are different;

In some possible embodiments, because the image dimension to be convolved needs to be the same as the dimension of the convolution kernel, the preset dimensions of the first convolution kernel and the second convolution kernel can be set to be the same, so that the sub-feature map obtained by performing convolution operation on the feature map to be processed through the first convolution kernel can be directly used for performing convolution on the second convolution kernel, and the processing rate is improved.

Further, in the embodiment of the present application, before performing multiple rounds of first volume operations on each feature map to be processed, dimension processing is performed on each feature map to be processed, so that the dimension of each feature map to be processed is the same as the preset dimension.

Specifically, the feature image extracted from the feature extraction backbone in the Yolof neural network is (N, 512,25, 25). I.e., N512-dimensional feature images with a size of 25 × 25. The present application continues with the convolutional trellis of Yolof, so it requires the input of the (N, 512,25,25) size of the feature map to be processed. Therefore, the (N, 384, H, W) feature map obtained in step 301 is first raised by dimension using 1x1 convolution (512, 1, 1), and then resized to become a to-be-processed feature map with size of (N, 512,25,25), and then the above-obtained (N, 512,25,25) to-be-processed feature map is subjected to a standard 3x3 convolution operation to refine the upper and lower semantics.

In order to obtain feature fusion results of the feature map to be processed in different receptive fields, in the embodiment of the present application, first dilation factors of multiple numerical values are set, and the first dilation factor of each numerical value corresponds to a first convolution operation. Therefore, a first total feature map representing feature fusion results of the feature map to be processed under different receptive fields can be obtained by carrying out Toront first volume operation on the image to be processed.

In the embodiment of the present application, the number of the second convolution kernels is greater than the number of the first convolution kernels, and when the first convolution operation of step 302 is executed, the first convolution kernels are first used to perform convolution operation on each picture in the input item when performing convolution operation on the input item, so as to obtain a first sub-feature map corresponding to each picture; wherein, the picture is a feature graph to be processed or a feature identification result. And then carrying out convolution operation on each picture in the input item when carrying out convolution operation on the first sub-feature graph by adopting a second convolution kernel, and adding pixel values at the same position of each picture in the convolution process to obtain feature extraction results corresponding to all pictures in the input item.

Specifically, the first expansion factor values of the present application are 2, 4, 6 and 8, respectively. Namely, 4 rounds of first volume operation are required for the feature map to be processed. When the first round of first convolution operation is executed, 1x1 convolution is firstly carried out on the obtained (N, 512,25,25) to-be-processed feature map to obtain the required (N, 128, 25,25) to-be-processed feature map, then a first expansion factor with the value of 2 is used, and a first convolution kernel with the dimension of 128 and the size of 3x3 is used for carrying out convolution operation on the to-be-processed feature map to obtain a first sub-feature map with the size of (N, 128, 25, 25). Taking 1 to-be-processed feature map as an example, the convolution operation of the first convolution kernel can obtain first sub-feature maps (i.e. 128 first sub-feature maps) of the to-be-processed feature map under 128 different dimensions.

Next, the above-mentioned (N, 128, 25,25) first sub-feature map is convolved by 128 second convolution kernels with 128 dimensions and a size of 1 × 1, and N × 128 feature recognition results are obtained by adding pixel values at the same position of different images in the convolution process, wherein the size of the feature recognition results is (N, 128, 25, 25). At this time, the first round of first convolution operation is completed, that is, the first round of first convolution operation finally obtains the feature fusion result of each first sub-feature map under the receptive field corresponding to the first dilation factor with the value of 2 in 128 dimensions through the convolution operation of the first convolution kernel and the second convolution kernel.

Further, the above operation of performing the convolution operation by the first convolution kernel and the second convolution kernel is repeated using the first dilation factors having values of 4, 6, and 8, respectively. Compared with the first convolution operation of the first round, the input items of the non-first convolution operation of the first round are the feature recognition results obtained by the previous round. And equivalently, the input item obtained by the first round of first convolution operation is the feature recognition result obtained by the previous round, and the second round of first convolution operation takes the feature recognition result 1 obtained by the first round as the input item to obtain the feature recognition result 2 corresponding to the input item. The third round of the first convolution operation takes the feature recognition result 2 as an input item to obtain a feature recognition result 3. The fourth round of the first convolution operation takes the feature recognition result 3 as an input item to obtain a feature recognition result 4. Because the receptive fields of the feature recognition results of each round are different, the feature recognition result 4 obtained in the final fourth round is equivalent to the feature recognition result fused with different receptive fields, namely the first total feature map.

The Yolof decoder processing stage needs to perform convolution operation on feature images of (N, 128, 25,25) by 128-dimensional convolution checks with a size of 3 × 3 based on a corresponding expansion factor of each pair, where the required parameter for each round is 128x3x3x 128-147456, and in the above flow, the required parameter for each round of the first convolution operation in the present application is 128x3x3+128x128x1x 1-17536. Compared with Yolof, the method reduces the parameter quantity by nearly 9 times in each round, and greatly improves the processing speed of the neural network.

When the decoder determines the preset classification of the first total feature map, the regression head needs to be controlled to perform second convolution operation on the first total feature map based on a second expansion factor so as to determine the region of interest corresponding to the first total feature map; and controlling the classification head to perform third convolution operation on the first total feature map based on the second expansion factor so as to determine a classification detection result of the first total feature map.

In practice, for the regression head: performing convolution operation on the first total feature map by adopting a third convolution kernel based on the second expansion factor to obtain a second sub-feature map under the receptive field corresponding to the second expansion factor; performing convolution operation on the second sub-feature graph by adopting a fourth convolution kernel to obtain a second total feature graph corresponding to the second sub-feature graph; and carrying out convolution operation on the second total characteristic diagram by adopting a fifth convolution core to obtain the region of interest.

Specifically, the first total feature map of (N, 128, 25,25) is taken as an input term, and a convolution operation is performed by using a 128-dimensional, 3 × 3-sized third convolution kernel based on a second expansion factor having a value of 1, so as to obtain N × 128 second sub-feature maps, that is, the second sub-feature maps of (N, 128, 25, 25). And (2) performing convolution operation on the second sub-feature map of the N, 128, 25,25) and a fourth convolution kernel of 128 dimensions and 1 × 1 size to obtain a second total feature map under the receptive field corresponding to the second expansion factor, then increasing the dimension to 512 dimensions, performing convolution operation on the second total feature map (N, 512,25,25) through a fifth convolution kernel (N, (4+1) × 9, 512, 3, 3) of 3 × 3 size to obtain a second total feature map of (N, (4+1) × 9, 3, 3), transmitting (N, 1 × 9, 25,25) to a classification head residual connection, and outputting the rest (N, 4 × 9, 25,25) to an interested region. Wherein, "4" represents the position of the region of interest, "9" represents the number of the region of interest, and "1" represents the presence of an object within the feature map.

For the sort head: performing convolution operation on the first total feature map by adopting a sixth convolution kernel based on the second expansion factor to obtain a third sub-feature map under the receptive field corresponding to the second expansion factor; performing convolution operation on the third sub-feature graph by adopting a seventh convolution kernel to obtain a third total feature graph corresponding to the third sub-feature graph; and carrying out convolution operation on the third total characteristic graph by adopting an eighth convolution kernel to obtain a classification detection result.

Specifically, the first total feature map of (N, 128, 25,25) is taken as an input term, and a convolution operation is performed by using a sixth convolution kernel of 128 dimensions and 3 × 3 size based on a second expansion factor with a value of 1, so as to obtain N × 128 third sub-feature maps, that is, the third sub-feature map of (N, 128, 25, 25). Carrying out convolution operation on the third sub-feature maps of N, 128, 25 and 25) and 128-dimensional and 1 × 1-sized seventh convolution kernels to obtain third total feature maps under the receptive field corresponding to the second expansion factor, then carrying out dimension increasing on the third total feature maps to 512-dimensional, and carrying out convolution operation on the third total feature maps (N, 512,25 and 25) through 3 × 3-sized eighth convolution kernels N, 9 × 34, 512, 3 and 3) to obtain third total feature maps of (N, 9 × 34, 3 and 3); and (N, 1 × 9, 25,25) obtained by the regression head is expanded to (N, 1 × 9 × 34, 25,25), residual errors are connected to (N, 9 × 34, 3, 3) obtained above, and the most possible 9 classes (class) are obtained after the convolution operation. Here, the experimental data of the present application is trained based on mahjong tiles, and there are 9 kinds of recognition possibilities, which correspond to the above "9". I.e. 9 image contents that can be recognized by the neural network. The mahjong has 34 classification results, corresponding to the above "34", which is a classification that can be recognized by the neural network (for example, "ten thousand", "two cakes", "east wind", etc.). The process is equivalent to performing a preliminary classification detection on the image content of the third total feature map, screening out the third total feature map capable of being identified by the neural network model (namely, the above 9 possibilities), and then classifying and identifying the region of interest of the screened third total feature map according to the position of the region of interest obtained by returning to the head, so as to determine the preset classification to which the third total feature map belongs (namely, which classification specifically belongs in the 34 kinds of mahjong tile classifications). The preset classification is the final recognition result of the image to be recognized.

It should be noted that the third, fourth and fifth convolution kernels are used to screen out the second overall feature map that can be identified by the neural network, and the sixth, seventh and eighth convolution kernels are used to determine the position of the region of interest. Although the dimensions and sizes of the third and fourth convolution kernels are the same as those of the first and second convolution kernels, and the dimensions and sizes of the sixth and seventh convolution kernels are also the same as those of the first and second convolution kernels, the internal configuration parameters for determining the specific functions of the convolution kernels are not the same, and the internal configuration parameters can be set based on actual requirements, which is not limited in the present application.

In the decoder proposed by Yolof, for the classification header, the classification header first performs a complete expansion convolution operation on the (N, 512,25,25) feature map obtained by the Encoder by using the expansion convolution with the expansion factor of 1, then repeats the complete expansion convolution operation step, performs 2 complete expansion convolution operations with the expansion factor of 1, classifies the feature map after operation by using the standard 3x3 convolution, and performs residual error connection on the result and the (N, (4+1) × 9, 25,25) feature map obtained by the regression header to obtain the category on the (N, 9, 25,25) feature map. And (2) aiming at a Regression head (Regression), firstly performing a complete expansion convolution operation on the (N, 512,25,25) feature map obtained by the Encoder by using an expansion convolution with an expansion factor of 1, then repeating the complete expansion convolution operation step for 3 times, performing a complete expansion convolution operation with an expansion factor of 1 for 4 times, then performing standard 3x3 convolution for feature Regression to obtain a (N, (4+1) 9, 25,25) feature map, transmitting the (N, 1) 9, 25,25) feature map to a classification head residual connection, and outputting a region of interest by the rest (N, 4) 9, 25, 25).

It should be noted here that, based on any expansion factor, performing convolution operations with different convolution kernels characterizes in a neural network structure that one complete expansion convolution operation is performed by one expansion convolution block. Taking the regression head of the present application as an example, based on the second expansion factor, the whole process of performing convolution operation by using the third, fourth, and fifth convolution kernels is equivalent to the operation executed by one expansion convolution block. It can be seen that the classification head of Yolof mentioned above needs to complete 2 expansion convolution operations with expansion factor of 1, and the regression head needs to complete 4 expansion convolution operations with expansion factor of 1, i.e. there are 2 expansion convolution blocks in the classification head representing Yolof. In the above flow of the present application, the first expansion convolution operation is performed in the classification head and the regression head based on the second expansion factor, that is, only one expansion convolution block needs to be set in the classification head and the regression head respectively.

The network structure of the application is shown in fig. 3b, firstly, the image to be recognized is input into RegNetX for feature extraction, then the extracted feature graph to be processed is modified based on the encoder flow in Yolof, and the convolution operation of originally adopting 3x3 convolution kernel to execute 4 times of expansion convolution is modified based on E₁To E₄(i.e., 4 values of the first expansion)A dilation factor), 4 times of first convolution operations are performed using a first convolution kernel of a size 3 × 3 and a second convolution kernel of a size 1 × 1, respectively, to obtain a first total feature map. Finally, the first overall feature map is respectively input into a regression header and a classification header of the decoder. The regression head is used for determining an interested region to be identified, and the classification head is used for classifying and identifying the image content in the interested region to obtain the belonged classification of the image to be identified.

The mahjong tiles are used as a training set to verify the feasibility of the technical scheme of the application. Firstly, a mahjong tile data set is obtained, the mahjong tile data set is marked and cleaned, and a training set and a testing set are divided. The feature extraction is used for obtaining a feature map to be processed of a target data set, the step-by-step stacking void convolution is used for extracting and fusing the features of the feature map to be processed, the small target detection performance is enhanced, and the feature extraction method is applied to the original image

And carrying out target detection on the feature map of the size. Specifically, the data set and the corresponding sample label are randomly scrambled according to the following steps of 8: and 2, dividing a training set and a test set according to the principle.

And training the neural network by taking the training set and the corresponding sample labels as input data, performing performance evaluation in each training turn, and obtaining the neural network model with the mahjong tile detection capability after performance evaluation indexes are converged. Further, the test data set is detected to obtain a detection result. Specifically, the RegNetX-400MF is selected as a basic construction feature extraction network, the characteristics of the feature extraction network comprise 4 modules, the feature graph output by each module is reduced by 2 times compared with the feature graph output by the previous module, and the downsampling is completed by increasing the convolution step length. Each module is composed of different convolution block numbers in cascade connection, and the number is [1,2,7, 12%]And forming a residual error-like network structure. Each convolutional layer is followed by a Batch Normalization layer and a ReLU activation layer. The Batch Normalization layer can normalize the output, and the ReLU layer adds nonlinearity to the output of the upper layer. Obtaining original image size through feature extraction

And carrying out feature extraction and fusion on the feature map through step-by-step stacking void convolution to enhance the detection performance of the dense target, and finally obtaining the feature map with the same size as the input feature map. Then, through two groups of convolution modules, object classification and position prediction are respectively carried out, the object classification and the position prediction are mapped to the original image, and an accurate mahjong target detection result is output through non-maximum suppression. And finally, taking the training set and the corresponding sample label as input data to train the neural network. The input network size is not fixed, the maximum size of the pictures in each Batch is read to be the input size of the Batch, the pictures smaller than the maximum size are expanded in a patch adding mode, and a mahjong target detection model is obtained after performance evaluation indexes reach convergence planting training in the training process. The neural network model is adopted to extract the characteristics of the test set, and the size of the original image is measured

The accurate target category and position coordinates are obtained through non-maximum suppression finally. And finally, training a mahjong target detector by using a target detection toolbox (MMDedetection) deep learning framework.

The experiment proves that only one expansion volume block is needed to be arranged in the classification head and the regression head in the decoder of the application, so that the identification precision similar to Yolof can be obtained.

Based on the same inventive concept, an embodiment of the present application further provides an image recognition apparatus 400 based on deep learning, specifically as shown in fig. 4, including:

the feature extraction module 401 is configured to perform feature extraction on an image to be identified, and acquire a plurality of feature maps to be processed, which represent different dimensions of the image to be identified; wherein, the sizes of all characteristic graphs to be processed are the same;

a feature fusion module 402, configured to perform multiple rounds of first convolution operations on each feature map to be processed based on a preset first expansion factor, so as to determine a first total feature map corresponding to each feature map to be processed according to a feature identification result obtained by each round of the first convolution operations; the first expansion factors corresponding to each round of first convolution operation are different, and the first expansion factors with different numerical values represent that the receptive fields corresponding to the feature recognition results are different;

an image recognition module 403, configured to perform a preset classification that the first total feature map belongs to through a decoder, and take the preset classification as a recognition result of the image to be recognized; wherein the first volume operation process is as follows:

In some possible embodiments, the preset dimensions of the first convolution kernel and the second convolution kernel are the same, and after the obtaining of the multiple feature maps to be processed representing different dimensions of the image to be recognized is performed, the feature extraction module 401 is further configured to:

In some possible embodiments, the number of the second convolution kernels is greater than the number of the first convolution kernels, the performing of the convolution operation on the input items with the first convolution kernels is performed, and the feature fusion module 402 is configured to:

performing the convolution operation on the first sub-feature map with the second convolution kernel, the feature fusion module 402 being configured to:

In some possible embodiments, performing the determining, by a decoder, the preset classification to which the first total feature map belongs, the image recognition module 403 is configured to:

In some possible embodiments, the performing the control regression head performs a second convolution operation on the first total feature map based on a second expansion factor to determine a region of interest corresponding to the first total feature map, and the image identification module 403 is configured to:

In some possible embodiments, executing the control classification head to perform a third convolution operation on the first total feature map based on the second expansion factor to determine a classification detection result of the first total feature map, and the image identification module 403 is configured to:

The electronic device 130 according to this embodiment of the present application is described below with reference to fig. 5. The electronic device 130 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the electronic device 130 is represented in the form of a general electronic device. The components of the electronic device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 that connects the various system components (including the memory 132 and the processor 131).

Bus 133 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 132 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the electronic device 130, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur via input/output (I/O) interfaces 135. Also, the electronic device 130 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 130, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 132 comprising instructions, executable by the processor 131 of the apparatus 400 to perform the above-described method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising computer programs/instructions which, when executed by the processor 131, implement a deep learning based image recognition method as provided herein.

In an exemplary embodiment, aspects of a deep learning based image recognition method provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of a deep learning based image recognition method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A program product for deep learning based image recognition of embodiments of the present application may employ a portable compact disk read-only memory (CD-ROM) and include program code, and may be executable on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "for example" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic devices may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image scaling apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image scaling apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image scaling apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable image scaling device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An image recognition method based on deep learning, characterized in that the method comprises:

performing convolution operation on an input item by adopting a first convolution kernel based on a first expansion factor corresponding to the round to obtain a first sub-feature graph corresponding to the input item; performing convolution operation on the first sub-feature map by adopting a second convolution kernel to obtain a feature identification result of the input item under the receptive field corresponding to the first expansion factor; the input items of the first-round first convolution operation are the feature graphs to be processed, and the input items of the non-first-round first convolution operation are feature identification results obtained in the previous round; the first convolution kernel and the second convolution kernel are different in size.

2. The method according to claim 1, wherein the preset dimensions of the first convolution kernel and the second convolution kernel are the same, and after obtaining a plurality of feature maps to be processed that characterize different dimensions of the image to be recognized, the method further comprises:

3. The method of claim 2, wherein the number of second convolution kernels is greater than the number of first convolution kernels, and wherein convolving the entries with the first convolution kernels comprises:

the performing, by using the second convolution kernel, a convolution operation on the first sub-feature map includes:

4. The method of claim 1, wherein determining, by the decoder, the preset classification to which the first total feature map belongs comprises:

5. The method of claim 4, wherein the controlling the regression head to perform a second convolution operation on the first total feature map based on a second dilation factor to determine the region of interest corresponding to the first total feature map comprises:

performing convolution operation on the second sub-feature graph by adopting a fourth convolution kernel to obtain a second total feature graph corresponding to the second sub-feature graph; performing convolution operation on the second total characteristic diagram by adopting a fifth convolution core to obtain the region of interest; wherein the third convolution kernel and the fifth convolution kernel are both the same size as the first convolution kernel; the fourth convolution kernels are the same size as the second convolution kernels, and the number of the fourth convolution kernels is greater than the number of the third convolution kernels.

6. The method of claim 4, wherein the controlling the classification head to perform a third convolution operation on the first total feature map based on the second dilation factor to determine a classification detection result for the first total feature map comprises:

7. The method according to any of claims 1-6, wherein at least one dilated convolution block is included in each of the classification header and the regression header of the decoder.

8. An apparatus for image recognition based on deep learning, the apparatus comprising:

the feature fusion module is configured to perform multiple rounds of first volume operations on each feature map to be processed based on a preset first expansion factor, so as to determine a first total feature map corresponding to each feature map to be processed according to a feature identification result obtained by each round of first volume operations; the first expansion factors corresponding to each round of first convolution operation are different, and the first expansion factors with different values represent different receptive fields corresponding to the characteristic identification result;

9. An electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A computer storage medium, characterized in that the computer storage medium stores a computer program for causing a computer to perform the method according to any one of claims 1-7.