CN112598045A

CN112598045A - Method for training neural network, image recognition method and image recognition device

Info

Publication number: CN112598045A
Application number: CN202011496692.5A
Authority: CN
Inventors: 李轩屹; 侯海波; 王涛; 张梦鹿
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-04-02

Abstract

The present disclosure provides a method for training a neural network, an image recognition method and an image recognition apparatus, which can be used in the field of artificial intelligence or other fields, the method for training the neural network comprising: the multi-scale feature extraction network is used for extracting multilayer image features of the training images; the multi-scale feature fusion network is used for performing weighted fusion on the multilayer image features based on the respective hierarchical weights of the multilayer image features to obtain fusion image features, determining image identification features based on the fusion image features, wherein the respective hierarchical weights of the multilayer image features are positively correlated with the respective influence degrees of the multilayer image features on identification results, and the dimensionality of the fusion image features is greater than the sum of the respective dimensionalities of the multilayer image features; the classifier is used for determining the recognition result of the training image based on the image recognition characteristics; wherein, the method comprises the following steps: and adjusting parameters of the neural network to enable the recognition result of the input training image to approach the labeling result of the training image.

Description

Method for training neural network, image recognition method and image recognition device

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method for training a neural network, an image recognition method, and an image recognition apparatus.

Background

With the rapid development of science and technology, the artificial intelligence technology is rapidly developed in the aspect of images. The computer can perform image recognition on the images according to different semantic information presented by the images. In the early stage, the characteristics of the image need to be manually extracted, and then the extracted characteristics are input into a classifier for image classification. The appearance of deep learning does not need to manually extract features, and features are collected by adopting a feature learning method so as to carry out image recognition.

In the process of implementing the disclosed concept, the applicant finds that in the related art, at least the following problem exists, and the existing deep learning model only uses a single feature network to identify an image, and does not fully consider the influence of different layer features on image identification, so that the accuracy of image identification is influenced.

Disclosure of Invention

In view of the above, the present disclosure provides a method for training a neural network, an image recognition method, and an image recognition apparatus, so as to at least partially solve the problem that the accuracy of image recognition is affected because the influence of different layer features on image recognition is not fully considered in the prior art.

One aspect of the present disclosure provides a method of training a neural network, comprising: the multi-scale feature extraction network is used for extracting multilayer image features of the training images; the multi-scale feature fusion network is used for performing weighted fusion on the multilayer image features based on the respective hierarchical weights of the multilayer image features to obtain fusion image features, determining image identification features based on the fusion image features, wherein the respective hierarchical weights of the multilayer image features are positively correlated with the respective influence degrees of the multilayer image features on identification results, and the dimensionality of the fusion image features is greater than the sum of the respective dimensionalities of the multilayer image features; the classifier is used for determining the recognition result of the training image based on the image recognition characteristics; wherein, the method comprises the following steps: and adjusting parameters of the neural network to enable the recognition result of the input training image to approach the labeling result of the training image.

One aspect of the present disclosure provides an image recognition method, including: acquiring an input image; processing the input image by utilizing the trained neural network to obtain a recognition result aiming at the input image; wherein, neural network includes: the multi-scale feature extraction network is used for extracting multilayer image features of the training images; the multi-scale feature fusion network is used for performing weighted fusion on the multilayer image features based on the respective hierarchical weights of the multilayer image features to obtain fusion image features, determining image identification features based on the fusion image features, wherein the respective hierarchical weights of the multilayer image features are positively correlated with the respective influence degrees of the multilayer image features on identification results, and the dimensionality of the fusion image features is greater than the sum of the respective dimensionalities of the multilayer image features; the classifier is used for determining the recognition result of the training image based on the image recognition characteristics; wherein the neural network is trained by: and adjusting parameters of the neural network to enable the recognition result of the input training image to approach the labeling result of the training image.

An aspect of the present disclosure provides an image recognition apparatus including: the device comprises an image acquisition module and an image identification module. The image acquisition module is used for acquiring an input image; the image recognition module is used for processing the input image by utilizing the trained neural network to obtain a recognition result aiming at the input image; wherein, neural network includes: the multi-scale feature extraction network is used for extracting multilayer image features of the training images; the multi-scale feature fusion network is used for performing weighted fusion on the multilayer image features based on the respective hierarchical weights of the multilayer image features to obtain fusion image features, determining image identification features based on the fusion image features, wherein the respective hierarchical weights of the multilayer image features are positively correlated with the respective influence degrees of the multilayer image features on identification results, and the dimensionality of the fusion image features is greater than the sum of the respective dimensionalities of the multilayer image features; the classifier is used for determining the recognition result of the training image based on the image recognition characteristics; wherein the neural network is trained by: and adjusting parameters of the neural network to enable the recognition result of the input training image to approach the labeling result of the training image.

Another aspect of the present disclosure provides an electronic device comprising one or more processors and a storage for storing executable instructions that, when executed by the processors, implement a neural network training method and/or an image recognition method as above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing a neural network training method and/or an image recognition method as above when executed.

Another aspect of the present disclosure provides a computer program comprising computer executable instructions for implementing a neural network training method and/or an image recognition method as above when executed.

According to the method for training the neural network, the image recognition method and the image recognition device, the multi-scale feature fusion network is used for conducting weighted fusion on the multi-layer image features, such as the shallow detail features and the deep semantic information features, more image information is mined, so that richer texture information can be extracted, meanwhile, the features with better robustness can be obtained according to various changes of the images, the advantage complementation of the features is achieved, and the classification accuracy is improved.

According to the method for training the neural network, the image recognition method and the image recognition device, the features are selected in a weight attention mechanism mode, the features are extracted by using different convolution kernels, information among the different convolution kernels is learned, and therefore weight distribution is carried out on different kernels (kernel) among channels so as to represent the kernels, the quality of the extracted multilayer image features is improved, and the image recognition effect of the trained neural network is improved.

According to the method for training the neural network, the image recognition method and the image recognition device, the image is preprocessed before model training or image recognition is carried out, so that the problem that the image recognition rate cannot meet the requirement due to the fact that noise information such as underexposure and focus blurring exists is at least partially solved. On one hand, Gaussian noise in the original image can be eliminated by Gaussian filtering and weighted fusion of the Gaussian filtered image and the original image, and the filtered image is fused by a negative weight value. On one hand, the tone of the image can be edited in a nonlinear mode, so that the image is edited from the linear response of the exposure intensity to the more human-like feeling response, and the image identification accuracy rate of the image due to high exposure or underexposure is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an application scenario of a method of training a neural network, an image recognition method, and an image recognition apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates an exemplary system architecture to which the method of training a neural network, the image recognition method, and the image recognition apparatus may be applied, according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a method of training a neural network, in accordance with an embodiment of the present disclosure;

fig. 4 schematically illustrates a structural diagram of a ResNext network packet convolution according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a structural schematic of a self-attention mechanism network according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a multi-scale feature fusion process diagram according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a multi-scale feature fusion process according to another embodiment of the disclosure;

FIG. 8 schematically illustrates a structural schematic of a neural network according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow diagram of a method of training a neural network, in accordance with another embodiment of the present disclosure;

FIG. 10 schematically illustrates a flow chart of an image recognition method according to an embodiment of the present disclosure;

fig. 11 schematically shows a block diagram of an image recognition apparatus according to an embodiment of the present disclosure; and

FIG. 12 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features.

With the development of science and technology, more and more business transactions in organizations (such as banks, public institutions, enterprises and the like) need identity verification through face recognition, for example, when bank cards are transacted, face information needs to be input into a database after face acquisition for identity verification later. In addition, mobile payment, storefront face recognition payment, application or website face recognition login and the like are also increasingly applied to daily life and work of users. The existing face recognition technology only uses a single feature network for recognition, the recognition error rate is high, and the use effect cannot meet the user requirements.

The method for training the neural network comprises a training data input process and a model parameter adjusting process, wherein the training data input process comprises the step of inputting a training image into the neural network, the model parameter adjusting process is started after the training data input process is completed, and the identification result aiming at the input training image is close to the labeling result of the training image by adjusting the parameters of the neural network. Wherein, this neural network includes: the multi-scale feature extraction network is used for extracting multilayer image features of the training images; the multi-scale feature fusion network is used for performing weighted fusion on the multilayer image features based on the respective hierarchical weights of the multilayer image features to obtain fusion image features, determining image identification features based on the fusion image features, wherein the respective hierarchical weights of the multilayer image features are positively correlated with the respective influence degrees of the multilayer image features on identification results, and the dimensionality of the fusion image features is greater than the sum of the respective dimensionalities of the multilayer image features; and the classifier is used for determining the recognition result of the training image based on the image recognition characteristics.

The embodiment of the disclosure identifies the image based on the multi-scale convolution network with the attention mechanism, and improves the defect of image identification of the existing single characteristic. The use of a multi-feature network with a converged attention mechanism has the following advantages: firstly, because the image is influenced by the environment (illumination and collection angle) in the image collection process, the image is preprocessed to eliminate the influence of the external environment on the image recognition. Next, the idea of attention mechanism is added to the convolutional layer, and different weight information is given to different features, thereby screening out feature information useful for identification. And then, splicing (concat) and fusing the shallow detail features and the deep semantic features, so that the network can fully learn multi-feature information of the same image to obtain higher image recognition accuracy.

Fig. 1 schematically illustrates an application scenario of a method for training a neural network, an image recognition method, and an image recognition apparatus according to an embodiment of the present disclosure.

As shown in the left diagram of fig. 1, a user logs in a system, an application, or computer software or the like on a notebook computer in a face recognition manner, and since the resolution of a notebook camera may be low or the user is in a dark scene, the accuracy of a face recognition result needs to be improved. For another example, in scenes such as storefront face-brushing payment and security protection, the accuracy of the face recognition result needs to be improved. As shown in the right diagram of fig. 1, when a user uses a mobile phone to perform mobile payment or unlock a screen, face recognition authentication may fail due to reasons such as insufficient exposure.

It should be noted that the above illustrated scenarios are only examples and are not limited herein.

Fig. 2 schematically illustrates an exemplary system architecture to which the method of training a neural network, the image recognition method, and the image recognition apparatus may be applied, according to an embodiment of the present disclosure. It should be noted that fig. 2 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. It should be noted that the method for training a neural network, the image recognition method, and the image recognition apparatus provided in the embodiments of the present disclosure may be used in the field of artificial intelligence in the image recognition related aspect, and may also be used in various fields other than the field of artificial intelligence, such as the financial field.

As shown in fig. 2, the system architecture 200 according to this embodiment may include

terminal devices

201, 202, 203, a network 204 and a server 205. The network 204 may include a plurality of gateways, routers, hubs, network wires, etc. to provide a medium for communication links between the

end devices

201, 202, 203 and the server 205. Network 204 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

201, 202, 203 to interact with other terminal devices and the server 205 via the network 204 to receive or transmit information and the like, such as transmitting model training instructions, image recognition instructions, image data, image recognition results and the like. The

terminal devices

201, 202, 203 may be installed with various communication client applications, such as an image recognition application, a bank application, an e-commerce application, a web browser application, a search application, an office application, an instant messaging tool, a mailbox client, social platform software, etc. (just examples).

The

terminal devices

201, 202, 203 include, but are not limited to, smart phones, desktop computers, augmented reality devices, tablet computers, remote video surveillance terminals, laptop portable computers, and other electronic devices that can support image recognition, image processing, and the like. The terminal device can be stored with a neural network for image recognition.

The server 205 may receive and process model training requests, image recognition requests, model download requests, and the like. For example, the server 205 may be a back office management server, a cluster of servers, or the like. The background management server can analyze and process the received service request, information request and the like, and feed back the processing result (such as an image recognition result, model parameters obtained by training a model and the like) to the terminal equipment.

It should be noted that the training neural network and the image recognition method provided by the embodiments of the present disclosure may be executed by the

terminal devices

201, 202, 203 or the server 205. Accordingly, the image recognition apparatus provided by the embodiment of the present disclosure may be disposed in the

terminal device

201, 202, 203 or the server 205. It should be understood that the number of terminal devices, networks, and servers are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

To facilitate understanding of the technical solution of the embodiments of the present disclosure, a residual error network (ResNet) is first exemplified.

The residual network is a convolutional neural network proposed by 4 scholars from Microsoft Research, and wins image classification and object Recognition in the 2015 ImageNet Large Scale Visual Recognition Competition (ILSVRC). The residual network is characterized by easy optimization and can improve accuracy by adding considerable depth. The inner residual block uses jump connection, and the problem of gradient disappearance caused by depth increase in a deep neural network is relieved.

ResNet's recognition is good and reduces the difficulty of network training, at least in part because if the later layers of a deep network are identical maps, the model degrades into a shallow network. What is currently being addressed is learning an identity mapping function. However, it is difficult to directly fit some layers to a potential identity mapping function, which may be the reason why deep networks are difficult to train. However, if one can change to learning a residual function, one constitutes an identity map. Moreover, the fit residual is certainly easier. For example, the output of the mapping increases by 2% such as the output changes from 5.1 to 5.2, while for residual structure output from 5.1 to 5.2, the mapping is from 0.1 to 0.2, increasing by 100%. Obviously, the latter output change has larger adjustment effect on the weight, so the effect is better. The idea of residual errors is to remove the same body part, highlighting minor variations.

The disclosed embodiments improve the accuracy of model training effects and image recognition results based at least in part on survivor networks.

Fig. 3 schematically shows a flow diagram of a method of training a neural network according to an embodiment of the present disclosure.

As shown in fig. 3, the method includes operations S302 to S304.

In operation S302, a training image is input to a neural network.

In this embodiment, the neural network may include: the system comprises a multi-scale feature extraction network, a multi-scale feature fusion network and a classifier.

The multi-scale feature extraction network is used for extracting multi-layer image features of the training image.

The multi-scale feature fusion network is used for performing weighted fusion on the multilayer image features based on the respective hierarchical weights of the multilayer image features to obtain fusion image features, determining image identification features based on the fusion image features, wherein the respective hierarchical weights of the multilayer image features are positively correlated with the respective influence degrees of the multilayer image features on identification results, and the dimensionality of the fusion image features is greater than the sum of the respective dimensionalities of the multilayer image features.

The classifier is used for determining the recognition result of the training image based on the image recognition features.

In one embodiment, a modified network of residual networks (ResNet), resenext network, may be selected, which has the advantage of increasing the dimensionality of the network by adding superparameters of the independent paths, "Cardinality," to the residual networks. The cardinal number in the network structure adopts the idea of grouping convolution, the multiple characteristic graphs are divided into different groups to be respectively subjected to convolution operation, and finally the convolution results of the different groups are combined, so that the gradient disappearance along with the increase of the network depth is avoided, and the diversified image characteristics can be convolved.

Fig. 4 schematically shows a structural diagram of a ResNext network packet convolution according to an embodiment of the present disclosure.

As shown in fig. 4, resenxt is a strategy for separable convolution at depths intermediate to the normal convolution kernel: packet convolution, the balance of the two strategies is achieved by controlling the number of packets (cardinality). The topology of each branch of resenext is the same. As shown in fig. 4, by using convolution kernels of 1 × 1 size and convolution kernels of 3 × 3 size (Conv _3), each feature map is provided with information of different receptive fields, and at the same time, it is convenient to fuse each element under the condition that the output sizes (e.g. 128) of the different convolution kernels are the same as the number of channels (e.g. 32)

The feature maps of different channels can be fused by concatenation (concat). T may represent a period.

In addition, the ResNeXt is selected to improve the image recognition efficiency, because the design of the branches of the ResNeXt with the same topology is more consistent with the hardware design principle of the GPU.

In operation S304, the recognition result for the input training image is made to approach the labeling result of the training image by adjusting parameters of the neural network.

In this embodiment, model training may be performed by a back propagation algorithm. Parameters of neural networks include, but are not limited to: the parameters of the convolution kernel, the weights of the layers of the neural network, the bias, etc., and when the neural network further includes other networks, such as the self-attention mechanism network, the parameters of the neural network may further include relevant parameters, such as the self-attention mechanism network, which is not limited herein.

The disclosed embodiments increase the dimensionality of the network. The cardinal number adopts the idea of grouping convolution in the network structure, divides multiple characteristic graphs into different groups to carry out convolution operation respectively, and finally splices convolution results of the different groups, so that gradient disappearance along with the increase of the network depth is avoided, diversified characteristics can be convolved, and the improvement of the image recognition accuracy of the trained neural network based on the diversified characteristics is facilitated.

The structure of the neural network is exemplified below.

In one embodiment, the multi-scale feature extraction network comprises a plurality of levels of feature extraction networks, each level of feature extraction network for extracting image features at different depths.

For example, each of the plurality of hierarchical feature extraction networks includes: at least two branch networks, a self-attention mechanism network and a feature fusion network.

At least two branch networks, each branch network being configured to obtain at least two sets of feature maps based on convolution kernels of different sizes (e.g., based on convolution kernels of size 3 × 3 (Conv _3) and based on convolution kernels of size 5 × 5), wherein a first size of a convolution kernel of a large size among the convolution kernels of different sizes is determined based on a second size of a convolution kernel of a small size and a preset dilation rate.

The self-attention mechanism network is used for determining the characteristic map weight of each of at least two groups of characteristic maps.

The feature fusion network is used for carrying out weighted fusion on the at least two groups of feature maps based on the feature map weights of the at least two groups of feature maps to obtain fused image features.

Fig. 5 schematically illustrates a structural schematic of a self-attention mechanism network according to an embodiment of the disclosure.

As shown in fig. 5, the self-attention mechanism network includes: a Global pooling layer (Global pooling), a first fully-connected layer (FC), a normalization layer (BN) and a second fully-connected layer (FC). In fig. 5, SKNet is an adaptive adjusting receptive field network, which is a lightweight network.

The global tie pooling layer is used for obtaining global information of each group of feature maps.

The first fully-connected layer is configured to determine a feature map weight for each of the at least two sets of feature maps based on the global information and an activation function (e.g., a ReLU function).

The normalization layer is used for normalizing the feature map weights of the at least two groups of feature maps.

The second fully-connected layer is configured to determine normalized feature map weights for each of the at least two sets of feature maps based on a loss function (Softmax).

The feature fusion network is specifically configured to perform weighted fusion (Add) on at least two sets of feature maps based on normalized feature map weights of the at least two sets of feature maps (e.g., two sets of feature maps obtained based on a convolution kernel (Conv _3) having a size of 3 × 3 and a convolution kernel (Conv _5) having a size of 5 × 5), so as to obtain fused image features.

In a specific embodiment, a weight attention mechanism is adopted to select the features, the features are extracted by using different convolution kernels, and information among the different convolution kernels is learned, so that weight distribution is performed on different kernels (kernel) among channels, and then characterization is performed. The specific operation is as follows.

First, the features are subjected to a binning operation, using convolution kernels of different sizes (3 × 3 and 5 × 5) to obtain two sets of feature maps. Wherein 5 × 5 selection adopts a 3 × 3 hole convolution with an expansion coefficient of 2 to improve the network receptive field in an exponential form. The convolution kernel size of the void convolution, k', k ═ k + (k-1) (r-1) where k is the original convolution kernel size and r is the expansion ratio.

And then, fusing the results after the processing of the ranking operation to form U, so that each feature map has information of different receptive fields. Each element is added while ensuring that the output size of the different convolution kernels is the same as the number of channels. Global average pooling is used on all feature maps to make them 1 x 1 feature maps to obtain global information for each channel. Get global information S_cAs shown in formula (1).

Then, the output S is output_cInput into the fully-connected network in order to find out the information proportion of each channel through the activation unit. The nonlinear selection is realized by using the ReLU activation function, so that more accurate weight distribution is achieved, and output neurons, namely dimension reduction operation, are reduced. Wherein B is batch normalization, δ is ReLU activation function, r is reduction scale, L is minimum length, and Z is weight, as shown in equation (2). d is the size after dimensionality reduction, and is shown as a formula (3).

d ═ max (C/r, L) formula (3)

And then, expanding the dimension-reduced feature map with the weight obtained by fusion (Fuse) to the original dimension through a full connection layer to represent the weight of each channel, and regressing the weight of each feature map by using a softmax function, wherein the sum of the feature map weights of different convolution kernels forming the same channel is 1, as shown in formula (4). And finally, multiplying the weights with the original characteristic graphs respectively, fusing, and forming a final characteristic graph V after pixel-by-pixel superposition.

According to the method and the device, the features are selected in a weight attention mechanism mode, the features are extracted by using different convolution kernels, and information among the different convolution kernels is learned, so that weight distribution is performed on different kernels (kernel) among channels to further characterize, and the quality of the extracted image features is improved.

In one embodiment, fig. 6 schematically illustrates a multi-scale feature fusion process schematic according to an embodiment of the disclosure.

As shown in fig. 6, the multi-scale feature fusion network may include: multiple Up-sampling networks (Up x 2), sub-fusion image feature acquisition network (C)₂～C₅) A splice module (concat), and a fully connected network (not shown).

The plurality of upsampling networks are used for upsampling the image features of the last hierarchy or the sub-fusion image features of all the intermediate hierarchies except the last hierarchy and the first hierarchy respectively to obtain at least one upsampling image feature of each of the intermediate hierarchies and the last hierarchy, wherein the dimension of the upsampling image feature of the current intermediate hierarchy is the same as the dimension of the image feature of the previous intermediate hierarchy.

The sub-fusion image feature acquisition network is used for performing weighted fusion on the up-sampling image feature of the last level and the image feature of the previous level based on the respective level weights of the multiple layers of image features, or performing weighted fusion on the up-sampling image feature of at least one middle level and the image feature of the previous level based on the respective level weights of the multiple layers of image features to obtain the sub-fusion image feature of the previous level of the current level. Wherein each image feature can be processed by a 1 × 1 convolution kernel (conv _1) to facilitate feature fusion.

The splicing module is used for splicing the image features of the last level of the training image and the sub-fusion image features of all levels except the image features of the last level to obtain fusion image features.

The fully connected network is used for carrying out feature learning on the fused image features so as to determine the image recognition features.

In one embodiment, the multi-scale feature fusion network may further perform an aliasing cancellation operation on the upsampled image features, for example, the multi-scale feature fusion network may further include: a plurality of convolutional networks and a pooling network.

Fig. 7 schematically illustrates a multi-scale feature fusion process according to another embodiment of the present disclosure.

As shown in fig. 7, a plurality of convolution networks (Conv 3 × 3) are used to perform convolution operation on the sub-fusion image feature of the previous level of the current level, resulting in the confusion-eliminated sub-fusion image feature, so that the plurality of up-sampling networks up-sample the plurality of confusion-eliminated sub-fusion image features respectively.

And the pooling network is used for carrying out feature selection on the fused image features to obtain the pooled fused image features. Feature selection is performed as SPP PooLing (SPP-PooLing).

Accordingly, the fully-connected network is specifically configured to perform feature learning on the pooled fused image features to determine image recognition features.

In a specific embodiment, the shallow detail features and the deep semantic information features of the convolutional layer are subjected to weighted fusion, image information is further mined, and therefore richer texture information is extracted, meanwhile, the features with better robustness can be obtained according to various changes of the images, and the features are realizedThe advantages are complementary to improve the accuracy of classification. Generally, it is considered that the higher the accuracy rate is, the higher the class information of the layer features is, and the fusion weight of the layer is larger. Wherein the feature map extracted from each convolution layer is C_iWith an accuracy of a_iThe characteristic weight is w_i,(i+1)Wherein i is 2, 3, 4, 5. Feature weighted fusion may include the following operations.

Firstly, the weights of the fifth layer and the fourth layer are respectively w according to the classification accuracy_5,4And w_4，5As shown in formula (5).

Then, the two layers of features are weighted and fused according to the weight, and in order to eliminate the feature aliasing effect caused by upsampling, the features are extracted by using 3 multiplied by 3 convolution to obtain P₄。

p₄＝Conv_3x3(w_5,4×p₅+w_4,5×C₄) Formula (6)

The remaining layers were then fused in a similar manner to give P3 and P2, respectively. Merging the fused feature maps by using a concat fusion mode, and finally performing feature selection by using spp pooling to obtain a feature F shown as a formula (7).

F＝spp(concat(p₅+p₄+p₃+p₂) Formula (7)

And then, inputting the extracted features into a full-connection layer, performing feature learning, and finally inputting the features into a classifier for image recognition.

Fig. 8 schematically shows a structural schematic of a neural network according to an embodiment of the present disclosure.

As shown in fig. 8, each multi-scale feature extraction network may include a plurality of blocks (Block), including Block _1 to Block _ n in fig. 8. Wherein each block may include 32 channels (groups). By adding an attention mechanism to the convolutional layer, different weight information is given to different features, and therefore feature information useful for identification is screened out. Then, the shallow detail features and the deep semantic features are spliced and fused, so that a network can fully learn multi-feature information of the same image, and the image recognition accuracy of the vegetables is higher.

Fig. 9 schematically illustrates a flow diagram of a method of training a neural network according to another embodiment of the present disclosure.

As shown in fig. 9, before performing operation S302, the method may further include operation S901.

In operation S901, a training image is preprocessed to reduce noise information in the training image, wherein the noise information includes noise information due to at least one of underexposure and focus blur.

Specifically, preprocessing the training image includes at least one of the following.

For example, Gaussian noise in the training images is eliminated by image weighted fusion.

For example, the training image is subjected to tone nonlinear editing.

In one embodiment, eliminating gaussian noise in the training image through image weighted fusion may include the following operations.

Firstly, scanning each pixel point in a training image by utilizing a convolution kernel to obtain a filtered training image, wherein the value of the current pixel point in the filtered training image is the weighted average gray value of the current pixel point and the adjacent pixel point.

And then, weighting and fusing the filtered training image and the training image to eliminate Gaussian noise in the training image.

For example, the image weighted fusion and gamma correction method is used to first perform a preprocessing operation on the acquired image to solve the noise information of the image such as underexposure, focus blur and the like during acquisition. For noise information, firstly, Gaussian filtering processing is carried out on an image, the pixel value of the whole image is weighted and averaged, each pixel point is scanned by convolution, and the average gray value obtained by weighting the pixels beside the point is used as the value after the Gaussian filtering. And then, performing weighted fusion on the Gaussian filtered image and the original image, and fusing the filtered image by using a negative weight value so as to eliminate Gaussian noise in the original image. Secondly, the tone of the image is edited in a nonlinear mode, so that the image is edited from the linear response of the exposure intensity to the more human feeling response, and the image due to high exposure or insufficient exposure is solved. Therefore, the image identification accuracy of the collected image in the underexposed environment can be effectively improved.

The preprocessed training images may then be input into a neural network. It should be noted that, in the process of performing image recognition, the input image may also be preprocessed, and details are not described here.

Another aspect of the present disclosure also provides an image recognition method.

Fig. 10 schematically shows a flow chart of an image recognition method according to an embodiment of the present disclosure.

As shown in fig. 10, the image recognition method includes operations S1002 to S1004.

In operation S1002, an input image is acquired. The input image may be an image acquired based on various terminal devices such as shown in fig. 1, or an image acquired from a network.

In operation S1004, the input image is processed using the trained neural network, resulting in a recognition result for the input image.

Wherein, neural network includes: the system comprises a multi-scale feature extraction network, a multi-scale feature fusion network and a classifier.

For example, neural networks are trained by: and adjusting parameters of the neural network to enable the recognition result of the input training image to approach the labeling result of the training image.

It should be noted that, the image recognition process may refer to the relevant content of the image recognition process involved in the above neural network training process, and is not described herein again.

Fig. 11 schematically shows a block diagram of an image recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 11, the image recognition apparatus 1100 may include: an image acquisition module 1110 and an image recognition module 1120.

The image acquisition module 1110 is used to acquire an input image.

The image recognition module 1120 is configured to process the input image using the trained neural network to obtain a recognition result for the input image.

The embodiment of the disclosure can provide higher accuracy when comparing the acquired client image with the identity information. Specifically, the preprocessing operation on the acquired image is beneficial to reducing the influence of environmental factors such as noise on the identification accuracy rate during image acquisition. In addition, weighted attention and a multi-layer feature fusion method are added in the convolution operation, and diversified forward features are selected, so that the accuracy of image recognition is facilitated.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described in detail herein.

Any of the modules, units, or at least part of the functionality of any of them according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, units according to the embodiments of the present disclosure may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by any other reasonable means of hardware or firmware by integrating or packaging the circuits, or in any one of three implementations of software, hardware and firmware, or in any suitable combination of any of them. Alternatively, one or more of the modules, units according to embodiments of the present disclosure may be implemented at least partly as computer program modules, which, when executed, may perform the respective functions.

For example, any number of the image acquisition module 1110 and the image recognition module 1120 may be combined in one module to be implemented, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the image acquisition module 1110 and the image recognition module 1120 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of them. Alternatively, at least one of the image acquisition module 1110 and the image recognition module 1120 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.

FIG. 12 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure. The electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 12, an electronic apparatus 1200 according to an embodiment of the present disclosure includes a processor 1201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM1203, various programs and data necessary for the operation of the electronic apparatus 1200 are stored. The processor 1201, the ROM 1202, and the RAM1203 are communicatively connected to each other by a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1202 and/or the RAM 1203. Note that the programs may also be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 1200 may also include input/output (I/O) interface 1205, according to an embodiment of the disclosure, input/output (I/O) interface 1205 also connected to bus 1204. The electronic device 1200 may also include one or more of the following components connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program, when executed by the processor 1201, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 1202 and/or the RAM1203 and/or one or more memories other than the ROM 1202 and the RAM1203 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being configured to cause the electronic device to implement the image model training method or the image recognition method provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 1201, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 1209, and/or installed from the removable medium 1211. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method for training a neural network, the neural network comprising:

the multi-scale feature extraction network is used for extracting multilayer image features of the training images;

the multi-scale feature fusion network is used for performing weighted fusion on the multilayer image features based on the respective hierarchical weights of the multilayer image features to obtain fusion image features, determining image identification features based on the fusion image features, wherein the respective hierarchical weights of the multilayer image features are positively correlated with the respective influence degrees of the multilayer image features on identification results, and the dimensionality of the fusion image features is greater than the sum of the respective dimensionalities of the multilayer image features; and

a classifier for determining a recognition result of the training image based on the image recognition feature;

wherein the method comprises the following steps:

inputting the training image into the neural network;

and adjusting parameters of the neural network to enable the recognition result of the input training image to approach the labeling result of the training image.

2. The method of claim 1, wherein the multi-scale feature fusion network comprises:

the up-sampling networks are used for up-sampling the image features of the last hierarchy or the sub-fusion image features of all the middle hierarchies except the last hierarchy and the first hierarchy respectively to obtain at least one up-sampling image feature of each of the middle hierarchies and the last hierarchy, wherein the dimension of the up-sampling image feature of the current middle hierarchy is the same as that of the image feature of the last middle hierarchy;

a sub-fusion image feature acquisition network, configured to perform weighted fusion on the up-sampled image feature of the last hierarchy and the image feature of the previous hierarchy based on respective hierarchy weights of the multiple layers of image features, or perform weighted fusion on the up-sampled image feature of each of the at least one intermediate hierarchy and the image feature of each of the previous hierarchies based on respective hierarchy weights of the multiple layers of image features, so as to obtain a sub-fusion image feature of the previous hierarchy of the current hierarchy;

the splicing module is used for splicing the image features of the last level of the training image and the sub-fusion image features of all levels except the image features of the last level to obtain the fusion image features; and

and the full-connection network is used for carrying out feature learning on the fused image features so as to determine the image recognition features.

3. The method of claim 2, wherein the multi-scale feature fusion network further comprises:

the convolution networks are used for performing convolution operation on the sub-fusion image features of the previous level of the current level to obtain confusion-eliminated sub-fusion image features, so that the up-sampling networks respectively up-sample the confusion-eliminated sub-fusion image features;

the pooling network is used for carrying out feature selection on the fused image features to obtain pooled fused image features; and

the fully-connected network is specifically configured to perform feature learning on the pooled fused image features to determine the image recognition features.

4. The method of claim 1, wherein the multi-scale feature extraction network comprises a plurality of levels of feature extraction networks, each level of feature extraction network for extracting image features at different depths.

5. The method of claim 4, wherein each of the plurality of tiers of feature extraction networks comprises:

at least two branch networks, each branch network is used for obtaining at least two groups of characteristic maps based on convolution kernels with different sizes, wherein the first size of a convolution kernel with a large size in the convolution kernels with different sizes is determined based on the second size of a convolution kernel with a small size and a preset expansion ratio;

the self-attention mechanism network is used for determining the characteristic map weight of each of the at least two groups of characteristic maps;

and the feature fusion network is used for performing weighted fusion on the at least two groups of feature maps based on the feature map weights of the at least two groups of feature maps to obtain fused image features.

6. The method of claim 5, wherein the self-attention mechanism network comprises:

the global tie pooling layer is used for obtaining global information of each group of feature maps;

the first full-connection layer is used for determining the characteristic map weight of each of at least two groups of characteristic maps based on the global information and the activation function;

the normalization layer is used for normalizing the feature map weights of the at least two groups of feature maps;

a second fully connected layer for determining normalized feature map weights for each of the at least two sets of feature maps based on a loss function; and

the feature fusion network is specifically configured to perform weighted fusion on the at least two groups of feature maps based on normalized feature map weights of the at least two groups of feature maps, so as to obtain fused image features.

7. The method of claim 1, further comprising: the training image is preprocessed to reduce noise information in the training image, wherein the noise information includes noise information due to at least one of underexposure and focus blur.

8. The method of claim 7, wherein the pre-processing the training images comprises at least one of:

eliminating Gaussian noise in the training image through image weighted fusion; and

and carrying out tone nonlinear editing on the training image.

9. The method of claim 8, wherein the eliminating gaussian noise in the training image by image weighted fusion comprises:

scanning each pixel point in the training image by utilizing a convolution kernel to obtain a filtered training image, wherein the value of the current pixel point in the filtered training image is the weighted average gray value of the current pixel point and the adjacent pixel point; and

and weighting and fusing the filtered training image and the training image to eliminate Gaussian noise in the training image.

10. An image recognition method, comprising:

acquiring an input image; and

processing the input image by using the trained neural network to obtain a recognition result aiming at the input image;

wherein the neural network comprises:

wherein the neural network is trained by: and adjusting parameters of the neural network to enable the recognition result of the input training image to approach the labeling result of the training image.

11. An image recognition apparatus comprising:

the image acquisition module is used for acquiring an input image; and

the image recognition module is used for processing the input image by utilizing the trained neural network to obtain a recognition result aiming at the input image;

wherein the neural network comprises:

12. An electronic device, comprising:

one or more processors;

a storage device for storing executable instructions which, when executed by the processor, implement a method of training a neural network according to any one of claims 1 to 9, or implement an image recognition method according to claim 10.

13. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement a method of training a neural network as claimed in any one of claims 1 to 9, or implement an image recognition method as claimed in claim 10.