CN111680698A

CN111680698A - Image recognition method and device and training method and device of image recognition model

Info

Publication number: CN111680698A
Application number: CN202010318743.9A
Authority: CN
Inventors: 康丽萍; 魏晓明
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-09-18

Abstract

The application discloses an image recognition method and device and an image recognition model training method and device, wherein the image recognition model training method comprises the following steps: acquiring an original image, and extracting a multi-channel characteristic map of the original image by using a convolution network of an image recognition model; determining a plurality of attention maps based on the correlation of the channel feature maps; determining salient regions in the original image using a non-maximum suppression algorithm based on a plurality of the attention maps; and obtaining a category identification result of the salient region by utilizing the classification network of the image identification model, and determining a first classification loss according to the category identification result so as to optimize the parameters of the image identification model according to the first classification loss. The image recognition model obtained through training has strong feature expression capacity, can recognize fine-grained features in the image, can achieve high image recognition accuracy, and improves the fine-grained recognition performance of the image.

Description

Image recognition method and device and training method and device of image recognition model

Technical Field

The application relates to the technical field of image recognition, in particular to an image recognition method and device and an image recognition model training method and device.

Background

The traditional image recognition is generally used for recognizing different types of objects such as flowers, birds, automobiles and the like, Fine-Grained image recognition also becomes an important research subject in the field of computer vision, and Fine-Grained classification (FGVC) is used for solving the problem of intra-class classification, namely, the Fine-Grained image recognition can be used for recognizing different types of objects under the same type of objects and performing more detailed subclass classification on similar basic classes.

However, the problem that the inter-class variance is small and the intra-class variance is large exists, and compared with a common image classification method, the fine-grained image identification difficulty is higher. The fine-grained image recognition has a high application value in a plurality of service scenes such as online shopping or consumption, so how to improve the feature expression capability of a fine-grained image recognition model and enhance the distinction among fine-grained categories is a technical problem to be solved urgently at present.

Disclosure of Invention

In view of the above, the present application is proposed to provide an image recognition method and apparatus, and an image recognition model training method and apparatus that overcome or at least partially solve the above problems.

According to a first aspect of the present application, there is provided a training method of an image recognition model, including:

acquiring an original image, and extracting a multi-channel characteristic map of the original image by using a convolution network of an image recognition model;

determining a plurality of attention maps based on the correlation of the channel feature maps;

determining salient regions in the original image using a non-maximum suppression algorithm based on a plurality of the attention maps;

and obtaining a category identification result of the salient region by utilizing the classification network of the image identification model, and determining a first classification loss according to the category identification result so as to optimize the parameters of the image identification model according to the first classification loss.

Optionally, the determining a plurality of attention maps based on the correlation of the channel feature maps includes:

performing global pooling on each channel feature map to obtain global features of each channel feature map;

determining the correlation among the channel feature maps based on the global features, and determining the activation weight of each channel feature map according to the correlation;

and carrying out recalibration on the weight of each channel feature map according to the activation weight, and determining a plurality of attention maps according to a recalibration result.

Optionally, the determining the salient region in the original image by using a non-maximum suppression algorithm based on a plurality of the attention maps comprises:

mapping each attention diagram to the original image to obtain a mapping area of each attention diagram;

determining probability values of the mapping regions based on the weights of the attention maps;

local search is carried out according to the probability value of each mapping region and the intersection ratio among the mapping regions, and a plurality of mapping regions corresponding to the local probability maximum value are reserved;

determining the salient region according to the reserved mapping region.

Optionally, the obtaining a category identification result of the salient region by using the classification network of the image identification model includes:

cutting the original image according to the saliency area to obtain a saliency image;

and inputting the saliency image into the classification network to obtain a category identification result of the saliency image.

Optionally, the training method of the image recognition model further includes:

acquiring a non-significant image and inputting the non-significant image into a classification network of the image recognition model to obtain a second classification loss;

and optimizing the parameters of the image recognition model according to the second classification loss and the first classification loss.

Optionally, the acquiring the non-salient image comprises:

according to the weight of each attention diagram, randomly sampling each channel corresponding to each attention diagram;

mapping an attention diagram corresponding to a channel obtained by random sampling to the original image to obtain a sampling image;

and determining a mask area of the sampling image according to a preset pixel threshold value, and obtaining the non-significant image according to the mask area of the sampling image.

inputting the multi-channel feature map corresponding to the original image into a classification network of the image recognition model to obtain a third classification loss;

and optimizing parameters of the image recognition model according to the first classification loss, the second classification loss and the second classification loss.

According to a second aspect of the present application, there is provided an image recognition method comprising:

acquiring an image to be identified;

generating a multi-channel feature map of the image to be recognized by using a convolution network of an image recognition model;

determining a salient region of the image to be identified based on the correlation of each channel feature map;

and classifying the salient region by utilizing a classification network of the image recognition model to obtain a class recognition result of the image to be recognized, wherein the image recognition model is obtained by training based on the training method of the image recognition model.

According to a third aspect of the present application, there is provided a training apparatus for an image recognition model, comprising:

the first acquisition unit is used for acquiring an original image and extracting a multi-channel characteristic map of the original image by using a convolution network of an image recognition model;

a first determination unit for determining a plurality of attention maps based on the correlation of the respective channel feature maps;

a second determination unit configured to determine a salient region in the original image using a non-maximum suppression algorithm based on a plurality of the attention maps;

and the first optimization unit is used for obtaining a class identification result of the salient region by utilizing a classification network of the image identification model, and determining a first classification loss according to the class identification result so as to optimize parameters of the image identification model according to the first classification loss.

According to a fourth aspect of the present application, there is provided an image recognition apparatus comprising:

the second acquisition unit is used for acquiring an image to be identified;

the generating unit is used for generating a multi-channel feature map of the image to be identified by using a convolution network of an image identification model;

the third determining unit is used for determining a salient region of the image to be recognized based on the correlation of the channel feature maps;

and the classification unit is used for classifying the salient region by utilizing the classification network of the image recognition model to obtain a class recognition result of the image to be recognized, wherein the image recognition model is obtained by training based on the training device of the image recognition model.

According to a fifth aspect of the present application, there is provided an electronic device comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method of training an image recognition model as defined in any one of the above, or to perform an image recognition method as defined in any one of the above.

According to a sixth aspect of the present application, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the training method of the image recognition model according to any one of the above, or perform the image recognition method according to any one of the above.

According to the technical scheme, the original image is obtained, and the multi-channel feature map of the original image is extracted by using the convolution network of the image recognition model; determining a plurality of attention maps based on the correlation of the channel feature maps; determining a manner of salient regions in the original image using a non-maximum suppression algorithm based on a plurality of the attention maps; and obtaining a category identification result of the salient region by utilizing the classification network of the image identification model, and determining a first classification loss according to the category identification result so as to optimize the parameters of the image identification model according to the first classification loss. The image recognition model obtained through training has strong feature expression capacity, can recognize fine-grained features in the image, can achieve high image recognition accuracy, and improves the fine-grained recognition performance of the image.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a diagram illustrating a classification method for a fine-grained visual classification task in the prior art;

FIG. 2 shows a flow diagram of a method of training an image recognition model according to one embodiment of the present application;

FIG. 3 shows a schematic diagram of a SENET structure according to an embodiment of the present application;

FIG. 4 illustrates a schematic flow chart of training an image recognition model according to an embodiment of the present application;

FIG. 5 shows a schematic flow diagram of an image recognition method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a training apparatus for image recognition models according to an embodiment of the present application;

FIG. 7 shows a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

FIG. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 9 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the prior art, a classification method for a fine-grained visual classification task is proposed, as shown in fig. 1, the method adopts a weak Supervised Data enhancement Network (WS-DAN) based method, and combines an attention mechanism, so that the Network can focus on parts of an image with speaking right without additional annotation information. Specifically, the method acquires an attention map (attention map) to characterize a salient region in an image based on a feature map (dimensions [ h, w, channels ]), and performs data enhancement on an original image through a salient image (attention crop) and a non-salient image (attention drop) under the guidance of the attention map, so as to improve the image feature expression capability. The two key steps of determining the attribute map and the attribute crop are as follows:

(1) according to the selected number N of the areas, fixedly selecting feature maps of the first N channels as attribute maps in the training process, namely [ h, w,0: N ];

(2) and randomly sampling in the attribute maps of the N channels according to a certain probability distribution, and selecting the feature map of a certain channel to acquire an attribute crop region.

The probability distribution is determined as follows:

part_weights＝tf.reduce_mean(tf.reduce_mean(attention_map,axis＝0),axis＝0)

part_weights＝tf.sqrt(part_weights)。

attention crop is obtained as follows, where selected _ index is the channel index selected according to part _ weights:

selected_index＝np.random.choice(np.arange(0,N),1,p＝part_weights)[0]

mask＝attention_map[:,:,selected_index]

threshold＝random.uniform(T1,T2)

itemindex＝np.where(mask>＝mask.max()*threshold)

ymin＝itemindex[0].min()/height-0.1

ymax＝itemindex[0].max()/height+0.1

xmin＝itemindex[1].min()/width-0.1

xmax＝itemindex[1].max()/width+0.1

bbox＝np.asarray([ymin,xmin,ymax,xmax],dtype＝np.float32)。

it can be found that the procedure for determining the attention crop based on the weakly supervised data enhanced network adopted in the prior art has at least the following two problems:

(1) the attribute map selects the first N fixed channels, and does not consider the correlation among different characteristic channels;

(2) the choice of the attribute crop is only to carry out random sampling of the channels according to the average response size of the N channels as the weight, and the most reasonable attribute crop is selected by not fully utilizing the position relation among the attribute maps of the N channels.

Due to the problems, the fine-grained image recognition model in the prior art still needs to be improved in recognition accuracy.

Based on this, an embodiment of the present application provides a training method for an image recognition model, as shown in fig. 2, the training method for an image recognition model includes the following steps S210 to S240:

step S210, obtaining an original image, and extracting a multi-channel characteristic map of the original image by using a convolution network of an image recognition model.

Firstly, an original image is obtained, the original image is input into a convolution network of an image recognition model to be subjected to convolution and other transformation operations, and then a multi-channel feature map corresponding to the original image is obtained, wherein the convolution network can adopt an Inception V4 network (Chinese translation name is temporarily absent in the industry), and the Inception V4 is a deep convolution neural network and is a basic network structure adopted in the embodiment of the application for image feature extraction. Of course, those skilled in the art may also use other types of convolutional neural Networks, such as a Region-based convolutional neural network (R-CNN for short), according to actual needs, which are not listed here.

In step S220, a plurality of attention maps are determined based on the correlation of the channel feature maps.

The embodiment of the application introduces an attention mechanism (attention) as a basis for image fine-grained identification, wherein the attention mechanism simulates the internal process of biological observation behavior, and is a mechanism for aligning internal experience and external feeling so as to increase the observation fineness of a partial region. The attention mechanism can quickly extract important features of sparse data, and therefore, the attention mechanism is widely used in the fields of natural language processing tasks and image processing tasks.

As described above, in the image recognition method in the prior art, correlation between different feature channels is not considered when determining an attention map (attribute map), which may affect the accuracy of a final obtained image recognition result, whereas in the embodiment of the present application, after obtaining a multi-channel feature map corresponding to an original image, an attention map is determined based on correlation between the channel feature maps, so that a selected attention map can capture fine-grained features in the image better.

The attention map is determined according to the correlation among the channel feature maps and can be specifically realized by adopting a Squeeze-and-activation network (SENet for short), and the core idea is that the attention is mainly distributed in feature channels (channels), which means that the attention degree of different image channels is different on an image, so that more reasonable and accurate image fine-grained features can be learned.

And step S230, determining a salient region in the original image by using a non-maximum suppression algorithm based on a plurality of the attention maps.

The attention map reflects a possibly interesting region in the image, and after a plurality of attention maps are determined according to the correlation among channels, the attention maps can be further screened to determine a salient region (attention crop) in the image so as to obtain a more accurate image salient region.

As described above, in the prior art, the most reasonable attribute crop is selected according to the manner of determining the attribute crop according to the attribute maps because the position relationship between the attribute maps of the channels is not fully utilized, so that the recognition effect of the finally obtained model is still to be improved.

Step S240, obtaining a classification recognition result of the salient region by using the classification network of the image recognition model, and determining a first classification loss according to the classification recognition result, so as to optimize parameters of the image recognition model according to the first classification loss.

After the saliency region corresponding to the original image is obtained, the classification network of the image recognition model is utilized to classify the image containing the saliency region, so that the class recognition result of the saliency region image is obtained, the classification loss of the image recognition model, namely the difference between the class of the saliency region predicted by the model and the real class of the saliency region, is correspondingly output, and the continuous optimization of the model parameters is realized through the back propagation of the classification loss value.

The image recognition model obtained through the training process can recognize fine granularity of the image, and further a better image recognition effect can be obtained.

In an embodiment of the application, the determining a plurality of attention maps based on the correlation of the channel feature maps includes: performing global pooling on each channel feature map to obtain global features of each channel feature map; determining the correlation among the channel feature maps based on the global features, and determining the activation weight of each channel feature map according to the correlation; and carrying out recalibration on the weight of each channel feature map according to the activation weight, and determining a plurality of attention maps according to a recalibration result.

In the embodiment of the application, SENET is adopted to determine the attribute map, as shown in FIG. 3, a SENET structural schematic diagram is provided, the core idea of the SENET is that the importance of different channels is measured through a group of learned weights, namely, after the learned weights are added, the original feature map is recalibrated, the SENET mainly comprises three parts, an Squeeze operation is extruded, an Excitation operation and a feature measurement Fscale operation are activated.

Firstly, the squaque operation is carried out, the feature compression is carried out along the spatial dimension, each two-dimensional feature channel in the feature map U is changed into a real number through the global pooling operation, the real number has a global receptive field to some extent, and the output dimension (1x1xC) is matched with the input feature channel number (WxHxC), which represents the global distribution of response on the feature channels, and the layers close to the input can also obtain the global receptive field.

The second is to activate the Excitation operation, which is a mechanism similar to the gate in the recurrent neural network, to generate a weight for each eigen-channel by a parameter learned to characterize the correlation between eigen-channels. Specifically, a series of non-Linear mappings may be performed on the global feature map 1x1xC generated above, for example, a fully connected (FC for short) and Linear modified unit activation function (RELU for short) and a fully connected manner are adopted, and finally, a sigmoid activation function (a threshold function of a neural network) is used to obtain an activation weight of each channel feature map.

And finally, performing feature measurement Fscale or recalibration Reweight operation, regarding the obtained activation weight as the importance of each feature channel after feature selection, weighting the feature channel by channel to a previous feature graph U through multiplication, and completing recalibration of the original weight of each channel in the channel dimension.

After the weight calibration of the feature maps of the channels is completed, the feature maps corresponding to the first N channels with larger weights can be selected as the attribute maps. The channel with higher significance selected in the self-adaptive mode can reflect fine-grained characteristics of the image, and the overall recognition effect of the model is improved.

In an embodiment of the present application, the determining the salient region in the original image using the non-maximum suppression algorithm based on a plurality of the attention maps comprises: mapping each attention diagram to the original image to obtain a mapping area of each attention diagram; determining probability values of the mapping regions based on the weights of the attention maps; local search is carried out according to the probability value of each mapping region and the intersection ratio among the mapping regions, and a plurality of mapping regions corresponding to the local probability maximum value are reserved; determining the salient region according to the reserved mapping region.

In specific implementation, the attention map may be mapped back to the original image, and an image marked by a plurality of rectangular frames may be obtained correspondingly, and then the NMS process is performed. Specifically, the probability value of each rectangular frame in the image is determined according to the weight of each attention map, then an Intersection-over-unity (IoU for short) between other rectangular frames and the rectangular frame corresponding to the maximum probability value is calculated respectively by taking the rectangular frame corresponding to the maximum probability value as a reference, local search is performed, a plurality of rectangular frames corresponding to the local probability maximum value are reserved, and finally the image corresponding to the salient region is determined according to the probability value corresponding to the reserved rectangular frames. An Intersection-over-Union (IoU) is a concept used in target detection, and is to calculate the overlapping rate of the generated candidate frame and the original labeled frame, i.e., the ratio of their Intersection to Union.

For example, assuming that the above selected attribute map has N ═ 6 channels, the corresponding average response magnitude order is: a < B < C < D < E < F, then 1) first determine the maximum probability rectangular box F (i.e., the rectangular box with the largest response) and use it as the remaining rectangular box; 2) starting from a maximum probability rectangular frame F, respectively judging whether the overlapping degrees of A, E and F, namely the intersection ratio IoU of the two frames, are greater than a certain set threshold, and if the overlapping degree of B, D and F exceeds the threshold, B, D is not reserved; 3) and E with the highest probability value is selected from the rest rectangular boxes A, C, E and marked as rectangular boxes to be reserved, then the intersection ratio of E and A, C is judged, and the rectangular boxes with the intersection ratio exceeding a set threshold value are removed. And repeating the steps until the rest rectangular frames are not available, marking all the rectangular frames to be reserved, and selecting one rectangular frame with the highest sequence, namely the largest response, from the reserved rectangular frames (of course, a plurality of rectangular frames can be reserved, and the number of the output rectangular frames can be controlled according to actual parameters) as the attribute crop.

In an embodiment of the application, the obtaining of the class identification result of the salient region by using the classification network of the image identification model includes: cutting the original image according to the saliency area to obtain a saliency image; and inputting the saliency image into the classification network to obtain a category identification result of the saliency image.

Different from the above embodiment, a person skilled in the art may also perform NMS processing on the attention map, and then map the processed saliency areas back to the original image to obtain the saliency image.

In this case, the salient region obtained by the NMS algorithm is essentially a feature map, and the position of the salient region in the original image can be obtained by mapping the salient region back into the original image. In order to achieve the purpose of reinforcement learning of the salient region in the image, the corresponding salient region in the original image is cut off, the salient region is converted into the image with the same size as the original image in a size conversion mode, then the salient image is obtained, and then the salient image is sent to a classification network to obtain the category identification result of the salient image.

In an embodiment of the present application, the training method of the image recognition model further includes: acquiring a non-significant image and inputting the non-significant image into a classification network of the image recognition model to obtain a second classification loss; and optimizing the parameters of the image recognition model according to the second classification loss and the first classification loss.

In specific implementation, the image recognition model in the embodiment of the application can train and learn the salient regions of the image and can train the non-salient regions of the image at the same time. The non-salient region refers to a part of the image except for the salient region, in order to avoid the over-focusing of the features learned by the model and enhance the generalization capability of the model, the non-salient image is simultaneously sent into a classification network of the image recognition model for training, and the model parameters are further optimized according to a loss function obtained by training.

In one embodiment of the present application, the acquiring the non-salient image comprises: according to the weight of each attention diagram, randomly sampling each channel corresponding to each attention diagram; mapping an attention diagram corresponding to a channel obtained by random sampling to the original image to obtain a sampling image; and determining a mask area of the sampling image according to a preset pixel threshold value, and obtaining the non-significant image according to the mask area of the sampling image.

In a specific implementation, the insignificant area may be obtained by: 1) randomly sampling one channel in the N attention diagrams as a basis for later data enhancement, wherein the random selection has the advantages that firstly, the robustness can be increased, secondly, a plurality of parts of an object can be focused, and then normalization processing can be carried out on the sampled characteristic channels to facilitate subsequent operation; 2) mapping the attention diagram corresponding to the channel obtained by random sampling back to the original image to obtain a sampled image; 3) determining a mask area according to a preset pixel threshold, and performing dot multiplication on the mask area and a sampling image corresponding to a sampling channel to obtain an insignificant image attribute drop corresponding to an original image. The mask in the embodiment of the application can shield the salient region on the image, so that the salient region does not participate in the calculation of the processing parameter, the model can focus on other region information except the salient region, and the generalization capability of the model is enhanced.

In an embodiment of the present application, the training method of the image recognition model further includes: inputting the multi-channel feature map corresponding to the original image into a classification network of the image recognition model to obtain a third classification loss; and optimizing parameters of the image recognition model according to the first classification loss, the second classification loss and the second classification loss.

In specific implementation, the image recognition model in the embodiment of the application can train and learn the salient region and the non-salient region of the image, and can also directly send the multi-channel feature map corresponding to the original image into the classification network of the model to participate in training at the same time, and according to the loss function obtained by training the original image, the parameters of the model are jointly optimized by combining the loss functions corresponding to the salient region image and the non-salient region image, so that the recognition effect and the generalization capability of the model are further enhanced.

The classification network in the embodiment of the application may adopt a softmax classification network, and softmax loss (softmax loss) is obtained correspondingly, and the softmax classification network is essentially a normalized exponential function, and can obtain the probability belonging to each class according to the input. In addition, in the training process of the original image, in order to enable each attention diagram to find the same object part each time, a center loss function (center loss) can be introduced to serve as one of the loss functions of the joint optimization, each feature diagram can be fixed to the center of each part by adding the sum of squared differences of the feature diagram and the part center (namely, the center loss) to serve as a penalty term, and the part center is updated according to each learned feature diagram.

As shown in fig. 4, an embodiment of the present application further provides a training flow diagram of an image recognition model. Firstly, obtaining an original image, carrying out feature extraction on the original image through a convolution network of an image recognition model to obtain multi-channel feature map features maps of WxHxC, inputting the multi-channel feature map into SENet of the model, carrying out recalibration on the weights of the multi-channel feature map, and selecting the front N as attention map attributes maps according to the weights of the recalibrated feature map. And then mapping the attention map to an original image to obtain a plurality of images with rectangular frame marks, screening the plurality of rectangular frames in the images by using a non-maximum suppression (NMS) algorithm, cutting and carrying out size conversion on the areas of the screened rectangular frames in the images to obtain a saliency image with the same size as the original image, and sending the saliency image into a classification network of a model for classification to obtain the classification loss softmax loss of the corresponding saliency image.

On the other hand, a mask region of the image can be generated based on the attention map, an insignificant image is generated according to the mask region and the corresponding channel image, and the insignificant image is used as the other input of the classification network, so that the classification loss softmax loss of the corresponding insignificant image is obtained. And simultaneously, the obtained multi-channel characteristic diagram of the original image is also used as the input of the classification network, parameters of the model are trained simultaneously, the corresponding original image classification loss softmax loss and the center loss are obtained, and finally the parameters of the model are jointly optimized based on the loss functions output by each branch classification network.

As shown in fig. 5, an embodiment of the present application further provides an image recognition method, where the image recognition method includes steps S510 to S540 as follows:

step S510, acquiring an image to be recognized.

When fine-grained identification is carried out on an image, the image to be identified is acquired firstly and is used as the input of a subsequent image identification model.

And step S520, generating a multi-channel characteristic map of the image to be recognized by using a convolution network of an image recognition model.

And inputting the obtained image to be identified into a convolution network of the image identification model for convolution processing, and further obtaining a multi-channel characteristic diagram corresponding to the image.

Step S530, determining a salient region of the image to be recognized based on the correlation of the channel feature maps.

According to the correlation of each channel feature map, performing extrusion-activation operation on each channel feature map by using SENet (Squeeze-and-Excitation Networks) in an image recognition model to obtain a plurality of attention maps, and processing the attention maps by using a non-maximum suppression algorithm to obtain a saliency area of the image to be recognized.

Step S540, classifying the salient region by using the classification network of the image recognition model to obtain a class recognition result of the image to be recognized, wherein the image recognition model is obtained by training based on the training method of the image recognition model as described in any one of the previous items.

Inputting the obtained image corresponding to the salient region into a classification network of a model for classification, and further obtaining a class identification result of the image to be identified, wherein the image identification model in the embodiment of the application is obtained by training in the following way:

acquiring an original image, and extracting a multi-channel characteristic map of the original image by using a convolution network of an image recognition model; determining a plurality of attention maps based on the correlation of the channel feature maps; determining salient regions in the original image using a non-maximum suppression algorithm based on a plurality of the attention maps; and obtaining a category identification result of the salient region by utilizing the classification network of the image identification model, and determining a first classification loss according to the category identification result so as to optimize the parameters of the image identification model according to the first classification loss.

By the image identification method, the correlation and the position relation among the characteristic channels are considered, so that the image category identification result output by the model has higher accuracy.

As shown in fig. 6, an embodiment of the present application further provides an apparatus 600 for training an image recognition model, where the apparatus 600 includes: a first obtaining unit 610, a first determining unit 620, a second determining unit 630 and a first optimizing unit 640.

The first obtaining unit 610 of the embodiment of the present application is configured to obtain an original image, and extract a multi-channel feature map of the original image by using a convolution network of an image recognition model.

Firstly, an original image is obtained, the original image is input into a convolution network of an image recognition model to perform transformation operations such as convolution and the like, and then a multichannel feature map corresponding to the original image is obtained, wherein the convolution network can adopt an Inception V4 network, the Inception V4 is a deep convolution neural network, and is a basic network structure adopted in the embodiment of the application for image feature extraction. Of course, those skilled in the art can also use other types of Convolutional Neural Networks such as Region-based Convolutional Neural Networks (R-CNNs), which are not listed here.

The first determining unit 620 of the embodiment of the present application is configured to determine a plurality of attention maps based on the correlation of each channel feature map.

The second determining unit 630 of the embodiment of the present application is configured to determine a salient region in the original image by using a non-maximum suppression algorithm based on a plurality of the attention maps.

The first optimizing unit 640 according to the embodiment of the present application is configured to obtain a class identification result of the significant region by using a classification network of the image identification model, and determine a first classification loss according to the class identification result, so as to optimize a parameter of the image identification model according to the first classification loss.

In an embodiment of the application, the first determining unit 620 is further configured to: performing global pooling on each channel feature map to obtain global features of each channel feature map; determining the correlation among the channel feature maps based on the global features, and determining the activation weight of each channel feature map according to the correlation; and carrying out recalibration on the weight of each channel feature map according to the activation weight, and determining a plurality of attention maps according to a recalibration result.

In an embodiment of the present application, the second determining unit 630 is further configured to: mapping each attention diagram to the original image to obtain a mapping area of each attention diagram; determining probability values of the mapping regions based on the weights of the attention maps; local search is carried out according to the probability value of each mapping region and the intersection ratio among the mapping regions, and a plurality of mapping regions corresponding to the local probability maximum value are reserved; determining the salient region according to the reserved mapping region.

In an embodiment of the present application, the first optimization unit 640 is further configured to: cutting the original image according to the saliency area to obtain a saliency image; and inputting the saliency image into the classification network to obtain a category identification result of the saliency image.

In one embodiment of the present application, the apparatus further comprises: the first input unit is used for acquiring a non-significant image and inputting the non-significant image into the classification network of the image recognition model to obtain a second classification loss; and the second optimization unit is used for optimizing the parameters of the image recognition model according to the second classification loss and the first classification loss.

In an embodiment of the present application, the first input unit is further configured to: according to the weight of each attention diagram, randomly sampling each channel corresponding to each attention diagram; mapping an attention diagram corresponding to a channel obtained by random sampling to the original image to obtain a sampling image; and determining a mask area of the sampling image according to a preset pixel threshold value, and obtaining the non-significant image according to the mask area of the sampling image.

In one embodiment of the present application, the apparatus further comprises: the second input unit is used for inputting the multi-channel feature map corresponding to the original image into the classification network of the image recognition model to obtain a third classification loss; a third optimizing unit, configured to optimize parameters of the image recognition model according to the first classification loss, the second classification loss, and the second classification loss.

As shown in fig. 7, an embodiment of the present application further provides an image recognition apparatus 700, where the apparatus 700 includes: a second obtaining unit 710, a generating unit 720, a third determining unit 730 and a classifying unit 740.

The second obtaining unit 710 in the embodiment of the application is configured to obtain an image to be identified.

The generating unit 720 in the embodiment of the present application is configured to generate a multi-channel feature map of the image to be recognized by using a convolutional network of an image recognition model.

The third determining unit 730 in this embodiment of the application is configured to determine a salient region of the image to be recognized based on the correlation of the channel feature maps.

The classification unit 740 of the embodiment of the application is configured to classify the salient region by using the classification network of the image recognition model to obtain a class recognition result of the image to be recognized, where the image recognition model is obtained by training based on the training device of the image recognition model.

Inputting the obtained images corresponding to the salient regions into a classification network of the model for classification, and further obtaining a class identification result of the image to be identified, wherein the image identification model in the embodiment of the application is obtained by training through a training device as follows:

the first acquisition unit is used for acquiring an original image and extracting a multi-channel characteristic map of the original image by using a convolution network of an image recognition model; a first determination unit for determining a plurality of attention maps based on the correlation of the respective channel feature maps; a second determination unit configured to determine a salient region in the original image using a non-maximum suppression algorithm based on a plurality of the attention maps; and the first optimization unit is used for obtaining a class identification result of the salient region by utilizing a classification network of the image identification model, and determining a first classification loss according to the class identification result so as to optimize parameters of the image identification model according to the first classification loss.

It should be noted that, for the specific implementation of each apparatus embodiment, reference may be made to the specific implementation of the corresponding method embodiment, which is not described herein again.

In summary, according to the technical scheme of the application, an original image is obtained, and a convolution network of an image recognition model is used for extracting a multi-channel feature map of the original image; determining a plurality of attention maps based on the correlation of the channel feature maps; determining a manner of salient regions in the original image using a non-maximum suppression algorithm based on a plurality of the attention maps; and obtaining a category identification result of the salient region by utilizing the classification network of the image identification model, and determining a first classification loss according to the category identification result so as to optimize the parameters of the image identification model according to the first classification loss. The image recognition model obtained through training has strong feature expression capacity, can recognize fine-grained features in the image, can achieve high image recognition accuracy, and improves the fine-grained recognition performance of the image.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the training means or image recognition means of the image recognition model according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

For example, fig. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 800 comprises a processor 810 and a memory 820 arranged to store computer executable instructions (computer readable program code). The memory 820 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 820 has a storage space 830 storing computer readable program code 831 for performing any of the method steps described above. For example, the storage space 830 for storing the computer-readable program code may include respective computer-readable program codes 831 for respectively implementing various steps in the above methods. The computer readable program code 331 may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as that shown in fig. 9. FIG. 9 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 900 stores computer readable program code 831 for executing the steps of the method according to the present application, which is readable by a processor 810 of the electronic device 800, and when the computer readable program code 831 is executed by the electronic device 800, causes the electronic device 800 to perform the steps of the method described above, and in particular, the computer readable program code 831 stored by the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 831 may be compressed in a suitable form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A training method of an image recognition model is characterized by comprising the following steps:

2. The method for training an image recognition model according to claim 1, wherein the determining a plurality of attention maps based on the correlation of the channel feature maps comprises:

3. The method for training an image recognition model according to claim 1, wherein the determining the salient region in the original image by using a non-maximum suppression algorithm based on the plurality of attention maps comprises:

determining the salient region according to the reserved mapping region.

4. The method for training an image recognition model according to claim 1, wherein the obtaining the class recognition result of the salient region by using the classification network of the image recognition model comprises:

5. The method for training an image recognition model according to claim 1, wherein the method further comprises:

6. The method for training an image recognition model according to claim 5, wherein the acquiring the non-significant image comprises:

7. The method for training an image recognition model according to claim 5, wherein the method further comprises:

8. An image recognition method, comprising:

acquiring an image to be identified;

classifying the salient region by using a classification network of the image recognition model to obtain a class recognition result of the image to be recognized, wherein the image recognition model is obtained by training based on the training method of the image recognition model according to any one of claims 1 to 7.

9. An apparatus for training an image recognition model, comprising:

10. An image recognition apparatus, comprising:

the second acquisition unit is used for acquiring an image to be identified;

a classification unit, configured to classify the salient region by using a classification network of the image recognition model to obtain a class recognition result of the image to be recognized, where the image recognition model is obtained by training based on the training apparatus of the image recognition model according to claim 9.

11. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method of training an image recognition model as claimed in any one of claims 1 to 7, or to perform an image recognition method as claimed in claim 8.

12. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement a training method of an image recognition model as claimed in any one of claims 1 to 7, or perform an image recognition method as claimed in claim 8.