CN111553372A

CN111553372A - Training image recognition network, image recognition searching method and related device

Info

Publication number: CN111553372A
Application number: CN202010332194.0A
Authority: CN
Inventors: 章书豪; 夏雄尉; 谢泽华; 周泽南; 苏雪峰
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-08-18
Anticipated expiration: 2040-04-24
Also published as: CN111553372B

Abstract

The application discloses a method for training an image recognition network, a method for image recognition search and a related device, wherein the method for training the image recognition network comprises the following steps: dividing an original training image into a plurality of training image blocks and marking labels; disordering and rearranging a plurality of training image blocks according to the image salient region detection result of the original training image to obtain a rearranged training image of the original training image; and taking the original training image, the rearranged training image and corresponding labeled data comprising a coarse-grained image class label, a fine-grained image class label, an image preprocessing class label and a training image block label sequence as training data, and training an image recognition network to obtain an image recognition model. The image recognition searching method comprises the following steps: acquiring an image to be identified; inputting the image to be recognized into the image recognition model, and outputting the target characteristics and the target category of the image to be recognized; and searching similar images in the image database by using the target characteristics and the target categories of the images to be recognized.

Description

Training image recognition network, image recognition searching method and related device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and a related apparatus for training an image recognition network and image recognition search.

Background

With the rapid development of science and technology, in daily life, a user can shoot interested articles at will, and the links of the same-style articles or similar articles can be rapidly obtained by searching the articles according to the article images, so that the searching requirement of the interested articles of the user is met; the item image search is actually an image recognition search of the item image.

At present, the above image recognition and search method generally extracts global features of an article image by using a deep learning model to perform recognition and search. However, for an article image with a complex scene, for example, an article area in the article image is relatively small, only the global feature of the article image can be extracted by using the deep learning model, and only the global feature of the article image is focused in a subsequent image identification search process, so that important features of the article image are easily omitted, the accuracy of image identification search is greatly reduced, and the user experience of the image identification search is poor.

Disclosure of Invention

The technical problem to be solved by the application is to provide a method and a related device for training an image recognition network and image recognition search, so that the image recognition network focuses on local features of an image and obtains an image recognition model with enhanced perception capability of the local features of the image; even aiming at the images to be identified with complex scenes, the accuracy of image identification and search can be effectively improved, so that the user experience of image identification and search is improved.

In a first aspect, an embodiment of the present application provides a method for training an image recognition network, where the method includes:

segmenting an original training image to obtain a plurality of training image blocks and marking labels;

performing disordering and rearrangement on the plurality of training image blocks based on the image salient region detection result of the original training image to obtain a rearranged training image of the original training image;

training an image recognition network to obtain an image recognition model based on the original training image, the rearranged training image and the corresponding labeled data; the labeling data comprises a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, wherein the image preprocessing category label comprises an original label or a rearranged label.

Optionally, the performing, on the basis of the image salient region detection result of the original training image, a disordering and rearranging on the plurality of training image blocks to obtain a rearranged training image of the original training image includes:

detecting an image salient region of the original training image by using an attention heat map model to obtain an attention heat map of the original training image;

and disordering and rearranging the plurality of training image blocks based on the heat degree of the attention heat map to obtain a rearranged training image of the original training image.

Optionally, the performing, on the basis of the image salient region detection result of the original training image, a shuffle and rearrangement of the training image blocks includes:

based on the image salient region detection result of the original training image, the training image blocks corresponding to the positions with higher degrees of significance in the image salient region detection result are disturbed to a lower degree, and the training image blocks corresponding to the positions with lower degrees of significance are disturbed to a higher degree.

Optionally, the training an image recognition network to obtain an image recognition model based on the original training image, the rearranged training image and the corresponding labeled data includes:

based on the original training image and the rearranged training image, utilizing a feature extraction network in the image recognition network to obtain training features;

based on the training features, obtaining prediction data by utilizing a recognition network in the image recognition network, wherein the prediction data comprises a prediction coarse-grained image category, a prediction fine-grained image category and a prediction image preprocessing category;

and training network parameters of the image recognition network by using a network loss function to obtain the image recognition model based on the prediction data and the annotation data.

Optionally, the network loss function includes a coarse-grained image class classification loss function, a fine-grained image class classification loss function, an image preprocessing class classification loss function, and a loss function for restoring a rearranged training image to an original training image.

In a second aspect, an embodiment of the present application provides an image recognition search method, using the image recognition model of any one of the above first aspects, the method including:

acquiring an image to be identified;

obtaining the target characteristics and the target category of the image to be recognized by using the image recognition model;

and searching a similar image of the image to be recognized in an image database based on the target feature and the target category.

In a third aspect, an embodiment of the present application provides an apparatus for training an image recognition network, where the apparatus includes:

the segmentation obtaining unit is used for segmenting the original training image to obtain a plurality of training image blocks and marking labels;

a rearrangement obtaining unit, configured to perform a disordering rearrangement on the plurality of training image blocks based on an image salient region detection result of the original training image, so as to obtain a rearranged training image of the original training image;

a training obtaining unit, configured to train an image recognition network to obtain an image recognition model based on the original training image, the rearranged training image, and the corresponding annotation data; the labeling data comprises a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, wherein the image preprocessing category label comprises an original label or a rearranged label.

Optionally, the rearrangement obtaining unit includes:

the detection obtaining subunit is configured to perform image salient region detection on the original training image by using an attention thermograph model, and obtain an attention thermograph of the original training image;

and the rearrangement obtaining subunit is configured to perform a disordering rearrangement on the plurality of training image blocks based on the heat of the attention heat map to obtain a rearranged training image of the original training image.

Optionally, the rearrangement obtaining unit is specifically configured to:

Optionally, the training obtaining unit includes:

a first obtaining subunit, configured to obtain, based on the original training image and the rearranged training image, a training feature by using a feature extraction network in the image recognition network;

a second obtaining subunit, configured to obtain, based on the training features, prediction data using an identification network in the image identification network, where the prediction data includes a prediction coarse-grained image category, a prediction fine-grained image category, and a prediction image preprocessing category;

and the training obtaining subunit is used for training the network parameters of the image recognition network by using a network loss function to obtain the image recognition model based on the prediction data and the annotation data.

In a fourth aspect, an embodiment of the present application provides an apparatus for image recognition search, where the apparatus uses the image recognition model according to any one of the above first aspects, the apparatus includes:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be recognized;

the obtaining unit is used for obtaining the target characteristics and the target categories of the image to be recognized by utilizing the image recognition model;

and the searching unit is used for searching the similar image of the image to be recognized in the image database based on the target feature and the target category.

In a fifth aspect, embodiments of the present application provide an apparatus for training an image recognition network, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for:

In a sixth aspect, embodiments of the present application provide an apparatus for image recognition search, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for:

acquiring an image to be identified;

In a seventh aspect, an embodiment of the present application provides a machine-readable medium, on which instructions are stored, which when executed by one or more processors, cause an apparatus to perform the method for training an image recognition network according to any one of the first aspect; alternatively, the apparatus is caused to perform the method of image recognition search described in the second aspect above.

Compared with the prior art, the method has the advantages that:

by adopting the technical scheme of the embodiment of the application, firstly, an original training image is divided into a plurality of training image blocks and labeled with labels; then, according to the image salient region detection result of the original training image, disordering and rearranging a plurality of training image blocks to obtain a rearranged training image of the original training image; finally, training an image recognition network to obtain an image recognition model by taking the original training image, the rearranged training image and the corresponding labeled data as training data; the labeling data comprise a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, and the image preprocessing category label comprises an original label or a rearranged label. Therefore, the method includes that a plurality of training image blocks segmented by an original training image are disorganized and rearranged according to the detection result of the image salient region of the original training image to obtain a rearranged training image, the original training image is combined with the rearranged training image to serve as the input of an image recognition network, the image recognition network focuses on the local features of the image, and an image recognition model with enhanced local feature perception capability of the image is obtained through training.

In addition, according to the technical scheme of the embodiment of the application, firstly, the image to be identified is obtained; then, inputting the image to be recognized into the image recognition model, and outputting the target characteristics and the target category of the image to be recognized; and finally, searching similar images in the image database by using the target characteristics and the target categories of the images to be recognized. Therefore, the target features of the image to be recognized, which are obtained by the image recognition model, not only pay attention to the global features of the image, but also pay attention to the local features of the image, so that the omission of the important features of the image to be recognized is avoided; similar pictures of the image to be identified are searched by combining the target characteristics with the target categories, and even aiming at the image to be identified with a complex scene, the accuracy of image identification and search can be effectively improved, so that the user experience of image identification and search is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a system framework related to an application scenario in an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for training an image recognition network according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an original training image and a heat of attention of the original training image provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an original training image and a rearranged training image of the original training image according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a method for image recognition search according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an apparatus for training an image recognition network according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an apparatus for image recognition search according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an apparatus for training an image recognition network or an image recognition search according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Searching with the item image is actually performing an image recognition search on the item image. In the prior art, an image recognition search method generally extracts global features of an article image by using a deep learning model to perform recognition search. However, through research, the inventor finds that, for an article image with a relatively complex scene, only the global feature of the article image can be extracted by using the deep learning model, and in the subsequent image identification search process, only the global feature of the article image is focused, so that important features of the article image are easily omitted, the accuracy of the image identification search is low, and the user experience of the image identification search is influenced.

In order to solve the problem, in the embodiment of the present application, an original training image is divided into a plurality of training image blocks and labeled with labels; disordering and rearranging a plurality of training image blocks according to the image salient region detection result of the original training image to obtain a rearranged training image of the original training image; taking the original training image, the rearranged training image and the corresponding annotation data as training data, and training an image recognition network to obtain an image recognition model; the labeling data comprise a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, and the image preprocessing category label comprises an original label or a rearranged label. Therefore, the method includes the steps that a plurality of training image blocks segmented by an original training image are disorganized and rearranged in a targeted mode by using an image salient region detection result of the original training image to obtain a rearranged training image, the original training image is combined with the rearranged training image to serve as input of an image recognition network, the image recognition network focuses on local features of the image, and an image recognition model with enhanced image local feature perception capability is obtained through training.

In addition, in the embodiment of the application, the image to be identified is acquired; inputting the image to be recognized into the image recognition model, and outputting the target characteristics and the target category of the image to be recognized; and searching similar images in the image database by using the target characteristics and the target categories of the images to be recognized. Therefore, the target features of the image to be recognized, which are obtained by the image recognition model, not only pay attention to the global features of the image, but also pay attention to the local features of the image, so that the omission of the important features of the image to be recognized is avoided; similar pictures of the image to be identified are searched by combining the target characteristics with the target categories, and even aiming at the image to be identified with a complex scene, the accuracy of image identification and search can be effectively improved, so that the user experience of image identification and search is improved.

For example, one of the scenarios in the embodiment of the present application may be applied to the scenario shown in fig. 1, where the scenario includes a terminal device 101, a processor 102, and an image database 103; the terminal device 101 may be a personal computer, or may be another mobile terminal, such as a mobile phone or a platform computer. The terminal device 101 collects a large number of original training images to form a training set, and the processor 102 obtains the original training images from the terminal device 101, and obtains an image recognition model by using the method for training the image recognition network in the embodiment of the present application. After the terminal device 101 sends the image to be recognized to the processor 102, the processor 102 searches the image database 103 for a similar image of the image to be recognized by using the image recognition search method in the embodiment of the present application.

It is to be understood that, in the above application scenarios, although the actions of the embodiments of the present application are described as being performed by the processor 102, the present application is not limited in terms of the subject of execution as long as the actions disclosed in the embodiments of the present application are performed.

It is to be understood that the above scenario is only one example of a scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario.

The following describes in detail specific implementations of a training image recognition network, an image recognition search method, and a related apparatus in the embodiments of the present application with reference to the accompanying drawings.

Exemplary method

Referring to fig. 2, a flowchart of a method for training an image recognition network in an embodiment of the present application is shown. In this embodiment, the method may include, for example, the steps of:

step 201: and (4) segmenting the original training image to obtain a plurality of training image blocks and marking labels.

It should be noted that, in the prior art, the deep learning model is obtained only by learning the original training image, and the main focus is on the global features of the image; aiming at images with complex scenes, the depth learning model can extract the global features of the images, only the global features of the images are concerned in the subsequent image identification and search process, and the important features of the images are easy to miss. In the embodiment of the present application, an original training image is divided into a plurality of training image blocks and recombined to obtain a new training image, and on the basis of learning the global features of an attention image of the original training image, the new training image needs to be learned to pay attention to the local features of the attention image. Therefore, the original training image needs to be divided to obtain a plurality of training image blocks, and each training image block needs to be labeled, so as to make it clear that the label sequence of the training image block corresponding to the new training image is obtained by recombining the plurality of training image blocks. The number of the plurality of training image blocks may be preset based on the segmentation requirement in a specific scenario, for example, the number of the plurality of training image blocks may be 9, 16, 25, or 36, and so on.

As an example, in the embodiment of the present application, the number of training image blocks preset based on the segmentation requirement in a specific scene is 9, an original training image is uniformly segmented to obtain a total of 9 training image blocks, and the 9 training image blocks are labeled with labels 1, 2, 3, 4, 5, 6, 7, 8, and 9 in sequence.

Step 202: and disordering and rearranging the training image blocks based on the image salient region detection result of the original training image to obtain a rearranged training image of the original training image.

It should be noted that, after the plurality of training image blocks are obtained in step 201, the plurality of training image blocks are disorganized and sequentially recombined to obtain a new training image, which is a rearranged training image of the original training image, so that the rearranged training image more obviously represents the image salient region and more clearly defines the image salient region relative to the original training image, so that the feature of the image salient region can be followed when the new training image is learned. In the embodiment of the present application, the multiple training image blocks are recombined to obtain a new training image, which may be a rearranged training image obtained by disorganizing and rearranging the multiple training image blocks by using an image salient region detection result of the original training image and recording the obtained new training image as the original training image.

As an example, the original training image in the above example is divided into 9 training image blocks, the 9 training image blocks are denoted by 1, 2, 3, 4, 5, 6, 7, 8, and 9, and the training image blocks are scrambled and rearranged using the image salient region detection result of the original training image, so that the training image blocks corresponding to the rearranged training image of the original training image are denoted by 1, 3, 5, 7, 2, 4, 6, 8, and 9 in the order of the reference numerals.

When the step 202 is implemented specifically, firstly, an image salient region detection result of an original training image needs to be obtained, and is usually obtained by performing image salient region detection on the original training image; then, the training image blocks are disorganized and rearranged according to the image salient region detection result, and a rearranged training image of the original training image is obtained. The principle of disordering and rearranging the training image blocks according to the detection result of the image salient region may be as follows: in the detection result of the image salient region of the original training image, the higher the salient degree is, the lower the disorder degree of the training image block corresponding to the position is, and the lower the salient degree is, the higher the disorder degree of the training image block corresponding to the position is

Therefore, in an optional implementation manner of this embodiment of this application, the step 202 performs a random rearrangement on the plurality of training image blocks based on the image salient region detection result of the original training image to obtain a rearranged training image of the original training image, which may include, but is not limited to, the following steps:

step A: and carrying out image salient region detection on the original training image to obtain an image salient region detection result.

And B: and disordering and rearranging the plurality of training image blocks based on the image salient region detection result to obtain a rearranged training image of the original training image.

It should be noted that, because the attention thermography model is a tool for visualizing the convolutional neural network, inputting an image into the attention thermography model can output an attention thermograph which obviously and clearly represents a salient region of the image in the image, and observing the attention thermograph can clearly define a key region in the image; therefore, for step a in the embodiment of the present application, the original training image may be input into the attention thermography model, so as to output the attention thermography of the original training image. That is, in an optional implementation manner of the embodiment of the present application, the step a performs image salient region detection on the original training image to obtain the image salient region detection result, which may specifically be: and detecting the image salient region of the original training image by using an attention heat map model to obtain the attention heat map of the original training image. Of course, in the embodiment of the present application, the detection of the salient region of the image may adopt other ways of detecting the salient region of the image besides the attention thermograph model, and correspondingly, the obtained detection result of the salient region of the image may also be other ways of detecting the salient region of the image besides the attention thermograph.

As an example, an original training image and an attention thermal diagram of the original training image are illustrated in FIG. 3. Wherein, the left image is the original training image, and the right image is the attention heat map of the left image. The right image is obtained by inputting the left image into the attention thermal map model and outputting the left image, the right image can obviously and clearly represent the image salient region in the left image, and the key region in the left image can be clarified by observing the right image.

Correspondingly, when the image salient region detection result is specifically the attention heat map, generally, the higher the heat degree is, the lower the disturbance degree of the training image block corresponding to the position is, and the higher the disturbance degree of the training image block corresponding to the position with the lower heat degree is, the multiple training image blocks rearranged in a disturbed manner according to the heat degree of the attention heat map are obtained to obtain the rearranged training image. Therefore, in an optional implementation manner of the embodiment of the present application, in the step B, based on the detection result of the image salient region, the training image blocks are shuffled and rearranged to obtain a rearranged training image of the original training image, for example, the method may specifically be: and disordering and rearranging the plurality of training image blocks based on the heat degree of the attention heat map to obtain a rearranged training image of the original training image.

As an example, on the basis of fig. 3, a schematic diagram of an original training image and a rearranged training image of the original training image is shown in fig. 4. Wherein, the left image is an original training image, and the right image is a rearranged training image of the left image. The right image is obtained by dividing the left image into a plurality of training image blocks and then scrambling and rearranging the plurality of training image blocks according to the right image in fig. 3.

Step 203: training an image recognition network to obtain an image recognition model based on the original training image, the rearranged training image and the corresponding labeled data; the labeling data comprises a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, wherein the image preprocessing category label comprises an original label or a rearranged label.

It should be noted that, after the rearranged training images of the original training images are obtained in steps 201 to 202, the original training images are not only used as the input of the image recognition network, but also used as the input of the image recognition network at the same time, so as to jointly train the image recognition network, so that the image recognition network learns the rearranged training images and focuses on the local features of the rearranged training images on the basis of learning the original training images and focusing on the global features of the original training images, and the perception capability of the obtained image recognition model on the local features of the images is enhanced. For an original training image or a rearranged training image, corresponding labeling data comprises a coarse-grained image class label, a fine-grained image class label, an image preprocessing class label and a training image block label sequence; the image preprocessing category label comprises an original label and a rearranged label, wherein the image preprocessing category label comprises a coarse-grained image classification label and a fine-grained image classification label, the coarse-grained image classification label is obtained by performing coarse-grained image classification on an image, and the fine-grained image classification label is obtained by performing fine-grained image classification on the image, namely, the fine-grained image classification label is smaller and finer than the granularity of the image category represented by the coarse-grained image classification label, and the image preprocessing category label comprises the original label or the rearranged label.

In an embodiment of the application, the image recognition network comprises a feature extraction network and a recognition network. When step 203 is implemented, firstly, inputting the original training image and the rearranged training image into the feature extraction network to output the training features; then, inputting the training characteristics into a recognition network to output a predicted coarse-grained image category, a predicted fine-grained image category and a predicted image preprocessing category as prediction data; and finally, performing reverse gradient training on the network parameters of the image recognition network by using the network loss function through the prediction data and the labeled data until the training is finished, and taking the trained image recognition network as an image recognition model. That is, in an optional implementation manner of the embodiment of the present application, the step 203 trains the image recognition network to obtain the image recognition model based on the original training image, the rearranged training image and the corresponding annotation data, and may include the following steps C to E:

and C: and based on the original training image and the rearranged training image, utilizing a feature extraction network in the image recognition network to obtain training features.

Step D: and obtaining prediction data by utilizing a recognition network in the image recognition network based on the training features, wherein the prediction data comprises a prediction coarse-grained image category, a prediction fine-grained image category and a prediction image preprocessing category.

Step E: and training network parameters of the image recognition network by using a network loss function to obtain the image recognition model based on the prediction data and the annotation data.

It should be further noted that, in the embodiment of the present application, since coarse-grained image category classification needs to be performed on the original training image and the rearranged training image, fine-grained image category classification needs to be performed on the original training image and the rearranged training image, whether the original training image and the rearranged training image are the original category or the rearranged category is determined, and the rearranged training image is reordered to recover the original training image; therefore, the 4 loss functions of the coarse-grained image class classification loss function, the fine-grained image class classification loss function, the image preprocessing class classification loss function and the rearranged training image restoration to the original training image loss function are combined to form the network loss function of the image recognition network. That is, in an optional implementation manner of this embodiment of the present application, the network loss function includes a coarse-grained image class classification loss function, a fine-grained image class classification loss function, an image preprocessing class classification loss function, and a loss function for restoring a rearranged training image to an original training image.

Through various implementation manners provided by the embodiment, firstly, an original training image is divided into a plurality of training image blocks and labeled with labels; then, according to the image salient region detection result of the original training image, disordering and rearranging a plurality of training image blocks to obtain a rearranged training image of the original training image; finally, training an image recognition network to obtain an image recognition model by taking the original training image, the rearranged training image and the corresponding labeled data as training data; the labeling data comprise a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, and the image preprocessing category label comprises an original label or a rearranged label. Therefore, the method includes that a plurality of training image blocks segmented by an original training image are disorganized and rearranged according to the detection result of the image salient region of the original training image to obtain a rearranged training image, the original training image is combined with the rearranged training image to serve as the input of an image recognition network, the image recognition network focuses on the local features of the image, and an image recognition model with enhanced local feature perception capability of the image is obtained through training.

It should be noted that, on the basis of the above embodiment, for an image to be recognized with a complex scene, in order to avoid easily omitting important features of the image to be recognized, after the image to be recognized is acquired, the image to be recognized may be input into an image recognition model, and even if the scene of the image to be recognized is complex, the image recognition model may focus on both global features of the image to be recognized and local features of the image to be recognized, so as to obtain target features and target categories of the image to be recognized, and in order to effectively improve the accuracy of image recognition search, similar images of the image to be recognized may be searched in an image database through the target features and the target categories.

Referring to fig. 5, a flowchart illustrating a method for image recognition search in an embodiment of the present application is shown. In an embodiment of the present application, with the image recognition model described in the above embodiment, the method may include the following steps:

step 501: and acquiring an image to be identified.

Step 502: and obtaining the target characteristics and the target category of the image to be recognized by using the image recognition model.

In the embodiment of the application, firstly, an image to be recognized is input into a feature extraction network in an image recognition model to obtain target features of the image to be recognized; then, the target feature is input into a recognition network in the image recognition model to obtain the target category of the image to be recognized.

Step 503: and searching a similar image of the image to be recognized in an image database based on the target feature and the target category.

In the embodiment of the application, for example, an image set corresponding to a target category may be determined in an image database, a similarity between a target feature and a feature of each image in the image set may be calculated, and a similar image of an image to be recognized may be determined based on the similarity.

Through various implementation manners provided by the embodiment, firstly, an image to be identified is obtained; then, inputting the image to be recognized into the image recognition model, and outputting the target characteristics and the target category of the image to be recognized; and finally, searching similar images in the image database by using the target characteristics and the target categories of the images to be recognized. Therefore, the target features of the image to be recognized, which are obtained by the image recognition model, not only pay attention to the global features of the image, but also pay attention to the local features of the image, so that the omission of the important features of the image to be recognized is avoided; similar pictures of the image to be identified are searched by combining the target characteristics with the target categories, and even aiming at the image to be identified with a complex scene, the accuracy of image identification and search can be effectively improved, so that the user experience of image identification and search is improved.

Exemplary devices

Referring to fig. 6, a schematic structural diagram of an apparatus for training an image recognition network in an embodiment of the present application is shown. In the embodiment of the present application, the apparatus may specifically include:

a segmentation obtaining unit 601, configured to segment an original training image to obtain a plurality of training image blocks and mark labels;

a rearrangement obtaining unit 602, configured to perform a random rearrangement on the multiple training image blocks based on an image significant region detection result of the original training image, so as to obtain a rearranged training image of the original training image;

a training obtaining unit 603, configured to train an image recognition network to obtain an image recognition model based on the original training image, the rearranged training image, and the corresponding labeled data; the labeling data comprises a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, wherein the image preprocessing category label comprises an original label or a rearranged label.

In an optional implementation manner of this embodiment of this application, the rearrangement obtaining unit 602 includes:

In an optional implementation manner of the embodiment of the present application, the rearrangement obtaining unit 602 is specifically configured to:

In an optional implementation manner of this embodiment of the present application, the training obtaining unit 603 includes:

In an optional implementation manner of the embodiment of the present application, the network loss function includes a coarse-grained image class classification loss function, a fine-grained image class classification loss function, an image preprocessing class classification loss function, and a loss function for restoring a rearranged training image to an original training image.

Referring to fig. 7, a schematic structural diagram of an apparatus for image recognition search in an embodiment of the present application is shown. In this embodiment of the present application, with the image recognition model described in the foregoing embodiment, the apparatus may specifically include:

an acquiring unit 701 configured to acquire an image to be recognized;

an obtaining unit 702, configured to obtain a target feature and a target category of the image to be recognized by using the image recognition model;

a searching unit 703, configured to search, in an image database, for a similar image of the image to be recognized based on the target feature and the target category.

FIG. 8 is a block diagram illustrating an apparatus 800 for training an image recognition network or image recognition search, according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure correlated to the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of training an image recognition network, the method comprising:

training an image recognition network to obtain an image recognition model based on the original training image, the rearranged training image and the corresponding labeled data; the labeling data comprises a coarse-grained image category label, a fine-grained image category label, an image preprocessing category label and a training image block label sequence, wherein the image preprocessing category label comprises an original label or a rearranged label;

alternatively, a method of training an image recognition network is enabled for a mobile terminal, the method comprising:

acquiring an image to be identified;

Fig. 9 is a schematic structural diagram of a server in the embodiment of the present application. The server 900 may vary widely in configuration or performance and may include one or more Central Processing Units (CPUs) 922 (e.g., one or more processors) and memory 932, one or more storage media 930 (e.g., one or more mass storage devices) storing applications 942 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 922 may be provided in communication with the storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900.

The server 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, one or more keyboards 956, and/or one or more operating systems 941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application in any way. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims

1. A method of training an image recognition network, comprising:

2. The method according to claim 1, wherein the performing a shuffle and a rearrangement of the training image blocks based on the image salient region detection result of the original training image to obtain a rearranged training image of the original training image comprises:

3. The method according to claim 1 or 2, wherein the performing of the shuffle of the plurality of training image blocks based on the image salient region detection result of the original training image comprises:

4. The method of claim 1, wherein training an image recognition network to obtain an image recognition model based on the original training image, the rearranged training image, and corresponding annotation data comprises:

5. A method of image recognition searching, using the image recognition model of any of claims 1 to 4, comprising:

acquiring an image to be identified;

6. An apparatus for training an image recognition network, comprising:

7. An apparatus for image recognition search, using the image recognition model of any one of claims 1 to 4, comprising:

8. An apparatus for training an image recognition network, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors comprises instructions for:

9. An apparatus for image recognition searching, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors comprises instructions for:

acquiring an image to be identified;

10. A machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of training an image recognition network of any of claims 1 to 4; or cause an apparatus to perform a method of image recognition searching as claimed in claim 5.