CN111274430A

CN111274430A - Porcelain field image retrieval algorithm based on feature reconstruction supervision

Info

Publication number: CN111274430A
Application number: CN202010059993.5A
Authority: CN
Inventors: 李可然; 闫倩; 郑洁
Original assignee: Epailive Auction Beijing Co ltd
Current assignee: Epailive Auction Beijing Co ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2020-06-12

Abstract

The invention discloses a porcelain field image retrieval algorithm based on feature reconstruction supervision, and provides a porcelain retrieval method based on feature reconstruction supervision based on deep learning aiming at application field characteristics and defects of a mainstream image retrieval method. The search domain represents a search sample of a user, and the searched domain represents a searched data set, and is mainly divided into two parts, namely feature extraction and feature reconstruction supervision, and a multi-supervision learning method is combined. The method can realize an effective image retrieval function in the field of porcelain, has high accuracy and has application value. In addition, in domestic and foreign data, image retrieval methods aiming at the characteristics of the field are fewer, so that the method has innovation.

Description

Porcelain field image retrieval algorithm based on feature reconstruction supervision

Technical Field

The invention relates to the field of image retrieval, in particular to a porcelain field retrieval algorithm based on feature reconstruction supervision, which is used for improving the image retrieval precision based on porcelain.

Background

The rapid development of computer technology has brought about a tremendous increase in the amount of information. In the face of mass data, timely and accurate retrieval becomes an inherent requirement for people to efficiently process useful data. In the field of image retrieval, early methods implemented retrieval requirements by analyzing keywords according to textual descriptions of image information by demanders, and combining manual labeling and the like. However, as the data volume is seriously amplified, the precision requirement of a searcher is improved, and the defects presented by the early method are more and more prominent and are difficult to meet the requirement. The subsequent developed image retrieval based on the content realizes the effects of searching and accurately searching images by images and is successfully applied to actual application search scenes such as Baidu search scenes, Taobao search scenes and the like. Although image search technology is mature, there is still a large development space in terms of accuracy, instantaneity, application range, and the like.

The modern people attach more importance to the pursuit of cultural and mental culture, and the collection enthusiasm of the cultural relic porcelain is increasingly showing the importance of image retrieval in the field. However, unlike the search in fields such as hundreds of miscellaneous pictures, Taobao E-commerce pictures and the like, the porcelain image search has a more vivid characteristic. Firstly, a plurality of searching users are experts or fans in relevant aspects, so that the requirement on the searching precision is high; compared with the traditional search engine, an e-commerce platform and the like, the market demand of porcelain image retrieval is relatively small due to the industrial characteristics, so that the research of the existing related algorithm is not mature; furthermore, the searched objects have greater similarity, for example, two similar blue and white bowls may have slight difference in patterns visually, but are actually two different bowls. Such a slight difference increases the difficulty of accurate search and the complexity of application.

Disclosure of Invention

The invention provides an image retrieval method based on feature reconstruction supervision, aiming at improving the retrieval precision of images in the field of porcelain. The method can realize an effective image retrieval function in the field of porcelain, has high accuracy and has application value. In addition, in domestic and foreign data, image retrieval methods aiming at the characteristics of the field are fewer, so that the method has innovation.

Aiming at the characteristics of the application field and the defects of the mainstream image retrieval method, the porcelain retrieval method based on the characteristic reconstruction supervision is provided on the basis of deep learning. The search domain represents a search sample of a user, and the searched domain represents a searched data set, and is mainly divided into two parts, namely feature extraction and feature reconstruction supervision, and a multi-supervision learning method is combined. The algorithm model is shown in fig. 1. The technical scheme is as follows:

a porcelain field image retrieval algorithm based on feature reconstruction supervision comprises the following steps:

step 1: extracting features, namely extracting the features by respectively using a feature extraction network aiming at the search domain and a feature extraction network aiming at the searched domain; the feature extraction network comprises two parts of extraction of high-level semantic features and feature dimension reduction coding;

step 2: the characteristic reconstruction means that the neural network autonomously reconstructs high-level semantic characteristics of a search domain and a searched domain, and the characteristic extraction network is supervised and corrected according to the difference between the reconstructed high-level semantic information and the high-level semantic characteristics extracted by the characteristic extraction network;

the reconstruction loss functions of the search domain, the searched domain positive sample and the searched domain negative sample are respectively as follows:

equation 2-1:

equation 2-2:

equations 2 to 3

Thus, the final reconstruction loss is:

equations 2-4:

in the above formula: d_s、

And

respectively reconstructing the output of the sub-network for the characteristics of the search domain, the searched domain positive sample and the searched domain negative sample;

and step 3: the loss function is that the triple loss is used as the loss function, and in the image searching process of the porcelain, the porcelain image characteristics of the searching domain need to be extracted and matched with the searched domain to obtain the similarity of the porcelain image characteristics and the searched domain;

the objective of the Triplet loss is to narrow the distance between the search sample and the positive sample of the searched domain and to increase the distance between the search sample and the negative sample of the searched domain. The objectives are as follows:

wherein the content of the first and second substances,

and

the characteristics of the search domain sample, the searched domain positive sample and the searched domain negative sample of the ith group are respectively, a is the minimum interval, so the input is a triplet composed of the search domain sample, the searched domain positive sample and the searched domain negative sample, and T is all possible triplet sequences.

Final loss function:

when selected by the search field sample, selecting the positive sample farthest from the search sample and the negative sample closest to the search sample, wherein the formulas are respectively as follows:

during the application process, the adjustment can be made according to the actual situation. Such as: the selection in the mini _ batch range is positive and negative samples that fit the above formula.

And 4, step 4: and (3) feature matching: the similarity degree can be reflected by the calculation of the distance in the feature space, and the calculation formula is as follows:

where X and y are two points in the n-dimensional feature space, respectively. Smaller computation results indicate that the two features in the feature space are closer, i.e., the images are more similar.

In the above, the search field represents a search sample of a user, and the searched field represents a searched data set; the feature extraction network of the search domain and the feature extraction network of the searched domain set parameters independently to retain information of the respective domains.

In the above, the extracting of the high-level semantic features refers to extracting the high-level semantic features of the search domain and the searched domain by using 5 convolution layers and 1 4096-dimensional full-connected layer respectively.

In the above, the convolutional layer has the same network configuration as that of the initiation v1 of Googlenet.

In the above, the feature dimension reduction coding means that the obvious features of the search domain and the searched domain are retained, the number of features is reduced, and the feature G is composed of three full connection layers respectively to realize the high-level semantic meaning of the search domain and the searched domain_SAnd

dimension reduction and coding, whereby the image features of the search field and the searched field are used laterMapping the features to the same feature space to obtain features E in the same feature space_SAnd

in the above, the three full-link layers are composed of 2048, 1024 and 64-dimensional full-link layers, and the full-link layer portions perform weight sharing, so as to ensure that semantic relation between features of the search domain and the searched domain is increased when performing same feature space coding mapping.

The invention has better performance in the deep learning task. Although the high-level semantics of the neural network have a characteristic of being difficult to understand, it is generally considered that basic information of images such as colors and textures is generally learned in the shallow layer of the neural network, and a larger sample size can bring more comprehensive, sufficient and universal characteristics. ImageNet is a large data set in the field of computer vision, has wide data sources, a plurality of types and complicated picture styles, contains common object types, has wider applicability on model parameters obtained by training on the basis of the ImageNet, and is suitable for being used as initial parameters for network fine tuning. The Googlenet network initiation v1 part of the invention is not retrained, and uses the weight parameters trained in the classifying task of the ImageNet match 1000 to perform fine adjustment on the basis of the characteristics that the data sets used in the invention are few in quantity and many in classification, ideal effects are difficult to obtain by single training and the image retrieval task of the porcelain is synthesized. The method can realize the effective image retrieval function in the field of porcelain, has high accuracy and has application value. In addition, in domestic and foreign data, image retrieval methods aiming at the characteristics of the field are fewer, so that the method has innovation.

Drawings

FIG. 1 is a general model of the algorithm of the present invention.

Fig. 2 is a schematic diagram of a feature encoding and reconstructing network structure according to the present invention.

FIG. 3 is a diagram illustrating a search result according to an embodiment of the present invention.

Detailed Description

In order to facilitate understanding of the present invention, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Aiming at the characteristics of the application field and the defects of a mainstream image retrieval method, the invention discloses a porcelain field image retrieval algorithm based on characteristic reconstruction supervision. The method comprises the following specific steps:

step 1: feature extraction

In order to realize similarity matching of a search domain and a searched domain, two feature extraction networks are used for respectively carrying out feature extraction on the search domain and the searched domain. As shown in fig. 1, the feature extraction network mainly includes two parts: extracting high-level semantic features and performing feature dimension reduction coding. The high-level semantic feature extraction parts of the search domain and the searched domain respectively use 5 convolution layers and 1 4096-dimensional full-connection layer to realize the high-level semantic feature extraction of the two domains, and the corresponding feature dimension reduction codes respectively consist of three full-connection layers and are used for realizing the feature G of the high-level semantic of the search domain and the searched domain_SAnd

dimension reduction and coding are carried out, so that the image features of two domains can be mapped to the same feature space, and the feature E positioned in the same feature space is obtained_SAnd

the convolution layer in the search domain and the searched domain high-level semantic feature extraction network has the same network configuration as the convolution layer of the inceptionv1 of Googlenet, and a 4096-dimensional full-connection layer is connected behind the convolution layer. On one hand, compared with other classical feature extraction networks such as AlexNet and VGG, the network has deeper and wider network configuration and can extract better image features; on the other hand, the network has smaller network parameter number and can realize faster training and testing. And the feature dimension reduction and coding network consists of full connection layers of 2048, 1024 and 64 dimensions and is used for coding and mapping the features.

In order to enhance the semantic relation between the search domain and the searched domain, the invention uses partial weight sharing constraint setting for the search domain and the searched domain feature extraction network. Parameters are independently set in a high-level semantic feature extraction part of a search domain and a searched domain to keep respective domain information, weight sharing is carried out in a full connection layer part of feature coding dimension reduction, and semantic relation of features of the search domain and the searched domain is guaranteed to be increased when same feature space coding mapping is carried out.

Step 2: feature reconstruction

For multidimensional data, the network model needs a large amount of calculation and time overhead to learn the data characteristics. The more features are, the higher the required cost is, the number of features is not always in a direct proportion to the final effect, the excessive features bring the risk of overfitting, the robustness of the model is reduced, and the application in a real scene is difficult to realize. Feature dimension reduction refers to the retention of the obvious features of the search domain and the searched domain and the reduction of the number of the features. The process is purposefully selecting a part of image information, so that the loss of part of feature information is inevitably caused, and the difficulty in understanding the feature extracted by a computer results in difficulty in judging the quality of semantic information extracted after dimension reduction from the perspective that a human can understand. Therefore, in order to compensate for the problem of image useful information loss caused by feature extraction and dimension reduction, the invention proposes to use a feature reconstruction sub-network. As shown in fig. 2, the neural network autonomously reconstructs the high-level semantic features of the search domain and the searched domain, and supervises and corrects the feature extraction network according to the difference between the reconstructed high-level semantic information and the feature extraction network to extract the high-level semantic features, thereby ensuring semantic consistency of the search domain and the searched domain before and after feature dimension reduction coding.

The reconstruction loss functions of the search domain and the searched domain (positive and negative samples) are respectively as follows:

equation 2-1:

equation 2-2:

equations 2 to 3

Thus, the final reconstruction loss is:

equations 2-4:

in the above formula: d_s、

And

reconstructing the output of the sub-network for the search domain, the searched domain positive sample, and the searched domain negative sample features, respectively.

And step 3: loss function

The image retrieval task and the classification task are different. In the classification task, the image tag set is relatively fixed, and the type of the image needs to be judged. The softmax function is not very suitable because the image retrieval target is to determine whether two pictures are the same picture. The present invention uses triplet loss as a loss function. The triple loss is widely recognized in the field of deep learning such as face recognition, similar to the extraction of face features in face recognition, in the image search process of the porcelain, the porcelain image features in the search domain also need to be extracted and matched with the searched domain to obtain the similarity of the two.

The objective of Triplet loss is to narrow the distance between the search sample and the positive sample and to increase the distance between the search sample and the negative sample. The objectives are as follows:

wherein the content of the first and second substances,

and

features of the search domain sample, the searched domain positive sample and the searched domain negative sample of the ith group respectively, a is a minimum interval, so the input is a triplet composed of the search domain sample, the searched domain positive sample and the searched domain negative sample, T is all possible triplet sequences, and the final loss function:

it can be seen from the formula that when the distance between the search domain sample and the searched domain negative sample is smaller, and the distance between the searched domain positive sample is larger, the value of L will increase accordingly, and the goal of reducing the loss function will lead the neural network to bring more ideal performance.

In addition, because the number of negative samples in the data set is far greater than that of positive samples, blind selection of negative samples as part of input leads to lengthening of training time, and the network is difficult to learn better characteristics, which affects final search accuracy. Likewise, blind selection of positive samples can also lead to similar problems. Therefore, in the invention, by using Tripletselection, when a sample is selected in a search domain, a sample which is difficult to distinguish is selected, and the formula is as follows:

the positive sample furthest from the search sample and the negative sample closest to the search sample are selected as inputs to the triplet. For neural networks, if the "least likely positive sample" and the "most likely negative sample" can be distinguished, it is indicated that the network has a higher applicability.

And 4, step 4: feature matching

In a porcelain search task, image features are extracted by using a deep learning method, but the extracted features cannot be directly output as a final result to be presented to a user, and the features of two images need to be converted into qualitative or quantitative indexes to be reflected in a certain mode. The characteristic matching realizes the quantitative embodiment of the similarity degree of the characteristics. One of the methods for calculating the euclidean distance in the feature space is widely used in the related fields such as face recognition.

The Euclidean distance reflects the real distance between two points in the characteristic space, and is simple and convenient to calculate and easy to understand. Each image feature obtained in the invention has the same important position, so the similarity degree can be reflected by the calculation of the distance in the feature space. The calculation formula is as follows:

On the basis of the above contents, in order to verify the effectiveness of the algorithm described by the invention, the algorithm is realized on the basis of a caffe deep learning framework, and an experimental result is displayed. The method comprises 32 kinds of porcelain classification, and relates to 30000 pictures such as blue and white bowls, pure color bowls, double-ear bottles and the like, wherein the data ratio of a training set to a testing set is 4: 1. The network training process was optimized using the SGD optimizer with a momentum of 0.9.

In the feature extraction section herein, the inference v1 structure of Googlenet was fine-tuned based on parameters trained on the 1000 classification datasets of ImageNet.

The final search effect is shown in fig. 3: the left column is a retrieval sample, the right column is a retrieval result, and the retrieval similarity is arranged in a descending order from left to right. The similarity of the original image in the retrieval result is ranked at a higher position, and the content provided by the text has certain application value.

The method can realize the effective image retrieval function in the field of porcelain, has high accuracy and has application value. In addition, in domestic and foreign data, image retrieval methods aiming at the characteristics of the field are fewer, so that the method has innovation.

The technical features mentioned above are combined with each other to form various embodiments which are not listed above, and all of them are regarded as the scope of the present invention described in the specification; also, modifications and variations may be suggested to those skilled in the art in light of the above teachings, and it is intended to cover all such modifications and variations as fall within the true spirit and scope of the invention as defined by the appended claims.

Claims

1. A porcelain field image retrieval algorithm based on feature reconstruction supervision is characterized by comprising the following steps:

equation 2-1:

equation 2-2:

equations 2 to 3

Thus, the final reconstruction loss is:

equations 2-4:

in the above formula: d_s、

And

wherein the content of the first and second substances,

and

2. The algorithm of claim 1, wherein the search field represents a user's search sample, the searched field representing a data set being searched; the feature extraction network of the search domain and the feature extraction network of the searched domain set parameters independently to retain information of the respective domains.

3. The algorithm of claim 1, wherein the high-level semantic feature extraction refers to high-level semantic feature extraction of a search domain and a searched domain by using 5 convolutional layers and 1 4096-dimensional full-connected layer respectively.

4. The algorithm of claim 3, wherein the convolutional layer has the same network configuration as the convolutional layer of initiation v1 of Googlenet.

5. The algorithm of claim 1, wherein the feature dimension reduction coding is to preserve the obvious features of the search domain and the searched domain, reduce the number of features, respectively comprise three fully-connected layers, and realize the feature G of high-level semantics of the search domain and the searched domain_SAnd

dimension reduction and coding, and then mapping the image features of the search domain and the searched domain to the same feature space, thereby obtaining the feature E positioned in the same feature space_SAnd

6. the algorithm of claim 5, wherein the three fully-connected layers are comprised of fully-connected layers of 2048, 1024 and 64 dimensions, and the fully-connected layer part performs weight sharing to ensure that semantic relation between the search domain and the searched domain features is increased when performing the same feature space coding mapping.