CN114647754A

CN114647754A - Hand-drawn image real-time retrieval method fusing image label information

Info

Publication number: CN114647754A
Application number: CN202210396360.2A
Authority: CN
Inventors: 戴大伟; 唐晓宇; 刘颖格; 夏书银; 王国胤
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-06-21

Abstract

The invention belongs to the field of image retrieval, and particularly relates to a hand-drawn image real-time retrieval method fusing image label information, which comprises the following steps: extracting characteristic graphs of the hand-drawn sketch and characteristic vectors of the real object image by adopting an improved neural network model, calculating Euclidean distances D between sketch branches and all images when the characteristic vectors of the hand-drawn image are generated for retrieval, and taking an average value D of D_mAs a label distance reference value, a pseudo label P is processed according to an input label and the probability value of the corresponding input label category processed by Softmax stored in a database_cRespectively for the distance d according to the database sample category_mWeighting to obtain a label weighted distance value D_lFinally according to D and D_lThe sum sorts the images in the database, and the top-k searched images are returned; when the method is used for searching the early sketch, the information such as the color, the characteristics and the like of the target image can be used for inquiring, and the searching efficiency is greatly improved when stroke information is less.

Description

Hand-drawn image real-time retrieval method fusing image label information

Technical Field

The invention belongs to the field of dynamic sketch retrieval, and particularly relates to a hand-drawn image real-time retrieval method fusing image label information.

Background

Due to the development of touch screens, in recent years, the sketch-based image retrieval can flexibly use an unlimited hand-drawn sketch to inquire a natural image receives wide attention. The sketch search may be classified into a coarse-level sketch search (CBIR) and a Fine-grained sketch search (FG-SBIR) according to the search Category. Fine-grained sketch retrieval FG-SBIR is image matching of the details of a hand-drawn sketch, aiming at retrieving a specific photo in the gallery. Currently, much progress is made in the research on FG-SBIR, but there are three problems in the sketching process that prevent FG-SBIR from being widely used in practice: (1) the drawing skill of the user is insufficient, the drawn sketch pattern has large difference, and the retrieval efficiency is low. (2) The time required to draw a complete sketch and the reduction in sketch retrieval time required to retrieve the target image with minimal strokes must also be considered. (3) The sketch has abstraction, a target image is generally searched by using simple lines during sketch retrieval, and information contained in the sample sketch is only black and white lines and is less. Secondly, the sketches have diversity, the contour similarity of target images (such as lady high-heeled shoes) with small style difference is extremely high, so that the sketches of the target images also have extremely high similarity, the target images cannot be distinguished from the sketches, and the reason that the early sketches are low in searching efficiency is also caused. In the traditional method, when a user searches commodities, only a target image can be searched by using lines, and if the user wants a chair with red color in the searching process, the user can search the wanted content when a draft at the later stage is complete because the line information does not contain color information.

In summary, the prior art has the technical problems that: how to improve retrieval efficiency when stroke information is few.

Disclosure of Invention

The invention provides a hand-drawn image real-time retrieval method fusing image label information to solve the technical problems, and the method is used for increasing sketch information in a sketch retrieval frame fusing the sketch style and enhancing the early retrieval efficiency of sketch retrieval. When the user searches early sketches, the user simultaneously uses information such as color, characteristics and the like of the target image to inquire, and the searching efficiency can be greatly improved when stroke information is less.

A hand-drawn image real-time retrieval method fusing image label information comprises the following steps:

inputting a hand-drawn image and label information of a target image into a neural network model trained and improved through a training set, and retrieving in real time to obtain a retrieval result;

the training of the improved neural network model comprises₁、f₂、f₃、f_exWherein, f₁To pre-train the network, f₂For the layer of attention, f₃To lower the dimension layer, f_exA label extraction layer;

the training process of the improved neural network model comprises the following steps:

s1: constructing a training set, wherein the training set comprises an image set consisting of a plurality of images and a complete sketch which is correspondingly retracted, and an expansion tag set corresponding to the images, and the expansion tag set corresponding to the images consists of all tag information of the images;

s2, selecting one image in the image set as a target image in each step of training, and training f of the neural network model by using the corresponding hand-drawn sketch of the image₁、f₂、f₃Three branches, fixed after training f₁、f₂Parameters, simultaneous training is completed by f₁、f₂、f₃Extracting embedded vectors of all target images;

s3, inputting the target images in the image set into the trained f₁In (1), a feature map of the target image is obtained, and the feature map is input into (f)_exPredicting the label of the image, training f by adopting a cross entropy loss function according to the label information in the extended label set_exAfter the training is finishedFixing parameters;

s4, rendering the complete sketch of each image in the image set into a plurality of sketches according to the stroke sequence of the drawing, forming a sketch branch set of the image set after each sketch is rendered, and processing the sketch branch set by f₁、f₂、f₃Extracting an embedded vector of a sketch branch;

s5, calculating the embedded vector error of each picture in the draft branch and the embedded vector error of the target image by adopting a triple loss function, reversely propagating the errors to approach the target image and keep away from the non-target image as the target, and adjusting f in the model₃The parameters of (1);

and S6, obtaining a sketch branch of the next target image, and repeating the steps S4-S6 until the model reaches the upper limit of the training times.

Furthermore, a complete sketch of an image is rendered into N pictures according to the stroke sequence of drawing, the N pictures form a sketch branch, each picture in the sketch branch comprises a first pen to an nth pen of the complete sketch, the strokes of each picture are different, N is more than or equal to 1 and less than or equal to N, and one sketch branch S is arranged according to the ascending sequence of the number of the strokes contained in the pictures, wherein { S ═ S { (S) }₁,s₂,…,s_n,…,s_N}，s_nRepresenting a picture containing first through nth strokes.

Further, the images in the training set are labeled with L ═ L₁,l₂,…,l_n,…,l_NIs used for training the label extraction layer f_exThe cross entropy Loss expression is as follows:

wherein K represents the total number of categories contained by the tag; n represents the total number of samples; n represents the nth sample; p is a radical of_ncRepresenting the probability that sample n belongs to class c; l_ncA correct probability label of class c representing sample n;

further, calculating the error of the embedded vector of each picture in the sketch branch and the embedded vector of the target image by adopting a triple Loss function, wherein the expression of the triple Loss is as follows:

Loss＝max(d(V_Si,V_p)-d(V_Si,V_n)+α，0)

wherein, V_SiAn embedded vector representing the ith picture in the sketch branch; v_pAn embedded vector representing the target image; v_nAn embedded vector representing a random one of the images in the image set other than the target image; α is a constant; d is the Euclidean distance calculation.

Further, the step of inputting the hand-drawn sketch and the label information of the target image by the user, retrieving in real time and obtaining a final retrieval result comprises the following steps:

the method comprises the following steps: user-entered sketch through image distance network f₁、f₂、f₃Obtaining a sketch embedding vector V of the step i_Si；

Step two: calculating V_SiWith the embedded vector V of each image in the database_pThe Euclidean distance of (D), obtain distance vector D ═ D₁,d₂,…,d_n,…,d_N}；

Step three: calculate the average of the elements in the distance vector and average f₁Output feature map input f_exPredicting the label probability of an input sketch, and processing the label probability by utilizing Softmax to obtain a pseudo label;

step four: average value d of elements in distance vector according to relation of pseudo label and input label_mWeighting to obtain a label weighted distance value D_l；

Step five: assigning an attenuation coefficient to the tag weighted distance based on D and D_lAnd the sum sorts the images in the database and obtains a retrieval result.

Further, label probability of a convolutional neural network prediction image is adopted, and a probability vector set P of N samples respectively belonging to the category c is obtained through Softmax processing_c＝{p_1c,p_2c,…,p_nc,…,p_NcH, will P_cProbability p of a sample n belonging to class c as a pseudo label_ncExpressed as:

wherein, V_ncA probability vector representing that sample n belongs to class c; v_nkA probability vector representing the total number of label categories for sample n; k represents the total number of categories contained in the label; n represents the total number of samples; n represents the nth sample; p is a radical of formula_ncIndicating the probability that sample n belongs to class c.

Further, the average value d of the elements in the distance vector is calculated according to the relation between the pseudo label and the input label_mWeighting to obtain a label weighted distance value D_lMax (p), the maximum value of the pseudo label of the sample n_n) For the label class to which the sample belongs, if Max (p)_n)>0.8, marking the sample n as a credible sample, otherwise, marking the sample n as an untrustworthy sample; if the pseudo label Max (p)_n)>0.8 and the same as the input label, wherein the sample n is a credible positive sample; if the pseudo tag Max (p)_n)>0.8 and different from the input label, the sample n is a credible negative sample; otherwise, the distance is an untrusted sample, and the distance is not weighted; calculating a tag weighted distance value D_lThe expression of (a) is:

wherein d is_mRepresents the average of the elements in the distance vector; d_nRepresenting the Euclidean distance of the sample n from the vector of the sketch; n represents the total number of samples; omega_p<0，ω_pWeighting the credible negative sample label, omega_n>0，ω_nWeighting the credible positive sample label; p is a radical of_nA pseudo label representing the probability value of the sample n.

Further, a decay factor, D and D, is assigned to the tag weighted distance_lAnd the sum ranks the images in the database, and the expression is as follows:

D_final＝D+ω·D_l

wherein D is the distance vector between the sketch branch and all the images；D_lWeighting the label distance; d_finalDistance according to the final sorting; ω is the label weighted distance weight, and ω gradually decreases as i increases, i.e., the input sketch is more complete.

The invention integrates the image label information of the sketch to carry out early image retrieval, retrieves images with less strokes in the early stage according to the extended label set of the target image, integrates the sketch style with the sketch retrieval frame, and can retrieve the target image by using the least strokes of the sketch, thereby reducing the early retrieval time of the hand-drawn sketch and improving the retrieval efficiency.

Drawings

FIG. 1 is a diagram of a baseline model of the present invention;

FIG. 2 is a diagram of a deep neural network search framework model according to the present invention;

FIG. 3 is a schematic diagram of a sketch branch rendering process and a picture tag classification;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A hand-drawn image real-time retrieval method fusing image label information, as shown in fig. 1-2, comprising:

acquiring a complete sketch of a target image and label information of all target images, wherein the label information of all target images forms an expansion label set, then rendering the complete sketch into N pictures according to the stroke sequence of drawing, forming a sketch branch by the N pictures after rendering, wherein each picture in the sketch branch comprises the first stroke to the nth stroke of the complete sketch, the strokes of each picture are different, N is more than or equal to 1 and less than or equal to N, and one sketch branch S is { S { (S) } according to the ascending sequence of the strokes of the pictures₁,s₂,…,s_n,…,s_N}，s_nRepresenting a picture containing the first through nth strokes.

As shown in fig. 3, a QMUL-Shoe-V2 data set and a QMUL-Chair-V2 data set are selected from an image retrieval data set of the fine-grained sketch retrieval FG-SBIR as a training data set of the current model, an image is respectively selected from a QMUL-Shoe-V2 data set and a QMUL-Chair-V2 data set as a target image, a complete sketch and label information of the image are obtained to form an extended label set, and a sketch branch of the target image is obtained by rendering according to a picture stroke sequence.

Specifically, as shown in (b) tag information in fig. 3, volunteers who found different drawing bases let them manually draw a complete sketch according to the target image. According to the picture information in the two data sets, corresponding label information is printed on each picture in the two data sets, and the content of the label information is classified artificially.

Specifically, as shown in the process of drawing a sketch by hand in fig. 3 (a), for a complete sketch, the complete sketch is rendered into N pictures according to the completeness of the sketch, the N pictures after rendering are sketch branches, and each picture in the sketch branches includes the first stroke to the nth stroke of the complete sketch. For example: the first picture in the sketch branch only comprises the first pen of the complete sketch, the second picture comprises the first pen and the second pen of the complete sketch, and the third picture comprises the first pen, the second pen and the third pen of the complete sketch, and so on.

A plurality of target images acquired from the QMEL-Shoe-V2 data set and the QMEL-Chair-V2 data set and label information of a complete sketch image set corresponding to the target images form an extended label set to form a training set, when a model is trained, label information of a branch specially used for training the images is provided, and when the model is searched, the label of the branch judging image is compared with the input label to perform auxiliary search; inputting a hand-drawn sketch and label information of a target image into the trained improved neural network model, and retrieving and obtaining the target image in real time;

s1: constructing a training set, wherein the training set comprises an image set consisting of a plurality of images and a complete sketch which is correspondingly recovered, and an expansion tag set corresponding to the images, and the expansion tag set corresponding to the images consists of all tag information of the images;

s2, selecting one image in the image set as a target image in each step of training, and using the corresponding hand-drawn sketch of the image to train f of the neural network model₁、f₂、f₃Three branches, fixed after training f₁、f₂Parameters, simultaneous training is completed by f₁、f₂、f₃Extracting embedded vectors of all target images;

s3, inputting the target images in the image set into the trained f₁In (1), a feature map of the target image is obtained, and the feature map is input into (f)_exPredicting a label of the target image; training f according to the label information in the extended label set by adopting a cross entropy loss function_exFixing parameters after training;

s5, calculating the embedded vector error of each picture in the draft branch and the embedded vector error of the target image by adopting a triple loss function, reversely propagating the errors to approach the target image and keep away from the non-target image as the target, and adjusting f in the model₃The parameters of (a);

s6, obtaining the draft branch of the next target image, and repeating the steps S4-S6 until the model reaches the upper limit of the training times.

In step S3, the training process of the improved neural network model uses a cross entropy loss function to train f according to the label information in the extended label set_exThe method comprises the following steps: set label L ═ L for images in the training set₁,l₂,…,l_n,…,l_NIs used for training the label extraction layer f_exThe cross entropy Loss expression is as follows:

wherein K represents the total number of categories contained by the tag; n represents the total number of samples; p is a radical of_ncRepresenting the probability that sample n belongs to class c; l_ncThe correct probability label of class c representing sample n.

In the step S5 of the training process of the improved neural network model, a triple Loss function is used to calculate the error between the embedded vector of each picture in the sketch branch and the embedded vector of the target image, and the expression of the triple Loss is:

Loss＝max(d(V_Si,V_p)-d(V_Si,V_n)+α，0)

Preferably, the user inputs a hand-drawn sketch of the target image f₁Through f_exPredicting the label probability of the target image, processing the label probability by utilizing Softmax to obtain a pseudo label, and storing the pseudo label into a database;

further, label probability of a convolutional neural network prediction image is adopted, and a probability vector set P of N samples respectively belonging to the category c is obtained through Softmax processing_c＝{p_1c,p_2c,…,p_nc,…,p_NcH, mixing P_cAs pseudo label, the probability p that a sample n belongs to class c_ncExpressed as:

wherein, V_ncA probability vector representing that sample n belongs to class c; v_nkLabel class representing sample nA probability vector of the total; k represents the total number of categories contained in the label; n represents the total number of samples; n represents the nth sample; p is a radical of_ncRepresenting the probability that sample n belongs to class c;

further, the input sketch passes through an image distance network f₁、f₂、f₃Obtaining a sketch embedding vector V of the step i_SiCalculating V_SiWith the embedded vector V of each image in the database_pThe Euclidean distance of (D), obtain distance vector D ═ D₁,d₂,…,d_n,…,d_N}; taking the average value D of D_mSelecting a probability value pseudo label P corresponding to the input label category processed by Softmax stored in a database as a label distance reference value according to the input label_c＝{p_1c,p_2c,…,p_nc,…,p_NcH, for distance d_mWeighting to obtain a label weighted distance value D_lMax (p), the maximum value of the pseudo label of the sample n_n) For the label class to which the sample belongs, if Max (p)_n)>0.8, marking the sample n as a credible sample, otherwise, marking the sample n as an untrusted sample; if the pseudo tag Max (p)_n)>0.8 and the same as the input label, wherein the sample n is a credible positive sample; if the pseudo label Max (p)_n)>0.8 and different from the input label, the sample n is a credible negative sample; otherwise, the distance is an untrusted sample, and the distance is not weighted; meanwhile, an attenuation coefficient is given to the weighted distance, so that the influence of the label on the retrieval result is reduced along with the increase of the steps, and finally the label is obtained according to D and D_lThe sum sequences the images in the database, compares the label information of the pseudo label in the database with the label information of the target image, and obtains a retrieval result; the expression is as follows:

D_final＝D+ω·D_l

wherein, ω is a label weighted distance weight, and when i is increased, i.e. the input sketch is more complete, ω is gradually decreased; omega_p<0，ω_pRepresenting confidence negative sample label weighted weights, ω_n>0，ω_nRepresenting a trusted positive sample label weighted weight; d_nRepresents the average of the elements in the distance vector; d is a distance vector between the sketch branch and all the images; d_lWeighting the label distance; d_finalThe distance according to which the final sorting is based.

When no commodity picture exists and the commodity is difficult to describe by characters, a user can manually draw a commodity sketch on a touch screen device by means of the image of the commodity, meanwhile, the characteristics (color, height, shape and the like) of the commodity to be searched can be input and searched at the same time, the commodity sketch is rendered into sketch branches and then input into a trained neural network model, the model returns k images most similar to the commodity sketch through the search of the sketch branches and the search of the label branch parts, and the searching efficiency is improved when stroke information is few.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A hand-drawn image real-time retrieval method fusing image label information is characterized by comprising the following steps:

inputting a hand-drawn sketch and label information of a target image into the trained improved neural network model, and retrieving in real time to obtain a retrieval result;

the improved neural network model comprises₁、f₂、f₃And f_ex，f₁To pre-train the network, f₂For the layer of attention, f₃To lower the dimension layer, f_exA label extraction layer;

s2: selecting one image in the image set as a target image in each step of training, and training f of the neural network model by using the hand-drawn sketch corresponding to the image₁、f₂、f₃Three branches, fixed after training f₁、f₂Parameters, simultaneous training is completed by f₁、f₂、f₃Extracting embedded vectors of all target images;

s3: inputting the target image in the image set into the trained f₁In (1), a feature map of the target image is obtained, and the feature map is input into (f)_exPredicting the label of the target image, and training f according to the label information in the extended label set by adopting a cross entropy loss function_exFixing parameters after training;

s4: rendering the complete sketch of each image in the image set into a plurality of sketches according to the stroke sequence of drawing, forming a sketch branch set of the image set after rendering of each sketch is completed, and performing f₁、f₂、f₃Extracting an embedded vector of a sketch branch;

s5: calculating the embedded vector error of each picture in the draft branch and the embedded vector error of the target image by adopting a triple loss function, reversely propagating the errors to approach the target image and keep away from the non-target image as targets, and adjusting f in the model₃The parameters of (1);

s6: and obtaining a sketch branch of the next target image, and repeating the steps S4-S6 until the model reaches the upper limit of the training times.

2. The method for retrieving hand-drawn images fused with image label information in real time as claimed in claim 1, wherein a label L ═ L is set for the images in the training set₁，l₂，...，l_n，...，l_NIs used for training the label extraction layer f_exThe cross entropy Loss expression is as follows:

wherein K represents the total number of categories contained by the tag; n represents the total number of samples; n represents the nth sample; p is a radical of_ncRepresenting the probability that sample n belongs to class c; l_ncThe correct probability label of class c representing sample n.

3. The method for retrieving the hand-drawn image fused with the image tag information in real time as claimed in claim 1, wherein a triple Loss function is adopted to calculate the error between the embedded vector of each picture in the sketch branch and the embedded vector of the target image, and the expression of triple Loss is as follows:

Loss＝max(d(V_Si，V_p)-d(V_Si，V_n)+α，0)

4. The method for searching the hand-drawn image fused with the image label information in real time according to claim 1, wherein the step of inputting the target image hand-drawn sketch and the label information, searching in real time and obtaining a final search result comprises the following steps:

Step two: calculating V_SiWith the embedded vector V of each image in the database_pTo obtain a distance vector D ═ D₁，d₂，...，d_n，...，d_N}；

Step three: computing elements in a distance vectorAnd f is calculated and₁output feature map input f_exPredicting the label probability of an input sketch, and processing the label probability by utilizing Softmax to obtain a pseudo label;

5. The method for retrieving the hand-drawn image fused with the image tag information in real time as claimed in claim 4, wherein tag probability of the image is predicted by using a convolutional neural network, and probability vector sets P of N samples respectively belonging to the category c are obtained by Softmax processing_c＝{p_1c，p_2c，...，p_nc，...，p_NcH, mixing P_cProbability p that a sample n belongs to class c as a pseudo label_ncExpressed as:

wherein, V_ncA probability vector representing that sample n belongs to class c; v_nkA probability vector representing the total number of label categories for sample n; k represents the total number of categories contained in the label; n represents the total number of samples; n represents the nth sample; p is a radical of_ncRepresenting the probability that sample n belongs to class c.

6. The method as claimed in claim 4, wherein the method comprises averaging d of elements in the distance vector according to the relationship between the pseudo tag and the input tag_mWeighting to obtain a label weighted distance value D_lMax (p), the maximum value of the pseudo label of the sample n_n) For the label class to which the sample belongs, if Max (p)_n) > 0.8, the sample n is markedA credible sample, otherwise, a non-credible sample is marked; if the pseudo label Max (p)_n) Is more than 0.8 and is the same as the input label, and the sample n is a credible positive sample; if the pseudo label Max (p)_n) Is more than 0.8 and is different from the input label, and the sample n is a credible negative sample; otherwise, the distance is an untrusted sample, and the distance is not weighted; calculating a tag weighted distance value D_lThe expression of (a) is:

wherein d is_mRepresents the average of the elements in the distance vector; d_nRepresenting the Euclidean distance of the sample n from the vector of the sketch; n represents the total number of samples; d_lRepresenting a tag weighted distance value; omega_p＜0，ω_pWeighting weights, ω, for the confidence negative sample labels_n＞0，ω_nWeighting the credible positive sample label; p is a radical of_nA pseudo label representing the probability value of the sample n.

7. The method as claimed in claim 4, wherein the hand-drawn image real-time retrieval method comprises assigning an attenuation coefficient to the label weighting distance, D and D_lThe sum ranks the images in the database, and the expression is as follows:

D_final＝D+ω·D_l

wherein D is a distance vector between the sketch branch and all the images; d_lWeighting the label distance; d_finalDistance according to the final sorting; ω is the label weighted distance weight, and ω gradually decreases as i increases, i.e., the input sketch is more complete.