CN114020948A

CN114020948A - Sketch image retrieval method and system based on sorting clustering sequence identification selection

Info

Publication number: CN114020948A
Application number: CN202111259946.6A
Authority: CN
Inventors: 陈亚雄; 汤一博; 李小玉; 赵东婕; 熊盛武
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-02-08

Abstract

The invention discloses a sketch image retrieval method based on sorting clustering sequence identification selection. Finally, the hash code is used for retrieving the sketch image. The invention designs a triple transform backbone of a sequence discrimination selection module, and captures an important discrimination domain between a sketch and a natural image; and providing an objective function consisting of three groups of items, semantic similar items, sorting clustering items and distinguishing learning items, keeping the semantic similarity of the hash codes in the process of learning the hash codes, capturing the similarity between different modes, and optimizing sorting information so as to cluster similar examples and know distinguishing domain learning. The problems of redundant information and neglected sequencing information are solved, the retrieval precision is higher, and the performance is further improved.

Description

Sketch image retrieval method and system based on sorting clustering sequence identification selection

Technical Field

The invention belongs to the technical field of image retrieval, relates to a sketch image retrieval method and a sketch image retrieval system, and particularly relates to a sketch image retrieval method and a sketch image retrieval system based on sorting clustering sequence identification selection.

Background

Due to the explosive growth of touch screen devices, the use of sketches is becoming more and more frequent: the user can draw a sketch on the touch screen device with a finger anytime and anywhere. It makes sense to mine an effective natural image using a sketch. And thus, the interest of sketch image retrieval is increasing, and the purpose of sketch image retrieval is to match natural images by using a hand-drawn sketch as a query mode.

The existing sketch image retrieval methods are roughly divided into two types: a manual production method and a deep learning method. However, the handmade sketch image retrieval method does not reduce cross-domain differences between the sketch and the natural image very well because handmade features do not effectively represent edges of natural images and misaligned sketches with large variations and ambiguities. In order to solve the problem of cross-domain difference, a deep learning sketch image retrieval method is provided. However, the existing deep learning method still faces two challenges: (1) the sketch and the natural image contain different objects with similar contour shapes. Some deep learning sketch retrieval methods cannot capture an important discrimination domain between a sketch and a natural image, so that the problem of information redundancy is caused, and the performance of sketch image retrieval is influenced finally; (2) the ranking information is closely related to the search results. In the process of learning the hash code of the sketch retrieval task, the conventional method neglects the utilization of sequencing information, so that the performance is not ideal.

Disclosure of Invention

The invention provides a sketch image retrieval method and a sketch image retrieval system based on sorting clustering sequence identification selection, aiming at the defects of the prior art, wherein the method and the system fully utilize distinguishing regions and sorting information to execute hash code learning, firstly draw a query sketch and select the distinguishing regions, and simultaneously utilize the sorting information to aggregate samples of the same category, so that the sample can be known in what kind under other modes. And finally, retrieving the sketch image by utilizing the hash code.

The method adopts the technical scheme that: a sketch image retrieval method based on sorting clustering sequence identification selection is characterized in that a sketch image retrieval network is firstly constructed, and then the sketch image retrieval network is utilized to carry out sketch image retrieval;

the construction of the sketch image retrieval network specifically comprises the following steps:

step 1: constructing a sketch image retrieval network;

the sketch image retrieval network comprises a transform partitioning module, a linear projection module and a transform coding module;

the transform segmentation module is used for dividing the input image into M2D small block images x_pThe size of each picture is H x W, the size of each small block picture in the picture is P x P,

the linear projection module is used for mapping the small block image output by the transform module to a D dimension, and adding learnable position embedding into the small block image embedding for storing position information; where the embedding vector is denoted as z₀The output of the position zero is a D-dimensional class token x_class；

The transformer coding module is used for transmitting z into the transformer coding module₀Excavating the relation between small images in the sequence; the transformer coding module comprises L transformer layers and a hash layer, wherein each transformer layer comprises a multi-headed self-attention layer MSA and Conv_1×1Block, Conv_1×1The block consists of two convolutional layers with 1 x 1 convolutional kernels and one fully-connected layer; for each transform layer, its input is the output of the previous layer; the L-th layer of the transform outputs and inputs a hash layer to carry out deep hash function learning, and the output hash code is used for constructing a three-tuple item, a category level semantic item and a sequencing cluster in the target functionAn item;

step 2: acquiring an existing sketch image data set, and dividing the data set into a training data set, a verification data set and a test data set;

and step 3: in the training dataset, N triplet elements are given

And triple tags

Wherein

The three elements in the table sequentially represent an anchor sketch, a positive example image and a negative example image of the ith data respectively;

to represent

The class label of (a) is used,

to represent

The class label of (a) is used,

watch small

Class labels of (1); n, I respectively represents the number of triple elements and the number of samples in the data set; a, p and n respectively represent an anchor point image, a positive example image and a negative example image;

and 4, step 4: training a sketch image retrieval network by using a training set, calculating a target function of the sketch image retrieval network and updating initial parameters of the sketch image retrieval network; the network training reaches a preset turn or until the loss does not decrease any more; and obtaining the trained sketch image retrieval network.

The technical scheme adopted by the system of the invention is as follows: a sketch image retrieval system for discriminating selections based on a sorted cluster sequence, comprising the following modules:

the module 1 is used for constructing a sketch image retrieval network module;

the module 2 is used for searching the sketch images by utilizing the sketch image searching network;

the module 1 specifically comprises the following sub-modules:

the submodule 1 is used for constructing a sketch image retrieval network;

The transformer coding module is used for transmitting z into the transformer coding module₀Excavating the relation between small images in the sequence; the transformer coding module comprises L transformer layers and a hash layer, wherein each transformer layer comprises a multi-headed self-attention layer MSA and Conv_1×1Block, Conv_1×1The block consists of two convolutional layers with 1 x 1 convolutional kernels and one fully-connected layer; for each transform layer, its input is the output of the previous layer; the L-layer transform outputs and inputs the hash layer, deep hash function learning is carried out, and the output hash code is used for constructing a triad item, a category level semantic item and a sequencing clustering item in the target function;

the submodule 2 is used for acquiring the existing sketch image data set and dividing the data set into a training data set, a verification data set and a test data set;

submodule step 3 for assigning N triplet elements in the training dataset

And triple tags

Wherein

to represent

The class label of (a) is used,

to represent

The class label of (a) is used,

to represent

the submodule 4 is used for training the sketch image retrieval network by utilizing the training set, calculating a target function of the sketch image retrieval network and updating initial parameters of the sketch image retrieval network; the network training reaches a preset turn or until the loss does not decrease any more; and obtaining the trained sketch image retrieval network.

Compared with the prior art, the invention has the following advantages:

1) designing a triple transform backbone of a sequence identification selection module, and capturing an important identification domain between a sketch and a natural image;

2) and providing an objective function consisting of three groups of items, semantic similar items, sorting clustering items and distinguishing learning items, keeping the semantic similarity of the hash codes in the process of learning the hash codes, capturing the similarity between different modes, and optimizing sorting information so as to cluster similar examples and know distinguishing domain learning. The problems of redundant information and neglected sequencing information are solved, the retrieval precision is higher, and the performance is further improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a network structure diagram according to an embodiment of the present invention.

Fig. 3 is a comparison of the inventive method and the DSIH-V method on an extended TU-Berlin dataset. (a) The first 20 of the 256-bit hash code image is retrieved using DSIH-V. (b) The first 20 names of the 256-bit hash code image are retrieved using the DSIH. The wrong retrieved image is labeled by x under the image.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The invention provides a sketch image retrieval method based on sorting clustering sequence identification selection, which fully utilizes distinguishing regions and sorting information to execute hash code learning, firstly draws a query sketch and selects the distinguishing regions, and meanwhile utilizes the sorting information to aggregate samples of the same category. Finally, the hash code is used for retrieving the sketch image.

Referring to fig. 1, the sketch image retrieval method based on sorting clustering sequence identification selection provided by the invention comprises the steps of firstly constructing a sketch image retrieval network, and then utilizing the sketch image retrieval network to perform sketch image retrieval;

the method for constructing the sketch image retrieval network comprises the following specific steps of:

step 1: constructing a sketch image retrieval network;

referring to fig. 2, the sketch image retrieval network of the present embodiment includes a transform partitioning module, a linear projection module, and a transform encoding module;

a transform partitioning module for dividing the input image into M2D patch images x_pThe size of each picture is H x W, the size of each small block picture in the picture is P x P,

the linear projection module is used for mapping the small block image output by the transform module to a D dimension, and embedding learnable positions into the small block image for storing position information; where the embedding vector is denoted as z₀The output of the position zero is a D-dimensional class token x_class；

The embedded vector is:

wherein the content of the first and second substances,

respectively representing 1 st, 2 nd, … th, M2-dimensional patch images; e denotes a patch image embedding projection, E_posIndicating position embedding.

To better focus on the most significant area, z₀And sending the sequence into a transform coding module, and mining the relation between small images in the sequence.

The transform coding module of this embodiment includes L transform layers and a hash layer, where each transform layer includes a multi-headed self-attention layer MSA and Conv_1×1Block, Conv_1×1The block consists of two convolutional layers with 1 x 1 convolutional kernels and one fully-connected layer; for each transform layer, its input is the output of the previous layer; the L-layer transform outputs and inputs the hash layer, deep hash function learning is carried out, and the output hash code is used for constructing a triad item, a category level semantic item and a sequencing clustering item in the target function;

the transformer coding module is as follows:

z′_l＝MAS(LN(z_l-1)+z_l-1) (2)

z_l＝CONV(LN(z′_l)+z′_l) (3)

wherein LN (-) represents the normalization of the layer, z_lRepresenting the embedded image representation; z'_lRepresents the output of a multi-headed self-attention layer, and CONV (·) represents a convolution operation.

To fully exploit the attention information, sequence discrimination selection is used to select valid regions to form a new sequence. For a transformer, the input to the L < th > layer is

Wherein the content of the first and second substances,

m outputs respectively representing the L-1 th layer; the K-head self-attention weight of each layer except the L-th layer is

Wherein L ∈ 1, 2.., L-1. For self-attention of each layer, each patch image has K sets of nodes. Thus, the weights of the M patch images in each layer can be expressed as

Where i ∈ 1, 2. Multiplying the weight of the front L-1 layer to obtain a final weight, wherein the final weight is as follows:

wherein, w _f Indicating the final weight at which the discrimination region can be selected.

The index of the tile image carrying the useful information can be obtained from the selection area, while the index is used as position information to find the corresponding tile image embedding. The selective embedding forms a new sequence and enters the L-th layer transformer.

The L < th > layer transform is followed by a hash layer, giving correspondingly arbitrary triplet elements

The deep hash function is:

wherein sign (·) represents an element sign function; φ (-) represents a tanh function;

representing a sample

K of (1) is a hash code;

representing a sample

At the output of the L-th layer transform, and

representing a deep hash function; theta^gRepresenting the weight parameter of the hash layer.

Step 2: acquiring an existing sketch image data set, and dividing the data set into a training data set, a verification data set and a test data set; the validation set is used in the experimental process for verifying the model training effect, and only the model performance on the test set is written here.

An example of an implementation of the invention uses two data sets, Sketchy dataset and TU-Berlindataset, for each data set, at 70:10: a scale of 20 divides the data set into a training set, a validation set, and a test set.

And step 3: in the training dataset, N triplet elements are given

Positive triple label

Wherein

to represent

The class label of (a) is used,

to represent

The class label of (a) is used,

to represent

it is an object of the present invention to perform hash code learning to project instances to hash codes while preserving similarity between matching sketches and images. More specifically,

ratio of

And smaller, where H (·, ·) represents the Hamming distance,

and

respectively represent

And

the k-bit hash code of (a),

and

hash codes respectively representing an anchor image, a positive example image and a negative example image;

In this embodiment, the learning rate is set to 0.0004, the loss function is optimized by using the Adam function, and the initial parameters are updated.

The invention provides a new objective function consisting of three groups of items, semantic similarity items, sorting clustering items and distinguishing learning items, which keeps the semantic similarity of hash codes in the process of learning the hash codes, captures the similarity of different modes, optimizes the sorting information clustering similar examples and knows the distinguishing domain learning.

The present invention performs hash code learning, can map instances to hash codes while preserving similarity of matching sketches and images, and to capture similarity of different modalities, three tuple items can be defined as:

where H (·,. cndot.) represents a Hamming distance, δ represents a boundary parameter, and max (·) represents a maximum function.

However, the above-mentioned three-tuple items are difficult to be optimized in the training process, so that the binary code is used

And

relaxed to hash-like code

And

using the two-normal form instead of the Hamming distance, the triplet terms are redefined as follows:

wherein | · | purple sweet₂Representing a two-normal vector.

The category-level semantic information is helpful for improving the potential correlation between similar hash codes, so that the label information is used for providing category-level semantics for the learning hash function, and the category-level semantic items are defined as follows:

wherein the content of the first and second substances,

a cross-entropy function is represented that is,

and

respectively represent

And

the tag information of (1).

The Average Precision (AP) is a retrieval index for judging whether the related examples are in the front of the ranking list, the higher the AP value is, the higher the aggregation degree of the related examples is, and the query example h_vCan be approximated as:

wherein R is^uRepresenting positive correlation diversity, | R^uI denotes the amount of positive correlation diversity, R^tA set of scores representing all instances; η represents a boundary parameter. And d is_ut＝[cos(h_v，h_u)-cos(h_v，h_t)]Where cos (·, ·) denotes cosine similarity, h_u∈R^u，h_t∈R^t，h_vRepresenting a query instance; to cluster similar natural images, the sorted cluster entry of the natural images can be expressed as:

wherein the content of the first and second substances,

representing natural image query instance h_aV represents the size of a batch of data volumes, and thus the sorted cluster entry of the sketch can be expressed as:

wherein the content of the first and second substances,

AP values representing a sketch query example.

Thus, the final sorted cluster term can be composed of equations 10 and 11:

wherein the content of the first and second substances,

representing a sorted clustering entry that optimizes the sorting information to cluster similar instances.

In order to improve the discrimination domain learning, the similarity of the classification marks corresponding to different labels is minimized, and the similarity of the classification marks of the samples with the same label is maximized. The discriminative learning term for a batch of sketch data can be expressed as:

wherein cos (·, ·) represents cosine similarity, μ represents a boundary parameter;

v denotes the L-th layer₁The classification marks of the individual sketch;

v denotes the L-th layer₂The classification of each sketch.

The discrimination learning item of the natural image may be expressed as:

wherein the content of the first and second substances,

denotes the L th layer v₁The classification of the individual images is marked,

specifying the L th layer v₂A classification label of each image;

respectively represent L layers of the v₁Anchor point label of individual image, Lth layer, vth₂Anchor point label of individual image, Lth layer, vth₁Positive example label and Lth layer v of individual image₂A positive example label of an image.

Thus, in conjunction with equations 13 and 14, the discrimination learning term can be defined as:

wherein the content of the first and second substances,

a discrimination learning term is represented that is capable of learning in a discrimination domain.

Consider the above four parts (three component items)

Category level semantic items

Ordering clustered items

And discriminating between learning items

) The overall objective function may be defined as:

wherein, α, β and γ represent weight parameters,

the overall objective function is represented, and the overall network is trained by combining the four loss functions proposed by the invention.

The network was trained on GeForce GTX Titan X GPU, InterCore i7-5930K 3.50GHZ CPU and 64GRAM devices. Input examples are reshaped into 288 × 288, a loss function is optimized by a learning rate of 0.0004 and an Adam function, and the size of one batch is set to 64; to generate a hash code with a number of bits 32, 64, 128, 256, 512, the hash code length k is set from 32 to 512; the initial weights of the sketch branches and the image branches both use weights pre-trained on the ImageNet dataset; for the three-tuple term, the boundary parameter delta is set to be 0.5, the boundary parameter eta in the sorting clustering term is set to be 0.01, and the boundary parameter mu in the discrimination learning term is set to be 0.5; the hyper-parameters α, β and γ are set to 0.8, 0.1 and 1, respectively. The network trains 500 rounds or until the loss no longer drops.

In the embodiment, the trained sketch image retrieval network is used for calculating the top n precision of the ranking list in the test data set to obtain the average precision mAP and the precision (precision @200) of the top 200 names, and the higher the values of the measurement indexes are, the better the performance of the experimental method is.

Referring to fig. 3, in order to verify the effectiveness of different influencing factors in the method of the present invention, an ablation experiment is first performed: firstly, the method of the invention has no three-tuple learning hash function (DSIH-T); second, no transform performs sketch image retrieval learning (DSIH-V) using the method of the present invention; thirdly, the method of the invention is utilized to perform Hash code learning (DSIH-R) by sequencing and clustering; finally, the method of the invention (HASE) is carried out. The method of the present invention was then compared to advanced methods such as DBSH, GDH, DVML, DSH, TVAE and StyleMeUp for search performance.

TABLE 1

Table 1 shows the mAP values for different embedding dimensions on the extended Sketchy dataset for the present invention and DSIH-T, DSIH-V and DSIH-R. The comparison result shows that the average precision index of the first 200 retrieval results aiming at different hash bits on the expanded Sketchy data set is the highest by the method provided by the invention.

TABLE 2

Table 2 is the maps values for different embedding dimensions on the extended TU-Berlin dataset for the present invention and other methods. The comparison result shows that the average precision index of the first 200 retrieval results aiming at different hash bits on the expanded TU-Berlin data set is the highest by the method provided by the invention.

TABLE 3

Table 3 shows the comparison of the present invention with other existing methods, which shows that the method of the present invention has higher searching precision.

In specific implementation, the above process can adopt computer software technology to realize automatic operation process.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A sketch image retrieval method based on sorting clustering sequence identification selection is characterized in that: firstly, constructing a sketch image retrieval network, and then utilizing the sketch image retrieval network to retrieve sketch images;

step 1: constructing a sketch image retrieval network;

The transformer coding module is used for transmitting z into the transformer coding module₀Excavating the relation between small images in the sequence; the transformer coding module comprises L transformer layers and a hash layer, wherein each transformer layer comprises a multi-headed self-attention layer MSA and Conv_1×1Block, Conv_1×1The block consists of two convolutional layers with 1 x 1 convolutional kernels and one fully-connected layer; for each transform layer, its input is the output of the previous layer; the L-th layer of the transform outputs and inputs a hash layer, deep hash function learning is carried out, and the output hash code is used for constructing a ternary group item and a category level semantic item in an objective functionAnd ordering the cluster items;

and step 3: in the training dataset, N triplet elements are given

And triple tags

Wherein

to represent

The class label of (a) is used,

to represent

The class label of (a) is used,

to represent

2. The sketch image retrieval method based on sorting clustering sequence identification selection as claimed in claim 1, wherein: in step 1, the location embedding capable of learning is added to the small block image embedding, and the embedding vector is as follows:

wherein the content of the first and second substances,

3. The sketch image retrieval method based on sorting clustering sequence identification selection as claimed in claim 1, wherein: in step 1, the transformer encoding module is:

z′_l＝MAS(LN(z_l-1)+z_l-1) (2)

z_l＝CONV(LN(z′_l)+z′_l) (3)

4. The sketch image retrieval method based on sorting clustering sequence identification selection as claimed in claim 1, wherein: in step 1, the input of the L-th layer of the transformer coding module is

Wherein the content of the first and second substances,

Wherein L is belonged to 1,2, … and L-1; for self-attention of each layer, each small image has K groups of nodes; thus, the weights of the M patch images in each layer are expressed as

Wherein i ∈ 1,2, …, K; multiplying the weight of the front L-1 layer to obtain a final weight, wherein the final weight is as follows:

wherein, w_fRepresenting the final weight of the selectable discrimination region;

the index of the small block image carrying the useful information can be obtained from the selection area, meanwhile, the index is used as the position information to find the corresponding small block image embedding, and the selective embedding forms a new sequence and enters the L-th layer transformer.

5. The sketch image retrieval method based on sorting clustering sequence identification selection as claimed in claim 1, wherein: in step 1, the hash layer is used for giving any given triple unit

The deep hash function is:

representing a sample

K of (1) is a hash code;

representing a sample

At the output of the L-th layer transform, and

6. The sketch image retrieval method based on sorting clustering sequence identification selection as claimed in claim 1, wherein: in step 2, n data sets are used, and for each data set, the data sets are divided into a training set, a verification set and a test set according to the proportion of 70:10: 20; wherein n is a preset value.

7. The sketch image retrieval method based on sorting clustering sequence identification selection as claimed in claim 1, wherein: in step 4, the target function consists of three groups of items, semantic similar items, sorting clustering items and distinguishing learning items, the semantic similarity of the hash codes is kept in the hash code learning process, the similarity of different modes is captured, the similar clustering examples of the sorting information are optimized, and distinguishing domain learning is known;

the triad terms are defined as follows:

wherein | · | purple sweet₂Representing a two-normal vector, delta representing a boundary parameter, and max (·) representing a maximum function;

and

being hash-like, from binary codes

And

relaxed to hash-like code;

and

category-level semantic similarity terms are defined as follows:

wherein the content of the first and second substances,

a cross-entropy function is represented that is,

and

respectively represent

And

the tag information of (a);

the sort cluster term is defined as follows:

wherein the content of the first and second substances,

an ordered cluster item representing a natural image,

a sorted cluster entry representing a sketch image,

representing natural image query instance h_aV represents the size of a batch of data;

representing sketch query realityAverage precision AP value of example; r^uRepresenting positive correlation diversity, | R^uI denotes the amount of positive correlation diversity, R^tA set of scores representing all instances; eta represents a boundary parameter, and d_ut＝[cos(h_v,h_u)-cos(h_v,h_t)]Where cos (·, ·) denotes cosine similarity, h_u∈R^u，h_t∈R^t，h_vRepresenting a query instance;

the discrimination learning term is defined as follows:

wherein the content of the first and second substances,

for the discrimination learning term of the sketch data,

learning terms for discrimination of natural graph data; cos (·, ·) represents cosine similarity, μ represents a boundary parameter;

v denotes the L-th layer₁The classification marks of the individual sketch;

v denotes the L-th layer₂The classification marks of the individual sketch;

denotes the L th layer v₂A classification label of each image;

respectively represent the L th layer v₁Anchor point label of individual image, Lth layer, vth₂Anchor point label of individual image, Lth layer, vth₁Positive example label and Lth layer v of individual image₂A proper label of the individual image;

the objective function is defined as:

where α, β, and γ represent weight parameters.

8. The sketch image retrieval method based on sorting clustering sequence identification selection according to any one of claims 1-7, wherein: and calculating the top n precisions of the ranking list in the test data set by using the trained sketch image retrieval network to obtain the average precision mAP and the top n precisions, wherein the higher the precision value is, the better the performance of the method is.

9. A sketch image retrieval system for distinguishing selection based on a sorting clustering sequence is characterized by comprising the following modules:

the module 1 is used for constructing a sketch image retrieval network module;

the module 1 specifically comprises the following sub-modules:

the submodule 1 is used for constructing a sketch image retrieval network;