CN115098707A

CN115098707A - Cross-modal Hash retrieval method and system based on zero sample learning

Info

Publication number: CN115098707A
Application number: CN202210726686.7A
Authority: CN
Inventors: 余国先; 白振华; 王峻; 闫中敏; 鹿旭东
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-09-23

Abstract

The invention provides a cross-modal Hash retrieval method and a cross-modal Hash retrieval system based on zero sample learning, which are used for acquiring text data and picture data uploaded by a user; extracting depth features of the data from the acquired text data and the acquired picture data; quantizing the extracted depth features into hash codes, comparing the hash codes with data in a database to obtain Hamming distance sequencing, and selecting data of a quantity specified by a user as a retrieval result; the invention can realize effective and accurate retrieval of new and old category samples and overcome the difficulty of closure of the existing hash retrieval system.

Description

Cross-modal Hash retrieval method and system based on zero sample learning

Technical Field

The invention relates to the technical field of data retrieval, in particular to a cross-modal hash retrieval method and a cross-modal hash retrieval system based on zero sample learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the development of information technology and the explosive growth of multimedia data, people can easily acquire massive multimodal data such as text, pictures, videos, and the like. In an open environment facing the internet, how to quickly and accurately retrieve information from massive multi-modal data according to the requirements of users is very important. The hash learning method has been widely applied to the retrieval of large-scale data due to the advantages of low storage and efficient query, and cross-modal hash learning attracts more and more attention in the multi-modal data information retrieval.

The main objective of cross-modal hash learning is to map high-dimensional multi-modal data into a uniform hamming space in the form of a low-dimensional binary hash code, and meanwhile, the learned hash code is required to keep the spatial structure similarity of the original data, that is, the data similar to the original space should be similar after being converted into the binary hash code.

The current cross-modal hash method can be divided into two categories according to whether label data is used: supervised transmembrane state hashing and unsupervised transmembrane state hashing. The unsupervised cross-modal Hash method does not use the labeled information of the sample for training, and generally excavates the information such as the similarity and distribution of data in the modality and between the modalities to guide the generation of the Hash code; supervised cross-modal hashing methods generally take advantage of the inherent properties (e.g., labels) of exemplars and structured information to guide hash code generation and maintain cross-modal similarity for hash codes, and thus generally perform better than unsupervised methods.

The inventor finds that the cross-modal hashing method has made a remarkable progress, but also faces many problems, such as a large amount of marking information is needed, the efficiency of identifying new categories is low, and the matching of modal data is missing, so that the efficiency and the accuracy of data retrieval are low.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a cross-modal Hash retrieval method and a cross-modal Hash retrieval system based on zero sample learning, wherein the relation among marked samples, part of marked samples and unmarked samples is fully mined based on a label completion strategy, so that the retrieval performance under the condition of label missing can be improved; the image and text feature extraction network based on the composite similarity and the deep learning can capture the depth features of the data, so that the cross-modal similarity between the data can be discovered; in addition, the category space embedding based on the category-level attribute vector can capture the association between the visible category and the invisible category, and can realize more efficient identification on new categories.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a cross-modal Hash retrieval method based on zero sample learning.

A cross-modal hash retrieval method based on zero sample learning comprises the following processes:

acquiring text data and picture data uploaded by a user;

extracting depth features of the data from the acquired text data and the acquired picture data;

and quantizing the extracted depth features into hash codes, comparing the hash codes with data in a database to obtain Hamming distance sequencing, and selecting data of a quantity specified by a user as a retrieval result.

The invention provides a cross-modal hash retrieval system based on zero sample learning.

A cross-modal hash retrieval system based on zero sample learning, comprising:

a data acquisition module configured to: acquiring text data and picture data uploaded by a user;

a data feature extraction module configured to: extracting depth features of the data from the acquired text data and the acquired picture data;

a transmembrane state retrieval module configured to: and quantizing the extracted depth features into hash codes, comparing the hash codes with data in a database to obtain Hamming distance sequencing, and selecting data of a quantity specified by a user as a retrieval result.

A third aspect of the present invention provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the steps in the zero sample learning based cross-modal hash retrieval method according to the first aspect of the present invention.

A fourth aspect of the present invention provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable on the processor, where the processor implements the steps in the zero-sample-learning-based cross-modal hash search method according to the first aspect of the present invention when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the zero sample learning-based cross-modal Hash retrieval method and system, the depth feature expressions of the picture data and the text data are respectively extracted through the two deep neural networks, the depth features are embedded into a class attribute space based on the class attribute vector, and finally the learned data features are quantized into Hash codes, so that the accuracy of data retrieval is improved. 2. According to the zero sample learning-based cross-modal Hash retrieval method and system, the missing labels are supplemented through a label supplement strategy, and the supplemented label matrix is used for training, so that the retrieval performance under the environment with the missing labels is improved.

3. According to the zero sample learning-based cross-modal Hash retrieval method and system, the characteristics are extracted through a composite similarity guide deep neural network, and the learned characteristics are kept in cross-modal similarity.

4. According to the zero sample learning-based cross-modal hash retrieval method and system, the depth features are embedded into the category space based on the category-level attribute vectors, so that the relation between the visible category and the invisible category can be discovered, and the identification efficiency of a new category is improved.

5. The zero sample learning-based cross-modal Hash retrieval method and system provided by the invention have the advantages that the three steps of feature extraction, class space learning and Hash code learning are combined together for optimization, and compared with the conventional cross-modal method, the problem of incompatibility among multiple steps is solved.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a schematic flowchart of a cross-modal hash retrieval method based on zero sample learning according to embodiment 1 of the present invention.

Fig. 2 is a connection diagram of a cross-modal hash retrieval system based on zero sample learning according to embodiment 2 of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example 1:

as shown in fig. 1, an embodiment 1 of the present invention provides a cross-modal hash retrieval method based on zero sample learning, including the following processes:

acquiring text data and picture data uploaded by a user;

and extracting the depth features of the data from the acquired data uploaded by the user.

And quantizing the depth features of the extracted data into hash codes, comparing the hash codes with the data in the database to obtain Hamming distance sequencing, and selecting data of a quantity specified by a user as a retrieval result.

Specifically, the method comprises the following steps:

in this embodiment, the text data refers to: the method comprises the steps that text content input by a user in a search box is divided by using blank spaces and input to a trained text feature extraction network, and finally the text content is expressed as a 500-dimensional feature vector;

in this embodiment, the picture data refers to: the user uploads the picture files through the webpage, and the sizes of the picture files are unified into 224 × 3 after the picture files are uploaded.

The url data can be acquired, which refers to the url address of the picture input by the user, the picture uploaded by the user can be acquired through the url address, and the sizes of the uploaded image data are unified into 224 × 3.

Further, the trained text feature extraction network comprises the following training steps:

constructing a word embedding network with 500-dimensional output dimension; wherein, the word embedding network is realized by a word2vec network;

constructing a first training set; the first training set is a wiki corpus;

and inputting the first training set into a word embedding network, calculating a negative log-likelihood function through an output result of the training set, and optimizing parameters of the word embedding network through the function.

In this embodiment, extracting the depth feature of the data from the acquired data uploaded by the user specifically includes:

and inputting the acquired picture and text data uploaded by the user into the trained picture and text network to obtain the depth characteristics of the data.

According to the method, the depth characteristics of the sample are embedded into a class space through the attribute vectors of class levels, so that the relation between visible classes and invisible classes can be well explored, and the high-efficiency identification of new classes is realized. In addition, the invention uses a label completion strategy to complete the missing label matrix, thereby fully considering the samples with marks, partial marks and no marks and fully utilizing the label information.

In this embodiment, the training of the post-training image and text network includes:

constructing a picture and text feature extraction network with input dimensions of 500 and 224 × 3 respectively and output dimensions of 16 dimensions; the text feature extraction network is realized by adopting a fully-connected neural network; the image feature extraction network is realized by adopting a convolutional neural network VGG 19;

constructing a second training set and a verification set; the second training set and the verification set are picture and text modal data of a sample with known partial labels and labels corresponding to the two modal data; wherein the picture data size is 224 × 3; the text modal data size is 500; each tag in the tag has three values: 1 represents that the label exists, -1 represents that the label does not exist, 0 represents that the label is absent, a second training set and a verification set are used, firstly, the absent label matrix is converted into a completed label matrix through a label completion strategy, then, picture and text modal data are input into a picture and text feature extraction network, a loss function is calculated through an output result of the training set, picture and text sign extraction network parameters are optimized through the loss, and the parameter with the best diagnostic performance on the verification set is selected as the trained picture and text feature extraction network.

In an exemplary embodiment, in the training stage, the depth feature of the data is extracted from the acquired data uploaded by the user, and the method specifically includes:

and S1021, acquiring data of two modes of a sample picture and a text and label data corresponding to the sample, preprocessing the data, and converting the sample data into data which is easy to read and convenient for calculation of a convolutional neural network.

Specifically, the specific implementation process of converting the sample data into the data which is easy to read and convenient for the convolutional neural network to calculate is as follows:

for the image modal data, the original data are images with different sizes. However, the input size of the convolutional neural network is uniform. Therefore, the original picture size is uniformly converted into 224 × 3 for input to the picture feature extraction network.

For text modal data, the raw data is a plurality of words separated by spaces; in order to express the text data as digital features and measure the similarity between the text data, the text data is firstly subjected to word segmentation, then converted into one-shot codes and input into the word embedding network, and finally a 500-dimensional text feature vector is obtained.

And S1022, completing the missing label matrix through a label completion strategy.

Specifically, the completion of the missing tag matrix is performed through the tag completion strategy, which includes the following steps:

and (3) calculating a completed label matrix Z by adopting a label completion method MLML:

first, sample-level smoothness is computed, defined as follows:

wherein I represents an identity matrix, D _X Is a diagonal matrix used to normalize the similarity so that the term sample level smoothness is not affected by the sample similarity scale. V _X And representing the sample similarity for measuring the relationship between each pair of samples. Here, we can use affinity matrix for definition. If x _j Is not x _i K is adjacent to then order V _x (i, j) ═ 0, otherwise calculated as:

V _X (i,j)＝exp(-d ² (x _i ,x _j )/σ _i σ _j ) (2)

wherein d is _ij Represents the euclidean distance between samples i and j; sigma _i ＝d(x _i ,x _h ) K and h are self-defined parameters.

Thereafter, the smoothness at the label level is computed, defined as follows:

wherein I represents an identity matrix, D _C Is a diagonal matrix for normalizing the similarity so that the item of label level smoothness is not influenced by the label similarity V _C The effect of scale. V _C Representing tag similarity, defined as follows:

wherein

Is Y _i Represents the number of partially labeled samples.

Finally, a complementary tag matrix Z is calculated, defined as follows:

Z＝(1-α _X )(1-α _C )(I-α _C L _C ) ^-1 Y(I-α _X L _X ) ^-1 (5)

wherein alpha is _X And alpha _C Is hyperparametric, L _C And L _X The definitions of (a) have been given above.

It should be noted that other label completion methods may be used.

And S1023, defining the similarity between the interior and the modality of the two modalities of the picture and the text.

Specifically, the specific implementation process of defining the similarity between the two modalities and between the modalities of the picture and the text is as follows:

first, the feature similarity between the intra-modality data is defined as follows as a part of the final composite similarity:

where v represents the v-th modality (v ═ {1,2}, 1 represents a picture modality, 2 represents a text modality), and x represents _i ,x _j Representing the characteristics of sample i and sample j, edist (x) _i ,x _j ) Represents x _i ,x _j The euler distance therebetween.

Then, the label similarity between the data is defined as follows:

where v, v'. epsilon. {1,2} represents two modes, L _i ^v A set of labels representing samples i of modality v.

Then, defining the composite similarity of the samples in the modality according to the feature similarity and the label similarity of the samples, and defining the composite similarity as follows:

wherein,

represents the composite similarity between samples i, j within the modality v,

representing the feature similarity of samples i and j,

representing the label similarity of samples i and j, the above notation has been given above.

For two labeled data, the semantic similarity of the labels can be used as a supplement of the feature similarity, otherwise, the composite similarity is the same as the feature similarity. Thus, not only can the label and feature information of the training data be utilized to solve the problem of insufficient labels, but also the semantic related examples can be searched.

Finally, the composite similarity between the modalities is calculated according to the 3 similarities, which is defined as follows:

wherein

Representing the degree of similarity of labels of the exemplars between the two modalities,

and

the composite similarity of the two intra-modal samples i and j, respectively, whose definitions have been given above.

S1024, under the guidance of the joint similarity, through a unified target equation, the deep feature learning, the class space learning and the cross-modal Hash coding learning can be simultaneously realized, and the method specifically comprises the following steps:

firstly, a deep neural network is adopted to learn low-dimensional feature vectors from original pictures and text information. Meanwhile, the composite similarity is used as guiding information in the process, so that the original structural information can be kept in a low-dimensional space. Let x be ⁽¹⁾ And x ⁽²⁾ Is expressed as

And

and

for the parameters of the network, it is formulated as:

wherein,

representative sample x _i Output under two neural networks. S ^vv′ Representing a composite similarity matrix between modalities v and v'.

And then, the learning of an attribute expression space is guided through the attribute vector, and further the zero sample cross-modal hash is realized. To avoid domain drift and semantic errors, the attribute space is defined as follows:

wherein A is ⁽¹⁾ ,A ⁽²⁾ The attribute matrix representing the relationship between the defined classes can be obtained by means of word embedding. C ⁽¹⁾ ,C ⁽²⁾ Is the category space to be learned finally. Since the two modalities are actually from the same domain, although the category labels may be different, the potential categories from the two modalities may be the same without loss of generality, a ⁽¹⁾ ,A ⁽²⁾ Are arranged as their sum matrix.

Then, the required hash code is learned in the class expression space by minimizing the quantization loss, defined as follows:

wherein, W ⁽¹⁾ ,W ⁽²⁾ Coefficient matrices for two modes, B ⁽¹⁾ ,B ⁽²⁾ In the training process, the order B can be used for the finally learned hash code ⁽¹⁾ ＝B ⁽²⁾ ＝B。

And finally, combining the processes for optimization, wherein the final objective function is as follows:

where α and β are hyperparameters, the meaning of the other symbols in the formula has been given above.

In this embodiment, quantizing the depth features of the extracted data into hash codes specifically includes: and quantizing the feature vector of the data into hash codes through a hash function.

Illustratively, the depth features of the extracted data are quantized into hash codes, and the specific implementation manner is as follows:

let the depth feature of the extracted data be x, the hash function H is defined as follows:

H(x)＝sgn(x) (14)

wherein sgn (x) is defined as follows:

in this embodiment, the comparing with the data in the database to obtain the hamming distance sequence specifically includes:

calculating the Hamming distance between the quantized hash codes and the hash codes of all data in the database;

the hamming distances are sorted.

Further, the existing data in the database is stored as follows:

inputting sample picture modal data into the trained picture network to obtain the characteristics of the picture modal;

inputting sample text mode data into the trained text network to obtain the characteristics of the text mode;

quantizing the characteristics of the image and text modes through a hash function to obtain hash codes;

converting the hash code into a 10-system code and storing the 10-system code in a database.

Further, the hamming distance is calculated as follows:

converting the quantized Hash codes into decimal;

and calculating the Hamming distance between the query sample and the sample in the database through a formula.

And (3) setting the quantized hash code as b and the decimal corresponding to the b as x, and defining the Hamming distance between the decimal y corresponding to the hash code of the sample i in the database as follows:

bit(x^y) (16)

wherein ^ represents an exclusive-or operation, and bit (a) represents the number of 1 in the binary number corresponding to a.

Further, the sorting the hamming distances specifically includes: the calculated hamming distances are sorted from small to large.

Further, the selecting of the data of the amount specified by the user as a retrieval result, where the retrieval result is used to return to the user, specifically includes: the top N data are selected as the search results according to the number of queries N (10, 25, and 50) specified by the user.

Example 2:

as shown in fig. 2, an embodiment 2 of the present invention provides a cross-modal hash retrieval system based on zero sample learning, including:

The working method of the system is the same as the cross-modal hash retrieval method based on zero sample learning provided in embodiment 1, and details are not repeated here.

Example 3:

embodiment 3 of the present invention provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the steps in the zero-sample-learning-based cross-modal hash search method according to embodiment 1 of the present invention.

Example 4:

embodiment 4 of the present invention provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable on the processor, where the processor implements the steps in the zero-sample-learning-based cross-modal hash search method according to embodiment 1 of the present invention when executing the program.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A cross-modal hash retrieval method based on zero sample learning is characterized in that:

the method comprises the following steps:

acquiring text data and picture data uploaded by a user;

2. The zero-sample-learning-based cross-modal hash retrieval method of claim 1, wherein:

the text data is: and (3) separating the input contents of the text input by the user in the search box by using a blank space, inputting the separated input contents into the trained text feature extraction network, and finally expressing the separated input contents as multi-dimensional feature vectors.

3. The zero-sample-learning-based cross-modal hash retrieval method of claim 2, wherein:

the trained text feature extraction network comprises the following training processes:

constructing a multi-output-dimension word embedding network; wherein, the word embedding network is realized by a word2vec network;

constructing a first training set; the first training set is a wiki corpus;

and inputting the first training set into the word embedding network, calculating a negative log-likelihood function through an output result of the training set, and optimizing parameters of the word embedding network through the function.

4. The zero-sample-learning-based cross-modal hash retrieval method of claim 1, wherein:

the picture data is: and (4) unifying the sizes of the picture files uploaded by the user through the webpage into a preset size after the files are uploaded.

5. The zero-sample-learning-based cross-modal hash retrieval method of claim 1, wherein:

inputting the acquired picture data and text data uploaded by the user into the trained picture and text network to obtain the depth characteristics of the picture data and the text data, wherein the depth characteristics comprise:

acquiring data of two modes of a sample picture and a text and label data corresponding to the sample, preprocessing the data, and converting the sample data into data which is easy to read and convenient for calculation of a convolutional neural network;

defining the similarity between the interior of two modes of the picture and the text and the modes;

under the guidance of the joint similarity, through a unified target equation, the deep feature learning, the class space learning and the cross-modal Hash coding learning are simultaneously realized.

6. The zero-sample-learning-based cross-modal hash retrieval method of claim 5, wherein:

and completing the missing label matrix through a label completion strategy, wherein the method comprises the following steps:

calculating sample level smoothness;

calculating the smoothness of the label level;

calculating a completed label matrix according to the smoothness of the sample level and the smoothness of the label level;

or,

defining the similarity between the two modals of the picture and the text, comprising the following steps:

defining feature similarity between the data inside the modalities, wherein the similarity serves as a part of final composite similarity;

defining tag similarity between data;

defining the composite similarity of the modal internal samples according to the feature similarity and the label similarity of the samples, wherein for two pieces of labeled data, the label semantic similarity can be used as a supplement of the feature similarity, otherwise, the composite similarity is the same as the feature similarity;

calculating the composite similarity among the modalities according to the feature similarity, the label similarity and the composite similarity;

or,

under the guidance of joint similarity, through a unified target equation, the deep feature learning, the class space learning and the cross-modal Hash coding learning are simultaneously realized, and the method comprises the following steps:

a deep neural network is adopted, low-dimensional feature vectors are learned from original pictures and text information, and meanwhile composite similarity is used as guiding information in the process, so that original structural information can be kept in a low-dimensional space;

guiding the learning of an attribute expression space through an attribute vector to realize zero sample cross-modal hashing;

learning the required hash code in the class expression space by minimizing quantization loss;

the above processes are combined to carry out optimization, and an objective function is obtained as follows:

wherein alpha and beta are hyper-parameters.

7. The zero-sample-learning-based cross-modal hash retrieval method of claim 1, wherein:

the extracted depth features of the data are quantized into hash codes, including:

or,

comparing with data in a database to obtain a hamming distance ranking, including:

assuming that the quantized hash code is b and the decimal corresponding to the quantized hash code is x, and for the decimal y corresponding to the hash code of the sample i in the database, the hamming distance between the two is defined as follows:

bit(x^y)

and ^ represents an exclusive-or operation, bit (a) represents the number of 1 in a corresponding binary number of a, and the Hamming distances calculated in the step are sorted from small to large.

8. A cross-modal hash retrieval system based on zero sample learning is characterized in that:

the method comprises the following steps:

9. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, carries out the steps of the zero-sample-learning-based cross-modal hash retrieval method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the steps of the zero sample learning based cross-modal hash retrieval method according to any of claims 1-7 when executing the program.