CN111753189A - Common characterization learning method for few-sample cross-modal Hash retrieval - Google Patents
Common characterization learning method for few-sample cross-modal Hash retrieval Download PDFInfo
- Publication number
- CN111753189A CN111753189A CN202010476647.7A CN202010476647A CN111753189A CN 111753189 A CN111753189 A CN 111753189A CN 202010476647 A CN202010476647 A CN 202010476647A CN 111753189 A CN111753189 A CN 111753189A
- Authority
- CN
- China
- Prior art keywords
- samples
- text
- image
- data
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000012512 characterization method Methods 0.000 title claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 43
- 230000006870 function Effects 0.000 claims abstract description 31
- 239000000284 extract Substances 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 42
- 230000008569 process Effects 0.000 claims description 27
- 238000012360 testing method Methods 0.000 claims description 21
- 238000013507 mapping Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 13
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 210000005036 nerve Anatomy 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/435—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/438—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/45—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a few-sample cross-modal Hash retrieval common characterization learning method, which designs a Jixiei-Jixiei network. Jiehi-Jie-Bi networks primarily involve two major modules: an informed module and an informed module. The informed module can fully utilize information hidden in data, fuse features of different layers and extract features with more global property; on the basis of the known module, the known module models the correlation of all samples and captures the nonlinear dependence relationship between data so as to better learn the common characterization of different modal data. And finally, establishing a loss function for keeping intra-modal and inter-modal similarity, and training and optimizing the network. The invention can effectively solve the problem of data imbalance under the condition of few samples, and can learn more representative common representation, thereby greatly improving the cross-modal retrieval precision.
Description
Technical Field
The invention relates to the field of computer visual information retrieval, in particular to a common characterization learning method for cross-modal hash retrieval of few samples.
Background
Data of various different modes on the internet is increasing day by day, so that cross-mode retrieval is more and more widely applied. The cross-modal retrieval is to take data of one modality as a query item, perform retrieval on a database consisting of data of another modality, and return similar data. Since images and texts are two most common multimedia data, and in addition, the hash method maps high-dimensional data into low-dimensional binary codes, which can improve the retrieval speed and save the storage space, only the hash retrieval across images and texts is discussed.
In recent years, various cross-modal Hash retrieval algorithms based on deep learning are proposed by the academic community, and better retrieval performance is achieved. In the overview of the algorithms, a deep network is designed for data of one modality, training and learning are respectively performed, and data of different modalities are independently mapped to a common space. However, this approach treats each data sample as an independent individual, extracts feature representations from the corresponding data sample only, ignores the correlation information between different data, and when the number of samples in certain categories is small, the information of these small samples may be covered by other information with enough samples, so that when there are insufficient training samples of different modalities, the existing algorithm may make it difficult for the model to learn a better common characterization. Data of different modes have heterogeneity, and when the model can extract powerful common representation of the data of different modes, retrieval accuracy can be improved. Therefore, how to effectively utilize the information contained between different modality data and learn representative common characterization is a problem that needs to be solved by the sample-less cross-modality retrieval task.
The level of cross-modal retrieval accuracy is directly related to the common representation of data, and the fact that people know the knowledge and can never end up is inspired by the ancient sentence, and a more powerful feature representation for learning the knowledge-people network is provided. Depth feature extraction is decomposed into two subtasks: 1) a better representation is learned directly from the sample itself using the knowledge network. Since different network layers can encode different information, for example, the lower layers of the convolutional neural network tend to encode structural information, and the higher layers tend to extract semantic information. In addition, the reception field of the high-level network is larger, the characteristics of the large target can be better extracted, the reception field of the low-level network is smaller, and the characteristics of the small target are mainly extracted. If the features extracted from different layers are fused, not only more global information can be obtained, but also the problem of multiple scales can be solved. Based on the method, a cognitive network with self-perception capability is designed, so that multi-layer abstract features can be captured, and global information of each layer in the deep neural network is fully utilized; 2) the feature representation is further improved by using that network to know other samples as context information. When a human being is learning a new thing, the learning speed is faster if the new thing is similar to the thing learned before. The human thinking mode is fused into the model design, and the network has the relevance perception capability.
The patent specification with the application number of 201910983514.6 discloses a text hash retrieval method based on deep learning, which comprises the steps of extracting semantic codes corresponding to each original vocabulary data of a word embedded matrix by using a bidirectional LSTM model, then connecting a text convolutional neural network in parallel behind the bidirectional LSTM model, adding an attention mechanism, converting an output value of a second full-connection layer into corresponding hash codes by using a sign function, reconstructing category labels by using the hash codes, and finally searching vector data closest to the Hamming distance of the hash codes of the retrieved text in the hash codes of a text library to finish the hash retrieval process of the retrieved text data. However, this patent does not achieve the correlation of the captured data effectively, extracting representative common tokens.
Disclosure of Invention
The invention provides a common characterization learning method for few-sample cross-modal Hash retrieval with high cross-modal retrieval precision.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a few-sample cross-modal Hash retrieval common characterization learning method comprises the following steps:
s1: dividing a data set and preprocessing original images and text data;
s2: establishing two parallel deep network structures, and respectively extracting characteristic representations of the preprocessed image and the preprocessed text;
s3: establishing a hash layer, and mapping the characteristics of the image and the text to a public space to obtain hash codes of data in different modes;
s4: training, optimizing and testing the model by utilizing the triple loss function;
s5: and establishing a retrieval interface, inputting data of one mode, retrieving in a database formed by data of another mode, and returning the most similar top k sample as a retrieval result.
Further, the specific process of step S1 is:
s11: dividing the data set: forming image text pairs by images and texts with consistent label categories in the cross-modal dataset, and randomly selecting a plurality of categories, wherein training samples corresponding to the categories are few and are called as few sample categories; other classes in the data set are called basic classes, and the training samples of the classes are enough; dividing the data set into a training set and a testing set, wherein the number of samples of a few sample classes in the training set is much smaller than that of samples of a basic class, and the number of samples of different classes in the testing set is relatively balanced;
s12: respectively preprocessing the image and the text: unifying the image size and normalizing the image; for text, if the text form given by the data set is an independent word, converting the text into a word vector by using a word bag model; if the given text is a sentence or an article, namely words have time sequence, extracting the characteristics of the text by using a pre-trained Bert model, and converting the text into a vector.
Further, the specific process of step S2 is:
s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; because the convolutional neural network can extract rich characteristic information of the image, the VGG19 model with the last full connection layer removed is taken as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x1,x2,x3(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y1,y2,y3;
S22: because the features obtained by different network layers contain different information, for example, the lower-layer convolutional layer encodes visual information, and the higher-layer convolutional layer tends to encode semantic information, a knowledge module is designed to fuse the features of different layers, capture global information and obtain more representative features; the self-learning module is composed of a plurality of fully connected layers and non-local blocks, wherein the fully connected layers are used for mapping the feature vectors of different layers obtained at S21 into vectors with the same dimension, and the non-local blocks are used for calculating the correlation among the vectors with the same dimension;
s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by human thinking mode, a mutual understanding module is designed, other samples are used as context information, and the learned feature representation is further improved(ii) a For an input image q, after S22, a feature I containing the global information of the image itself is obtainedselfAnd then reducing the dimension of the features by using the full connection layer to obtain sqIts dimension and text feature TselfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, the use of all samples wastes time and memory, so that all samples are divided by using the category information, the average characteristic of each type of sample is calculated, the obtained average characteristic is called a category vector, and the category vectors represent all sample characteristics;
whereinIs a nerve tensor comprising m slices, which is updated during the training process, σ (·) is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficientWherein WzAnd bzIs a parameter of the fully-connected layer;
after the correlation coefficients of the image q and all the category vectors are processed by a softmax function, the normalized correlation coefficient is obtainedAnd finally, integrating other sample information into the image characteristics in a weighted summation mode to obtain a final characteristic representation:
the obtained feature representation not only contains the feature information of the feature representation, but also contains experience information of other samples; similarly, the obtained text is characterized by the following features after passing through the known module:
further, in step S23, if there are n categories in total, there are n category vectors, and the ith image category vector is recorded asIn the training process, the category vectors are updated at intervals, so that all sample information can be utilized, and the calculation amount is not too large.
Further, in step S22, the non-local block is implemented as follows: for an image, the vector of the same dimension is still marked as x1,x2,x3And x ═ x1,x2,x3) Then the response of the informed module at the ith position is:wherein G (x)j) a non-linear mapping function, for the convenience of training, using 1 × 1 convolutional layer, N (x) is a normalization factor;for calculating xiAnd xjThe correlation of (c); then the features are fused to obtain the output of an image knowledge module asWherein mean (-) refers to the mean operation, the output of the informed module in the text network is
Further, the specific process of step S3 is:
s31: taking a full connection layer as a hash layer, wherein the dimension of the full connection layer is consistent with the bit number of the hash code; mapping the image and text features obtained in step S24 to a common hash space, to obtain hash codes of the image and text, respectively;
whereinAnd andparameters of a hash layer of an image network and a hash layer of a text network are respectively set;
s32: and respectively converting the hash codes of the image and the text into binary codes by utilizing a tanh function:
BI=tanh(HI)
BT=tanh(HT)。
further, the specific process of step S4 is:
s41: the objective of the triplet loss function is to make the distance between homogeneous samples smaller than the distance between heterogeneous samples, and the calculation formula is:
wherein e, e+,e-Is a triplet, e and e, consisting of hash codes+Belong to the same class, eAnd e-α is a threshold parameter;calculating the Euclidean distance;
in the cross-modal retrieval task, not only the semantic similarity between the same-modal data but also the semantic similarity between different modalities are required to be kept, that is, if the labels of the data of different modalities are consistent, the learned feature representations are required to be similar as much as possible, so that the triple loss function between the same modality and the different modalities is designed as follows:
the overall loss function is: l ═ Lintra+linterDuring training, taking a small batch of image text pairs as input each time, calculating a loss function value L, solving a gradient by using a chain method and reversely transmitting and updating network parameters until the model converges;
s42: and storing the trained model, and testing the model.
Further, the specific process of step S5 is:
establishing an input interface, taking data of one mode as input, and forming a database by a binary code obtained by model coding data of the other mode; and mapping the input data into binary codes by using the model trained in S41, calculating the Hamming distance between the binary codes and the binary codes in the database, sequencing the samples of the database according to the Hamming distance, and returning the first k most similar samples as a retrieval result.
Further, in step S42, the process of testing the model is as follows: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.
Further, in step S11, the ratio of the number of the less sample classes to the number of the basic classes is about 1: 4.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the Chi-Chi network provided by the invention can fully utilize the information of the samples and other samples, and combine the information in each sample and all data, rather than treating each sample as a discrete unit so as to learn more powerful feature representation. Even under the condition that the number of samples in certain categories is small, the technical scheme can effectively capture the data correlation and extract representative common characteristics, so that the cross-modal retrieval accuracy is greatly improved.
Drawings
FIG. 1 is a network framework diagram of the present invention;
FIG. 2 is a flow chart of the steps of the present invention;
FIG. 3 is a graph comparing experimental results of the method of the present invention with those of the prior art.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, a method for learning a common characterization by using a few-sample cross-modal hash search includes the following steps:
s1: dividing a data set and preprocessing original images and text data;
s2: establishing two parallel deep network structures, and respectively extracting characteristic representations of the preprocessed image and the preprocessed text;
s3: establishing a hash layer, and mapping the characteristics of the image and the text to a public space to obtain hash codes of data in different modes;
s4: training, optimizing and testing the model by utilizing the triple loss function;
s5: and establishing a retrieval interface, inputting data of one mode, retrieving in a database formed by data of another mode, and returning the most similar top k sample as a retrieval result.
The specific process of step S1 is:
s11: dividing the data set: forming image text pairs by images and texts with consistent label categories in the cross-modal dataset, and randomly selecting a plurality of categories, wherein training samples corresponding to the categories are few and are called as few sample categories; other classes in the data set are called basic classes, and the training samples of the classes are enough; dividing the data set into a training set and a testing set, wherein the number of samples of a few sample classes in the training set is much smaller than that of samples of a basic class, and the number of samples of different classes in the testing set is relatively balanced;
s12: respectively preprocessing the image and the text: unifying the image size and normalizing the image; for text, if the text form given by the data set is an independent word, converting the text into a word vector by using a word bag model; if the given text is a sentence or an article, namely words have time sequence, extracting the characteristics of the text by using a pre-trained Bert model, and converting the text into a vector.
The specific process of step S2 is:
s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; due to the fact thatThe convolutional neural network can extract rich characteristic information of the image, so that the VGG19 model with the last full connection layer removed is used as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x1,x2,x3(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y1,y2,y3;
S22: because the features obtained by different network layers contain different information, for example, the lower-layer convolutional layer encodes visual information, and the higher-layer convolutional layer tends to encode semantic information, a knowledge module is designed to fuse the features of different layers, capture global information and obtain more representative features; the self-learning module is composed of a plurality of fully connected layers and non-local blocks, wherein the fully connected layers are used for mapping the feature vectors of different layers obtained at S21 into vectors with the same dimension, and the non-local blocks are used for calculating the correlation among the vectors with the same dimension;
s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by a human thinking mode, a learning module is designed, and other samples are used as context information to further improve the learned feature representation; for an input image q, after S22, a feature I containing the global information of the image itself is obtainedselfAnd then reducing the dimension of the features by using the full connection layer to obtain sqIts dimension and text feature TselfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, using all samples wastes time and memory, so that all samples are divided by using the class information, the average feature of each class of sample is calculated, and the obtained average feature is calledThe category vectors are used for representing all sample characteristics;
whereinIs a nerve tensor comprising m slices, which is updated during the training process, σ (·) is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficientWherein WzAnd bzIs a parameter of the fully-connected layer;
after the correlation coefficients of the image q and all the category vectors are processed by a softmax function, the normalized correlation coefficient is obtainedAnd finally, integrating other sample information into the image characteristics in a weighted summation mode to obtain a final characteristic representation:
the obtained feature representation not only contains the feature information of the feature representation, but also contains experience information of other samples; similarly, the obtained text is characterized by the following features after passing through the known module:
further, in step S23, if there are n categories in total, there are n category vectors, and the ith image category vector is recordedIs composed ofIn the training process, the category vectors are updated at intervals, so that all sample information can be utilized, and the calculation amount is not too large.
Further, in step S22, the non-local block is implemented as follows: for an image, the vector of the same dimension is still marked as x1,x2,x3And x ═ x1,x2,x3) Then the response of the informed module at the ith position is:wherein G (x)j) a non-linear mapping function, for the convenience of training, using 1 × 1 convolutional layer, N (x) is a normalization factor;for calculating xiAnd xjThe correlation of (c); then the features are fused to obtain the output of an image knowledge module asWherein mean (-) refers to the mean operation, the output of the informed module in the text network is
The specific process of step S3 is:
s31: taking a full connection layer as a hash layer, wherein the dimension of the full connection layer is consistent with the bit number of the hash code; mapping the image and text features obtained in step S24 to a common hash space, to obtain hash codes of the image and text, respectively;
whereinAnd andparameters of a hash layer of an image network and a hash layer of a text network are respectively set;
s32: and respectively converting the hash codes of the image and the text into binary codes by utilizing a tanh function:
BI=tanh(HI)
BT=tanh(HT)。
the specific process of step S4 is:
s41: the objective of the triplet loss function is to make the distance between homogeneous samples smaller than the distance between heterogeneous samples, and the calculation formula is:
wherein e, e+,e-Is a triplet, e and e, consisting of hash codes+Belong to the same class, e and e-α is a threshold parameter;calculating the Euclidean distance;
in the cross-modal retrieval task, not only the semantic similarity between the same-modal data but also the semantic similarity between different modalities are required to be kept, that is, if the labels of the data of different modalities are consistent, the learned feature representations are required to be similar as much as possible, so that the triple loss function between the same modality and the different modalities is designed as follows:
the overall loss function is: l ═ Lintra+linterDuring training, taking a small batch of image text pairs as input each time, calculating a loss function value L, solving a gradient by using a chain method and reversely transmitting and updating network parameters until the model converges;
s42: and storing the trained model, and testing the model.
The specific process of step S5 is:
establishing an input interface, taking data of one mode as input, and forming a database by a binary code obtained by model coding data of the other mode; and mapping the input data into binary codes by using the model trained in S41, calculating the Hamming distance between the binary codes and the binary codes in the database, sequencing the samples of the database according to the Hamming distance, and returning the first k most similar samples as a retrieval result.
In step S42, the process of testing the model is as follows: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.
In step S11, the ratio of the number of the less sample classes to the number of the basic classes is about 1: 4.
The scheme of the invention adopts two parallel deep networks (called image network and text network) to respectively process the image and the text. Each deep network contains four parts: the primary feature extractor is used for extracting primary features of the sample, the image is in a VGG19 model, and the text is in a bag-of-words model or a Bert model; the knowing module fuses the characteristics of different layers to obtain more global information; the known module takes other samples as context information, calculates the correlation among the samples to capture the nonlinear dependence information among different samples, and thus obtains more representative feature representation; and the Hash layer maps the obtained image and text features to a common space to learn common characteristics. And finally, training the model by utilizing the triple loss function, and keeping the similarity in the modes and among the modes. During training, small batches of image text pairs are input each time, and Adam is used as an optimizer to perform optimization. And (5) carrying out iterative training for multiple times until the model is converged, and storing the model.
After training the model, the model performance is tested, and the flow is shown in fig. 2. Firstly, mapping samples in training sets of images and texts to be Hash codes by using a trained image network and a trained text network respectively, and then performing binarization operation by using a tanh function to obtain binary codes which are respectively used as an image database and a text database. If the performance of the image retrieval text is to be tested, samples in the image test set are used as query images, after the query images are mapped into binary codes, the Hamming distance between the binary codes and the text database is calculated, the samples corresponding to the text database are sequenced according to the Hamming distance, and the smaller the Hamming distance, the more similar the results are. And finally, calculating the mAP of the image retrieval text according to the sorting condition. The performance process of testing the text retrieval image is similar to the process of image retrieval text, but the text test set is used as a query set, and binary codes corresponding to the image training set are used as a database.
FIG. 3 shows mAP results on the Wikipedia dataset for the present invention and other methods. Image → Text in the table represents the Image retrieval Text task, Text → Image represents the Text retrieval Image task, K represents the number of samples of each class of small sample class in the training set, and 16bits represents that the number of bits of the binary code is 16 bits. From the table, it can be seen that the search performance of the invention on two tasks is higher than that of the other two methods, and the effectiveness of the invention is illustrated.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A few-sample cross-modal Hash retrieval common characterization learning method is characterized by comprising the following steps:
s1: dividing a data set and preprocessing original images and text data;
s2: establishing two parallel deep network structures, and respectively extracting characteristic representations of the preprocessed image and the preprocessed text;
s3: establishing a hash layer, and mapping the characteristics of the image and the text to a public space to obtain hash codes of data in different modes;
s4: training, optimizing and testing the model by utilizing the triple loss function;
s5: and establishing a retrieval interface, inputting data of one mode, retrieving in a database formed by data of another mode, and returning the most similar top k sample as a retrieval result.
2. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 1, wherein the specific process of step S1 is:
s11: dividing the data set: forming image text pairs by images and texts with consistent label categories in the cross-modal dataset, and randomly selecting a plurality of categories, wherein training samples corresponding to the categories are few and are called as few sample categories; other classes in the data set are called basic classes, and the training samples of the classes are enough; dividing the data set into a training set and a testing set, wherein the number of samples of a few sample classes in the training set is much smaller than that of samples of a basic class, and the number of samples of different classes in the testing set is relatively balanced;
s12: respectively preprocessing the image and the text: unifying the image size and normalizing the image; for text, if the text form given by the data set is an independent word, converting the text into a word vector by using a word bag model; if the given text is a sentence or an article, namely words have time sequence, extracting the characteristics of the text by using a pre-trained Bert model, and converting the text into a vector.
3. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 2, wherein the specific process of step S2 is:
s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; because the convolutional neural network can extract rich characteristic information of the image, the VGG19 model with the last full connection layer removed is taken as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x1,x2,x3(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y1,y2,y3;
S22: because the features obtained by different network layers contain different information, for example, the lower-layer convolutional layer encodes visual information, and the higher-layer convolutional layer tends to encode semantic information, a knowledge module is designed to fuse the features of different layers, capture global information and obtain more representative features; the self-learning module is composed of a plurality of fully connected layers and non-local blocks, wherein the fully connected layers are used for mapping the feature vectors of different layers obtained at S21 into vectors with the same dimension, and the non-local blocks are used for calculating the correlation among the vectors with the same dimension;
s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by a human thinking mode, a learning module is designed, and other samples are used as context information to further improve the learned feature representation; for an input image q, after S22, a feature I containing the global information of the image itself is obtainedselfAnd then reducing the dimension of the features by using the full connection layer to obtain sqIts dimension and text feature TselfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, the use of all samples wastes time and memory, so that all samples are divided by using the category information, the average characteristic of each type of sample is calculated, the obtained average characteristic is called a category vector, and the category vectors represent all sample characteristics;
whereinIs a nerve tensor containing m slices, which will be in the training processUpdate it, σ () is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficientWherein WzAnd bzIs a parameter of the fully-connected layer;
after the correlation coefficients of the image q and all the category vectors are processed by a softmax function, the normalized correlation coefficient is obtainedAnd finally, integrating other sample information into the image characteristics in a weighted summation mode to obtain a final characteristic representation:
4. the method for learning common features for low-sample cross-modal hash search as claimed in claim 3, wherein in step S23, if there are n classes, there are n class vectors, and the ith image class vector is recorded asIn the training process, the category vectors are updated at intervals, so that all sample information can be utilized, and the calculation amount is not too large.
5. The method for learning common features for low-sample cross-modal hash retrieval as claimed in claim 4, wherein in step S22, the method is not localThe block is implemented as follows: for an image, the vector of the same dimension is still marked as x1,x2,x3And x ═ x1,x2,x3) Then the response of the informed module at the ith position is:wherein G (x)j) a non-linear mapping function, for the convenience of training, using 1 × 1 convolutional layer, N (x) is a normalization factor;for calculating xiAnd xjThe correlation of (c); then the features are fused to obtain the output of an image knowledge module asWherein mean (-) refers to the mean operation, the output of the informed module in the text network is
6. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 5, wherein the specific process of step S3 is:
s31: taking a full connection layer as a hash layer, wherein the dimension of the full connection layer is consistent with the bit number of the hash code; mapping the image and text features obtained in step S24 to a common hash space, to obtain hash codes of the image and text, respectively;
whereinAndandparameters of a hash layer of an image network and a hash layer of a text network are respectively set;
s32: and respectively converting the hash codes of the image and the text into binary codes by utilizing a tanh function:
BI=tanh(HI)
BT=tanh(HT)。
7. the method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 6, wherein the specific process of step S4 is:
s41: the objective of the triplet loss function is to make the distance between homogeneous samples smaller than the distance between heterogeneous samples, and the calculation formula is:
wherein e, e+,e-Is a triplet, e and e, consisting of hash codes+Belong to the same class, e and e-α is a threshold parameter;calculating the Euclidean distance;
in the cross-modal retrieval task, not only the semantic similarity between the same-modal data but also the semantic similarity between different modalities are required to be kept, that is, if the labels of the data of different modalities are consistent, the learned feature representations are required to be similar as much as possible, so that the triple loss function between the same modality and the different modalities is designed as follows:
the overall loss function is: l ═ Lintra+linterDuring training, taking a small batch of image text pairs as input each time, calculating a loss function value L, solving a gradient by using a chain method and reversely transmitting and updating network parameters until the model converges;
s42: and storing the trained model, and testing the model.
8. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 7, wherein the specific process of step S5 is:
establishing an input interface, taking data of one mode as input, and forming a database by a binary code obtained by model coding data of the other mode; and mapping the input data into binary codes by using the model trained in S41, calculating the Hamming distance between the binary codes and the binary codes in the database, sequencing the samples of the database according to the Hamming distance, and returning the first k most similar samples as a retrieval result.
9. The method for learning common characteristics of few-sample cross-modal hash search as claimed in claim 8, wherein in step S42, the model testing process is: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.
10. The low-sample cross-modal hash retrieval common characterization learning method of claim 9, wherein in step S11, the ratio of the number of low-sample classes and the number of basic classes is about 1: 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010476647.7A CN111753189B (en) | 2020-05-29 | Few-sample cross-modal hash retrieval common characterization learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010476647.7A CN111753189B (en) | 2020-05-29 | Few-sample cross-modal hash retrieval common characterization learning method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111753189A true CN111753189A (en) | 2020-10-09 |
CN111753189B CN111753189B (en) | 2024-07-05 |
Family
ID=
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112559810A (en) * | 2020-12-23 | 2021-03-26 | 上海大学 | Method and device for generating hash code by utilizing multi-layer feature fusion |
CN112613451A (en) * | 2020-12-29 | 2021-04-06 | 民生科技有限责任公司 | Modeling method of cross-modal text picture retrieval model |
CN112860935A (en) * | 2021-02-01 | 2021-05-28 | 西安电子科技大学 | Cross-source image retrieval method, system, medium and equipment |
CN113033695A (en) * | 2021-04-12 | 2021-06-25 | 北京信息科技大学 | Method for predicting faults of electronic device |
CN113408581A (en) * | 2021-05-14 | 2021-09-17 | 北京大数据先进技术研究院 | Multi-mode data matching method, device, equipment and storage medium |
CN114117153A (en) * | 2022-01-25 | 2022-03-01 | 山东建筑大学 | Online cross-modal retrieval method and system based on similarity relearning |
CN114880514A (en) * | 2022-07-05 | 2022-08-09 | 人民中科(北京)智能技术有限公司 | Image retrieval method, image retrieval device and storage medium |
CN115146488A (en) * | 2022-09-05 | 2022-10-04 | 山东鼹鼠人才知果数据科技有限公司 | Variable business process intelligent modeling system and method based on big data |
CN115203442A (en) * | 2022-09-15 | 2022-10-18 | 中国海洋大学 | Cross-modal deep hash retrieval method, system and medium based on joint attention |
WO2023078044A1 (en) * | 2021-11-05 | 2023-05-11 | 同方威视技术股份有限公司 | Method, system and device for checking authenticity of declaration information, and medium |
CN116244483A (en) * | 2023-05-12 | 2023-06-09 | 山东建筑大学 | Large-scale zero sample data retrieval method and system based on data synthesis |
CN116662490A (en) * | 2023-08-01 | 2023-08-29 | 山东大学 | Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information |
CN116825210A (en) * | 2023-08-28 | 2023-09-29 | 山东大学 | Hash retrieval method, system, equipment and medium based on multi-source biological data |
CN117056550A (en) * | 2023-10-12 | 2023-11-14 | 中国科学技术大学 | Long-tail image retrieval method, system, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273517A (en) * | 2017-06-21 | 2017-10-20 | 复旦大学 | Picture and text cross-module state search method based on the embedded study of figure |
CN108170755A (en) * | 2017-12-22 | 2018-06-15 | 西安电子科技大学 | Cross-module state Hash search method based on triple depth network |
CN109299341A (en) * | 2018-10-29 | 2019-02-01 | 山东师范大学 | One kind confrontation cross-module state search method dictionary-based learning and system |
CN110222140A (en) * | 2019-04-22 | 2019-09-10 | 中国科学院信息工程研究所 | A kind of cross-module state search method based on confrontation study and asymmetric Hash |
US20200073968A1 (en) * | 2018-09-04 | 2020-03-05 | Inception Institute of Artificial Intelligence, Ltd. | Sketch-based image retrieval techniques using generative domain migration hashing |
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273517A (en) * | 2017-06-21 | 2017-10-20 | 复旦大学 | Picture and text cross-module state search method based on the embedded study of figure |
CN108170755A (en) * | 2017-12-22 | 2018-06-15 | 西安电子科技大学 | Cross-module state Hash search method based on triple depth network |
US20200073968A1 (en) * | 2018-09-04 | 2020-03-05 | Inception Institute of Artificial Intelligence, Ltd. | Sketch-based image retrieval techniques using generative domain migration hashing |
CN109299341A (en) * | 2018-10-29 | 2019-02-01 | 山东师范大学 | One kind confrontation cross-module state search method dictionary-based learning and system |
CN110222140A (en) * | 2019-04-22 | 2019-09-10 | 中国科学院信息工程研究所 | A kind of cross-module state search method based on confrontation study and asymmetric Hash |
Non-Patent Citations (1)
Title |
---|
陈兆佳: ""基于三元组深度哈希的跨模态检索方法"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 02 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112559810A (en) * | 2020-12-23 | 2021-03-26 | 上海大学 | Method and device for generating hash code by utilizing multi-layer feature fusion |
CN112613451A (en) * | 2020-12-29 | 2021-04-06 | 民生科技有限责任公司 | Modeling method of cross-modal text picture retrieval model |
CN112860935B (en) * | 2021-02-01 | 2023-02-21 | 西安电子科技大学 | Cross-source image retrieval method, system, medium and equipment |
CN112860935A (en) * | 2021-02-01 | 2021-05-28 | 西安电子科技大学 | Cross-source image retrieval method, system, medium and equipment |
CN113033695A (en) * | 2021-04-12 | 2021-06-25 | 北京信息科技大学 | Method for predicting faults of electronic device |
CN113033695B (en) * | 2021-04-12 | 2023-07-25 | 北京信息科技大学 | Method for predicting faults of electronic device |
CN113408581A (en) * | 2021-05-14 | 2021-09-17 | 北京大数据先进技术研究院 | Multi-mode data matching method, device, equipment and storage medium |
WO2023078044A1 (en) * | 2021-11-05 | 2023-05-11 | 同方威视技术股份有限公司 | Method, system and device for checking authenticity of declaration information, and medium |
CN114117153A (en) * | 2022-01-25 | 2022-03-01 | 山东建筑大学 | Online cross-modal retrieval method and system based on similarity relearning |
CN114880514A (en) * | 2022-07-05 | 2022-08-09 | 人民中科(北京)智能技术有限公司 | Image retrieval method, image retrieval device and storage medium |
CN115146488B (en) * | 2022-09-05 | 2022-11-22 | 山东鼹鼠人才知果数据科技有限公司 | Variable business process intelligent modeling system and method based on big data |
CN115146488A (en) * | 2022-09-05 | 2022-10-04 | 山东鼹鼠人才知果数据科技有限公司 | Variable business process intelligent modeling system and method based on big data |
CN115203442A (en) * | 2022-09-15 | 2022-10-18 | 中国海洋大学 | Cross-modal deep hash retrieval method, system and medium based on joint attention |
CN116244483A (en) * | 2023-05-12 | 2023-06-09 | 山东建筑大学 | Large-scale zero sample data retrieval method and system based on data synthesis |
CN116662490A (en) * | 2023-08-01 | 2023-08-29 | 山东大学 | Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information |
CN116662490B (en) * | 2023-08-01 | 2023-10-13 | 山东大学 | Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information |
CN116825210A (en) * | 2023-08-28 | 2023-09-29 | 山东大学 | Hash retrieval method, system, equipment and medium based on multi-source biological data |
CN116825210B (en) * | 2023-08-28 | 2023-11-17 | 山东大学 | Hash retrieval method, system, equipment and medium based on multi-source biological data |
CN117056550A (en) * | 2023-10-12 | 2023-11-14 | 中国科学技术大学 | Long-tail image retrieval method, system, equipment and storage medium |
CN117056550B (en) * | 2023-10-12 | 2024-02-23 | 中国科学技术大学 | Long-tail image retrieval method, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109471895B (en) | Electronic medical record phenotype extraction and phenotype name normalization method and system | |
CN111382272B (en) | Electronic medical record ICD automatic coding method based on knowledge graph | |
CN108334574B (en) | Cross-modal retrieval method based on collaborative matrix decomposition | |
CN110222140A (en) | A kind of cross-module state search method based on confrontation study and asymmetric Hash | |
CN112966127A (en) | Cross-modal retrieval method based on multilayer semantic alignment | |
CN113095415B (en) | Cross-modal hashing method and system based on multi-modal attention mechanism | |
CN113657450B (en) | Attention mechanism-based land battlefield image-text cross-modal retrieval method and system | |
CN112800292B (en) | Cross-modal retrieval method based on modal specific and shared feature learning | |
CN111680176A (en) | Remote sensing image retrieval method and system based on attention and bidirectional feature fusion | |
CN111475622A (en) | Text classification method, device, terminal and storage medium | |
CN111159407A (en) | Method, apparatus, device and medium for training entity recognition and relation classification model | |
CN109829065B (en) | Image retrieval method, device, equipment and computer readable storage medium | |
CN112949740B (en) | Small sample image classification method based on multilevel measurement | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN114298122B (en) | Data classification method, apparatus, device, storage medium and computer program product | |
CN114549850B (en) | Multi-mode image aesthetic quality evaluation method for solving modal missing problem | |
CN113806580B (en) | Cross-modal hash retrieval method based on hierarchical semantic structure | |
CN114896434B (en) | Hash code generation method and device based on center similarity learning | |
CN112860930B (en) | Text-to-commodity image retrieval method based on hierarchical similarity learning | |
CN114358188A (en) | Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment | |
CN111026887A (en) | Cross-media retrieval method and system | |
CN113836896A (en) | Patent text abstract generation method and device based on deep learning | |
CN115130538A (en) | Training method of text classification model, text processing method, equipment and medium | |
CN108805280B (en) | Image retrieval method and device | |
CN116310339A (en) | Remote sensing image segmentation method based on matrix decomposition enhanced global features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |