CN111753189A - Common characterization learning method for few-sample cross-modal Hash retrieval - Google Patents

Common characterization learning method for few-sample cross-modal Hash retrieval Download PDF

Info

Publication number
CN111753189A
CN111753189A CN202010476647.7A CN202010476647A CN111753189A CN 111753189 A CN111753189 A CN 111753189A CN 202010476647 A CN202010476647 A CN 202010476647A CN 111753189 A CN111753189 A CN 111753189A
Authority
CN
China
Prior art keywords
samples
text
image
data
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010476647.7A
Other languages
Chinese (zh)
Other versions
CN111753189B (en
Inventor
王少英
赖韩江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202010476647.7A priority Critical patent/CN111753189B/en
Priority claimed from CN202010476647.7A external-priority patent/CN111753189B/en
Publication of CN111753189A publication Critical patent/CN111753189A/en
Application granted granted Critical
Publication of CN111753189B publication Critical patent/CN111753189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/438Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a few-sample cross-modal Hash retrieval common characterization learning method, which designs a Jixiei-Jixiei network. Jiehi-Jie-Bi networks primarily involve two major modules: an informed module and an informed module. The informed module can fully utilize information hidden in data, fuse features of different layers and extract features with more global property; on the basis of the known module, the known module models the correlation of all samples and captures the nonlinear dependence relationship between data so as to better learn the common characterization of different modal data. And finally, establishing a loss function for keeping intra-modal and inter-modal similarity, and training and optimizing the network. The invention can effectively solve the problem of data imbalance under the condition of few samples, and can learn more representative common representation, thereby greatly improving the cross-modal retrieval precision.

Description

Common characterization learning method for few-sample cross-modal Hash retrieval
Technical Field
The invention relates to the field of computer visual information retrieval, in particular to a common characterization learning method for cross-modal hash retrieval of few samples.
Background
Data of various different modes on the internet is increasing day by day, so that cross-mode retrieval is more and more widely applied. The cross-modal retrieval is to take data of one modality as a query item, perform retrieval on a database consisting of data of another modality, and return similar data. Since images and texts are two most common multimedia data, and in addition, the hash method maps high-dimensional data into low-dimensional binary codes, which can improve the retrieval speed and save the storage space, only the hash retrieval across images and texts is discussed.
In recent years, various cross-modal Hash retrieval algorithms based on deep learning are proposed by the academic community, and better retrieval performance is achieved. In the overview of the algorithms, a deep network is designed for data of one modality, training and learning are respectively performed, and data of different modalities are independently mapped to a common space. However, this approach treats each data sample as an independent individual, extracts feature representations from the corresponding data sample only, ignores the correlation information between different data, and when the number of samples in certain categories is small, the information of these small samples may be covered by other information with enough samples, so that when there are insufficient training samples of different modalities, the existing algorithm may make it difficult for the model to learn a better common characterization. Data of different modes have heterogeneity, and when the model can extract powerful common representation of the data of different modes, retrieval accuracy can be improved. Therefore, how to effectively utilize the information contained between different modality data and learn representative common characterization is a problem that needs to be solved by the sample-less cross-modality retrieval task.
The level of cross-modal retrieval accuracy is directly related to the common representation of data, and the fact that people know the knowledge and can never end up is inspired by the ancient sentence, and a more powerful feature representation for learning the knowledge-people network is provided. Depth feature extraction is decomposed into two subtasks: 1) a better representation is learned directly from the sample itself using the knowledge network. Since different network layers can encode different information, for example, the lower layers of the convolutional neural network tend to encode structural information, and the higher layers tend to extract semantic information. In addition, the reception field of the high-level network is larger, the characteristics of the large target can be better extracted, the reception field of the low-level network is smaller, and the characteristics of the small target are mainly extracted. If the features extracted from different layers are fused, not only more global information can be obtained, but also the problem of multiple scales can be solved. Based on the method, a cognitive network with self-perception capability is designed, so that multi-layer abstract features can be captured, and global information of each layer in the deep neural network is fully utilized; 2) the feature representation is further improved by using that network to know other samples as context information. When a human being is learning a new thing, the learning speed is faster if the new thing is similar to the thing learned before. The human thinking mode is fused into the model design, and the network has the relevance perception capability.
The patent specification with the application number of 201910983514.6 discloses a text hash retrieval method based on deep learning, which comprises the steps of extracting semantic codes corresponding to each original vocabulary data of a word embedded matrix by using a bidirectional LSTM model, then connecting a text convolutional neural network in parallel behind the bidirectional LSTM model, adding an attention mechanism, converting an output value of a second full-connection layer into corresponding hash codes by using a sign function, reconstructing category labels by using the hash codes, and finally searching vector data closest to the Hamming distance of the hash codes of the retrieved text in the hash codes of a text library to finish the hash retrieval process of the retrieved text data. However, this patent does not achieve the correlation of the captured data effectively, extracting representative common tokens.
Disclosure of Invention
The invention provides a common characterization learning method for few-sample cross-modal Hash retrieval with high cross-modal retrieval precision.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a few-sample cross-modal Hash retrieval common characterization learning method comprises the following steps:
s1: dividing a data set and preprocessing original images and text data;
s2: establishing two parallel deep network structures, and respectively extracting characteristic representations of the preprocessed image and the preprocessed text;
s3: establishing a hash layer, and mapping the characteristics of the image and the text to a public space to obtain hash codes of data in different modes;
s4: training, optimizing and testing the model by utilizing the triple loss function;
s5: and establishing a retrieval interface, inputting data of one mode, retrieving in a database formed by data of another mode, and returning the most similar top k sample as a retrieval result.
Further, the specific process of step S1 is:
s11: dividing the data set: forming image text pairs by images and texts with consistent label categories in the cross-modal dataset, and randomly selecting a plurality of categories, wherein training samples corresponding to the categories are few and are called as few sample categories; other classes in the data set are called basic classes, and the training samples of the classes are enough; dividing the data set into a training set and a testing set, wherein the number of samples of a few sample classes in the training set is much smaller than that of samples of a basic class, and the number of samples of different classes in the testing set is relatively balanced;
s12: respectively preprocessing the image and the text: unifying the image size and normalizing the image; for text, if the text form given by the data set is an independent word, converting the text into a word vector by using a word bag model; if the given text is a sentence or an article, namely words have time sequence, extracting the characteristics of the text by using a pre-trained Bert model, and converting the text into a vector.
Further, the specific process of step S2 is:
s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; because the convolutional neural network can extract rich characteristic information of the image, the VGG19 model with the last full connection layer removed is taken as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x1,x2,x3(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y1,y2,y3
S22: because the features obtained by different network layers contain different information, for example, the lower-layer convolutional layer encodes visual information, and the higher-layer convolutional layer tends to encode semantic information, a knowledge module is designed to fuse the features of different layers, capture global information and obtain more representative features; the self-learning module is composed of a plurality of fully connected layers and non-local blocks, wherein the fully connected layers are used for mapping the feature vectors of different layers obtained at S21 into vectors with the same dimension, and the non-local blocks are used for calculating the correlation among the vectors with the same dimension;
s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by human thinking mode, a mutual understanding module is designed, other samples are used as context information, and the learned feature representation is further improved(ii) a For an input image q, after S22, a feature I containing the global information of the image itself is obtainedselfAnd then reducing the dimension of the features by using the full connection layer to obtain sqIts dimension and text feature TselfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, the use of all samples wastes time and memory, so that all samples are divided by using the category information, the average characteristic of each type of sample is calculated, the obtained average characteristic is called a category vector, and the category vectors represent all sample characteristics;
s24: computing input image features sqAnd
Figure BDA0002516033120000031
the correlation vector of (c):
Figure BDA0002516033120000032
wherein
Figure BDA0002516033120000041
Is a nerve tensor comprising m slices, which is updated during the training process, σ (·) is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficient
Figure BDA0002516033120000042
Wherein WzAnd bzIs a parameter of the fully-connected layer;
after the correlation coefficients of the image q and all the category vectors are processed by a softmax function, the normalized correlation coefficient is obtained
Figure BDA0002516033120000043
And finally, integrating other sample information into the image characteristics in a weighted summation mode to obtain a final characteristic representation:
Figure BDA0002516033120000044
the obtained feature representation not only contains the feature information of the feature representation, but also contains experience information of other samples; similarly, the obtained text is characterized by the following features after passing through the known module:
Figure BDA0002516033120000045
further, in step S23, if there are n categories in total, there are n category vectors, and the ith image category vector is recorded as
Figure BDA0002516033120000046
In the training process, the category vectors are updated at intervals, so that all sample information can be utilized, and the calculation amount is not too large.
Further, in step S22, the non-local block is implemented as follows: for an image, the vector of the same dimension is still marked as x1,x2,x3And x ═ x1,x2,x3) Then the response of the informed module at the ith position is:
Figure BDA0002516033120000047
wherein G (x)j) a non-linear mapping function, for the convenience of training, using 1 × 1 convolutional layer, N (x) is a normalization factor;
Figure BDA0002516033120000048
for calculating xiAnd xjThe correlation of (c); then the features are fused to obtain the output of an image knowledge module as
Figure BDA0002516033120000049
Wherein mean (-) refers to the mean operation, the output of the informed module in the text network is
Figure BDA00025160331200000410
Further, the specific process of step S3 is:
s31: taking a full connection layer as a hash layer, wherein the dimension of the full connection layer is consistent with the bit number of the hash code; mapping the image and text features obtained in step S24 to a common hash space, to obtain hash codes of the image and text, respectively;
Figure BDA00025160331200000411
Figure BDA00025160331200000412
wherein
Figure BDA00025160331200000413
And
Figure BDA00025160331200000414
Figure BDA00025160331200000415
and
Figure BDA00025160331200000416
parameters of a hash layer of an image network and a hash layer of a text network are respectively set;
s32: and respectively converting the hash codes of the image and the text into binary codes by utilizing a tanh function:
BI=tanh(HI)
BT=tanh(HT)。
further, the specific process of step S4 is:
s41: the objective of the triplet loss function is to make the distance between homogeneous samples smaller than the distance between heterogeneous samples, and the calculation formula is:
Figure BDA0002516033120000051
wherein e, e+,e-Is a triplet, e and e, consisting of hash codes+Belong to the same class, eAnd e-α is a threshold parameter;
Figure BDA0002516033120000052
calculating the Euclidean distance;
in the cross-modal retrieval task, not only the semantic similarity between the same-modal data but also the semantic similarity between different modalities are required to be kept, that is, if the labels of the data of different modalities are consistent, the learned feature representations are required to be similar as much as possible, so that the triple loss function between the same modality and the different modalities is designed as follows:
Figure BDA0002516033120000053
Figure BDA0002516033120000054
the overall loss function is: l ═ Lintra+linterDuring training, taking a small batch of image text pairs as input each time, calculating a loss function value L, solving a gradient by using a chain method and reversely transmitting and updating network parameters until the model converges;
s42: and storing the trained model, and testing the model.
Further, the specific process of step S5 is:
establishing an input interface, taking data of one mode as input, and forming a database by a binary code obtained by model coding data of the other mode; and mapping the input data into binary codes by using the model trained in S41, calculating the Hamming distance between the binary codes and the binary codes in the database, sequencing the samples of the database according to the Hamming distance, and returning the first k most similar samples as a retrieval result.
Further, in step S42, the process of testing the model is as follows: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.
Further, in step S11, the ratio of the number of the less sample classes to the number of the basic classes is about 1: 4.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the Chi-Chi network provided by the invention can fully utilize the information of the samples and other samples, and combine the information in each sample and all data, rather than treating each sample as a discrete unit so as to learn more powerful feature representation. Even under the condition that the number of samples in certain categories is small, the technical scheme can effectively capture the data correlation and extract representative common characteristics, so that the cross-modal retrieval accuracy is greatly improved.
Drawings
FIG. 1 is a network framework diagram of the present invention;
FIG. 2 is a flow chart of the steps of the present invention;
FIG. 3 is a graph comparing experimental results of the method of the present invention with those of the prior art.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, a method for learning a common characterization by using a few-sample cross-modal hash search includes the following steps:
s1: dividing a data set and preprocessing original images and text data;
s2: establishing two parallel deep network structures, and respectively extracting characteristic representations of the preprocessed image and the preprocessed text;
s3: establishing a hash layer, and mapping the characteristics of the image and the text to a public space to obtain hash codes of data in different modes;
s4: training, optimizing and testing the model by utilizing the triple loss function;
s5: and establishing a retrieval interface, inputting data of one mode, retrieving in a database formed by data of another mode, and returning the most similar top k sample as a retrieval result.
The specific process of step S1 is:
s11: dividing the data set: forming image text pairs by images and texts with consistent label categories in the cross-modal dataset, and randomly selecting a plurality of categories, wherein training samples corresponding to the categories are few and are called as few sample categories; other classes in the data set are called basic classes, and the training samples of the classes are enough; dividing the data set into a training set and a testing set, wherein the number of samples of a few sample classes in the training set is much smaller than that of samples of a basic class, and the number of samples of different classes in the testing set is relatively balanced;
s12: respectively preprocessing the image and the text: unifying the image size and normalizing the image; for text, if the text form given by the data set is an independent word, converting the text into a word vector by using a word bag model; if the given text is a sentence or an article, namely words have time sequence, extracting the characteristics of the text by using a pre-trained Bert model, and converting the text into a vector.
The specific process of step S2 is:
s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; due to the fact thatThe convolutional neural network can extract rich characteristic information of the image, so that the VGG19 model with the last full connection layer removed is used as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x1,x2,x3(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y1,y2,y3
S22: because the features obtained by different network layers contain different information, for example, the lower-layer convolutional layer encodes visual information, and the higher-layer convolutional layer tends to encode semantic information, a knowledge module is designed to fuse the features of different layers, capture global information and obtain more representative features; the self-learning module is composed of a plurality of fully connected layers and non-local blocks, wherein the fully connected layers are used for mapping the feature vectors of different layers obtained at S21 into vectors with the same dimension, and the non-local blocks are used for calculating the correlation among the vectors with the same dimension;
s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by a human thinking mode, a learning module is designed, and other samples are used as context information to further improve the learned feature representation; for an input image q, after S22, a feature I containing the global information of the image itself is obtainedselfAnd then reducing the dimension of the features by using the full connection layer to obtain sqIts dimension and text feature TselfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, using all samples wastes time and memory, so that all samples are divided by using the class information, the average feature of each class of sample is calculated, and the obtained average feature is calledThe category vectors are used for representing all sample characteristics;
s24: computing input image features sqAnd
Figure BDA0002516033120000071
the correlation vector of (c):
Figure BDA0002516033120000072
wherein
Figure BDA0002516033120000073
Is a nerve tensor comprising m slices, which is updated during the training process, σ (·) is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficient
Figure BDA0002516033120000081
Wherein WzAnd bzIs a parameter of the fully-connected layer;
after the correlation coefficients of the image q and all the category vectors are processed by a softmax function, the normalized correlation coefficient is obtained
Figure BDA0002516033120000082
And finally, integrating other sample information into the image characteristics in a weighted summation mode to obtain a final characteristic representation:
Figure BDA0002516033120000083
the obtained feature representation not only contains the feature information of the feature representation, but also contains experience information of other samples; similarly, the obtained text is characterized by the following features after passing through the known module:
Figure BDA0002516033120000084
further, in step S23, if there are n categories in total, there are n category vectors, and the ith image category vector is recordedIs composed of
Figure BDA0002516033120000085
In the training process, the category vectors are updated at intervals, so that all sample information can be utilized, and the calculation amount is not too large.
Further, in step S22, the non-local block is implemented as follows: for an image, the vector of the same dimension is still marked as x1,x2,x3And x ═ x1,x2,x3) Then the response of the informed module at the ith position is:
Figure BDA0002516033120000086
wherein G (x)j) a non-linear mapping function, for the convenience of training, using 1 × 1 convolutional layer, N (x) is a normalization factor;
Figure BDA0002516033120000087
for calculating xiAnd xjThe correlation of (c); then the features are fused to obtain the output of an image knowledge module as
Figure BDA0002516033120000088
Wherein mean (-) refers to the mean operation, the output of the informed module in the text network is
Figure BDA0002516033120000089
The specific process of step S3 is:
s31: taking a full connection layer as a hash layer, wherein the dimension of the full connection layer is consistent with the bit number of the hash code; mapping the image and text features obtained in step S24 to a common hash space, to obtain hash codes of the image and text, respectively;
Figure BDA00025160331200000810
Figure BDA00025160331200000811
wherein
Figure BDA00025160331200000812
And
Figure BDA00025160331200000813
Figure BDA00025160331200000814
and
Figure BDA00025160331200000815
parameters of a hash layer of an image network and a hash layer of a text network are respectively set;
s32: and respectively converting the hash codes of the image and the text into binary codes by utilizing a tanh function:
BI=tanh(HI)
BT=tanh(HT)。
the specific process of step S4 is:
s41: the objective of the triplet loss function is to make the distance between homogeneous samples smaller than the distance between heterogeneous samples, and the calculation formula is:
Figure BDA0002516033120000091
wherein e, e+,e-Is a triplet, e and e, consisting of hash codes+Belong to the same class, e and e-α is a threshold parameter;
Figure BDA0002516033120000092
calculating the Euclidean distance;
in the cross-modal retrieval task, not only the semantic similarity between the same-modal data but also the semantic similarity between different modalities are required to be kept, that is, if the labels of the data of different modalities are consistent, the learned feature representations are required to be similar as much as possible, so that the triple loss function between the same modality and the different modalities is designed as follows:
Figure BDA0002516033120000093
Figure BDA0002516033120000094
the overall loss function is: l ═ Lintra+linterDuring training, taking a small batch of image text pairs as input each time, calculating a loss function value L, solving a gradient by using a chain method and reversely transmitting and updating network parameters until the model converges;
s42: and storing the trained model, and testing the model.
The specific process of step S5 is:
establishing an input interface, taking data of one mode as input, and forming a database by a binary code obtained by model coding data of the other mode; and mapping the input data into binary codes by using the model trained in S41, calculating the Hamming distance between the binary codes and the binary codes in the database, sequencing the samples of the database according to the Hamming distance, and returning the first k most similar samples as a retrieval result.
In step S42, the process of testing the model is as follows: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.
In step S11, the ratio of the number of the less sample classes to the number of the basic classes is about 1: 4.
The scheme of the invention adopts two parallel deep networks (called image network and text network) to respectively process the image and the text. Each deep network contains four parts: the primary feature extractor is used for extracting primary features of the sample, the image is in a VGG19 model, and the text is in a bag-of-words model or a Bert model; the knowing module fuses the characteristics of different layers to obtain more global information; the known module takes other samples as context information, calculates the correlation among the samples to capture the nonlinear dependence information among different samples, and thus obtains more representative feature representation; and the Hash layer maps the obtained image and text features to a common space to learn common characteristics. And finally, training the model by utilizing the triple loss function, and keeping the similarity in the modes and among the modes. During training, small batches of image text pairs are input each time, and Adam is used as an optimizer to perform optimization. And (5) carrying out iterative training for multiple times until the model is converged, and storing the model.
After training the model, the model performance is tested, and the flow is shown in fig. 2. Firstly, mapping samples in training sets of images and texts to be Hash codes by using a trained image network and a trained text network respectively, and then performing binarization operation by using a tanh function to obtain binary codes which are respectively used as an image database and a text database. If the performance of the image retrieval text is to be tested, samples in the image test set are used as query images, after the query images are mapped into binary codes, the Hamming distance between the binary codes and the text database is calculated, the samples corresponding to the text database are sequenced according to the Hamming distance, and the smaller the Hamming distance, the more similar the results are. And finally, calculating the mAP of the image retrieval text according to the sorting condition. The performance process of testing the text retrieval image is similar to the process of image retrieval text, but the text test set is used as a query set, and binary codes corresponding to the image training set are used as a database.
FIG. 3 shows mAP results on the Wikipedia dataset for the present invention and other methods. Image → Text in the table represents the Image retrieval Text task, Text → Image represents the Text retrieval Image task, K represents the number of samples of each class of small sample class in the training set, and 16bits represents that the number of bits of the binary code is 16 bits. From the table, it can be seen that the search performance of the invention on two tasks is higher than that of the other two methods, and the effectiveness of the invention is illustrated.
The same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A few-sample cross-modal Hash retrieval common characterization learning method is characterized by comprising the following steps:
s1: dividing a data set and preprocessing original images and text data;
s2: establishing two parallel deep network structures, and respectively extracting characteristic representations of the preprocessed image and the preprocessed text;
s3: establishing a hash layer, and mapping the characteristics of the image and the text to a public space to obtain hash codes of data in different modes;
s4: training, optimizing and testing the model by utilizing the triple loss function;
s5: and establishing a retrieval interface, inputting data of one mode, retrieving in a database formed by data of another mode, and returning the most similar top k sample as a retrieval result.
2. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 1, wherein the specific process of step S1 is:
s11: dividing the data set: forming image text pairs by images and texts with consistent label categories in the cross-modal dataset, and randomly selecting a plurality of categories, wherein training samples corresponding to the categories are few and are called as few sample categories; other classes in the data set are called basic classes, and the training samples of the classes are enough; dividing the data set into a training set and a testing set, wherein the number of samples of a few sample classes in the training set is much smaller than that of samples of a basic class, and the number of samples of different classes in the testing set is relatively balanced;
s12: respectively preprocessing the image and the text: unifying the image size and normalizing the image; for text, if the text form given by the data set is an independent word, converting the text into a word vector by using a word bag model; if the given text is a sentence or an article, namely words have time sequence, extracting the characteristics of the text by using a pre-trained Bert model, and converting the text into a vector.
3. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 2, wherein the specific process of step S2 is:
s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; because the convolutional neural network can extract rich characteristic information of the image, the VGG19 model with the last full connection layer removed is taken as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x1,x2,x3(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y1,y2,y3
S22: because the features obtained by different network layers contain different information, for example, the lower-layer convolutional layer encodes visual information, and the higher-layer convolutional layer tends to encode semantic information, a knowledge module is designed to fuse the features of different layers, capture global information and obtain more representative features; the self-learning module is composed of a plurality of fully connected layers and non-local blocks, wherein the fully connected layers are used for mapping the feature vectors of different layers obtained at S21 into vectors with the same dimension, and the non-local blocks are used for calculating the correlation among the vectors with the same dimension;
s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by a human thinking mode, a learning module is designed, and other samples are used as context information to further improve the learned feature representation; for an input image q, after S22, a feature I containing the global information of the image itself is obtainedselfAnd then reducing the dimension of the features by using the full connection layer to obtain sqIts dimension and text feature TselfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, the use of all samples wastes time and memory, so that all samples are divided by using the category information, the average characteristic of each type of sample is calculated, the obtained average characteristic is called a category vector, and the category vectors represent all sample characteristics;
s24: computing input image features sqAnd
Figure FDA0002516033110000021
the correlation vector of (c):
Figure FDA0002516033110000022
wherein
Figure FDA0002516033110000023
Is a nerve tensor containing m slices, which will be in the training processUpdate it, σ () is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficient
Figure FDA0002516033110000024
Wherein WzAnd bzIs a parameter of the fully-connected layer;
after the correlation coefficients of the image q and all the category vectors are processed by a softmax function, the normalized correlation coefficient is obtained
Figure FDA0002516033110000025
And finally, integrating other sample information into the image characteristics in a weighted summation mode to obtain a final characteristic representation:
Figure FDA0002516033110000026
the obtained feature representation not only contains the feature information of the feature representation, but also contains experience information of other samples; similarly, the obtained text is characterized by the following features after passing through the known module:
Figure FDA0002516033110000027
4. the method for learning common features for low-sample cross-modal hash search as claimed in claim 3, wherein in step S23, if there are n classes, there are n class vectors, and the ith image class vector is recorded as
Figure FDA0002516033110000031
In the training process, the category vectors are updated at intervals, so that all sample information can be utilized, and the calculation amount is not too large.
5. The method for learning common features for low-sample cross-modal hash retrieval as claimed in claim 4, wherein in step S22, the method is not localThe block is implemented as follows: for an image, the vector of the same dimension is still marked as x1,x2,x3And x ═ x1,x2,x3) Then the response of the informed module at the ith position is:
Figure FDA0002516033110000032
wherein G (x)j) a non-linear mapping function, for the convenience of training, using 1 × 1 convolutional layer, N (x) is a normalization factor;
Figure FDA0002516033110000033
for calculating xiAnd xjThe correlation of (c); then the features are fused to obtain the output of an image knowledge module as
Figure FDA0002516033110000034
Wherein mean (-) refers to the mean operation, the output of the informed module in the text network is
Figure FDA0002516033110000035
6. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 5, wherein the specific process of step S3 is:
s31: taking a full connection layer as a hash layer, wherein the dimension of the full connection layer is consistent with the bit number of the hash code; mapping the image and text features obtained in step S24 to a common hash space, to obtain hash codes of the image and text, respectively;
Figure FDA0002516033110000036
Figure FDA0002516033110000037
wherein
Figure FDA0002516033110000038
And
Figure FDA0002516033110000039
and
Figure FDA00025160331100000310
parameters of a hash layer of an image network and a hash layer of a text network are respectively set;
s32: and respectively converting the hash codes of the image and the text into binary codes by utilizing a tanh function:
BI=tanh(HI)
BT=tanh(HT)。
7. the method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 6, wherein the specific process of step S4 is:
s41: the objective of the triplet loss function is to make the distance between homogeneous samples smaller than the distance between heterogeneous samples, and the calculation formula is:
Figure FDA00025160331100000311
wherein e, e+,e-Is a triplet, e and e, consisting of hash codes+Belong to the same class, e and e-α is a threshold parameter;
Figure FDA00025160331100000312
calculating the Euclidean distance;
in the cross-modal retrieval task, not only the semantic similarity between the same-modal data but also the semantic similarity between different modalities are required to be kept, that is, if the labels of the data of different modalities are consistent, the learned feature representations are required to be similar as much as possible, so that the triple loss function between the same modality and the different modalities is designed as follows:
Figure FDA0002516033110000041
Figure FDA0002516033110000042
the overall loss function is: l ═ Lintra+linterDuring training, taking a small batch of image text pairs as input each time, calculating a loss function value L, solving a gradient by using a chain method and reversely transmitting and updating network parameters until the model converges;
s42: and storing the trained model, and testing the model.
8. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 7, wherein the specific process of step S5 is:
establishing an input interface, taking data of one mode as input, and forming a database by a binary code obtained by model coding data of the other mode; and mapping the input data into binary codes by using the model trained in S41, calculating the Hamming distance between the binary codes and the binary codes in the database, sequencing the samples of the database according to the Hamming distance, and returning the first k most similar samples as a retrieval result.
9. The method for learning common characteristics of few-sample cross-modal hash search as claimed in claim 8, wherein in step S42, the model testing process is: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.
10. The low-sample cross-modal hash retrieval common characterization learning method of claim 9, wherein in step S11, the ratio of the number of low-sample classes and the number of basic classes is about 1: 4.
CN202010476647.7A 2020-05-29 Few-sample cross-modal hash retrieval common characterization learning method Active CN111753189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010476647.7A CN111753189B (en) 2020-05-29 Few-sample cross-modal hash retrieval common characterization learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010476647.7A CN111753189B (en) 2020-05-29 Few-sample cross-modal hash retrieval common characterization learning method

Publications (2)

Publication Number Publication Date
CN111753189A true CN111753189A (en) 2020-10-09
CN111753189B CN111753189B (en) 2024-07-05

Family

ID=

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559810A (en) * 2020-12-23 2021-03-26 上海大学 Method and device for generating hash code by utilizing multi-layer feature fusion
CN112613451A (en) * 2020-12-29 2021-04-06 民生科技有限责任公司 Modeling method of cross-modal text picture retrieval model
CN112860935A (en) * 2021-02-01 2021-05-28 西安电子科技大学 Cross-source image retrieval method, system, medium and equipment
CN113033695A (en) * 2021-04-12 2021-06-25 北京信息科技大学 Method for predicting faults of electronic device
CN113408581A (en) * 2021-05-14 2021-09-17 北京大数据先进技术研究院 Multi-mode data matching method, device, equipment and storage medium
CN114117153A (en) * 2022-01-25 2022-03-01 山东建筑大学 Online cross-modal retrieval method and system based on similarity relearning
CN114880514A (en) * 2022-07-05 2022-08-09 人民中科(北京)智能技术有限公司 Image retrieval method, image retrieval device and storage medium
CN115146488A (en) * 2022-09-05 2022-10-04 山东鼹鼠人才知果数据科技有限公司 Variable business process intelligent modeling system and method based on big data
CN115203442A (en) * 2022-09-15 2022-10-18 中国海洋大学 Cross-modal deep hash retrieval method, system and medium based on joint attention
WO2023078044A1 (en) * 2021-11-05 2023-05-11 同方威视技术股份有限公司 Method, system and device for checking authenticity of declaration information, and medium
CN116244483A (en) * 2023-05-12 2023-06-09 山东建筑大学 Large-scale zero sample data retrieval method and system based on data synthesis
CN116662490A (en) * 2023-08-01 2023-08-29 山东大学 Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information
CN116825210A (en) * 2023-08-28 2023-09-29 山东大学 Hash retrieval method, system, equipment and medium based on multi-source biological data
CN117056550A (en) * 2023-10-12 2023-11-14 中国科学技术大学 Long-tail image retrieval method, system, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273517A (en) * 2017-06-21 2017-10-20 复旦大学 Picture and text cross-module state search method based on the embedded study of figure
CN108170755A (en) * 2017-12-22 2018-06-15 西安电子科技大学 Cross-module state Hash search method based on triple depth network
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash
US20200073968A1 (en) * 2018-09-04 2020-03-05 Inception Institute of Artificial Intelligence, Ltd. Sketch-based image retrieval techniques using generative domain migration hashing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273517A (en) * 2017-06-21 2017-10-20 复旦大学 Picture and text cross-module state search method based on the embedded study of figure
CN108170755A (en) * 2017-12-22 2018-06-15 西安电子科技大学 Cross-module state Hash search method based on triple depth network
US20200073968A1 (en) * 2018-09-04 2020-03-05 Inception Institute of Artificial Intelligence, Ltd. Sketch-based image retrieval techniques using generative domain migration hashing
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN110222140A (en) * 2019-04-22 2019-09-10 中国科学院信息工程研究所 A kind of cross-module state search method based on confrontation study and asymmetric Hash

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈兆佳: ""基于三元组深度哈希的跨模态检索方法"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 02 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559810A (en) * 2020-12-23 2021-03-26 上海大学 Method and device for generating hash code by utilizing multi-layer feature fusion
CN112613451A (en) * 2020-12-29 2021-04-06 民生科技有限责任公司 Modeling method of cross-modal text picture retrieval model
CN112860935B (en) * 2021-02-01 2023-02-21 西安电子科技大学 Cross-source image retrieval method, system, medium and equipment
CN112860935A (en) * 2021-02-01 2021-05-28 西安电子科技大学 Cross-source image retrieval method, system, medium and equipment
CN113033695A (en) * 2021-04-12 2021-06-25 北京信息科技大学 Method for predicting faults of electronic device
CN113033695B (en) * 2021-04-12 2023-07-25 北京信息科技大学 Method for predicting faults of electronic device
CN113408581A (en) * 2021-05-14 2021-09-17 北京大数据先进技术研究院 Multi-mode data matching method, device, equipment and storage medium
WO2023078044A1 (en) * 2021-11-05 2023-05-11 同方威视技术股份有限公司 Method, system and device for checking authenticity of declaration information, and medium
CN114117153A (en) * 2022-01-25 2022-03-01 山东建筑大学 Online cross-modal retrieval method and system based on similarity relearning
CN114880514A (en) * 2022-07-05 2022-08-09 人民中科(北京)智能技术有限公司 Image retrieval method, image retrieval device and storage medium
CN115146488B (en) * 2022-09-05 2022-11-22 山东鼹鼠人才知果数据科技有限公司 Variable business process intelligent modeling system and method based on big data
CN115146488A (en) * 2022-09-05 2022-10-04 山东鼹鼠人才知果数据科技有限公司 Variable business process intelligent modeling system and method based on big data
CN115203442A (en) * 2022-09-15 2022-10-18 中国海洋大学 Cross-modal deep hash retrieval method, system and medium based on joint attention
CN116244483A (en) * 2023-05-12 2023-06-09 山东建筑大学 Large-scale zero sample data retrieval method and system based on data synthesis
CN116662490A (en) * 2023-08-01 2023-08-29 山东大学 Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information
CN116662490B (en) * 2023-08-01 2023-10-13 山东大学 Confusion-free text hash algorithm and confusion-free text hash device for fusing hierarchical label information
CN116825210A (en) * 2023-08-28 2023-09-29 山东大学 Hash retrieval method, system, equipment and medium based on multi-source biological data
CN116825210B (en) * 2023-08-28 2023-11-17 山东大学 Hash retrieval method, system, equipment and medium based on multi-source biological data
CN117056550A (en) * 2023-10-12 2023-11-14 中国科学技术大学 Long-tail image retrieval method, system, equipment and storage medium
CN117056550B (en) * 2023-10-12 2024-02-23 中国科学技术大学 Long-tail image retrieval method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109471895B (en) Electronic medical record phenotype extraction and phenotype name normalization method and system
CN111382272B (en) Electronic medical record ICD automatic coding method based on knowledge graph
CN108334574B (en) Cross-modal retrieval method based on collaborative matrix decomposition
CN110222140A (en) A kind of cross-module state search method based on confrontation study and asymmetric Hash
CN112966127A (en) Cross-modal retrieval method based on multilayer semantic alignment
CN113095415B (en) Cross-modal hashing method and system based on multi-modal attention mechanism
CN113657450B (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN111680176A (en) Remote sensing image retrieval method and system based on attention and bidirectional feature fusion
CN111475622A (en) Text classification method, device, terminal and storage medium
CN111159407A (en) Method, apparatus, device and medium for training entity recognition and relation classification model
CN109829065B (en) Image retrieval method, device, equipment and computer readable storage medium
CN112949740B (en) Small sample image classification method based on multilevel measurement
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
CN114549850B (en) Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN113806580B (en) Cross-modal hash retrieval method based on hierarchical semantic structure
CN114896434B (en) Hash code generation method and device based on center similarity learning
CN112860930B (en) Text-to-commodity image retrieval method based on hierarchical similarity learning
CN114358188A (en) Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment
CN111026887A (en) Cross-media retrieval method and system
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN115130538A (en) Training method of text classification model, text processing method, equipment and medium
CN108805280B (en) Image retrieval method and device
CN116310339A (en) Remote sensing image segmentation method based on matrix decomposition enhanced global features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant