CN114969423A - Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment - Google Patents

Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment Download PDF

Info

Publication number
CN114969423A
CN114969423A CN202210718696.6A CN202210718696A CN114969423A CN 114969423 A CN114969423 A CN 114969423A CN 202210718696 A CN202210718696 A CN 202210718696A CN 114969423 A CN114969423 A CN 114969423A
Authority
CN
China
Prior art keywords
image
text
similarity
features
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210718696.6A
Other languages
Chinese (zh)
Inventor
孟铃涛
张飞飞
徐常胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202210718696.6A priority Critical patent/CN114969423A/en
Publication of CN114969423A publication Critical patent/CN114969423A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an image text cross-modal retrieval model, method and computer equipment based on a local shared semantic center. And then defining a group of trainable semantic centers shared by image texts, calculating the similarity between each local feature and the semantic center, and distributing the local features to a plurality of semantic centers according to the similarity to obtain a plurality of semantically aligned image representations and text representations. And performing multi-level modeling on the regional characteristic weight of the image and the word characteristic weight of the text by using bi-GRU to obtain multi-level global representation of the integrated local characteristic. Local similarity of the image and the text is calculated through semantically aligned image representation and text representation, and global similarity of the image and the text is calculated through multi-layer global representation of the image and the text. The invention can effectively improve the accuracy of the cross-modal retrieval of the image text.

Description

Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment
Technical Field
The invention belongs to the field of image text cross-modal retrieval, and particularly relates to an image text cross-modal retrieval model and method based on a local shared semantic center and computer equipment.
Background
The image text cross-modal retrieval aims to retrieve data with the same semantic meaning as the data in one modal by using the data in the other modal, is an important research direction related to the fields of machine vision, natural language processing, multi-modal learning and the like, and has become a research hotspot at home and abroad at present. In recent years, with the development of deep learning technology, image text cross-modal retrieval has achieved excellent performance. However, the task still faces a great challenge because not only the semantic knowledge of images and texts needs to be deeply understood, but also the semantic correspondence between different modalities needs to be acquired across the modality gap.
In order to solve the above challenges, the current method focuses more on fine-grained correspondence between image texts, highlights important semantic knowledge through a local alignment method, and learns the images and the texts more comprehensively. But current methods ignore the heavy computational burden associated with local alignment. Therefore, reducing the interaction size of local features while fully understanding images and text is important for image-text cross-modality retrieval.
Recently, the cluster learning method has been successful in optimizing semantic representations common to features. However, most of the feature learning of the current clustering focuses on global representation, so that fine-grained local information is ignored, and the challenge of cross-modal retrieval of image texts cannot be well met. Therefore, the invention designs a clustering center for sharing the image text, and realizes fine-grained alignment between the image and the text by adopting a soft distribution strategy, thereby deeply understanding the semantic corresponding relation between the image and the text and improving the retrieval efficiency.
Disclosure of Invention
The invention aims to express the semantic commonality of the local features of the image text by using a trainable semantic center shared by the image text, and realize the fine-grained alignment of the image text through the semantic center, thereby excavating the deep image semantics and text semantics, avoiding the direct interaction of the local features of the image text and reducing the calculation scale. And the global alignment is used as the supplement of the local alignment, the cross-modal semantic correspondence of the image text is realized from multiple angles, and the semantic information is summarized more comprehensively. The technical scheme for realizing the invention is as follows:
a cross-modal retrieval model of image texts based on a local shared semantic center is obtained through the following steps:
and S1, extracting the region features of the image and the word-level features of the text respectively, and then obtaining the image features and the text features for local alignment and global alignment respectively through two layers of independent mapping.
S2, clustering the image features and the text features in the step S1 to obtain k initialized shared semantic centers;
s3, calculating the similarity between the image text features in the step S1 and the shared semantic centers in the step S2, aggregating the image features into image semantic representations of k corresponding shared semantic centers by using the similarity, and aggregating the text features into text semantic representations of k corresponding shared semantic centers;
s4, modeling the pooling operation of the regional characteristics and the text word-level characteristics of the image in the step 1 to obtain image global representation and text global representation;
s5, calculating the local similarity of the image text by using the image semantic representation and the text semantic representation with the same shared semantic center in the step S3, calculating the global similarity of the image text by using the image global representation and the text global representation in the step S4, and expressing the overall similarity of the image and the text by using the weighted sum of the local similarity and the global similarity to complete the modeling.
And S6, training the image text cross-modal retrieval model by using the overall similarity, and performing real-time image text cross-modal retrieval by using the trained model.
As a preferred technical solution, the specific process of extracting the image text features in step S1 includes:
step 51-1, extracting the regional characteristics of the image by using pre-trained fast-RCNN, and respectively mapping the extracted regional characteristics to obtain two groups of image characteristics through two layers of independent multilayer perceptrons
Figure BDA0003710485310000021
And
Figure BDA0003710485310000022
step S1-2, dividing the input text sentence into words, filling 0 to fixed word length, sending the divided and filled text to pre-trained Bert to obtain word-level feature representation, and then using two layers of independent multilayer perceptrons to respectively map to obtain two groups of text features
Figure BDA0003710485310000023
And
Figure BDA0003710485310000024
Figure BDA0003710485310000025
as a preferred technical solution, the specific process of initializing the semantic center in step S2 includes:
step S2-1, randomly sampling image features and text features in a training data set;
step S2-2, carrying out K-means clustering on the randomly sampled image features and text features to obtain K initialized clustering centers
Figure BDA0003710485310000026
And k < n;
and step S2-3, defining the initialized clustering center C as a trainable shared semantic center, and training the clustering center C along with the model.
As a preferred technical solution, the specific process of obtaining the image text alignment semantic representation in step S3 includes:
step S3-1, for the image feature V in step S1-1 l And text feature T in step S1-2 l Respectively calculating cosine distances with the shared semantic center C in the step S2-3 to obtain similarity matrixes of the image and the shared semantic center and similarity matrixes of the text and the shared semantic center, and performing softmax operation on the similarity matrixes to obtain normalized similarity matrixes;
step S3-2, using the value of the normalized similarity matrix in step S3-1 as the image feature V in step S1-1 l And the text feature T in step S1-2 l The weighted sum of the features in the modality is the image features and the text features of the corresponding semantic centers, and since the number of the semantic centers is k in step S2-3, the number of the image features and the text features aligned according to the semantic centers is k.
As a preferred technical solution, the specific process of obtaining the global representation of the image text in step S4 includes:
step S4-1, for the image feature V in step S1-1 g And text feature T in step S1-2 g Performing different pooling operations, such as maximum pooling, second value pooling, minimum pooling and the like, to obtain pooled image features and text features;
and step 54-2, modeling the pooling features of the image and the text respectively by using the bi-GRU, solving coefficients required by optimal pooling, and then obtaining the global features of the image and the text according to the solved optimal pooling strategy.
As a preferred technical solution, the specific process of calculating the image text similarity and training the model in step S5 includes:
s5-1, fine-grained knowledge in the image and the text is aligned through the step S3-2, the local similarity between the image and the text corresponding to a certain semantic center is represented by the cosine distance between the image feature and the text feature aligned with the same semantic center in the step S3-2, and the sum of the local similarities aligned with all the semantic centers is calculated to be used as the local similarity between the image and the text;
s5-2, the global similarity between the image and the text is represented by the cosine similarity of the global features of the image and the text in step S4-2.
And S5-3, finally, representing the overall similarity as the weighted sum of the local similarity and the global similarity, and training by adopting ternary sequencing loss according to the overall similarity.
As a preferred technical solution, the method process of image text cross-modality retrieval in step S6 includes:
for any group of image text pairs, firstly, the feature extraction method of step S1 is adopted to extract the features of the images and the texts, then the local features and the global features of the images and the local features and the global features of the texts are extracted according to step S3 and step S4, the extracted global features and the local features are subjected to local alignment and global alignment of the images and the texts according to the method of step S5, the similarity of the images and the texts is calculated, and a retrieval result is obtained.
A computer device is internally provided with an instruction or a program of the image text cross-modal retrieval model based on the local shared semantic center or an instruction or a program of the image text cross-modal retrieval method.
The invention has the beneficial effects that:
(1) the method solves the problems that the traditional image text cross-modal retrieval local alignment calculation amount is large, and fine granularity between an image and a text cannot be deeply explored.
(2) According to the image text cross-modal retrieval method based on the local shared semantic center, local features of the image and the text can be indirectly aligned through the semantic center by learning the trainable semantic center shared by a group of image texts, so that the semantic relationship between the image and the text is deeply mined, and the interaction cost brought by local alignment is reduced.
(2) The invention applies soft distribution to the matching problem of clustering, the weight coefficient becomes smooth and differentiable due to the soft distribution, and the clustering center can carry out end-to-end training along with the model, thereby generating a reliable shared semantic center.
(3) On the basis of local alignment, the alignment of global features is used as auxiliary information to promote semantic matching between image texts, and the relationship between the image and the text is understood from the local and global angles, so that the calculation retrieval performance is improved.
Drawings
FIG. 1 is a flow chart of image text cross-modal retrieval based on a locally shared semantic center.
Detailed Description
The method comprises the steps of firstly extracting regional features of an image and word-level features of a text, and respectively obtaining the image features and the text features for local alignment and global alignment through two layers of independent mapping. And obtaining an initialized cluster center group by using a clustering method, setting the cluster center as a trainable shared semantic center, and updating along with network training. And aligning the image text features to the corresponding shared semantic centers through the cosine distance between the image text features and the shared semantic centers, thereby obtaining the image local features and the text local features with the same number as the shared semantic centers. And calculating the global features of the image and the global features of the text by a method for modeling image text feature pooling operation. And performing local alignment by using the local features of the image texts, performing global alignment by using the global features of the image texts, finally obtaining the multi-angle image text similarity, and training the model by using the ternary ordering loss.
The invention is described in further detail below with reference to the figures and specific embodiments.
Fig. 1 is a flowchart of an image text cross-modal retrieval method based on a local shared semantic center according to the present invention. The method comprises the following steps of firstly extracting the characteristics of an image text, then defining a group of trainable shared semantic centers, calculating the local characteristics of image text alignment according to the relation between the image text and the shared semantic centers, obtaining the global characteristics of the image text by modeling the pooling method of the image text characteristics, calculating the overall image text similarity by using local alignment and global alignment, and finally training by using ternary ordering loss, wherein the method specifically comprises the following steps:
s1, extracting image text features: and extracting the regional characteristics of the image and the word-level characteristics of the text respectively, and then obtaining the image characteristics and the text characteristics for local alignment and global alignment respectively through two layers of independent mapping.
The method specifically comprises the following steps: extracting the regional characteristics of the image by using pre-trained Faster-RCNN, and respectively taking the extracted regional characteristics as two independent multi-layer perceptron MLPs Vl And MLP Vg Is mapped to obtain two sets of image features
Figure BDA0003710485310000051
And
Figure BDA0003710485310000052
then, the input text sentence is divided into words, 0 is used for filling the words to a fixed word length, the divided and filled text is sent to a pre-trained Bert to obtain the feature representation of word level, and then the extracted feature representation of word level is used as a two-layer independent multi-layer perceptron MLP Tl And MLP Tg Respectively mapping to obtain two groups of text features
Figure BDA0003710485310000053
Figure BDA0003710485310000054
And
Figure BDA0003710485310000055
s2, initializing a shared semantic center: and performing K-Means clustering on the image features and the text features in the step S1 to obtain K initialized shared semantic centers.
The method specifically comprises the following steps: firstly, randomly sampling image features and text features in a training data set to obtain a plurality of untrained image features and text features, then carrying out K-means clustering on the randomly sampled image features and text features,get k initialized cluster centers
Figure BDA0003710485310000056
And k < n, then defining the initialized clustering center C as a trainable shared semantic center, and updating the parameters of the shared semantic center along with network training.
S3, learning the aligned semantic representation of the image text: and calculating the similarity between the image text features in the step S1 and the shared semantic centers in the step S2, aggregating the image features into k image semantic representations corresponding to the shared semantic centers by using the similarity, and aggregating the text features into k text semantic representations corresponding to the shared semantic centers.
The method specifically comprises the following steps: for the image feature V in step S1 l And text feature T l And respectively calculating cosine distances with the shared semantic center C in the step S2 to obtain similarity matrixes of the image and the shared semantic center and the text and the shared semantic center, and performing softmax operation on the similarity matrixes to obtain normalized similarity matrixes.
The value of the normalized similarity matrix in step S3 is then taken as the image feature V in step S1 l Weight of and text feature T l The weighted sum of the features in the modalities is the image features and the text features corresponding to the semantic centers, and since the number of the semantic centers in step S2 is k, the number of the image features and the number of the text features aligned according to the semantic centers are both k.
S4, global representation of the learning image text: and (3) modeling the regional characteristics of the image and the word-level characteristics of the text in the step 1 by using bi-GRU to obtain the optimal image global representation and text global representation.
The method specifically comprises the following steps: for the image feature V in step S1 g And text feature T g And performing different pooling operations, such as maximum pooling, second value pooling, minimum pooling and the like, to obtain pooled image features and text features.
And further modeling the pooling features of the image and the text by using the bi-GRU respectively, solving coefficients required by optimal pooling, and then obtaining the global features of the image and the text according to the solved optimal pooling strategy.
S5, calculating the similarity of image texts: the local similarity of the image text is calculated by using the image semantic representation and the text semantic representation having the same shared semantic center in the step S3, the global similarity of the image text is calculated by using the image global representation and the text global representation in the step S4, and the overall similarity of the image and the text is expressed by a weighted sum of the local similarity and the global similarity.
The method specifically comprises the following steps: fine-grained knowledge in the image and the text has been aligned by step S3, the local similarity between the image and the text corresponding to a certain semantic center is represented by the cosine distance of the image feature and the text feature aligned to the same semantic center in step S3, and the sum of the local similarities aligned to all semantic centers is found as the local similarity between the image and the text.
The global similarity between the image and the text is represented by the cosine similarity of the global features of the image and the text in step S4. And calculating the overall similarity of the image and the text by the weighted sum of the local similarity and the global similarity, and finally training by adopting ternary sequencing loss according to the overall similarity.
The present invention will be explained below with reference to specific examples. The implementation of the present invention includes the model building and training process and the image text retrieval process, which are described in detail below.
1. The model building and training process comprises the following steps:
1.1 feature extraction Process for image text
The regional characteristics and the text word-level characteristics of the image are extracted by using pre-trained Faster R-CNN and pre-trained Bert respectively, and for the subsequent alignment of image texts from the local and global angles, the image characteristics and the text characteristics are extracted by using two layers of independent multilayer perceptrons.
1.1.1 feature extraction of images
Given image I, a region R in the image is detected using pre-trained Faster R-CNN i And extracting each region r i Characteristic f of i . Then useTwo independent multi-layer perceptrons are used for characterizing the region f of the image i Respectively mapped to obtain
Figure BDA0003710485310000061
And
Figure BDA0003710485310000062
Figure BDA0003710485310000063
Figure BDA0003710485310000064
in the formulae (1) and (2), MLP Vl 、MLP Vg Representing two independent multi-layered perceptrons, deriving image features for local and global alignment, respectively, as
Figure BDA0003710485310000065
And
Figure BDA0003710485310000066
Figure BDA0003710485310000067
1.1.2 feature extraction of text
Given text S, the text is first divided into a number of individual words using a segmentation tool and the words are filled to a fixed length with 0S. Word sequences s of fixed length i Input to the pretrained Bert to obtain word-level text features z i . The word-level features z of the text are then transformed using two independent multi-layered perceptrons i Respectively mapped to obtain
Figure BDA0003710485310000068
And
Figure BDA0003710485310000069
z i =Bert(s i )#(3)
Figure BDA0003710485310000071
Figure BDA0003710485310000072
in the formulae (4) and (5), MLP Tl 、MLP Tg Representing two independent multi-layered perceptrons, deriving text feature representations for local alignment and global alignment, respectively, as
Figure BDA0003710485310000073
And
Figure BDA0003710485310000074
Figure BDA0003710485310000075
1.2 initialization of semantic centers
First, image features V for local alignment are aligned in a training dataset l And text feature T for local alignment l Random sampling is carried out to obtain a plurality of untrained image characteristics and text characteristics, then K-means clustering is carried out on the randomly sampled image characteristics and text characteristics to obtain K initialized clustering centers
Figure BDA0003710485310000076
And k < n, then defining the initialized clustering center C as a trainable shared semantic center, and updating the parameters of the shared semantic center along with network training.
1.3 aligned semantic representation of image text
According to the semantic commonality between the image text and the shared semantics, the image context feature and the text context feature which are aligned semantically are obtained, and because the local feature of the image and the local feature of the text are aligned based on the shared semantic center, the local similarity between the image and the text can be represented by the context features of the image and the text under the same shared semantic center.
1.3.1 obtaining aligned semantic representations of images
In order to obtain the image context feature aligned with the shared semantic center, calculating the cosine similarity between the image feature and the shared semantic center:
Figure BDA0003710485310000077
in the formula (6)
Figure BDA0003710485310000078
Representing a transpose of the ith shared semantic center,
Figure BDA0003710485310000079
representing the jth image feature for local alignment,
Figure BDA00037104853100000710
expressing the cosine similarity between the ith shared semantic center and the jth image characteristic used for local alignment, and performing softamx operation on a cosine similarity matrix to obtain a normalized similarity matrix:
Figure BDA00037104853100000711
in the formula (7), λ represents a temperature coefficient,
Figure BDA00037104853100000712
represents the cosine similarity after normalization, and is
Figure BDA00037104853100000713
As
Figure BDA00037104853100000714
The corresponding semantic center c is calculated according to the weight of the semantic center c i The local features of the image of (1):
Figure BDA00037104853100000715
in the formula (8), the reaction mixture is,
Figure BDA0003710485310000081
finger corresponds to the ith shared semantic center c i To obtain image features sharing semantic alignment
Figure BDA0003710485310000082
1.3.2 obtaining aligned semantic representations of text
As in step 1.3.1, in order to obtain the text context feature aligned with the shared semantic center, the cosine similarity between the text feature and the shared semantic center is calculated:
Figure BDA0003710485310000083
in the formula (9)
Figure BDA0003710485310000084
Representing a transpose of the ith shared semantic center,
Figure BDA0003710485310000085
representing the jth text feature for local alignment,
Figure BDA0003710485310000086
expressing the cosine similarity between the ith shared semantic center and the jth text feature for local alignment, and performing softamx operation on a cosine similarity matrix to obtain a normalized similarity matrix:
Figure BDA0003710485310000087
in the formula (10), λ represents a temperature coefficient,
Figure BDA0003710485310000088
represents the cosine similarity after normalization, and is
Figure BDA0003710485310000089
As
Figure BDA00037104853100000810
The corresponding semantic center c is calculated according to the weight of the semantic center c i Text context feature of (2):
Figure BDA00037104853100000811
in the formula (11), the reaction mixture is,
Figure BDA00037104853100000812
finger corresponds to the ith shared semantic center c i To obtain shared semantically aligned text features
Figure BDA00037104853100000813
1.4 Global representation of image text
The global alignment of the image text provides more general and comprehensive semantic information for understanding the shared semantics of the image and the text than the local alignment, so that the semantic alignment from the global perspective can be regarded as auxiliary information for the image text alignment.
1.4.1 extracting Global features of an image
Performing multi-pooling on the image features for global alignment in step 1.1.1 to obtain global representations of a plurality of images:
Figure BDA00037104853100000814
in the formula (12), the reaction mixture is,
Figure BDA00037104853100000815
representing a pool of image featuresConversion result, max i Characteristic of expression pair
Figure BDA00037104853100000816
Pooling ith value, e.g. max when i is 1 1 It means that the image features are maximally pooled,
Figure BDA00037104853100000817
results of maximum pooling. To find the optimal pooling strategy, all pooling results are modeled using bi-GRU to approximate the maximum pooling, second-value pooling, average pooling, or more complex pooling results:
Figure BDA00037104853100000818
in the formula (13)
Figure BDA00037104853100000819
A position code representing a feature of the image,
Figure BDA00037104853100000820
representing the output of a position-coded counterpart bi-GRU, having dimensions of
Figure BDA0003710485310000091
Output of each position code corresponding bi-GRU
Figure BDA0003710485310000092
Are all d-dimensional features whose dimensions are mapped to using a fully-connected layer
Figure BDA0003710485310000093
Then the normalization operation is performed using softmax:
Figure BDA0003710485310000094
w in formula (14) v Weight matrix representing fully connected layers, b v Representing the bias of the fully-connected layer, since the output dimension of the fully-connected layer is
Figure BDA0003710485310000095
So w therein v Has the dimension of
Figure BDA0003710485310000096
b v Has the dimension of
Figure BDA0003710485310000097
Figure BDA0003710485310000098
And representing the weight coefficient corresponding to the ith value pooling result, wherein the global feature of the image is represented by the weighted sum of the pooling results:
Figure BDA0003710485310000099
1.4.2 extracting Global representations of text
As with step 1.4.1, the text features used for global alignment in step 1.1.1 are multiclassified to obtain global representations of multiple texts:
Figure BDA00037104853100000910
in the formula (18), the reaction mixture,
Figure BDA00037104853100000911
representing pooled results, max, on text features i Characteristic of expression pair
Figure BDA00037104853100000912
Pooling ith value, e.g. max when i is 1 1 It means that the text features are maximally pooled,
Figure BDA00037104853100000913
results of maximum pooling. To find the best poolA pooling strategy, modeling all pooling results using bi-GRU to approximate maximum pooling, second value pooling, average pooling, or more complex pooling results:
Figure BDA00037104853100000914
in the formula (17)
Figure BDA00037104853100000915
A position code representing a feature of the text,
Figure BDA00037104853100000916
representing the output of a position-coded counterpart bi-GRU, having dimensions of
Figure BDA00037104853100000917
Using the fully-connected layer to map its dimensions into
Figure BDA00037104853100000918
Then the normalization operation is performed using softmax:
Figure BDA00037104853100000919
in the formula (18), w t Weight matrix representing fully connected layers, b t Representing the bias of the fully-connected layer, since the output dimension of the fully-connected layer is
Figure BDA00037104853100000920
Therefore w therein t Has the dimension of
Figure BDA00037104853100000921
b t Has the dimension of
Figure BDA00037104853100000922
Figure BDA00037104853100000923
And representing the weight coefficient corresponding to the ith value pooling result, wherein the global feature of the text is represented by the weighted sum of the pooling results:
Figure BDA00037104853100000924
1.5 image text similarity calculation
Since the local features of the image texts are aligned by the shared semantic center, the local similarity between the image texts can be calculated by the image context feature and the text context feature under the same shared semantic; the global similarity between the image texts is calculated by the global features of the image and the text, and is used as auxiliary information to improve the retrieval precision.
1.5.1 local similarity of image text
Step 1.3.1 and step 1.3.2 have extracted picture context characteristic and text context characteristic through sharing the semantic alignment separately, the local similarity is expressed by the cosine similarity of the text context characteristic of the picture:
Figure BDA0003710485310000101
in the formula (20)
Figure BDA0003710485310000102
Expressed in a shared semantic center c i Contextual features of the lower image
Figure BDA0003710485310000103
Contextual features with text
Figure BDA0003710485310000104
Cosine similarity between them. Taking the sum of the similarity of all aligned semantic centers as the local similarity of the image and the text:
Figure BDA0003710485310000105
1.5.2 Global similarity of image text
Steps 1.4.1 and 1.4.2 extract the global representations of the image and the text, respectively, and the global similarity is represented by the cosine similarity of the global representations of the image and the text:
Figure BDA0003710485310000106
r in the formula (22) g (v, t) represents the global feature g of the image v And global features g of text t Cosine similarity between them.
1.5.3 Overall similarity of image text
According to the steps 1.5.1 and 1.5.2, the local similarity and the global similarity between the image texts are obtained, and the overall similarity of the image and the text is determined by the local similarity and the global similarity:
R(v,t)=β 1 R l (v,t)+β 2 R g (v,t)#(23)
in the formula (23) < beta > 1 And beta 2 Is a hyper-parameter determining the local and global ratio, in practice, will be beta 1 Set to 0.2, set beta 2 Setting the value to be 1 can obtain a better result, and training by adopting ternary sequencing loss according to the obtained similarity:
Figure BDA0003710485310000107
in the formula (24), Δ is a hyperparameter, and in practice, setting Δ to 0.15 can achieve a good result, and (v, t) represents a data set
Figure BDA0003710485310000108
The positive sample pair of (1) is,
Figure BDA0003710485310000109
the most difficult negative sample representing v,
Figure BDA00037104853100001010
indicating that the condition t' ≠ t is satisfied
Figure BDA00037104853100001011
When R (v, t') is maximized.
Figure BDA00037104853100001012
The most difficult negative sample representing t,
Figure BDA00037104853100001013
indicating that the condition v' ≠ v is satisfied
Figure BDA00037104853100001014
Then, R (v, t') is maximized. [ x ] of]+ ≡ max (0, x), distance between pairs of positive samples is pulled closer with a ternary ordering penalty; wherein t 'and v' are intermediate variables.
2. Cross-modal retrieval process for image text
After the model is fully trained in the training set, calculating the similarity between any image to be tested and all texts in the test library through a formula (23), and retrieving the text with the maximum similarity as a retrieval result; giving a section of text to be tested, calculating the similarity between the text and all images in the test library through a formula (23), and searching the image with the maximum similarity as a search result.
In summary, the invention discloses an image text cross-modal retrieval method based on a local shared semantic center. The method performs cross-modal semantic alignment of image text from the perspective of local alignment and global alignment. For local alignment, training a semantic center shared by image texts, wherein the semantic center describes semantic commonality of local features of the image and the text, so that the local features of the image and the text can be aligned according to the same semantic center, and the method for local alignment omits complicated direct interaction of the local features, so that the calculation amount of local alignment is reduced while fine-grained semantic information is mined; for global alignment, the images and texts are expressed by using global features, more comprehensive semantic knowledge can be obtained, and the semantic knowledge can be used as auxiliary information to improve the accuracy of cross-modal retrieval. Therefore, the method solves the problems of redundant calculation and low recall rate of cross-modal retrieval local alignment of the image text.
The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.

Claims (9)

1. An image text cross-modal retrieval model based on a local shared semantic center is characterized in that the model is obtained by the following steps:
s1, extracting the regional features of the image and the word-level features of the text respectively, and then obtaining the image features and the text features for local alignment and global alignment respectively through two layers of independent mapping;
s2, clustering the image features and the text features in S1 to obtain k initialized shared semantic centers;
s3, obtaining the image text alignment semantic representation: calculating the similarity between the image text characteristics in the S1 and the shared semantic center in the step S2, aggregating the image characteristics into k image alignment semantic representations corresponding to the shared semantic center by using the similarity, namely the image context characteristics, and aggregating the text characteristics into k text alignment semantic representations corresponding to the shared semantic center, namely the text context characteristics;
s4, modeling the pooling operation of the regional characteristics and the text word-level characteristics of the image in the step 1 to obtain image global representation and text global representation;
s5, calculating the local similarity of the image text by using the image semantic representation and the text semantic representation with the same shared semantic center in the step S3, calculating the global similarity of the image text by using the image global representation and the text global representation in the step S4, and expressing the overall similarity of the image and the text by using the weighted sum of the local similarity and the global similarity to complete the modeling.
2. The model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S1 includes:
s1.1 feature extraction of images
Given image I, a region R in the image is detected using pre-trained Faster R-CNN i And extracting each region r i Characteristic f of i Then using two independent multi-layer perceptrons to characterize the region f of the image i Respectively mapped to obtain
Figure FDA0003710485300000011
And
Figure FDA0003710485300000012
Figure FDA0003710485300000013
Figure FDA0003710485300000014
in the formulae (1) and (2), MLP Vl 、MLP Vl Representing two independent multi-layered perceptrons, deriving image features for local and global alignment, respectively, as
Figure FDA0003710485300000015
And
Figure FDA0003710485300000016
Figure FDA0003710485300000017
s1.2 feature extraction of text
Given a text S, the text is first divided into a number of individual words using a segmentation tool and the words are filled to a fixed length with 0SAnd (3) inputting a word sequence with a fixed length into a pre-trained Bert to obtain word-level text features, and then using two independent multilayer perceptrons to obtain the word-level features f of the text i Respectively mapped to obtain
Figure FDA0003710485300000018
And
Figure FDA0003710485300000019
z i =Bert(s i )#(3)
Figure FDA0003710485300000021
Figure FDA0003710485300000022
in equation (3), Bert denotes a pre-trained Bert network, s i Representing the original input text, z i Representing the text word-level features extracted by Bert, MLP in equations (4) (5) Tl 、MLP Tg Representing two independent multi-layered perceptrons, deriving text feature representations for local alignment and global alignment, respectively, as
Figure FDA0003710485300000023
And
Figure FDA0003710485300000024
3. the model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S2 includes:
s2.1 pairing image features V for local alignment in a training dataset l And text feature T for local alignment l Random sampling is carried out to obtain a plurality of untrained samplesThe image features and the text features of (a),
s2.2, carrying out K-means clustering on the randomly sampled image features and text features to obtain K initialized clustering centers
Figure FDA0003710485300000025
k < m and k < n,
s2.3, the initialized clustering center C is defined as a trainable shared semantic center, and parameters of the shared semantic center are updated along with network training.
4. The model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S3 includes:
s3.1 obtaining an aligned semantic representation of an image
In order to obtain the image context feature aligned with the shared semantic center, calculating the cosine similarity between the image feature and the shared semantic center:
Figure FDA0003710485300000026
in formula (6)
Figure FDA0003710485300000027
Representing a transpose of the ith shared semantic center,
Figure FDA0003710485300000028
representing the jth image feature for local alignment,
Figure FDA0003710485300000029
expressing the cosine similarity between the ith shared semantic center and the jth image feature for local alignment, and performing softamx operation on the cosine similarity matrix to obtain a normalized similarity matrix:
Figure FDA00037104853000000210
in the formula (7), λ represents a temperature coefficient, a ij Represents the cosine similarity after normalization, and is
Figure FDA00037104853000000211
As
Figure FDA00037104853000000212
The corresponding semantic center c is calculated according to the weight of the semantic center c i The local features of the image of (1):
Figure FDA0003710485300000031
in the formula (8), p i Finger corresponds to the ith shared semantic center c i To obtain image features sharing semantic alignment
Figure FDA0003710485300000032
S3.2 obtaining aligned semantic representation of text
As in step S3.1, in order to obtain the text context feature aligned with the shared semantic center, the cosine similarity between the text feature and the shared semantic center is calculated:
Figure FDA0003710485300000033
in the formula (9)
Figure FDA0003710485300000034
Representing a transpose of the ith shared semantic center,
Figure FDA0003710485300000035
representing the jth text feature for local alignment,
Figure FDA0003710485300000036
expressing the cosine similarity between the ith shared semantic center and the jth text feature for local alignment, and performing softamx operation on the cosine similarity matrix to obtain a normalized similarity matrix:
Figure FDA0003710485300000037
in the formula (10), λ represents a temperature coefficient, a ij Expressing the cosine similarity after normalization, and a ij As
Figure FDA0003710485300000038
The corresponding semantic center c is calculated according to the weight of the semantic center c i Text context feature of (2):
Figure FDA0003710485300000039
in the formula (11), p i Finger corresponds to the ith shared semantic center c i To obtain shared semantically aligned text features
Figure FDA00037104853000000310
5. The model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S4 includes:
s4.1 extracting global features of image
Performing multi-pooling on the image features for global alignment in the step 1 to obtain global representations of a plurality of images:
Figure FDA00037104853000000311
max in formula (12) i Characteristic of expression pair
Figure FDA00037104853000000312
The pooling of the ith value is carried out,
Figure FDA00037104853000000313
all pooling results were modeled using bi-GRU to approximate different pooling results:
Figure FDA00037104853000000314
in the formula (13)
Figure FDA00037104853000000315
A position code representing a feature of the image,
Figure FDA00037104853000000316
representing the output of a position-coded counterpart bi-GRU, having dimensions of
Figure FDA0003710485300000041
Using the fully-connected layer to map its dimensions into
Figure FDA0003710485300000042
Then the normalization operation is performed using softmax:
Figure FDA0003710485300000043
w in formula (14) v Weight matrix representing fully connected layers, b v Representing the bias of the fully-connected layer, since the output dimension of the fully-connected layer is
Figure FDA0003710485300000044
So w therein v Has a dimension of
Figure FDA0003710485300000045
b v Has the dimension of
Figure FDA0003710485300000046
The global features of the image are represented by a weighted sum of the pooling results:
Figure FDA0003710485300000047
s4.2 extracting a Global representation of text
As in step S4.1, the text features used for global alignment in step 1 are multi-pooled to obtain global representations of multiple texts:
Figure FDA0003710485300000048
max in the formula (18) i Characteristic of expression pair
Figure FDA0003710485300000049
The pooling of the ith value is carried out,
all pooling results were modeled using bi-GRU to approximate the pooling results for different pooling:
Figure FDA00037104853000000410
in the formula (17)
Figure FDA00037104853000000411
A position code representing a feature of the image,
Figure FDA00037104853000000412
representing the output of a position-coded counterpart bi-GRU, having dimensions of
Figure FDA00037104853000000413
Using the fully-connected layer to map its dimensions into
Figure FDA00037104853000000414
Then the normalization operation is performed using softmax:
Figure FDA00037104853000000415
in the formula (18), w t Weight matrix representing fully connected layers, b t Representing the bias of the fully-connected layer, since the output dimension of the fully-connected layer is
Figure FDA00037104853000000416
So w therein t Has the dimension of
Figure FDA00037104853000000417
b t Has the dimension of
Figure FDA00037104853000000418
Figure FDA00037104853000000419
And representing the weight coefficient corresponding to the ith value pooling result, wherein the global feature of the text is represented by the weighted sum of the pooling results:
Figure FDA00037104853000000420
6. the model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S5 includes:
s5.1 local similarity of image texts
The local similarity is represented by cosine similarity of image text context features:
Figure FDA00037104853000000421
in the formula (20)
Figure FDA0003710485300000051
Represented in a shared semantic center c i Contextual features of the lower image
Figure FDA0003710485300000052
Contextual features with text
Figure FDA0003710485300000053
Cosine similarity between the images and the texts, taking the sum of the similarity of all aligned semantic centers as the local similarity of the images and the texts:
Figure FDA0003710485300000054
s5.2 Global similarity of image text
The global similarity is represented by the cosine similarity of the image global representation and the text global representation:
Figure FDA0003710485300000055
r in the formula (22) g (v, t) represents the global feature g of the image v And global features g of text t Cosine similarity between them;
s5.3 Overall similarity of image text
According to the local similarity and the global similarity between the image texts obtained in the steps S5.1 and S5.2, the overall similarity between the image and the text is determined by the local similarity and the global similarity:
R(v,t)=β 1 R l (v,t)+β 2 R g (v,t)#(23)
in the formula (23) < beta > 1 And beta 2 Is a hyper-parameter that determines the local global scale.
7. The image text cross-modal retrieval model based on the local shared semantic center according to any one of claims 1 to 6, characterized in that the image text cross-modal retrieval model is trained by using overall similarity; the method comprises the following specific steps:
and training by adopting ternary sequencing loss according to the obtained overall similarity:
Figure FDA0003710485300000056
in the formula (24), Δ is a hyperparameter, and (v, t) represents a data set
Figure FDA0003710485300000057
The positive sample pair of (1) is,
Figure FDA0003710485300000058
the most difficult negative sample representing v,
Figure FDA0003710485300000059
the hardest negative sample, x, representing t] + ≡ max (0, x), the distance between pairs of positive samples is drawn closer using a ternary ordering penalty.
8. The image text cross-modal retrieval method based on the image text cross-modal retrieval model with the locally shared semantic center according to any one of claims 1 to 6, characterized in that any image to be tested is input into the model according to any one of claims 1 to 6, the overall similarity between the image and all texts in a model test library is calculated, and the text with the maximum similarity is retrieved as a retrieval result; and for any section of text to be detected, calculating the similarity between the text and all images in the test library, and searching the image with the maximum similarity as a search result.
9. A computer device, characterized in that the computer device is provided with the instruction or program of the image text cross-modal retrieval model based on the local shared semantic center according to any one of claims 1 to 6 or the instruction or program of the image text cross-modal retrieval method according to claim 8.
CN202210718696.6A 2022-06-23 2022-06-23 Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment Pending CN114969423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210718696.6A CN114969423A (en) 2022-06-23 2022-06-23 Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210718696.6A CN114969423A (en) 2022-06-23 2022-06-23 Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment

Publications (1)

Publication Number Publication Date
CN114969423A true CN114969423A (en) 2022-08-30

Family

ID=82965490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210718696.6A Pending CN114969423A (en) 2022-06-23 2022-06-23 Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment

Country Status (1)

Country Link
CN (1) CN114969423A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775918A (en) * 2023-08-22 2023-09-19 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775918A (en) * 2023-08-22 2023-09-19 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning
CN116775918B (en) * 2023-08-22 2023-11-24 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning

Similar Documents

Publication Publication Date Title
CN112966127B (en) Cross-modal retrieval method based on multilayer semantic alignment
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
WO2023093574A1 (en) News event search method and system based on multi-level image-text semantic alignment model
CN108733742B (en) Global normalized reader system and method
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN112905822B (en) Deep supervision cross-modal counterwork learning method based on attention mechanism
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN111291188B (en) Intelligent information extraction method and system
CN111753189A (en) Common characterization learning method for few-sample cross-modal Hash retrieval
CN110598005A (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN112687388B (en) Explanatory intelligent medical auxiliary diagnosis system based on text retrieval
CN110619121B (en) Entity relation extraction method based on improved depth residual error network and attention mechanism
CN106909537B (en) One-word polysemous analysis method based on topic model and vector space
CN113553440B (en) Medical entity relationship extraction method based on hierarchical reasoning
CN112256866B (en) Text fine-grained emotion analysis algorithm based on deep learning
CN113486667A (en) Medical entity relationship joint extraction method based on entity type information
CN110647904A (en) Cross-modal retrieval method and system based on unmarked data migration
CN110765755A (en) Semantic similarity feature extraction method based on double selection gates
CN114780690B (en) Patent text retrieval method and device based on multi-mode matrix vector representation
CN111984791A (en) Long text classification method based on attention mechanism
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
Xiong et al. An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval
CN112860930A (en) Text-to-commodity image retrieval method based on hierarchical similarity learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination