CN116150418A - Image-text matching method and system based on mixed focusing attention mechanism - Google Patents

Image-text matching method and system based on mixed focusing attention mechanism Download PDF

Info

Publication number
CN116150418A
CN116150418A CN202310424288.4A CN202310424288A CN116150418A CN 116150418 A CN116150418 A CN 116150418A CN 202310424288 A CN202310424288 A CN 202310424288A CN 116150418 A CN116150418 A CN 116150418A
Authority
CN
China
Prior art keywords
word
features
image
attention mechanism
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310424288.4A
Other languages
Chinese (zh)
Other versions
CN116150418B (en
Inventor
鲍秉坤
叶俊杰
邵曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202310424288.4A priority Critical patent/CN116150418B/en
Publication of CN116150418A publication Critical patent/CN116150418A/en
Application granted granted Critical
Publication of CN116150418B publication Critical patent/CN116150418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image-text matching method and system based on a mixed focusing attention mechanism, wherein the method comprises the following steps: s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description; s2, adaptively adjusting temperature coefficients of an attention mechanism to different pictures by using a focused cross-mode attention mechanism, so as to distinguish effective and ineffective regional characteristics; s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, and controlling a self-attention matrix to adaptively select the effective regional features and the word features by using a gating signal; and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features. The invention can realize the mutual search between pictures and texts.

Description

Image-text matching method and system based on mixed focusing attention mechanism
Technical Field
The invention belongs to the crossing field of computer vision and natural language processing, and particularly relates to a method for calculating image and text matching.
Background
The image and the text are taken as main media of internet propagation information, daily life of people is filled, the image is taken as visual data, the image is naturally different from natural language data such as the text, although the two data are different in mode, in many scenes, the contents of the image and the text propagation are closely related, one image and one sentence of natural language description usually have internal semantic association, and how to mine the association has great application prospect and value for realizing semantic alignment between the image and the natural language. By mining the similarity score between the image and the natural language text, the image-text pair with semantic matching is found, so that the development of the current text search image/image search text can be greatly promoted, and a user can be helped to search more valuable information in the Internet, namely the research value and meaning of image-text matching.
The image-text matching method needs to score the matching degree of a given image and natural language description, so that understanding the content of the image and the natural language description is the key for determining the matching score, and only the image-text matching method can understand the content of the image and the text, the matching degree of the image and the text can be judged more accurately and comprehensively. In the traditional image-text matching method, in order to realize fine matching among images and texts, a pre-trained target detector is often utilized to extract a significant region in an image, and for natural language description, the characteristic of each word in a sentence is often extracted in a sequence modeling mode, so that the matching image and the global information describing the whole situation are converted into matching of local information of the region and the word, and the matching degree of the images and the texts is calculated from bottom to top.
The above-described method still currently has the following two challenges: (a) The existing redundant information/noise information, the conventional graphic matching model often uses a fixed number (typical value is 36) of region features extracted from the image in advance, wherein partial regions do not contain information related to texts, namely noise features; there is also some degree of overlap of the partial regions, i.e. redundancy features. (b) The graph-text matching model cannot distinguish useful information from useless information, a single-mode self-attention mechanism is not always focused on whether a certain area is a useful area or not, and the existing cross-mode attention mechanism is always capable of distinguishing all areas in all pictures by only using one temperature coefficient, and cannot assign different temperature coefficients to different pictures.
Disclosure of Invention
The invention aims to solve the technical problems that: in the process of mutual retrieval between pictures and texts, how to remove redundant/noise area information in the image and how to construct a cross-modal and self-modal attention mechanism, so that the picture and text matching method can pay excessive attention to the redundant/noise area information.
In order to solve the technical problems, the invention provides an image-text matching method based on a mixed focusing attention mechanism, which comprises the following steps:
s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description;
s2, utilizing a focused cross-modal attention mechanism to adaptively adjust temperature coefficients of the attention mechanism on different pictures, so as to distinguish effective and ineffective regional features and realize cross-modal context extraction and fusion of regional-level and word-level features;
s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, controlling a self-attention matrix to adaptively select effective regional features and word features by using a gating signal, masking noise and redundant regions, and enhancing the distinguishing degree of the different regional features and word features;
and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features.
The image-text matching method based on the mixed focusing attention mechanism further comprises the following steps:
and S5, optimizing all the linear layers in the steps S1-S4 by using a triplet loss function, and executing the steps S1-S4 after optimizing.
In the foregoing method for matching graphics based on a hybrid focus attention mechanism, in step S1, two sub-steps are included:
s11, detecting the most remarkable image by using a pre-trained Faster R-CNN target detector
Figure SMS_3
The regions are extracted, and the corresponding features of each region are extracted, and then mapped to +.>
Figure SMS_5
The dimension hidden space, the obtained region feature is marked as +.>
Figure SMS_7
Wherein the feature vector->
Figure SMS_2
Each element in (a) is real,/-, a ∈>
Figure SMS_4
Representing the dimension of the feature vector ∈>
Figure SMS_6
Representing the real number field, ++>
Figure SMS_8
Representation->
Figure SMS_1
Real vectors of dimensions;
step S12 for inclusion of
Figure SMS_9
The natural language description of each word adopts a Bi-gating circulation unit Bi-GRU to extract the characteristics of each word, and the forward process of the Bi-GRU reads the last word from the first word and records the hidden state when reading each word:
Figure SMS_10
wherein ,
Figure SMS_11
indicating the hidden state of the forward process, +.>
Figure SMS_12
Indicate->
Figure SMS_13
Single hot code of individual word,/->
Figure SMS_14
Representing the forward process of Bi-GRU;
the Bi-GRU backward process reads from the last word to the first word and records the hidden state when reading each word:
Figure SMS_15
wherein ,
Figure SMS_16
indicating the hidden state of the backward process, +.>
Figure SMS_17
Represents a backward process of Bi-GRU;
word characteristics
Figure SMS_18
Hidden state from forward process->
Figure SMS_19
And hidden state of backward process->
Figure SMS_20
Averaging, namely:
Figure SMS_21
mapping its features to linear layers
Figure SMS_22
The vitamin-hidden space, marked as +.>
Figure SMS_23
Figure SMS_24
Representing the dimension of the feature vector.
In the foregoing method for matching graphics based on a hybrid focus attention mechanism, in step S2, two sub-steps are included;
step S21, giving the image area characteristics
Figure SMS_25
And word feature of description->
Figure SMS_26
The average feature is determined separately and recorded as the image area average feature +.>
Figure SMS_27
And word average feature +.>
Figure SMS_28
Mean feature in image area->
Figure SMS_29
Sum word averagingCharacteristics->
Figure SMS_30
For the query object, the attention score for each region and word is calculated:
Figure SMS_31
Figure SMS_32
,/>
wherein ,
Figure SMS_34
representing the average feature of an image region->
Figure SMS_37
For->
Figure SMS_41
Individual image area features->
Figure SMS_35
Attention score of->
Figure SMS_39
Representing word average feature +.>
Figure SMS_43
For->
Figure SMS_46
Individual word feature +.>
Figure SMS_33
Attention score of->
Figure SMS_38
Figure SMS_42
and
Figure SMS_45
Figure SMS_36
The parameters are respectively a first parameter matrix, a second parameter matrix, a third parameter matrix, a fourth parameter matrix and a +.>
Figure SMS_40
and
Figure SMS_44
For the parameter vector +.>
Figure SMS_47
Representing element multiplication, and weighting and summing the region and word characteristics through the attention score to obtain global characteristics of the image and the text, namely:
Figure SMS_48
wherein ,
Figure SMS_49
representing global features of the image;
Figure SMS_50
Global features representing sentence descriptions;
for a size of
Figure SMS_51
Calculating the current text description for the lot size of the image +.>
Figure SMS_52
Focusing degree of sheet image->
Figure SMS_53
The method comprises the following steps:
Figure SMS_54
wherein ,
Figure SMS_55
for the parameter vector +.>
Figure SMS_56
Representing a concatenation operation of two feature vectors, +.>
Figure SMS_57
Activating the function for sigmoid, thereby obtaining the current text description pair +.>
Figure SMS_58
Focusing degree of sheet image->
Figure SMS_59
Step S22, obtaining the first
Figure SMS_60
Regional characteristics of a sheet of image->
Figure SMS_61
And word characteristics of the text description->
Figure SMS_62
And (4) the right->
Figure SMS_63
Focusing fraction +.>
Figure SMS_64
Then, by local word and region interaction, the similarity score of each word to each region is calculated>
Figure SMS_65
The method comprises the following steps:
Figure SMS_66
wherein ,
Figure SMS_67
representing transpose, for similarity score +.>
Figure SMS_68
Performing L2 normalizationNormalized similarity degree obtained by chemical treatment>
Figure SMS_69
Represents->
Figure SMS_70
Personal word and->
Figure SMS_71
The degree of similarity of the individual regions;
the attention score is given by:
Figure SMS_72
weighting and summing the attention scores of each region through each word to obtain the corresponding cross-modal context characteristics of each word
Figure SMS_73
The method comprises the following steps:
Figure SMS_74
implementing the first via the linear layer
Figure SMS_75
Individual word feature +.>
Figure SMS_76
And corresponding cross-modal context feature +.>
Figure SMS_77
Is a fusion of (1), namely:
Figure SMS_78
wherein ,
Figure SMS_79
representing the characteristics of the two mode information after being fused;
Figure SMS_80
Is a linear layer;>
global features of the image obtained in step S21
Figure SMS_81
And global features of sentence descriptions->
Figure SMS_82
Fusion is performed as a global feature after fusion +.>
Figure SMS_83
The method comprises the following steps:
Figure SMS_84
the fused global features
Figure SMS_85
Fusion feature corresponding to each word +.>
Figure SMS_86
Merging features recorded as multi-modal
Figure SMS_87
In the foregoing image-text matching method based on the hybrid focusing attention mechanism, in step S3, the attention coefficient matrix is calculated as follows:
Figure SMS_88
wherein ,
Figure SMS_89
and
Figure SMS_90
Representing two linear layers of different parameters;
gating signal
Figure SMS_91
The calculation is as follows:
Figure SMS_92
wherein ,
Figure SMS_93
to activate the function +.>
Figure SMS_94
Is a learnable parameter vector, gating signal
Figure SMS_95
Figure SMS_96
Each scalar element +.>
Figure SMS_97
The importance of each feature is considered as being +.>
Figure SMS_98
Before softmax normalization of each row of elements, the gating score is separated into important features and unimportant features by threshold values, namely, each +.>
Figure SMS_99
Fixed as hard (hard) score:
Figure SMS_100
wherein ,
Figure SMS_101
is threshold value, < >>
Figure SMS_102
Score for unimportant local feature, +.>
Figure SMS_103
A score that is an important local feature;
the gating vector is expressed as
Figure SMS_104
By the first
Figure SMS_105
Individual gating signals->
Figure SMS_106
Attention score matrix->
Figure SMS_107
Is>
Figure SMS_108
The column elements are weighted, expressed by:
Figure SMS_109
wherein ,
Figure SMS_110
representing the attention score matrix->
Figure SMS_111
Is a single element;
post-gating attention moment array by softmax function
Figure SMS_112
Carrying out normalization processing on each row of elements in the list;
updated global features
Figure SMS_113
Multimodal features from attention score ++>
Figure SMS_114
Weighted sum is performed, namely:
Figure SMS_115
wherein ,
Figure SMS_116
is an activation function;
Figure SMS_117
Is a linear layer;
Figure SMS_118
The multi-mode feature matrix obtained in the previous step is obtained;
the characteristic matrix updated by the gating self-modal attention mechanism is recorded as
Figure SMS_119
, wherein ,
Figure SMS_120
Representing updated global features +.>
Figure SMS_121
Representing the updated local features.
In step S4, the foregoing image-text matching method based on the hybrid focusing attention mechanism is based on the updated features obtained in step S3
Figure SMS_122
The score of the current image-text pair is predicted by adopting a linear layer, and the score is expressed as follows: />
Figure SMS_123
wherein ,
Figure SMS_124
activating a function for sigmoid;
Figure SMS_125
Representing a linear layer->
Figure SMS_126
Representation of image->
Figure SMS_127
And text description->
Figure SMS_128
Matching scores between.
In the foregoing image-text matching method based on the hybrid focusing attention mechanism, in step S5, the triplet loss function
Figure SMS_129
Expressed as:
Figure SMS_130
wherein ,
Figure SMS_131
Figure SMS_132
is a threshold value, < >>
Figure SMS_133
and
Figure SMS_134
The first and second most difficult negative samples are respectively;
the formula for optimizing all the linear layers by using the triplet loss function is as follows:
Figure SMS_135
wherein ,
Figure SMS_136
and
Figure SMS_137
Representing the parameter scalar I and parameter scalar II inside the linear layer, respectively, < >>
Figure SMS_138
For parameter scalar before optimization, ++>
Figure SMS_139
To be optimizedParameter scalar->
Figure SMS_140
For learning rate->
Figure SMS_141
Representing the gradient solving process.
An image-text matching system based on a mixed focusing attention mechanism comprises the following functional modules:
and a characteristic extraction module of the image-text pair: extracting the characteristics of the salient region in the image and the characteristics of each word in the natural language description;
cross-modal attention mechanism module: the focused cross-modal attention mechanism is utilized to adaptively adjust the temperature coefficients of the attention mechanism to different pictures, so that effective and ineffective regional features are distinguished, and cross-modal context extraction and fusion of regional-level and word-level features are realized;
gated self-attention mechanism module: the intra-modal fusion of the regional features and the word features is realized by using a gating self-attention mechanism, the effective regional features and word features are adaptively selected by controlling the self-attention matrix through a gating signal, noise and redundant regions are covered, and the distinguishing degree of the different regional features and word features is enhanced;
and a matching score calculation module: the cross-modal and self-modal region features and word features are used to calculate the matching score for the entire image and sentence.
The foregoing image-text matching system based on the hybrid focusing attention mechanism further comprises:
loss function optimizing module: and optimizing all linear layers in the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair by using the triple loss function, and executing the working processes of the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair after optimizing.
A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as described above.
The invention has the beneficial effects that: the invention can automatically judge whether the contents of the given image and the natural language description are consistent to obtain a matching score, can be used for cross-modal retrieval in the Internet, namely, the text retrieval corresponding to the text retrieval or the text retrieval corresponding to the text retrieval can adaptively filter and compress the redundant or noise regional characteristics in the process of matching the images and texts, thereby better realizing the mutual retrieval between the images and texts.
Drawings
Fig. 1 is a flow chart of a graph-text matching method based on a mixed focus attention mechanism.
Detailed Description
The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings.
Example 1
As shown in fig. 1, the present invention provides a graph-text matching method based on a hybrid focusing attention mechanism, which comprises the following steps:
s1, extracting characteristics of image-text pairs, namely extracting characteristics of a salient region in an image and characteristics of each word in natural language description;
s2, utilizing a focused cross-modal attention mechanism to adaptively adjust temperature coefficients of the attention mechanism on different pictures, so as to distinguish effective and ineffective regional features and realize cross-modal context extraction and fusion of regional-level and word-level features;
s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, controlling a self-attention matrix to adaptively select effective regional features and word features by using a gating signal, masking noise and redundant regions, and enhancing the distinguishing degree of the different regional features and word features;
and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features.
In step S1, two sub-steps of extracting salient region features in the image and extracting word features in the natural language description are included:
s11, extracting regional characteristics in the image, and detecting the most remarkable image by using a pre-trained Faster R-CNN target detector
Figure SMS_143
Area(s)>
Figure SMS_147
Is typically 36 and extracts the corresponding features of each region, then maps the features to +.>
Figure SMS_149
The dimension hidden space, the obtained region feature is marked as +.>
Figure SMS_142
Wherein the feature vector->
Figure SMS_146
Each element in (a) is real,/-, a ∈>
Figure SMS_148
Representing the dimension of the feature vector ∈>
Figure SMS_150
Representing the real number field, ++>
Figure SMS_144
Representation->
Figure SMS_145
Real vectors of dimensions;
s12, extracting word characteristics in the text, wherein the text comprises
Figure SMS_151
The natural language description of each word adopts a Bi-gating circulation unit Bi-GRU to extract the characteristics of each word, and the forward process of the Bi-GRU reads the last word from the first word and records the hidden state when reading each word:
Figure SMS_152
wherein ,
Figure SMS_153
indicating the hidden state of the forward process, +.>
Figure SMS_154
Indicate->
Figure SMS_155
Single hot code of individual word,/->
Figure SMS_156
Representing the forward process of Bi-GRU;
then, the Bi-GRU backward process reads from the last word to the first word and records the hidden state of each word read:
Figure SMS_157
wherein ,
Figure SMS_158
indicating the hidden state of the backward process, +.>
Figure SMS_159
Represents a backward process of Bi-GRU;
finally, word characteristics
Figure SMS_160
Hidden state from forward process->
Figure SMS_161
And hidden state of backward process->
Figure SMS_162
Averaging, namely:
Figure SMS_163
,/>
mapping its features to linear layers
Figure SMS_164
The vitamin-hidden space, marked as +.>
Figure SMS_165
Figure SMS_166
Representing the dimension of the feature vector.
In step S2, after the salient region features in the image and the word features in the text are extracted, local interactions of different modes are performed by using a focused cross-mode attention mechanism, so as to obtain information of mode complementation. In order to distinguish the importance degree of the areas, focusing operation is carried out on the attention moment array obtained by the cross-modal attention mechanism, so that the difference between attention scores is increased, redundant and noisy areas can be better filtered, and useful areas are reserved, and the method comprises two substeps, namely, the calculation process of the attention focusing score is firstly, and then, the realization flow of the focused cross-modal attention mechanism is carried out;
in step S21, the attention focusing score is calculated, for a batch (batch) image in the training process, for a given specific description, the matching degree with different images should be different, if the matching image-text sample pair is the matching image-text sample pair, the matching degree of the current description with the corresponding image should be stronger, otherwise weaker, so the attention focusing score is calculated from the content of the image-text whole, and the focusing degree of the current text on different images in a batch of samples is calculated.
The embodiment distinguishes the focusing degree of the text to the image through the global information, complements the local characteristic of the cross-modal attention mechanism, and gives the image area characteristics
Figure SMS_167
And word feature of description->
Figure SMS_168
The average feature is determined as the image area average feature +.>
Figure SMS_169
And word average feature +.>
Figure SMS_170
Mean feature in image area->
Figure SMS_171
And word average feature +.>
Figure SMS_172
For the query object, the attention score for each region and word is calculated:
Figure SMS_173
Figure SMS_174
wherein ,
Figure SMS_176
representing the average feature of an image region->
Figure SMS_181
For->
Figure SMS_185
Individual image area features->
Figure SMS_177
Attention score of->
Figure SMS_179
Representing word average feature +.>
Figure SMS_183
For->
Figure SMS_187
Individual word feature +.>
Figure SMS_175
Attention score of->
Figure SMS_180
Figure SMS_184
and
Figure SMS_188
Figure SMS_178
The parameters are respectively a first parameter matrix, a second parameter matrix, a third parameter matrix, a fourth parameter matrix and a +.>
Figure SMS_182
and
Figure SMS_186
For the parameter vector +.>
Figure SMS_189
Representing element multiplication, and weighting and summing the region and word characteristics through the attention score to obtain global characteristics of the image and the text, namely:
Figure SMS_190
wherein ,
Figure SMS_191
representing global features of the image;
Figure SMS_192
Representing the global features of the sentence description.
For a size of
Figure SMS_193
Calculating the current text description for the lot size of the image +.>
Figure SMS_194
Focusing degree of sheet image->
Figure SMS_195
The method comprises the following steps:
Figure SMS_196
wherein ,
Figure SMS_197
for the parameter vector +.>
Figure SMS_198
Representing a concatenation operation of two feature vectors, +.>
Figure SMS_199
Activating the function for sigmoid, thereby obtaining the current text description pair +.>
Figure SMS_200
Focusing degree of sheet image->
Figure SMS_201
Step S22, implementing a flow by a focused cross-mode attention mechanism, and obtaining the first step
Figure SMS_202
Regional characteristics of a sheet of image->
Figure SMS_203
And word characteristics of the text description->
Figure SMS_204
And (4) the right->
Figure SMS_205
Focusing fraction +.>
Figure SMS_206
Then, by local word and region interaction, the similarity score of each word to each region is calculated>
Figure SMS_207
The method comprises the following steps: />
Figure SMS_208
wherein ,
Figure SMS_209
representing transpose, for similarity score +.>
Figure SMS_210
The L2 normalization processing can obtain the normalization similarity degree
Figure SMS_211
Represents->
Figure SMS_212
Personal word and->
Figure SMS_213
The degree of similarity of the individual regions;
the existing image-text matching method can pass through the super-parameter temperature coefficient
Figure SMS_214
To control the similarity score to sharpen the degree of interest of the word in the region, resulting in the attention score +.>
Figure SMS_215
The method comprises the following steps:
Figure SMS_216
when the temperature coefficient is
Figure SMS_217
When rising, the attention fraction will be more focused, the first +.>
Figure SMS_218
The word will be in one or several areas, when the temperature coefficient of the super parameter is +>
Figure SMS_219
When decreasing, the attention score is more distracted, first +.>
Figure SMS_220
The degree of attention of individual words to all regions will tend to be uniform.
Super-parametric temperature coefficient in the above manner
Figure SMS_221
Is fixed, and will often be the same temperature coefficient for different images in one batch, in this embodiment by focusing on the fraction +.>
Figure SMS_222
The temperature coefficient is controlled so as to realize that the text description can have different temperature coefficients for different images, so that the validity conditions of the areas in different images can be better distinguished, namely whether the areas of different images are useful information or noise or redundant information, and the attention score in the embodiment is obtained by the following formula:
Figure SMS_223
by focusing the fraction
Figure SMS_224
The focused cross-modal attention mechanism in this embodiment can more effectively distinguish between different images;
weighting and summing the attention scores of each region through each word to obtain the corresponding cross-modal context characteristics of each word
Figure SMS_225
The method comprises the following steps:
Figure SMS_226
implementing the first via the linear layer
Figure SMS_227
Individual word feature +.>
Figure SMS_228
And corresponding cross-modal context feature +.>
Figure SMS_229
Is a fusion of (1), namely:
Figure SMS_230
wherein ,
Figure SMS_231
representing the characteristics of the two mode information after being fused;
Figure SMS_232
Is a linear layer;
global features of the image obtained in step S21
Figure SMS_233
And global features of sentence descriptions->
Figure SMS_234
Fusion is performed as a global feature after fusion +.>
Figure SMS_235
The method comprises the following steps:
Figure SMS_236
the fused global features
Figure SMS_237
Fusion feature corresponding to each word +.>
Figure SMS_238
Merge marked as a multimodal feature->
Figure SMS_239
The method comprises the steps of carrying out a first treatment on the surface of the In the next stepIn the step, the extraction and fusion of the intra-mode information are realized through a gating self-attention mechanism.
In step S3, a gated self-modal attention mechanism is used for a given multi-modal feature
Figure SMS_240
, wherein ,
Figure SMS_241
and
Figure SMS_242
Can be regarded as local features, such as word features fused with visual information, and +.>
Figure SMS_243
The method can be regarded as global features, and the importance degree of each local feature is different, for example, the importance degree of each word is different, and the nouns in a sentence are generally more important than prepositions and the like, so the embodiment designs a gated self-modal attention mechanism, adopts a gating signal to control the attention score matrix of the local feature, further controls the importance degree of different local information, and the attention score matrix is calculated as follows:
Figure SMS_244
wherein ,
Figure SMS_245
and
Figure SMS_246
Representing two linear layers of different parameters;
gating signal
Figure SMS_247
It can be calculated as:
Figure SMS_248
wherein ,
Figure SMS_249
to activate the function +.>
Figure SMS_250
Is a learnable parameter vector, gating signal
Figure SMS_251
Figure SMS_252
Each scalar element +.>
Figure SMS_253
Can be regarded as the importance of each feature in the attention matrix +.>
Figure SMS_254
Before softmax normalization of each row of elements, the gating score is separated into important features and unimportant features by threshold values, namely, each +.>
Figure SMS_255
Fixed as hard (hard) score:
Figure SMS_256
wherein ,
Figure SMS_257
is a threshold value, in this experiment 0, < >>
Figure SMS_258
Score for unimportant local feature, +.>
Figure SMS_259
A score that is an important local feature;
the gating vector is expressed as
Figure SMS_260
By the first
Figure SMS_261
Individual gating signals->
Figure SMS_262
Attention score matrix->
Figure SMS_263
Is>
Figure SMS_264
The column elements are weighted, expressed by:
Figure SMS_265
wherein ,
Figure SMS_266
representing the attention score matrix->
Figure SMS_267
Is a single element; the gated attention matrix is then subjected to a softmax function>
Figure SMS_268
Normalization processing is performed on each row of elements, and gating weighting is performed, so that when the softmax function processing is performed on each row, the attention distribution is sharpened, so that each query focuses on important features;
finally, updated global features
Figure SMS_269
Multimodal features from attention score ++>
Figure SMS_270
Weighted sum is performed, namely:
Figure SMS_271
wherein ,
Figure SMS_272
is an activation function;
Figure SMS_273
Is a linear layer;
Figure SMS_274
The multi-mode feature matrix obtained in the previous step is obtained;
the feature matrix updated by the gated self-modal attention mechanism can be recorded as
Figure SMS_275
, wherein ,
Figure SMS_276
Representing updated global features +.>
Figure SMS_277
Representing the updated local features. />
In step S4, the graph-text matching score is calculated and the model is trained, based on the updated features obtained in step S3
Figure SMS_278
The score of the current graphic pair is predicted by a linear layer, and can be expressed as:
Figure SMS_279
wherein ,
Figure SMS_280
activating a function for sigmoid;
Figure SMS_281
Representing a linear layer.
Figure SMS_282
Representation of image->
Figure SMS_283
And text description->
Figure SMS_284
The matching score between the above formula shows that the global feature +.>
Figure SMS_285
And predicting the score of the image-text matching.
An image-text matching system based on a mixed focusing attention mechanism comprises the following functional modules:
and a characteristic extraction module of the image-text pair: extracting the characteristics of the salient region in the image and the characteristics of each word in the natural language description;
cross-modal attention mechanism module: the focused cross-modal attention mechanism is utilized to adaptively adjust the temperature coefficients of the attention mechanism to different pictures, so that effective and ineffective regional features are distinguished, and cross-modal context extraction and fusion of regional-level and word-level features are realized;
gated self-attention mechanism module: the intra-modal fusion of the regional features and the word features is realized by using a gating self-attention mechanism, the effective regional features and word features are adaptively selected by controlling the self-attention matrix through a gating signal, noise and redundant regions are covered, and the distinguishing degree of the different regional features and word features is enhanced;
and a matching score calculation module: the cross-modal and self-modal region features and word features are used to calculate the matching score for the entire image and sentence.
In the feature extraction module of the image-text pair, the following steps are executed:
s11, detecting the most remarkable image by using a pre-trained Faster R-CNN target detector
Figure SMS_286
The regions are extracted, and the corresponding features of each region are extracted, and then mapped to +.>
Figure SMS_289
The dimension hidden space, the obtained region feature is marked as +.>
Figure SMS_291
Wherein the feature vector->
Figure SMS_287
Each element in (a) is real,/-, a ∈>
Figure SMS_290
Representing the dimension of the feature vector ∈>
Figure SMS_292
Representing the real number field, ++>
Figure SMS_293
Representation->
Figure SMS_288
Real vectors of dimensions;
step S12 for inclusion of
Figure SMS_294
The natural language description of each word adopts a Bi-gating circulation unit Bi-GRU to extract the characteristics of each word, and the forward process of the Bi-GRU reads the last word from the first word and records the hidden state when reading each word:
Figure SMS_295
wherein ,
Figure SMS_296
indicating the hidden state of the forward process, +.>
Figure SMS_297
Indicate->
Figure SMS_298
Single hot code of individual word,/->
Figure SMS_299
Representing the forward process of Bi-GRU;
then, the Bi-GRU backward process reads from the last word to the first word and records the hidden state of each word read:
Figure SMS_300
wherein ,
Figure SMS_301
indicating the hidden state of the backward process, +.>
Figure SMS_302
Represents a backward process of Bi-GRU;
finally, word characteristics
Figure SMS_303
Hidden state from forward process->
Figure SMS_304
And hidden state of backward process->
Figure SMS_305
Averaging, namely:
Figure SMS_306
,/>
mapping its features to linear layers
Figure SMS_307
The vitamin-hidden space, marked as +.>
Figure SMS_308
Figure SMS_309
Representing the dimension of the feature.
A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as described above.
Example 2
The image-text matching method based on the mixed focusing attention mechanism further comprises, after executing step S1-step S4 in embodiment 1:
and S5, optimizing all the linear layers in the steps S1-S4 by using a triplet loss function, and executing the steps S1-S4 after optimizing.
In step S5, all the linear layers in step S1-step S4 are optimized by using a triplet loss function, and step S1-step S4 are executed after the optimization, wherein the triplet loss function
Figure SMS_310
Expressed as:
Figure SMS_311
wherein ,
Figure SMS_312
Figure SMS_317
is a threshold value, < >>
Figure SMS_320
and
Figure SMS_313
The most difficult negative samples are represented by the first and second most difficult negative samples: sample +.>
Figure SMS_316
And the current query image->
Figure SMS_319
Is the lowest, i.e.)>
Figure SMS_321
To be in the picture->
Figure SMS_314
The lowest matching score for the case of (2); similarly, image sample->
Figure SMS_318
And at presentInquiry text of +.>
Figure SMS_322
Is the lowest, i.e.)>
Figure SMS_323
For at text +.>
Figure SMS_315
The lowest matching score for the case of (2);
the formula for optimizing all the linear layers by using the triplet loss function is as follows:
Figure SMS_324
wherein ,
Figure SMS_325
and
Figure SMS_326
Representing the parameter scalar I and parameter scalar II inside the linear layer, respectively, < >>
Figure SMS_327
For parameter scalar before optimization, ++>
Figure SMS_328
For optimized parameter scalar, ++>
Figure SMS_329
For learning rate->
Figure SMS_330
Representing the gradient solving process.
Steps S1 to S4 are specifically performed in the same manner as in example 1.
The image-text matching system based on the mixed focusing attention mechanism further comprises, based on the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair in the embodiment 1:
loss function optimizing module: and optimizing all linear layers in the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair by using the triple loss function, and executing the working processes of the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair after optimizing.
The specific execution process of each functional module is the same as that of embodiment 1.
The following is a description of specific experimental data.
Training, verifying and testing are carried out on a flicker30k data set, wherein the data set comprises 31783 pictures, each picture has 5 corresponding natural language descriptions, 29783 pictures are used for model training, 1000 pictures are used for verifying, and 1000 pictures are used for testing, so that the invention has a good effect.
The effectiveness of the image-text matching method based on the mixed focusing attention mechanism proposed in this embodiment is measured by recall@k (abbreviated as R@K, where k=1, 5, 10), and recall@k represents the proportion of the correct answer appearing in the previous top-K in the search result, and the comprehensive performance of the image-text matching method is measured by rsum, which is obtained by adding R@1, R@5, and r@10 of the image search text and the text search image, namely:
Figure SMS_331
the left term in the above equation represents the sum of the text of the image search R@1, R@5, r@10, and the right term represents the sum of the text of the image search R@1, R@5, r@10.
Table 1 is a table of comparison of the method of the present invention with other methods of pattern matching on a flicker30k dataset. Comparing the inventive method with several classical methods in the field of image-text matching, SCAN (2018 CVPR), PFAN (2019 IJCAI), VSRN (2019 ICCV), DP-RNN (2020 AAAI), CVSE (2020 ECCV), CAAN (2020 CVPR), as can be seen from the result, the inventive method has more balanced retrieval effect at two angles of image retrieval text and text retrieval image, and compared with the existing method, the inventive image-text matching method based on mixed focusing attention mechanism has the best comprehensive performance, namely rsum is highest and is 489.3; secondly, in two subtasks of image retrieval text and text retrieval image, the method provided by the invention has the best effect on recall@1, and the retrieval success rate of the method is proved to be far higher than that of other methods.
Figure SMS_332
It should be noted that each step/component described in the present application may be split into more steps/components, or two or more steps/components or part of the operations of the steps/components may be combined into new steps/components, as needed for implementation, to achieve the object of the present invention.
The above-described method according to the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored on such software process on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes a memory component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the processing methods described herein. Further, when the general-purpose computer accesses code for implementing the processes shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the processes shown herein.
It will be readily appreciated by those skilled in the art that the foregoing is merely illustrative of the present invention and is not intended to limit the invention, but any modifications, equivalents, improvements or the like which fall within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. The image-text matching method based on the mixed focusing attention mechanism is characterized by comprising the following steps of:
s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description;
s2, utilizing a focused cross-modal attention mechanism to adaptively adjust temperature coefficients of the attention mechanism on different pictures, so as to distinguish effective and ineffective regional features and realize cross-modal context extraction and fusion of regional-level and word-level features;
s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, controlling a self-attention matrix to adaptively select effective regional features and word features by using a gating signal, masking noise and redundant regions, and enhancing the distinguishing degree of the different regional features and word features;
and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features.
2. The method for matching graphics based on a hybrid focus attention mechanism of claim 1, further comprising:
and S5, optimizing all the linear layers in the steps S1-S4 by using a triplet loss function, and executing the steps S1-S4 after optimizing.
3. A method of matching graphics based on a mixed focus attention mechanism as claimed in claim 1 or 2, characterized in that in step S1 two sub-steps are included:
s11, detecting the most remarkable image by using a pre-trained Faster R-CNN target detector
Figure QLYQS_2
The regions are extracted, and the corresponding features of each region are extracted, and then mapped to +.>
Figure QLYQS_5
The dimension hidden space is marked as the obtained region characteristics
Figure QLYQS_7
Wherein the feature vector->
Figure QLYQS_3
Each element in (a) is real,/-, a ∈>
Figure QLYQS_4
Representing the dimension of the feature vector ∈>
Figure QLYQS_6
Representing the real number field, ++>
Figure QLYQS_8
Representation->
Figure QLYQS_1
Real vectors of dimensions;
step S12 for inclusion of
Figure QLYQS_9
The natural language description of each word adopts a Bi-gating circulation unit Bi-GRU to extract the characteristics of each word, and the forward process of the Bi-GRU reads the last word from the first word and records the hidden state when reading each word:
Figure QLYQS_10
wherein ,
Figure QLYQS_11
indicating the hidden state of the forward process, +.>
Figure QLYQS_12
Indicate->
Figure QLYQS_13
Single hot code of individual word,/->
Figure QLYQS_14
Representing the forward process of Bi-GRU;
the Bi-GRU backward process reads from the last word to the first word and records the hidden state when reading each word:
Figure QLYQS_15
wherein ,
Figure QLYQS_16
indicating the hidden state of the backward process, +.>
Figure QLYQS_17
Represents a backward process of Bi-GRU;
word characteristics
Figure QLYQS_18
Hidden state from forward process->
Figure QLYQS_19
And hidden state of backward process->
Figure QLYQS_20
Averaging, namely:
Figure QLYQS_21
mapping its features to linear layers
Figure QLYQS_22
The vitamin-hidden space, marked as +.>
Figure QLYQS_23
Figure QLYQS_24
Representing the dimension of the feature vector.
4. A method of matching text based on a mixed focus attention mechanism as claimed in claim 3, characterized in that in step S2, two sub-steps are included;
step S21, giving the image area characteristics
Figure QLYQS_25
And word feature of description->
Figure QLYQS_26
The average feature is determined separately and recorded as the image area average feature +.>
Figure QLYQS_27
And word average feature +.>
Figure QLYQS_28
Mean feature in image area->
Figure QLYQS_29
And word average feature +.>
Figure QLYQS_30
For the query object, the attention score for each region and word is calculated:
Figure QLYQS_31
Figure QLYQS_32
wherein ,
Figure QLYQS_35
representing the average feature of an image region->
Figure QLYQS_39
For->
Figure QLYQS_43
Individual image area features->
Figure QLYQS_36
Attention score of->
Figure QLYQS_37
Representing word average feature +.>
Figure QLYQS_41
For->
Figure QLYQS_45
Individual word feature +.>
Figure QLYQS_34
Attention score of->
Figure QLYQS_38
Figure QLYQS_42
and
Figure QLYQS_46
Figure QLYQS_33
The parameters are respectively a first parameter matrix, a second parameter matrix, a third parameter matrix, a fourth parameter matrix and a +.>
Figure QLYQS_40
and
Figure QLYQS_44
For the parameter vector +.>
Figure QLYQS_47
Representing element multiplication, and weighting and summing the region and word characteristics through the attention score to obtain global characteristics of the image and the text, namely:
Figure QLYQS_48
wherein ,
Figure QLYQS_49
representing global features of the image;
Figure QLYQS_50
Global features representing sentence descriptions;
for a size of
Figure QLYQS_51
Calculating the current text description for the lot size of the image +.>
Figure QLYQS_52
Focusing degree of sheet image->
Figure QLYQS_53
The method comprises the following steps:
Figure QLYQS_54
wherein ,
Figure QLYQS_55
for the parameter vector +.>
Figure QLYQS_56
Representing a concatenation operation of two feature vectors, +.>
Figure QLYQS_57
Activating the function for sigmoid, thereby obtaining the current text description pair +.>
Figure QLYQS_58
Focusing degree of sheet image->
Figure QLYQS_59
Step S22, obtaining the first
Figure QLYQS_60
Regional characteristics of a sheet of image->
Figure QLYQS_61
And word characteristics of the text description->
Figure QLYQS_62
And (4) the right->
Figure QLYQS_63
Focusing fraction +.>
Figure QLYQS_64
Then, by local word and region interaction, the similarity score of each word to each region is calculated>
Figure QLYQS_65
The method comprises the following steps:
Figure QLYQS_66
wherein ,
Figure QLYQS_67
representing transpose, for similarity score +.>
Figure QLYQS_68
L2 normalizationThe reason is normalized similarity->
Figure QLYQS_69
Represents->
Figure QLYQS_70
Personal word and->
Figure QLYQS_71
The degree of similarity of the individual regions;
the attention score is given by:
Figure QLYQS_72
weighting and summing the attention scores of each region through each word to obtain the corresponding cross-modal context characteristics of each word
Figure QLYQS_73
The method comprises the following steps: />
Figure QLYQS_74
Implementing the first via the linear layer
Figure QLYQS_75
Individual word feature +.>
Figure QLYQS_76
And corresponding cross-modal context feature +.>
Figure QLYQS_77
Is a fusion of (1), namely:
Figure QLYQS_78
wherein ,
Figure QLYQS_79
representing the characteristics of the two mode information after being fused;
Figure QLYQS_80
Is a linear layer;
global features of the image obtained in step S21
Figure QLYQS_81
And global features of sentence descriptions->
Figure QLYQS_82
Fusion is performed as a global feature after fusion +.>
Figure QLYQS_83
The method comprises the following steps:
Figure QLYQS_84
the fused global features
Figure QLYQS_85
Fusion feature corresponding to each word +.>
Figure QLYQS_86
Merging features recorded as multi-modal
Figure QLYQS_87
5. The method of claim 4, wherein in step S3, the attention coefficient matrix is calculated by the following formula:
Figure QLYQS_88
wherein ,
Figure QLYQS_89
and
Figure QLYQS_90
Representing two linear layers of different parameters;
gating signal
Figure QLYQS_91
The calculation is as follows:
Figure QLYQS_92
wherein ,
Figure QLYQS_93
to activate the function +.>
Figure QLYQS_94
Is a learnable parameter vector, gating signal +.>
Figure QLYQS_95
Figure QLYQS_96
Each scalar element +.>
Figure QLYQS_97
The importance of each feature is considered as being +.>
Figure QLYQS_98
Before softmax normalization of each row of elements, the gating score is separated into important features and unimportant features by threshold values, namely, each +.>
Figure QLYQS_99
Fixed as hard score:
Figure QLYQS_100
wherein ,
Figure QLYQS_101
is threshold value, < >>
Figure QLYQS_102
Score for unimportant local feature, +.>
Figure QLYQS_103
A score that is an important local feature;
the gating vector is expressed as
Figure QLYQS_104
By the first
Figure QLYQS_105
Individual gating signals->
Figure QLYQS_106
Attention score matrix->
Figure QLYQS_107
Is>
Figure QLYQS_108
The column elements are weighted, expressed by:
Figure QLYQS_109
wherein ,
Figure QLYQS_110
representing the attention score matrix->
Figure QLYQS_111
Is a single element;
post-gating attention moment by softmax functionArray
Figure QLYQS_112
Carrying out normalization processing on each row of elements in the list;
updated global features
Figure QLYQS_113
Multimodal features from attention score ++>
Figure QLYQS_114
Weighted sum is performed, namely: />
Figure QLYQS_115
wherein ,
Figure QLYQS_116
is an activation function;
Figure QLYQS_117
Is a linear layer;
Figure QLYQS_118
The multi-mode feature matrix obtained in the previous step is obtained;
the characteristic matrix updated by the gating self-modal attention mechanism is recorded as
Figure QLYQS_119
, wherein
Figure QLYQS_120
Representing updated global features +.>
Figure QLYQS_121
Representing the updated local features.
6. The method for matching graphics and text based on a mixed focus attention mechanism as recited in claim 5, wherein the method comprises the steps ofIn step S4, the updated features obtained in step S3 are used
Figure QLYQS_122
The score of the current image-text pair is predicted by adopting a linear layer, and the score is expressed as follows:
Figure QLYQS_123
wherein ,
Figure QLYQS_124
activating a function for sigmoid;
Figure QLYQS_125
Representing a linear layer->
Figure QLYQS_126
Representation of image->
Figure QLYQS_127
And text description->
Figure QLYQS_128
Matching scores between.
7. The method of claim 6, wherein in step S5, the triplet loss function
Figure QLYQS_129
Expressed as:
Figure QLYQS_130
wherein ,
Figure QLYQS_131
Figure QLYQS_132
is a threshold value, < >>
Figure QLYQS_133
and
Figure QLYQS_134
The first and second most difficult negative samples are respectively;
the formula for optimizing all the linear layers by using the triplet loss function is as follows:
Figure QLYQS_135
wherein ,
Figure QLYQS_136
and
Figure QLYQS_137
Representing the parameter scalar I and parameter scalar II inside the linear layer, respectively, < >>
Figure QLYQS_138
For parameter scalar before optimization, ++>
Figure QLYQS_139
For optimized parameter scalar, ++>
Figure QLYQS_140
For learning rate->
Figure QLYQS_141
Representing the gradient solving process.
8. The image-text matching system based on the mixed focusing attention mechanism is characterized by comprising the following functional modules:
and a characteristic extraction module of the image-text pair: extracting the characteristics of the salient region in the image and the characteristics of each word in the natural language description;
cross-modal attention mechanism module: the focused cross-modal attention mechanism is utilized to adaptively adjust the temperature coefficients of the attention mechanism to different pictures, so that effective and ineffective regional features are distinguished, and cross-modal context extraction and fusion of regional-level and word-level features are realized;
gated self-attention mechanism module: the intra-modal fusion of the regional features and the word features is realized by using a gating self-attention mechanism, the effective regional features and word features are adaptively selected by controlling the self-attention matrix through a gating signal, noise and redundant regions are covered, and the distinguishing degree of the different regional features and word features is enhanced;
and a matching score calculation module: the cross-modal and self-modal region features and word features are used to calculate the matching score for the entire image and sentence.
9. The system for matching text based on a mixed focus attention mechanism of claim 8, further comprising:
loss function optimizing module: and optimizing all linear layers in the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair by using the triple loss function, and executing the working processes of the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair after optimizing.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.
CN202310424288.4A 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism Active CN116150418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310424288.4A CN116150418B (en) 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310424288.4A CN116150418B (en) 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism

Publications (2)

Publication Number Publication Date
CN116150418A true CN116150418A (en) 2023-05-23
CN116150418B CN116150418B (en) 2023-07-07

Family

ID=86352855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310424288.4A Active CN116150418B (en) 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism

Country Status (1)

Country Link
CN (1) CN116150418B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176017A1 (en) * 2017-03-24 2018-09-27 Revealit Corporation Method, system, and apparatus for identifying and revealing selected objects from video
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method
CN112966135A (en) * 2021-02-05 2021-06-15 华中科技大学 Image-text retrieval method and system based on attention mechanism and gate control mechanism
CN113065012A (en) * 2021-03-17 2021-07-02 山东省人工智能研究院 Image-text analysis method based on multi-mode dynamic interaction mechanism
CN114155429A (en) * 2021-10-09 2022-03-08 信阳学院 Reservoir earth surface temperature prediction method based on space-time bidirectional attention mechanism
CN114461821A (en) * 2022-02-24 2022-05-10 中南大学 Cross-modal image-text inter-searching method based on self-attention reasoning
CN114492646A (en) * 2022-01-28 2022-05-13 北京邮电大学 Image-text matching method based on cross-modal mutual attention mechanism
CN114691986A (en) * 2022-03-21 2022-07-01 合肥工业大学 Cross-modal retrieval method based on subspace adaptive spacing and storage medium
CN115017266A (en) * 2022-06-23 2022-09-06 天津理工大学 Scene text retrieval model and method based on text detection and semantic matching and computer equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176017A1 (en) * 2017-03-24 2018-09-27 Revealit Corporation Method, system, and apparatus for identifying and revealing selected objects from video
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method
CN112966135A (en) * 2021-02-05 2021-06-15 华中科技大学 Image-text retrieval method and system based on attention mechanism and gate control mechanism
CN113065012A (en) * 2021-03-17 2021-07-02 山东省人工智能研究院 Image-text analysis method based on multi-mode dynamic interaction mechanism
CN114155429A (en) * 2021-10-09 2022-03-08 信阳学院 Reservoir earth surface temperature prediction method based on space-time bidirectional attention mechanism
CN114492646A (en) * 2022-01-28 2022-05-13 北京邮电大学 Image-text matching method based on cross-modal mutual attention mechanism
CN114461821A (en) * 2022-02-24 2022-05-10 中南大学 Cross-modal image-text inter-searching method based on self-attention reasoning
CN114691986A (en) * 2022-03-21 2022-07-01 合肥工业大学 Cross-modal retrieval method based on subspace adaptive spacing and storage medium
CN115017266A (en) * 2022-06-23 2022-09-06 天津理工大学 Scene text retrieval model and method based on text detection and semantic matching and computer equipment

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
MENGXIAO TIAN等: "Adaptive Latent Graph Representation Learning for Image-Text Matching", 《IEEE TRANSACTIONS ON IMAGE PROCESSING》, vol. 32, pages 471 - 482, XP011931400, DOI: 10.1109/TIP.2022.3229631 *
XI SHAO等: "Automatic Scene Recognition Based on Constructed Knowledge Space Learning", 《IEEE ACCESS》, vol. 7, pages 102902 - 102910, XP011738447, DOI: 10.1109/ACCESS.2019.2919342 *
曲磊钢: "基于动态模态交互建模的图文检索方法研究", 《中国优秀硕士学位论文全文数据库》, pages 138 - 1313 *
甘益波等: "联合图像域间和域内信息建模的图像风格转换", 《计算机辅助设计与图形学学报》, vol. 34, no. 10, pages 1489 - 1496 *
赵小虎等: "基于全局-局部特征和自适应注意力机制的图像语义描述算法", 《浙江大学学报(工学版)》, vol. 54, no. 1, pages 126 - 134 *
邵曦等: "结合Bi-LSTM和注意力模型的问答系统研究", 《计算机应用与软件》, vol. 37, no. 10, pages 52 - 56 *

Also Published As

Publication number Publication date
CN116150418B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
US11093560B2 (en) Stacked cross-modal matching
CN109858555B (en) Image-based data processing method, device, equipment and readable storage medium
US8111923B2 (en) System and method for object class localization and semantic class based image segmentation
JP5351958B2 (en) Semantic event detection for digital content recording
CN111126069A (en) Social media short text named entity identification method based on visual object guidance
CN110717332B (en) News and case similarity calculation method based on asymmetric twin network
Li et al. Multimodal architecture for video captioning with memory networks and an attention mechanism
CN111460201B (en) Cross-modal retrieval method for modal consistency based on generative countermeasure network
CN111460247A (en) Automatic detection method for network picture sensitive characters
CN113239907B (en) Face recognition detection method and device, electronic equipment and storage medium
CN112487827B (en) Question answering method, electronic equipment and storage device
KR20190108378A (en) Method and System for Automatic Image Caption Generation
CN113392967A (en) Training method of domain confrontation neural network
CN115690534A (en) Image classification model training method based on transfer learning
CN115861995A (en) Visual question-answering method and device, electronic equipment and storage medium
CN118035416A (en) Method and system for streaming question-answer map
CN114332559B (en) RGB-D significance target detection method based on self-adaptive cross-mode fusion mechanism and deep attention network
CN117851826A (en) Model construction method, model construction device, apparatus, and storage medium
CN117668292A (en) Cross-modal sensitive information identification method
CN116150418B (en) Image-text matching method and system based on mixed focusing attention mechanism
CN117235605A (en) Sensitive information classification method and device based on multi-mode attention fusion
CN117152669A (en) Cross-mode time domain video positioning method and system
CN116894943A (en) Double-constraint camouflage target detection method and system
Rao et al. Deep learning-based image retrieval system with clustering on attention-based representations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant