CN116150418B - Image-text matching method and system based on mixed focusing attention mechanism - Google Patents

Image-text matching method and system based on mixed focusing attention mechanism Download PDF

Info

Publication number
CN116150418B
CN116150418B CN202310424288.4A CN202310424288A CN116150418B CN 116150418 B CN116150418 B CN 116150418B CN 202310424288 A CN202310424288 A CN 202310424288A CN 116150418 B CN116150418 B CN 116150418B
Authority
CN
China
Prior art keywords
word
features
image
feature
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310424288.4A
Other languages
Chinese (zh)
Other versions
CN116150418A (en
Inventor
鲍秉坤
叶俊杰
邵曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202310424288.4A priority Critical patent/CN116150418B/en
Publication of CN116150418A publication Critical patent/CN116150418A/en
Application granted granted Critical
Publication of CN116150418B publication Critical patent/CN116150418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image-text matching method and system based on a mixed focusing attention mechanism, wherein the method comprises the following steps: s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description; s2, adaptively adjusting temperature coefficients of an attention mechanism to different pictures by using a focused cross-mode attention mechanism, so as to distinguish effective and ineffective regional characteristics; s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, and controlling a self-attention matrix to adaptively select the effective regional features and the word features by using a gating signal; and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features. The invention can realize the mutual search between pictures and texts.

Description

Image-text matching method and system based on mixed focusing attention mechanism
Technical Field
The invention belongs to the crossing field of computer vision and natural language processing, and particularly relates to a method for calculating image and text matching.
Background
The image and the text are taken as main media of internet propagation information, daily life of people is filled, the image is taken as visual data, the image is naturally different from natural language data such as the text, although the two data are different in mode, in many scenes, the contents of the image and the text propagation are closely related, one image and one sentence of natural language description usually have internal semantic association, and how to mine the association has great application prospect and value for realizing semantic alignment between the image and the natural language. By mining the similarity score between the image and the natural language text, the image-text pair with semantic matching is found, so that the development of the current text search image/image search text can be greatly promoted, and a user can be helped to search more valuable information in the Internet, namely the research value and meaning of image-text matching.
The image-text matching method needs to score the matching degree of a given image and natural language description, so that understanding the content of the image and the natural language description is the key for determining the matching score, and only the image-text matching method can understand the content of the image and the text, the matching degree of the image and the text can be judged more accurately and comprehensively. In the traditional image-text matching method, in order to realize fine matching among images and texts, a pre-trained target detector is often utilized to extract a significant region in an image, and for natural language description, the characteristic of each word in a sentence is often extracted in a sequence modeling mode, so that the matching image and the global information describing the whole situation are converted into matching of local information of the region and the word, and the matching degree of the images and the texts is calculated from bottom to top.
The above-described method still currently has the following two challenges: (a) The existing redundant information/noise information, the conventional graphic matching model often uses a fixed number (typical value is 36) of region features extracted from the image in advance, wherein partial regions do not contain information related to texts, namely noise features; there is also some degree of overlap of the partial regions, i.e. redundancy features. (b) The graph-text matching model cannot distinguish useful information from useless information, a single-mode self-attention mechanism is not always focused on whether a certain area is a useful area or not, and the existing cross-mode attention mechanism is always capable of distinguishing all areas in all pictures by only using one temperature coefficient, and cannot assign different temperature coefficients to different pictures.
Disclosure of Invention
The invention aims to solve the technical problems that: in the process of mutual retrieval between pictures and texts, how to remove redundant/noise area information in the image and how to construct a cross-modal and self-modal attention mechanism, so that the picture and text matching method can pay excessive attention to the redundant/noise area information.
In order to solve the technical problems, the invention provides an image-text matching method based on a mixed focusing attention mechanism, which comprises the following steps:
s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description;
s2, utilizing a focused cross-modal attention mechanism to adaptively adjust temperature coefficients of the attention mechanism on different pictures, so as to distinguish effective and ineffective regional features and realize cross-modal context extraction and fusion of regional-level and word-level features;
s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, controlling a self-attention matrix to adaptively select effective regional features and word features by using a gating signal, masking noise and redundant regions, and enhancing the distinguishing degree of the different regional features and word features;
and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features.
The image-text matching method based on the mixed focusing attention mechanism further comprises the following steps:
and S5, optimizing all the linear layers in the steps S1-S4 by using a triplet loss function, and executing the steps S1-S4 after optimizing.
In the foregoing method for matching graphics based on a hybrid focus attention mechanism, in step S1, two sub-steps are included:
s11, adopting pre-treatmentThe trained Faster R-CNN target detector detects the most significant m areas in the image, extracts the corresponding features of each area, maps the features to D-dimensional hidden space through a linear layer, and marks the obtained area features as
Figure GDA0004250290020000031
Wherein the feature vector v i Each element in (a) is a real number, D represents the dimension of the feature vector,
Figure GDA0004250290020000032
representing the real number field, ++>
Figure GDA0004250290020000033
A real vector representing the D dimension;
s12, extracting the characteristics of each word by adopting a Bi-gating circulation unit Bi-GRU for natural language description containing n words, wherein the forward process of the Bi-GRU reads the last word from the first word, and records the hidden state when reading each word:
Figure GDA0004250290020000034
wherein,,
Figure GDA0004250290020000035
representing the hidden state, x, of the forward process i One-hot code representing the i-th word, < ->
Figure GDA0004250290020000036
Representing the forward process of Bi-GRU;
the Bi-GRU backward process reads from the last word to the first word and records the hidden state when reading each word:
Figure GDA0004250290020000041
wherein,,
Figure GDA0004250290020000042
indicating the hidden state of the backward process, +.>
Figure GDA0004250290020000043
Represents a backward process of Bi-GRU;
word feature e i Hidden state by forward procedure
Figure GDA0004250290020000044
And hidden state of backward process->
Figure GDA0004250290020000045
Averaging, namely:
Figure GDA0004250290020000046
mapping its features to D-dimensional hidden space through a linear layer, noted as
Figure GDA0004250290020000047
D represents the dimension of the feature vector.
In the foregoing method for matching graphics based on a hybrid focus attention mechanism, in step S2, two sub-steps are included;
step S21, giving the image area characteristics
Figure GDA0004250290020000048
And word feature of description->
Figure GDA0004250290020000049
The average feature is calculated separately and is recorded as the average feature of the image area +.>
Figure GDA00042502900200000410
And word average feature +.>
Figure GDA00042502900200000411
Mean feature in image area->
Figure GDA00042502900200000412
And word average feature +.>
Figure GDA00042502900200000413
For the query object, the attention score for each region and word is calculated:
Figure GDA00042502900200000414
Figure GDA00042502900200000415
wherein,,
Figure GDA00042502900200000416
representing the average feature of an image region->
Figure GDA00042502900200000417
For the ith image region feature v i Attention score of->
Figure GDA00042502900200000418
Representing word average feature +.>
Figure GDA00042502900200000419
For the ith word feature e i Concentration score, W v 、U v And W is e 、U e Respectively a parameter matrix I, a parameter matrix II, a parameter matrix III, a parameter matrix IV and a parameter matrix q v And q e As a parameter vector, the element multiplication is represented by the following, the weighted sum of the area and the word characteristic is carried out through the attention score, and the global characteristics of the image and the text can be obtained, namely:
Figure GDA0004250290020000051
wherein,,
Figure GDA0004250290020000052
representing global features of the image; />
Figure GDA0004250290020000053
Global features representing sentence descriptions;
for a batch of images of size b, calculating the degree of focus f of the current text description on the ith image therein i The method comprises the following steps:
Figure GDA0004250290020000054
wherein q is a parameter vector, ||represents the splicing operation of the two feature vectors, sigma (·) is a sigmoid activation function, and thus the focusing degree { f) of the current text description on the b images is obtained 1 ,…,f b };
Step S22, obtaining regional characteristics of the ith image
Figure GDA0004250290020000055
And word characteristics of the text description->
Figure GDA0004250290020000056
After focusing the focus fraction f of the ith image, calculating the similarity fraction s of each word to each region through local word and region interaction ij The method comprises the following steps:
Figure GDA0004250290020000057
wherein ( T Representing transpose, for similarity score s ij Carrying out L2 normalization processing to obtain normalized similarity degree
Figure GDA0004250290020000058
Representing the similarity of the ith word and the jth region;
the attention score is given by:
Figure GDA0004250290020000059
weighting and summing the attention scores of each region through each word to obtain a cross-modal context feature c corresponding to each word i The method comprises the following steps:
Figure GDA00042502900200000514
realizing ith word feature e via linear layer i And corresponding cross-modal context feature c i Is a fusion of (1), namely:
Figure GDA0004250290020000061
wherein,,
Figure GDA0004250290020000062
representing the characteristics of the two mode information after being fused; fc is a linear layer;
global features of the image obtained in step S21
Figure GDA0004250290020000063
And global features of sentence descriptions->
Figure GDA0004250290020000064
Fusing to obtain a fused global feature e g The method comprises the following steps:
Figure GDA0004250290020000065
fusing the global features e g Fusion features corresponding to each word
Figure GDA0004250290020000066
Merge marked as a multimodal feature->
Figure GDA0004250290020000067
In the foregoing image-text matching method based on the hybrid focusing attention mechanism, in step S3, the attention coefficient matrix is calculated as follows:
A=fc q (E)×[fc k (E)] T (13)
wherein fc is q (. Cndot.) and fc k (. Cndot.) represents a linear layer in which two parameters differ;
the gating signal G is calculated as:
G=tanh(q T ·E) (14)
wherein, tan h (·) is the activation function,
Figure GDA0004250290020000068
is a learnable parameter vector, gating signal
Figure GDA0004250290020000069
Each scalar element G in G i ,i∈[0,n]Regarding the importance of each feature, before softmax normalization of each row of elements in the attention matrix A, the gating score is separated into important and non-important features by a threshold value, i.e., each g i Fixed as hard (hard) score:
Figure GDA00042502900200000610
wherein t is a threshold value, l is a score of a non-important local feature, and h is a score of an important local feature;
the gating vector is expressed as
Figure GDA00042502900200000611
With ith gating signal
Figure GDA0004250290020000071
Weighting the ith column element of the attention score matrix a, expressed by:
Figure GDA0004250290020000072
wherein a is i,j ,i∈[0,n],j∈[0,n]Each element representing an attention score matrix a;
gated attention matrix A by softmax function G Carrying out normalization processing on each row of elements in the list;
updated global features
Figure GDA0004250290020000073
The multi-modal feature E is weighted and summed by the attention score, namely:
Figure GDA0004250290020000074
wherein relu (·) is the activation function; fc (fc) v (. Cndot.) is a linear layer;
Figure GDA0004250290020000075
the multi-mode feature matrix obtained in the previous step is obtained;
the characteristic matrix updated by the gating self-modal attention mechanism is recorded as
Figure GDA0004250290020000076
Wherein (1)>
Figure GDA0004250290020000077
Representing updated global features +.>
Figure GDA0004250290020000078
Representing the updated local features.
In step S4, the foregoing image-text matching method based on the hybrid focusing attention mechanism is based on the updated features obtained in step S3
Figure GDA0004250290020000079
The score of the current graphic pair is predicted by a linear layer and is expressed as:
Figure GDA00042502900200000710
wherein σ (·) is a sigmoid activation function; fc (·) represents the linear layer, S (I, T) represents the matching score between the image I and the text description T.
In the foregoing image-text matching method based on the hybrid focus attention mechanism, in step S5, the triplet loss function L is expressed as:
Figure GDA00042502900200000711
wherein [ x ]] + =max (x, 0), a is a threshold value,
Figure GDA0004250290020000081
and->
Figure GDA0004250290020000082
The first and second most difficult negative samples are respectively;
the formula for optimizing all the linear layers by using the triplet loss function is as follows:
w new =w old -μ×grad(L) (20)
wherein w is new And w old Representing a first parameter scalar and a second parameter scalar, w, respectively, inside a linear layer old To be parameter scalar before optimization, w new For the optimized parameter scalar, μ is the learning rate, and grad (·) represents the gradient solving process.
An image-text matching system based on a mixed focusing attention mechanism comprises the following functional modules:
and a characteristic extraction module of the image-text pair: extracting the characteristics of the salient region in the image and the characteristics of each word in the natural language description;
cross-modal attention mechanism module: the focused cross-modal attention mechanism is utilized to adaptively adjust the temperature coefficients of the attention mechanism to different pictures, so that effective and ineffective regional features are distinguished, and cross-modal context extraction and fusion of regional-level and word-level features are realized;
gated self-attention mechanism module: the intra-modal fusion of the regional features and the word features is realized by using a gating self-attention mechanism, the effective regional features and word features are adaptively selected by controlling the self-attention matrix through a gating signal, noise and redundant regions are covered, and the distinguishing degree of the different regional features and word features is enhanced;
and a matching score calculation module: the cross-modal and self-modal region features and word features are used to calculate the matching score for the entire image and sentence.
The foregoing image-text matching system based on the hybrid focusing attention mechanism further comprises:
loss function optimizing module: and optimizing all linear layers in the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair by using the triple loss function, and executing the working processes of the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair after optimizing.
A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as described above.
The invention has the beneficial effects that: the invention can automatically judge whether the contents of the given image and the natural language description are consistent to obtain a matching score, can be used for cross-modal retrieval in the Internet, namely, the text retrieval corresponding to the text retrieval or the text retrieval corresponding to the text retrieval can adaptively filter and compress the redundant or noise regional characteristics in the process of matching the images and texts, thereby better realizing the mutual retrieval between the images and texts.
Drawings
Fig. 1 is a flow chart of a graph-text matching method based on a mixed focus attention mechanism.
Detailed Description
The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings.
Example 1
As shown in fig. 1, the present invention provides a graph-text matching method based on a hybrid focusing attention mechanism, which comprises the following steps:
s1, extracting characteristics of image-text pairs, namely extracting characteristics of a salient region in an image and characteristics of each word in natural language description;
s2, utilizing a focused cross-modal attention mechanism to adaptively adjust temperature coefficients of the attention mechanism on different pictures, so as to distinguish effective and ineffective regional features and realize cross-modal context extraction and fusion of regional-level and word-level features;
s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, controlling a self-attention matrix to adaptively select effective regional features and word features by using a gating signal, masking noise and redundant regions, and enhancing the distinguishing degree of the different regional features and word features;
and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features.
In step S1, two sub-steps of extracting salient region features in the image and extracting word features in the natural language description are included:
s11, extracting regional features in the image, detecting the most obvious m regions in the image by using a pre-trained Faster R-CNN target detector, extracting features corresponding to each region, mapping the features to a D-dimensional hidden space through a linear layer, and marking the obtained regional features as
Figure GDA0004250290020000101
Wherein the feature vector v i Each element in (a) is a real number, D represents the dimension of the feature vector, < >>
Figure GDA0004250290020000102
Representing the real number field, ++>
Figure GDA0004250290020000103
A real vector representing the D dimension;
step S12, extracting word characteristics in the text, extracting the characteristics of each word by adopting a Bi-gating circulation unit Bi-GRU for natural language description containing n words, reading the last word from the first word by the forward process of the Bi-GRU, and recording the hidden state when reading each word:
Figure GDA0004250290020000104
wherein,,
Figure GDA0004250290020000105
representing the hidden state, x, of the forward process i One-hot code representing the i-th word, < ->
Figure GDA0004250290020000106
Representing the forward process of Bi-GRU;
then, the Bi-GRU backward process reads from the last word to the first word and records the hidden state of each word read:
Figure GDA0004250290020000111
wherein,,
Figure GDA0004250290020000112
indicating the hidden state of the backward process, +.>
Figure GDA0004250290020000113
Representation ofA Bi-GRU backward process;
finally, word feature e i Hidden state by forward procedure
Figure GDA0004250290020000114
And hidden state of backward process->
Figure GDA0004250290020000115
Averaging, namely:
Figure GDA0004250290020000116
mapping its features to D-dimensional hidden space through a linear layer, noted as
Figure GDA0004250290020000117
D represents the dimension of the feature vector.
In step S2, after the salient region features in the image and the word features in the text are extracted, local interactions of different modes are performed by using a focused cross-mode attention mechanism, so as to obtain information of mode complementation. In order to distinguish the importance degree of the areas, focusing operation is carried out on the attention moment array obtained by the cross-modal attention mechanism, so that the difference between attention scores is increased, redundant and noisy areas can be better filtered, and useful areas are reserved, and the method comprises two substeps, namely, the calculation process of the attention focusing score is firstly, and then, the realization flow of the focused cross-modal attention mechanism is carried out;
in step S21, the attention focusing score is calculated, for a batch (batch) image in the training process, for a given specific description, the matching degree with different images should be different, if the matching image-text sample pair is the matching image-text sample pair, the matching degree of the current description with the corresponding image should be stronger, otherwise weaker, so the attention focusing score is calculated from the content of the image-text whole, and the focusing degree of the current text on different images in a batch of samples is calculated.
The embodiment is realized byGlobal information to distinguish the degree of focus of text on an image, complementary to the local nature of the cross-modal attentiveness mechanism, given image region characteristics
Figure GDA0004250290020000121
And word feature of description->
Figure GDA0004250290020000122
The average feature is calculated as the image area average feature +.>
Figure GDA0004250290020000123
And word average feature +.>
Figure GDA0004250290020000124
Mean feature in image area->
Figure GDA0004250290020000125
And word average feature +.>
Figure GDA0004250290020000126
For the query object, the attention score for each region and word is calculated:
Figure GDA0004250290020000127
Figure GDA0004250290020000128
wherein,,
Figure GDA0004250290020000129
representing the average feature of an image region->
Figure GDA00042502900200001210
For the ith image region feature v i Attention score of->
Figure GDA00042502900200001211
Representing word average feature +.>
Figure GDA00042502900200001212
For the ith word feature e i Concentration score, W v 、U v And W is e 、U e Respectively a parameter matrix I, a parameter matrix II, a parameter matrix III, a parameter matrix IV and a parameter matrix q v And q e As a parameter vector, the element multiplication is represented by the following, the weighted sum of the area and the word characteristic is carried out through the attention score, and the global characteristics of the image and the text can be obtained, namely:
Figure GDA00042502900200001213
wherein,,
Figure GDA00042502900200001214
representing global features of the image; />
Figure GDA00042502900200001215
Representing the global features of the sentence description.
For a batch of images of size b, calculating the degree of focus f of the current text description on the ith image therein i The method comprises the following steps:
Figure GDA00042502900200001216
wherein q is a parameter vector, ||represents the splicing operation of the two feature vectors, sigma (·) is a sigmoid activation function, and thus the focusing degree { f) of the current text description on the b images is obtained 1 ,…,f b }。
Step S22, implementing a flow by a focused cross-mode attention mechanism, and obtaining the regional characteristics of the ith image
Figure GDA00042502900200001217
And word characteristics of the text description->
Figure GDA00042502900200001218
After focusing the focus fraction f of the ith image, calculating the similarity fraction s of each word to each region through local word and region interaction ij The method comprises the following steps:
Figure GDA0004250290020000131
wherein ( T Representing transpose, for similarity score s ij The L2 normalization processing can obtain the normalization similarity degree
Figure GDA0004250290020000132
Representing the similarity of the ith word and the jth region;
the existing image-text matching method can control the similarity score through the super-parameter temperature coefficient lambda, so that the attention degree of words to the region is sharpened, and the attention score alpha is obtained ij The method comprises the following steps:
Figure GDA0004250290020000133
the attention score will be more focused when the temperature coefficient lambda increases, the i-th word will tend to be in only one or a few areas, the attention score will be more distracted when the super parameter temperature coefficient lambda decreases, and the attention of the i-th word will tend to be uniform for all areas.
The hyper-parameter temperature coefficient λ in the above manner is fixed, and the same temperature coefficient is often used for different images in one batch, in this embodiment, the temperature coefficient is controlled by the focus fraction f, so that the text description can have different temperature coefficients for different images, and therefore, the validity of the regions in different images can be better distinguished, that is, whether the regions in different images are useful information or noise or redundant information, and the attention fraction in this embodiment is obtained by the following formula:
Figure GDA0004250290020000134
by focusing on the fraction f, the focused cross-modal attention mechanism in this embodiment can more effectively distinguish different images;
weighting and summing the attention scores of each region through each word to obtain a cross-modal context feature c corresponding to each word i The method comprises the following steps:
Figure GDA0004250290020000141
realizing ith word feature e via linear layer i And corresponding cross-modal context feature c i Is a fusion of (1), namely:
Figure GDA0004250290020000142
wherein,,
Figure GDA0004250290020000143
representing the characteristics of the two mode information after being fused; fc is a linear layer;
global features of the image obtained in step S21
Figure GDA0004250290020000144
And global features of sentence descriptions->
Figure GDA0004250290020000145
Fusing to obtain a fused global feature e g The method comprises the following steps:
Figure GDA0004250290020000146
fusing the global features e g Fusion features corresponding to each word
Figure GDA0004250290020000147
Merge marked as a multimodal feature->
Figure GDA0004250290020000148
In the next step, the extraction and fusion of the intra-modal information are realized through a gating self-attention mechanism.
In step S3, a gated self-modal attention mechanism is used for a given multi-modal feature
Figure GDA0004250290020000149
Wherein->
Figure GDA00042502900200001410
To->
Figure GDA00042502900200001411
Can be regarded as local features, such as word features fused with visual information, and e g The method can be regarded as global features, and the importance degree of each local feature is different, for example, the importance degree of each word is different, and the nouns in a sentence are generally more important than prepositions and the like, so the embodiment designs a gated self-modal attention mechanism, adopts a gating signal to control the attention score matrix of the local feature, further controls the importance degree of different local information, and the attention score matrix is calculated as follows:
A=fc q (E)×[fc k (E)] T (13)
wherein fc is q (. Cndot.) and fc k (. Cndot.) represents a linear layer in which two parameters differ;
the gating signal G may be calculated as:
G=tanh(q T ·E) (14)
wherein, tan h (·) is the activation function,
Figure GDA0004250290020000151
is a learnable parameter vector, gating signal
Figure GDA0004250290020000152
Each scalar element G in G i ,i∈[0,n]Can be regarded as the importance of each feature, i.e. each g, before softmax normalization of each row of elements in the attention matrix a, the gating score is separated by a threshold value into important and non-important features i Fixed as hard (hard) score:
Figure GDA0004250290020000153
wherein t is a threshold value, 0,l is a score of an unimportant local feature in the experiment, and h is a score of an important local feature;
the gating vector is expressed as
Figure GDA0004250290020000154
With ith gating signal
Figure GDA0004250290020000155
Weighting the ith column element of the attention score matrix a, expressed by:
Figure GDA0004250290020000156
wherein a is i,j ,i∈[0,n],j∈[0,n]Each element representing an attention score matrix a;
the gated attention matrix A is then subjected to a softmax function G Normalization processing is performed on each row of elements, and gating weighting is performed, so that when the softmax function processing is performed on each row, the attention distribution is sharpened, so that each query focuses on important features;
finally, updated global features
Figure GDA0004250290020000157
Multimode by attention scoreThe state characteristics E are obtained by weighted summation, namely:
Figure GDA0004250290020000158
wherein relu (·) is the activation function; fc (fc) v (. Cndot.) is a linear layer;
Figure GDA0004250290020000159
the multi-mode feature matrix obtained in the previous step is obtained;
the feature matrix updated by the gated self-modal attention mechanism can be recorded as
Figure GDA0004250290020000161
Wherein the method comprises the steps of
Figure GDA0004250290020000162
Representing updated global features +.>
Figure GDA0004250290020000163
Representing the updated local features.
In step S4, the graph-text matching score is calculated and the model is trained, based on the updated features obtained in step S3
Figure GDA0004250290020000164
The score of the current graphic pair is predicted by a linear layer and can be expressed as:
Figure GDA0004250290020000165
wherein σ (·) is a sigmoid activation function; fc (·) represents a linear layer. S (I, T) represents the matching score between image I and text description T, the above equation indicating the use of global features
Figure GDA0004250290020000166
And predicting the score of the image-text matching.
An image-text matching system based on a mixed focusing attention mechanism comprises the following functional modules:
and a characteristic extraction module of the image-text pair: extracting the characteristics of the salient region in the image and the characteristics of each word in the natural language description;
cross-modal attention mechanism module: the focused cross-modal attention mechanism is utilized to adaptively adjust the temperature coefficients of the attention mechanism to different pictures, so that effective and ineffective regional features are distinguished, and cross-modal context extraction and fusion of regional-level and word-level features are realized;
gated self-attention mechanism module: the intra-modal fusion of the regional features and the word features is realized by using a gating self-attention mechanism, the effective regional features and word features are adaptively selected by controlling the self-attention matrix through a gating signal, noise and redundant regions are covered, and the distinguishing degree of the different regional features and word features is enhanced;
and a matching score calculation module: the cross-modal and self-modal region features and word features are used to calculate the matching score for the entire image and sentence.
In the feature extraction module of the image-text pair, the following steps are executed:
s11, detecting the most obvious m areas in the image by using a pre-trained fast R-CNN target detector, extracting the corresponding features of each area, mapping the features to a D-dimensional hidden space through a linear layer, and marking the obtained area features as
Figure GDA0004250290020000171
Wherein the feature vector v i Each element in (a) is a real number, D represents the dimension of the feature vector,
Figure GDA0004250290020000172
representing the real number field, ++>
Figure GDA0004250290020000173
A real vector representing the D dimension;
s12, extracting the characteristics of each word by adopting a Bi-gating circulation unit Bi-GRU for natural language description containing n words, wherein the forward process of the Bi-GRU reads the last word from the first word, and records the hidden state when reading each word:
Figure GDA0004250290020000174
wherein,,
Figure GDA0004250290020000175
representing the hidden state, x, of the forward process i One-hot code representing the i-th word, < ->
Figure GDA0004250290020000176
Representing the forward process of Bi-GRU;
then, the Bi-GRU backward process reads from the last word to the first word and records the hidden state of each word read:
Figure GDA0004250290020000177
wherein,,
Figure GDA0004250290020000178
indicating the hidden state of the backward process, +.>
Figure GDA0004250290020000179
Represents a backward process of Bi-GRU;
finally, word feature e i Hidden state by forward procedure
Figure GDA00042502900200001710
And hidden state of backward process->
Figure GDA00042502900200001711
Averaging, namely:
Figure GDA00042502900200001712
mapping its features to D-dimensional hidden space through a linear layer, noted as
Figure GDA00042502900200001713
D represents the dimension of the feature.
A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as described above.
Example 2
The image-text matching method based on the mixed focusing attention mechanism further comprises, after executing step S1-step S4 in embodiment 1:
and S5, optimizing all the linear layers in the steps S1-S4 by using a triplet loss function, and executing the steps S1-S4 after optimizing.
In step S5, all the linear layers in step S1 to step S4 are optimized by using a triplet loss function, and step S1 to step S4 are executed after the optimization, where the triplet loss function L is expressed as:
Figure GDA0004250290020000181
/>
wherein [ x ]] + =max (x, 0), a is a threshold value,
Figure GDA0004250290020000182
and->
Figure GDA0004250290020000183
The most difficult negative samples are represented by the first and second most difficult negative samples: sample +.>
Figure GDA0004250290020000184
The matching degree with the current query image I is the lowest, i.e. +.>
Figure GDA0004250290020000188
To at the same timeThe lowest matching score in the case of image I; similarly, image sample->
Figure GDA0004250290020000186
The matching degree with the current query text T is the lowest, i.e. +.>
Figure GDA0004250290020000187
Is the lowest matching score in the case of text T;
the formula for optimizing all the linear layers by using the triplet loss function is as follows:
w new =w old -μ×grad(L) (20)
wherein w is new And w old Representing a first parameter scalar and a second parameter scalar, w, respectively, inside a linear layer old To be parameter scalar before optimization, w new For the optimized parameter scalar, μ is the learning rate, and grad (·) represents the gradient solving process.
Steps S1 to S4 are specifically performed in the same manner as in example 1.
The image-text matching system based on the mixed focusing attention mechanism further comprises, based on the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair in the embodiment 1:
loss function optimizing module: and optimizing all linear layers in the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair by using the triple loss function, and executing the working processes of the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair after optimizing.
The specific execution process of each functional module is the same as that of embodiment 1.
The following is a description of specific experimental data.
Training, verifying and testing are carried out on a flicker30k data set, wherein the data set comprises 31783 pictures, each picture has 5 corresponding natural language descriptions, 29783 pictures are used for model training, 1000 pictures are used for verifying, and 1000 pictures are used for testing, so that the invention has a good effect.
The effectiveness of the image-text matching method based on the mixed focusing attention mechanism proposed in this embodiment is measured by recall@k (abbreviated as R@K, where k=1, 5, 10), and recall@k represents the proportion of the correct answer appearing in the previous top-K in the search result, and the comprehensive performance of the image-text matching method is measured by rsum, which is obtained by adding R@1, R@5, and r@10 of the image search text and the text search image, namely:
Figure GDA0004250290020000191
the left term in the above equation represents the sum of the text of the image search R@1, R@5, r@10, and the right term represents the sum of the text of the image search R@1, R@5, r@10.
Table 1 is a table of comparison of the method of the present invention with other methods of pattern matching on a flicker30k dataset. Comparing the inventive method with several classical methods in the field of image-text matching, SCAN (2018 CVPR), PFAN (2019 IJCAI), VSRN (2019 ICCV), DP-RNN (2020 AAAI), CVSE (2020 ECCV), CAAN (2020 CVPR), as can be seen from the result, the inventive method has more balanced retrieval effect at two angles of image retrieval text and text retrieval image, and compared with the existing method, the inventive image-text matching method based on mixed focusing attention mechanism has the best comprehensive performance, namely rsum is highest and is 489.3; secondly, in two subtasks of image retrieval text and text retrieval image, the method provided by the invention has the best effect on recall@1, and the retrieval success rate of the method is proved to be far higher than that of other methods.
TABLE 1
Figure GDA0004250290020000201
It should be noted that each step/component described in the present application may be split into more steps/components, or two or more steps/components or part of the operations of the steps/components may be combined into new steps/components, as needed for implementation, to achieve the object of the present invention.
The above-described method according to the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored on such software process on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes a memory component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the processing methods described herein. Further, when the general-purpose computer accesses code for implementing the processes shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the processes shown herein.
It will be readily appreciated by those skilled in the art that the foregoing is merely illustrative of the present invention and is not intended to limit the invention, but any modifications, equivalents, improvements or the like which fall within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims (9)

1. The image-text matching method based on the mixed focusing attention mechanism is characterized by comprising the following steps of:
s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description;
s2, utilizing a focused cross-modal attention mechanism to adaptively adjust temperature coefficients of the attention mechanism on different pictures, so as to distinguish effective and ineffective regional features and realize cross-modal context extraction and fusion of regional-level and word-level features;
s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, controlling a self-attention coefficient matrix to adaptively select effective regional features and word features by using a gating signal, masking noise and redundant regions, and enhancing the distinguishing degree of the different regional features and word features;
s4, calculating the matching score of the whole image and the sentence by using the cross-mode and self-mode regional features and the word features;
in step S2, two sub-steps are included;
step S21, giving the image area characteristics
Figure FDA0004273981120000011
And word feature of description->
Figure FDA0004273981120000012
The average feature is calculated separately and is recorded as the average feature of the image area +.>
Figure FDA0004273981120000013
And word average feature +.>
Figure FDA0004273981120000014
Mean feature in image area->
Figure FDA0004273981120000015
And word average feature +.>
Figure FDA0004273981120000016
For the query object, the attention score for each region and word is calculated:
Figure FDA0004273981120000017
Figure FDA0004273981120000018
wherein,,
Figure FDA0004273981120000019
representing the average feature of an image region->
Figure FDA00042739811200000112
For the ith image region feature v i Attention score of->
Figure FDA00042739811200000110
Representing word average feature +.>
Figure FDA00042739811200000111
For the ith word feature e i Concentration score, W v 、U v And W is e 、U e Respectively a parameter matrix I, a parameter matrix II, a parameter matrix III, a parameter matrix IV and a parameter matrix q v And q e As a parameter vector, the element multiplication is represented by the following, the weighted sum of the area and the word characteristic is carried out through the attention score, and the global characteristics of the image and the text can be obtained, namely:
Figure FDA0004273981120000021
wherein,,
Figure FDA0004273981120000022
representing global features of the image; />
Figure FDA0004273981120000023
Global features representing sentence descriptions;
for a batch of images of size b, calculating the degree of focus f of the current text description on the ith image therein i The method comprises the following steps:
Figure FDA0004273981120000024
where q is a parameter vector, ||represents a stitching operation of the two feature vectors, σ (sigma is a sigmoid activation function, thereby obtaining a focusing degree { f) of the current text description on the b images 1 ,…,f b };
Step S22, obtaining regional characteristics of the ith image
Figure FDA0004273981120000025
And word characteristics of the text description->
Figure FDA0004273981120000026
After focusing the focus fraction f of the ith image, calculating the similarity fraction s of each word to each region through local word and region interaction ij The method comprises the following steps:
Figure FDA0004273981120000027
wherein ( T Representing transpose, for similarity score s ij Carrying out L2 normalization processing to obtain normalized similarity degree
Figure FDA0004273981120000028
Representing the similarity of the ith word and the jth region;
the attention score is given by:
Figure FDA0004273981120000029
lambda is a super-parameter temperature coefficient, and the attention score of each region is weighted and summed through each word to obtain the cross-modal context characteristic c corresponding to each word i The method comprises the following steps:
Figure FDA0004273981120000031
realizing ith word feature e via linear layer i And corresponding cross-modal context feature c i Is a fusion of (1), namely:
Figure FDA0004273981120000032
wherein,,
Figure FDA0004273981120000033
representing the characteristics of the two mode information after being fused; fc is a linear layer;
global features of the image obtained in step S21
Figure FDA0004273981120000034
And global features of sentence descriptions->
Figure FDA0004273981120000035
Fusing to obtain a fused global feature e g The method comprises the following steps:
Figure FDA0004273981120000036
fusing the global features e g Fusion features corresponding to each word
Figure FDA0004273981120000037
Merging features recorded as multi-modal
Figure FDA0004273981120000038
2. The method for matching graphics based on a hybrid focus attention mechanism of claim 1, further comprising:
and S5, optimizing all the linear layers in the steps S1-S4 by using a triplet loss function, and executing the steps S1-S4 after optimizing.
3. A method of matching graphics based on a mixed focus attention mechanism as claimed in claim 1 or 2, characterized in that in step S1 two sub-steps are included:
s11, detecting the most obvious m areas in the image by using a pre-trained fast R-CNN target detector, extracting the corresponding features of each area, mapping the features to a D-dimensional hidden space through a linear layer, and marking the obtained area features as
Figure FDA0004273981120000039
Wherein the feature vector v i Each element in (a) is a real number, D represents the dimension of the feature vector, < >>
Figure FDA00042739811200000310
Representing the real number field, ++>
Figure FDA00042739811200000311
A real vector representing the D dimension;
s12, extracting the characteristics of each word by adopting a Bi-gating circulation unit Bi-GRU for natural language description containing n words, wherein the forward process of the Bi-GRU reads the last word from the first word, and records the hidden state when reading each word:
Figure FDA0004273981120000041
wherein,,
Figure FDA0004273981120000042
representing the hidden state, x, of the forward process i Representation ofOne-hot code of the i-th word, +.>
Figure FDA0004273981120000043
Representing the forward process of Bi-GRU;
the Bi-GRU backward process reads from the last word to the first word and records the hidden state when reading each word:
Figure FDA0004273981120000044
wherein,,
Figure FDA0004273981120000045
indicating the hidden state of the backward process, +.>
Figure FDA0004273981120000046
Represents a backward process of Bi-GRU;
word feature e i Hidden state by forward procedure
Figure FDA0004273981120000047
And hidden state of backward process->
Figure FDA0004273981120000048
Averaging, namely:
Figure FDA0004273981120000049
mapping its features to D-dimensional hidden space through a linear layer, noted as
Figure FDA00042739811200000410
D represents the dimension of the feature vector.
4. The method of claim 1, wherein in step S3, the attention coefficient matrix is calculated by the following formula:
A=fc q (E)×[fc k (E)] T (13)
wherein fc is q (. Cndot.) and fc k (. Cndot.) represents a linear layer in which two parameters differ;
the gating signal G is calculated as:
G=tanh(q T ·E) (14)
wherein, tan h (·) is the activation function,
Figure FDA0004273981120000051
is a learnable parameter vector, gating signal
Figure FDA0004273981120000052
Each scalar element G in G i ,i∈[0,n]Regarding the importance of each feature, before softmax normalization of each row of elements in the attention coefficient matrix A, the gating score is separated into important and non-important features by a threshold value, i.e., each g i Fixed as hard score:
Figure FDA0004273981120000053
wherein t is a threshold value, l is a score of a non-important local feature, and h is a score of an important local feature;
the gating vector is expressed as
Figure FDA0004273981120000054
With ith gating signal
Figure FDA0004273981120000055
Weighting the i-th column element of the attention coefficient matrix a is expressed by:
Figure FDA0004273981120000056
wherein a is i,j ,i∈[0,n],j∈[0,n]Each element representing an attention coefficient matrix a;
gated attention coefficient matrix A by softmax function G Carrying out normalization processing on each row of elements in the list;
updated global features
Figure FDA0004273981120000057
The multi-modal feature E is weighted and summed by the attention score, namely:
Figure FDA0004273981120000058
wherein relu (·) is the activation function; fc (fc) v (. Cndot.) is a linear layer;
Figure FDA0004273981120000059
the multi-mode feature matrix obtained in the previous step is obtained;
the characteristic matrix updated by the gating self-modal attention mechanism is recorded as
Figure FDA00042739811200000510
Wherein->
Figure FDA00042739811200000511
Representing updated global features +.>
Figure FDA00042739811200000512
Representing the updated local features.
5. The method of claim 4, wherein in step S4, the updated features obtained in step S3 are used as the reference
Figure FDA0004273981120000061
The score of the current graphic pair is predicted by a linear layer and is expressed as:
Figure FDA0004273981120000062
wherein σ (·) is a sigmoid activation function; fc (·) represents the linear layer, S (I, Y) represents the matching score between image I and text description Y.
6. The method of claim 5, wherein in step S5, the triplet loss function L is expressed as:
Figure FDA0004273981120000063
wherein [ x ]] + =max (x, 0), a is a threshold value,
Figure FDA0004273981120000064
and->
Figure FDA0004273981120000065
The first and second most difficult negative samples are respectively;
the formula for optimizing all the linear layers by using the triplet loss function is as follows:
w new =w old -μ×grad(L) (20)
wherein w is new And w old Representing a first parameter scalar and a second parameter scalar, w, respectively, inside a linear layer old To be parameter scalar before optimization, w new For the optimized parameter scalar, μ is the learning rate, and grad (·) represents the gradient solving process.
7. The image-text matching system based on the mixed focusing attention mechanism is characterized by comprising the following functional modules:
and a characteristic extraction module of the image-text pair: extracting the characteristics of the salient region in the image and the characteristics of each word in the natural language description;
cross-modal attention mechanism module: the focused cross-modal attention mechanism is utilized to adaptively adjust the temperature coefficients of the attention mechanism to different pictures, so that effective and ineffective regional features are distinguished, and cross-modal context extraction and fusion of regional-level and word-level features are realized;
gated self-attention mechanism module: the intra-modal fusion of the regional features and the word features is realized by using a gating self-attention mechanism, the effective regional features and word features are adaptively selected by controlling a self-attention coefficient matrix through a gating signal, noise and redundant regions are covered, and the distinguishing degree of the different regional features and word features is enhanced;
and a matching score calculation module: calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional characteristics and the word characteristics;
in the cross-modal attention mechanism module, the following steps are performed;
step S21, giving the image area characteristics
Figure FDA0004273981120000071
And word feature of description->
Figure FDA0004273981120000072
The average feature is calculated separately and is recorded as the average feature of the image area +.>
Figure FDA0004273981120000073
And word average feature +.>
Figure FDA0004273981120000074
Mean feature in image area->
Figure FDA0004273981120000075
And word average feature +.>
Figure FDA0004273981120000076
For the query object, the attention score for each region and word is calculated:
Figure FDA0004273981120000077
Figure FDA0004273981120000078
wherein,,
Figure FDA0004273981120000079
representing the average feature of an image region->
Figure FDA00042739811200000710
For the ith image region feature v i Attention score of->
Figure FDA00042739811200000711
Representing word average feature +.>
Figure FDA00042739811200000712
For the ith word feature e i Concentration score, W v 、U v And W is e 、U e Respectively a parameter matrix I, a parameter matrix II, a parameter matrix III, a parameter matrix IV and a parameter matrix q v And q e As a parameter vector, the element multiplication is represented by the following, the weighted sum of the area and the word characteristic is carried out through the attention score, and the global characteristics of the image and the text can be obtained, namely:
Figure FDA00042739811200000713
wherein,,
Figure FDA00042739811200000714
representing global features of the image; />
Figure FDA00042739811200000715
Global features representing sentence descriptions;
for a batch of images of size b, calculating the degree of focus f of the current text description on the ith image therein i The method comprises the following steps:
Figure FDA0004273981120000081
wherein q is a parameter vector, ||represents the splicing operation of the two feature vectors, sigma (·) is a sigmoid activation function, and thus the focusing degree { f) of the current text description on the b images is obtained 1 ,…,f b };
Step S22, obtaining regional characteristics of the ith image
Figure FDA0004273981120000082
And word characteristics of the text description->
Figure FDA0004273981120000083
After focusing the focus fraction f of the ith image, calculating the similarity fraction s of each word to each region through local word and region interaction ij The method comprises the following steps:
Figure FDA0004273981120000084
wherein ( T Representing transpose, for similarity score s ij Carrying out L2 normalization processing to obtain normalized similarity degree
Figure FDA0004273981120000085
Representing the ith word and the jth wordThe degree of similarity of the regions; v j Features representing regions in the jth image;
the attention score is given by:
Figure FDA0004273981120000086
lambda is a super-parameter temperature coefficient, and the attention score of each region is weighted and summed through each word to obtain the cross-modal context characteristic c corresponding to each word i The method comprises the following steps:
Figure FDA0004273981120000087
realizing ith word feature e via linear layer i And corresponding cross-modal context feature c i Is a fusion of (1), namely:
Figure FDA0004273981120000088
wherein,,
Figure FDA0004273981120000089
representing the characteristics of the two mode information after being fused; fc is a linear layer;
global features of the image obtained in step S21
Figure FDA0004273981120000091
And global features of sentence descriptions->
Figure FDA0004273981120000092
Fusing to obtain a fused global feature e g The method comprises the following steps:
Figure FDA0004273981120000093
fusing the global features e g Fusion features corresponding to each word
Figure FDA0004273981120000094
Merging features recorded as multi-modal
Figure FDA0004273981120000095
8. The system for matching text based on a mixed focus attention mechanism of claim 7, further comprising:
loss function optimizing module: and optimizing all linear layers in the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair by using the triple loss function, and executing the working processes of the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair after optimizing.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.
CN202310424288.4A 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism Active CN116150418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310424288.4A CN116150418B (en) 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310424288.4A CN116150418B (en) 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism

Publications (2)

Publication Number Publication Date
CN116150418A CN116150418A (en) 2023-05-23
CN116150418B true CN116150418B (en) 2023-07-07

Family

ID=86352855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310424288.4A Active CN116150418B (en) 2023-04-20 2023-04-20 Image-text matching method and system based on mixed focusing attention mechanism

Country Status (1)

Country Link
CN (1) CN116150418B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176017A1 (en) * 2017-03-24 2018-09-27 Revealit Corporation Method, system, and apparatus for identifying and revealing selected objects from video
CN114492646A (en) * 2022-01-28 2022-05-13 北京邮电大学 Image-text matching method based on cross-modal mutual attention mechanism
CN114691986A (en) * 2022-03-21 2022-07-01 合肥工业大学 Cross-modal retrieval method based on subspace adaptive spacing and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516085B (en) * 2019-07-11 2022-05-17 西安电子科技大学 Image text mutual retrieval method based on bidirectional attention
CN112966135B (en) * 2021-02-05 2022-03-29 华中科技大学 Image-text retrieval method and system based on attention mechanism and gate control mechanism
CN113065012B (en) * 2021-03-17 2022-04-22 山东省人工智能研究院 Image-text analysis method based on multi-mode dynamic interaction mechanism
CN114155429A (en) * 2021-10-09 2022-03-08 信阳学院 Reservoir earth surface temperature prediction method based on space-time bidirectional attention mechanism
CN114461821A (en) * 2022-02-24 2022-05-10 中南大学 Cross-modal image-text inter-searching method based on self-attention reasoning
CN115017266A (en) * 2022-06-23 2022-09-06 天津理工大学 Scene text retrieval model and method based on text detection and semantic matching and computer equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176017A1 (en) * 2017-03-24 2018-09-27 Revealit Corporation Method, system, and apparatus for identifying and revealing selected objects from video
CN114492646A (en) * 2022-01-28 2022-05-13 北京邮电大学 Image-text matching method based on cross-modal mutual attention mechanism
CN114691986A (en) * 2022-03-21 2022-07-01 合肥工业大学 Cross-modal retrieval method based on subspace adaptive spacing and storage medium

Also Published As

Publication number Publication date
CN116150418A (en) 2023-05-23

Similar Documents

Publication Publication Date Title
JP5351958B2 (en) Semantic event detection for digital content recording
US20200097604A1 (en) Stacked cross-modal matching
US8111923B2 (en) System and method for object class localization and semantic class based image segmentation
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112817914A (en) Attention-based deep cross-modal Hash retrieval method and device and related equipment
Li et al. Multimodal architecture for video captioning with memory networks and an attention mechanism
CN111460201B (en) Cross-modal retrieval method for modal consistency based on generative countermeasure network
CN111105013B (en) Optimization method of countermeasure network architecture, image description generation method and system
CN111666588A (en) Emotion difference privacy protection method based on generation countermeasure network
CN115858847A (en) Combined query image retrieval method based on cross-modal attention retention
CN114330334A (en) Multi-modal ironic detection method based on knowledge graph and cross-modal attention
CN114510594A (en) Traditional pattern subgraph retrieval method based on self-attention mechanism
CN112149603A (en) Cross-modal data augmentation-based continuous sign language identification method
Wang et al. Exploring font-independent features for scene text recognition
CN112115131A (en) Data denoising method, device and equipment and computer readable storage medium
CN115712740A (en) Method and system for multi-modal implication enhanced image text retrieval
CN114022687B (en) Image description countermeasure generation method based on reinforcement learning
CN116150418B (en) Image-text matching method and system based on mixed focusing attention mechanism
CN113536015A (en) Cross-modal retrieval method based on depth identification migration
Oussama et al. A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries
CN116662591A (en) Robust visual question-answering model training method based on contrast learning
Hettiarachchi et al. Depth as attention to learn image representations for visual localization, using monocular images
CN117235605B (en) Sensitive information classification method and device based on multi-mode attention fusion
CN112766156B (en) Riding attribute identification method and device and storage medium
CN114091108B (en) Intelligent system privacy evaluation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant