CN114896438A - Image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine mechanism - Google Patents
Image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine mechanism Download PDFInfo
- Publication number
- CN114896438A CN114896438A CN202210504224.0A CN202210504224A CN114896438A CN 114896438 A CN114896438 A CN 114896438A CN 202210504224 A CN202210504224 A CN 202210504224A CN 114896438 A CN114896438 A CN 114896438A
- Authority
- CN
- China
- Prior art keywords
- feature vector
- text
- image
- representing
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000011176 pooling Methods 0.000 title claims abstract description 90
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000007246 mechanism Effects 0.000 title claims description 29
- 239000013598 vector Substances 0.000 claims abstract description 342
- 238000010586 diagram Methods 0.000 claims abstract description 18
- 230000001902 propagating effect Effects 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 51
- 238000004364 calculation method Methods 0.000 claims description 22
- 238000013139 quantization Methods 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 15
- 230000002457 bidirectional effect Effects 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 230000006735 deficit Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 18
- 238000002474 experimental method Methods 0.000 description 6
- 238000004088 simulation Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Library & Information Science (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an image-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine, which comprises the following steps: respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text; obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector; respectively inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combination module to obtain final image and text characteristic vectors; obtaining comprehensive similarity based on the first similarity, the second similarity and the third similarity, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters; and obtaining a retrieval matching result by utilizing the comprehensive similarity. The invention improves the problem of difficult alignment of the retrieval task, and can obtain more complete image characteristic vectors and text characteristic vectors which can represent the image text matching relationship, thereby improving the retrieval accuracy.
Description
Technical Field
The invention belongs to the technical field of data mining, and relates to a graph-text retrieval method based on hierarchical alignment and generalized pooling image attention mechanism.
Background
In recent years, with the rapid development of the internet, people can receive a large amount of data every day, and researchers pay attention to how to accurately retrieve required information from a large amount of information. The proposal of the image-text retrieval provides a solution to the problems.
The essence of the image-text retrieval is that samples of two modes, namely an image and a text, are respectively coded to obtain semantic representation characteristics of the images and the text, and meanwhile, a corresponding similarity calculation method is designed to calculate the similarity between the image characteristics and the text characteristics. Through the image-text retrieval model, a user can quickly find the image corresponding to the description under the condition of giving the text description, and can quickly obtain the corresponding text description content under the condition of giving the image. The existing hierarchical alignment mode only considers semantic alignment between the whole image and the whole sentence and semantic alignment between an image area and a word, and ignores non-object elements such as global background information. Such semantic alignment is susceptible to negative examples with similar object entities but slightly different backgrounds. Meanwhile, the traditional feature aggregation method adopts a maximum pooling or average pooling mode, and ignores the importance of the multi-modal feature global-local feature cooperative relationship.
Therefore, how to improve the semantic alignment problem and how to enhance the multi-modal feature global-local feature collaborative relationship becomes an urgent problem to be solved.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine. The technical problem to be solved by the invention is realized by the following technical scheme:
the embodiment of the invention provides an image-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine, which comprises the following steps:
step 1, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector;
step 4, obtaining a comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively located in an image feature vector extraction part, a text feature vector extraction part, a graph attention and generalized pooling combined module;
and 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters.
In one embodiment of the present invention, the step 1 comprises:
step 1.1, extracting the global feature vector V of the preset image G And local feature vector V L ;
Step 1.2, cascading the global feature vector V G And the local feature vector V L Obtaining the characteristic vector of the initial image;
step 1.3, extracting the initial text characteristic vector T of the preset text S 。
In one embodiment of the invention, the global feature vector V G Comprises the following steps:
V G =W g G+b g ,
wherein, V G A global feature vector, W, representing said preset image g A first weight matrix is represented that is, representing the size of the first weight matrix, D representing the dimension of the feature vector of the output image, D 0 Represents the size of each pixel, G represents the first output characteristic and satisfies Representing the size of the first output feature, m representing the size of the reconstructed feature map, b g Represents a first bias constant;
the local feature vector V L Comprises the following steps:
V L =W l L+b l ,
wherein, V L Local feature vectors, W, representing said preset image l A second weight matrix is represented that represents a second weight matrix, representing the size of the second weight matrix, D k A dimension representing a feature of each region, L represents a second output feature and satisfies Representing the size of the second output feature, k representing the number of regions detected from the preset image, b l Represents a second bias constant;
the initial image feature vector is:
V U =V G ||V L ,
wherein, V U Representing the initial image feature vector, | | | representing a cascading operation, V U Can be expressed as Representing the size of the image feature vector, D U A dimension representing a feature vector of the image;
the initial text feature vector is:
T S =W S S+b S
wherein, T S Representing the initial text feature vector, S represents the output feature and satisfies Size of a feature vector representing text, D 1 Dimension representing a feature of the text, l representing the number of words in the text, W S Representing weight momentsThe number of the arrays is determined,b S representing a third bias constant.
In one embodiment of the present invention, the step 2 comprises:
step 2.1, extracting the first image feature vector of the ith node from the initial image feature vectorAnd a second image feature vector of a j-th node
Step 2.2, the first image feature vector is processedAnd the second image feature vectorPerforming dot product operation to obtain a first relation E U ;
Step 2.3, according to the characteristic vector of the initial image and the first relation E U Constructing the image feature map;
step 2.4, extracting the first text feature vector of the ith 1 node from the initial text feature vectorAnd a second text feature vector of the j1 th node
Step 2.5, the first text feature vector is processedAnd the second text feature vectorPerforming dot product operation to obtain a second relation E S ;
Step 2.6, according to the initial text feature vector and the second relation E S And constructing the text feature graph.
In one embodiment of the invention, said first relation E U Comprises the following steps:
wherein, | represents a dot product operation;
the image characteristic map is as follows:
G V =(V U ,E U )
wherein G is V Representing an image feature graph, taking the features in the initial image feature vector as nodes, and taking the first relation E U As an edge;
the second relation E S Comprises the following steps:
the text characteristic graph is as follows:
G T =(T S ,E S )
wherein G is T Representing a text feature graph, taking the features in the initial text feature vector as nodes, and taking the second relation E S As an edge.
In one embodiment of the present invention, the step 3 comprises:
step 3.1, inputting the image feature map into a map attention network module, and propagating the initial image feature vector through a multi-head map attention mechanism algorithm to obtain an updated image feature vector;
step 3.2, inputting the text feature map into a map attention network module, and propagating the initial text feature vector through a multi-head map attention machine algorithm to obtain an updated text feature vector;
step 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector;
and 3.4, inputting the updated text feature vector into a generalized pooling module to obtain a final text feature vector.
In one embodiment of the invention, said step 3.1 comprises:
step 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix;
step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer;
step 3.14, all the first output characteristics of the same image are spliced to obtain spliced image characteristics;
step 3.15, the spliced image features are subjected to regularization network to obtain updated image feature vectors;
the step 3.2 comprises:
step 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix;
step 3.23, multiplying the second multi-head attention weight matrix and the learnable weight matrix with the initial text feature vector to obtain a second output feature of each layer;
step 3.24, all second output characteristics of the same text are spliced to obtain spliced text characteristics;
and 3.25, obtaining an updated text feature vector by the spliced text features through a regularization network.
In one embodiment of the invention, said step 3.3 comprises:
step 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code;
step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit;
step 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector, wherein the final image feature vector is as follows:
wherein,the final image feature vector is represented by a vector of features,representing the ith feature, θ, in the final image feature vector k The first pooling coefficient is represented by a first pooling coefficient,representing ith feature in the updated image feature vector, wherein the value of N is equal to D U ;
The step 3.4 comprises:
step 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code;
step 3.42, after the second position code is converted into vector representation, a sequence model based on a bidirectional gating circulation unit is adopted to generate a second pooling coefficient;
step 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector, wherein the final text feature vector is as follows:
wherein,the final text feature vector is represented by a vector of characters,represents the i1 th feature, θ, in the final text feature vector k1 The second pooling coefficient is represented as a function of,representing the i1 th feature in the updated text feature vector, wherein the value of N1 is equal to D S 。
In one embodiment of the present invention, the step 4 comprises:
step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain the first similarity;
step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain the second similarity;
4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity;
4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity;
and 4.5, calculating a loss function by utilizing the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image characteristic vector extraction part, the text characteristic vector extraction part and the drawing attention and generalized pooling combined module.
In one embodiment of the present invention, the first similarity is:
wherein S is 1 (V L ,T S ) Represents the first degree of similarity, V L Representing said local feature vector, T S Representing the initial text feature vector, | | | |, represents the module value of the feature vector;
the second similarity is as follows:
wherein S is 2 (V U ,T S ) Represents the second degree of similarity, V U Representing the initial image feature vector; the third similarity is as follows:
wherein,represents the third degree of similarity, and represents the third degree of similarity,the final image feature vector is represented by a vector of features,representing the final text feature vector;
the comprehensive similarity is as follows:
s (I, T) represents the comprehensive similarity, I represents an input image to be matched, and T represents an input text to be matched;
the loss function is calculated as follows:
L=[d+S(I′,T)-S(I,T)] + +[d+S(I,T′)-S(I,T)] +
wherein L represents a loss function, d represents a deficit parameter, [ x ]] + ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max X≠I S (X, T) and T ═ argmax Y≠T S (I, Y), X denotes image information that does not match the given text information, and Y denotes text information that does not match the given image information.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention force mechanism, a hierarchical similarity comprehensive calculation mode is introduced into intra-modal and inter-modal semantic alignment, and the semantic alignment mode is calculated by utilizing the similarity between the extracted image feature vectors and the text description feature vectors under different conditions, so that the learning of intra-modal and inter-modal interaction information is enriched, the problem of difficulty in alignment of retrieval tasks is solved, and the retrieval accuracy is further improved.
2. The image-text retrieval method based on the hierarchical alignment and generalized pooling image-attention machine system utilizes a generalized pooling mode to replace the traditional modes of maximum pooling, average pooling and the like, integrates the pooling mode into the image-attention machine system, and extracts the maximum value in the feature vector.
Other aspects and features of the present invention will become apparent from the following detailed description, which proceeds with reference to the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims. It should be further understood that the drawings are not necessarily drawn to scale and that, unless otherwise indicated, they are merely intended to conceptually illustrate the structures and procedures described herein.
Drawings
Fig. 1 is a schematic flowchart of an image-text retrieval method based on hierarchical alignment and generalized pooling map attention machine according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a feature vector diagram according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating an attention mechanism and a generalized pooling module according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of an image-text retrieval method based on a hierarchical alignment and generalized pooling map attention machine provided in an embodiment of the present invention, and the present invention provides an image-text retrieval method based on a hierarchical alignment and generalized pooling map attention machine, which includes steps 1 to 5, wherein:
step 1, please refer to fig. 2, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector.
Specifically, the preset image is an image which needs to be matched with the text, the preset text is a text which needs to be matched with the image, if the preset image is 1 and the number of the initial texts is 5, the preset image and the 5 initial texts need to be retrieved to obtain a text with the highest similarity, and the text is used for describing the content of the image and serves as a matching result.
In a specific embodiment, step 1 may specifically include:
step 1.1, extracting global feature vector V of preset image G And local feature vector V L 。
In the present embodiment, the global feature vector V G Comprises the following steps:
V G =W g G+b g ,
wherein, V G Global feature vector, W, representing a preset image g A first weight matrix is represented that is, representing the size of the first weight matrix, D representing the dimension of the global feature vector of the output image, D 0 Represents the size of each pixel, G represents the first output characteristic and satisfies Representing the size of the first output feature, m representing the size of the reconstructed feature map, b g Representing a first bias constant.
In the present embodiment, the local feature vector V L Comprises the following steps:
V L =W l L+b l ,
wherein, V L Local feature vectors, W, representing a predetermined image l A second weight matrix is represented that represents a second weight matrix, representing the size of the second weight matrix, D k A dimension representing a feature of each region, L represents a second output feature and satisfies Representing the size of the second output feature, k representing the number of regions detected from the preset image, b l Representing a second bias constant.
Step 1.2, cascading Global feature vector V G And local feature vector V L Obtaining an initial image feature vector, wherein the initial image feature vector is as follows:
V U =V G ||V L ,
wherein, V U Representing the initial image feature vector, | | | representing a cascading operation, V U Can be expressed as Representing the size of the image feature vector, D U Representing the dimensions of the image feature vector.
Step 1.3, extracting initial text characteristic vector T of preset text S The initial text feature vector is:
T S =W S S+b S
wherein, T S Representing the initial text feature vector, S represents the output feature and satisfies Representing text featuresSize of vector, D 1 Dimension representing a feature of the text, l representing the number of words in the text, W S A matrix of weights is represented by a matrix of weights,b S representing a third bias constant.
Optionally, the global feature vector extraction process may use a ResNet152 encoder module that is pre-trained on the ImageNet dataset to accurately extract pixel-level features of the image. In the local feature vector extraction process, a fast-RCNN module can be used as an encoder, and the module is obtained by pre-training on a Visual Genome data set. The image feature vector dimension is 2048 and is shared by global features and local features. The text feature vector extraction part selects a BERT pre-training model which comprises 12 layers, 12 heads, 768 hidden units and 110M parameters, and the dimensionality of the finally obtained text feature vector is 768.
And 2, correspondingly obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector.
In a specific embodiment, step 2 may specifically include:
step 2.1, extracting the first image feature vector of the ith node from the initial image feature vectorAnd a second image feature vector of a j-th node
Step 2.2, carrying out feature vector on the first imageAnd a second image feature vectorPerforming dot product operation to obtain a first relation E U First relation E U Comprises the following steps:
wherein an "-" indicates a dot-product operation.
Step 2.3, according to the characteristic vector of the initial image and the first relation E U Constructing an image feature map, wherein the image feature map is as follows:
G V =(V U ,E U )
wherein G is V Representing an image feature map, taking the features in the initial image feature vector as nodes, and taking the first relation E U As an edge.
Step 2.4, extracting the first text feature vector of the ith 1 node from the initial text feature vectorAnd a second text feature vector of the j1 th node
Step 2.5, carrying out comparison on the first text feature vectorAnd a second text feature vectorPerforming dot product operation to obtain a second relation E S Second relation E S Comprises the following steps:
step 2.6, according to the initial text feature vector and the second relation E S Constructing a text characteristic diagram, wherein the text characteristic diagram is as follows:
G T =(T S ,E S )
wherein G is T Representing text featuresThe graph takes the characteristics in the initial text characteristic vector as nodes and takes the second relation E S As an edge.
And 3, respectively inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combined module to obtain a final image characteristic vector and a final text characteristic vector.
Specifically, the attention and generalized pooling combination module is used for inputting the image feature map and the text feature map into the attention and generalized pooling combination module respectively and simultaneously, and iteratively updating the image feature vector and the text feature vector; referring to fig. 3, fig. 3 is a schematic diagram illustrating an attention mechanism and a generalized pooling module according to an embodiment of the present invention. As shown, in this embodiment, the image feature vectors and text feature vectors are updated and aggregated by a constructed graph attention mechanism and generalized pooling operation. The stacking of multiple graph attention and generalized pooling union modules enables better updating of vectors.
In a particular embodiment, step 3 may particularly comprise steps 3.1-3.4, wherein:
and 3.1, inputting the image feature map into a map attention network module, and spreading the initial image feature vector through a multi-head map attention machine mechanism algorithm to obtain an updated image feature vector.
Specifically, step 3.1 may specifically include steps 3.11-3.15, wherein:
and 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features.
Specifically, the initial image feature vector V of the image is obtained in step 1 U Is shown asAnd (3) introducing a multi-head self-attention mechanism to calculate the attention coefficient of each node, and setting the number of parallel layers of the multi-head attention mechanism as H, wherein H is more than or equal to 1 and less than or equal to H. Inputting the image feature vectors into each parallel layer at the same time, and calculating the weight matrixInputting the vector dot product of the features to obtain a first preliminary quantization result of the node features, wherein the calculation mode of the first feature quantization result of the h-th layer node is as follows:
wherein,represents the first preliminary quantization result, i.e., the importance of node i to node j in the h-th layer, D U Dimension, W, representing image features q And W k Each represents a learnable weight matrix.
And 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix.
Specifically, the first feature quantization result is regularized to facilitate comparison of parameters between nodes, so as to obtain a first multi-start attention weight matrix α ij The specific calculation method is as follows:
wherein softmax represents a normalization function, Ν i A set of neighbor nodes representing node i.
Step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer, wherein the first output feature is as follows:
wherein the head h Representing a first output characteristic, W v h Representing a weight matrix that the h-th layer can learn.
Step 3.14, all the first output features of the same image are spliced (vector end-to-end connection), so as to obtain spliced image features, wherein the spliced image features are as follows:
wherein,representing features of the stitched image, W o Representing a learnable weight matrix and concat represents a splicing function.
And 3.15, obtaining an updated image feature vector by the spliced image features through a regularization network.
Specifically, the spliced image features are subjected to final output representation through a regularization network, namely, the updated image feature vectors are refined through an image attention machine mechanismIth feature of updated image feature vectorThe specific calculation method is as follows:
in which, ReLU is selected as the activation function, and BN layer is used to keep the input of each layer of neural network in the same distribution.
And 3.2, inputting the text feature map into the map attention network module, and transmitting the initial text feature vector through a multi-head map attention machine mechanism algorithm to obtain an updated text feature vector.
Specifically, step 3.2 may specifically include steps 3.21-3.25, wherein:
and 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features.
Specifically, the initial text feature vector T of the image is obtained in step 1 S Is shown asAnd simultaneously inputting the text feature vectors into each parallel layer, and calculating the vector dot product of the weight matrix and the input features to obtain a second preliminary quantization result of the node features. The second feature quantization result of the h-th layer node is calculated as follows:
wherein,represents the second preliminary quantization result, i.e., the importance of node i1 to node j1 in the h-th level, D S Dimension, W, representing a feature of the text q And W k Each represents a learnable weight matrix.
And 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix.
Specifically, the second feature quantization result is regularized to facilitate comparison of parameters between nodes, so as to obtain a second multi-headed attention weight matrix α i1j1 The specific calculation method is as follows:
wherein N is i1 Representing a set of neighbor nodes for node i 1.
Step 3.23, multiplying the second multi-head attention weight matrix, the learnable weight matrix and the initial text feature vector to obtain a second output feature of each layer, wherein the second output feature is as follows:
wherein, head 1 h Representing a second output characteristic, W s h Representing a weight matrix that the h-th layer can learn.
Step 3.24, all second output features of the same text are spliced (vectors are connected end to end) to obtain spliced text features, wherein the spliced text features are as follows:
And 3.25, obtaining an updated text feature vector by the spliced text features through a regularization network.
Specifically, the spliced text features are subjected to final output representation through a regularization network, namely, the updated text feature vectors are refined through an attention machine mechanism. Ith 1 feature of updated text feature vectorThe specific calculation method of (2) is as follows:
and 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector.
Specifically, step 3.3 may specifically include steps 3.31-3.33, wherein:
and 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code.
Specifically, the generalized pooling module consists of a triangular position coding strategy and a sequence model based on a bidirectional gating cyclic unit. Firstly, vectorizing a position index by an updated image feature vector through a triangular position coding strategy, wherein the specific calculation mode is as follows:
wherein p is k Representing a first position code, d p Representing a first given vector dimension, j a Code representing the first position, j a Is given as d p 1/2, d of p Is equal to D U 。
Step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit, wherein the specific calculation mode is as follows:
wherein,representing a set of first pooling coefficients, MLP representing a multi-layer neural network unit, BiGRU representing a bi-directional gated cyclic unit.
And 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector.
Specifically, when the image passes through the generalized pooling module, the generalized pooling module performs sorting operation on the vectors, learns the pooling coefficient of each vector, performs weighted sum on the vectors, and finally outputs the node feature vector of the imageThe specific calculation method is as followsThe following:
θ k =f(k,N),k=1,2,…,N
Wherein,representing the final image feature vector, f corresponding to the process of generating pooling coefficients, θ k Representing the first pooling coefficient, i.e. theta k Representing pooled coefficients of the classified kth vector and satisfying Representing the updated image feature vector, with the value of N equal to D U 。
And 3.4, inputting the updated text feature vector into a generalized pooling module to obtain a final text feature vector.
Specifically, step 3.4 may specifically include steps 3.41 to 3.43, where:
and 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code.
Specifically, vectorization is performed on the position index by the updated text feature vector through a triangle position coding strategy, and the specific calculation mode is as follows:
wherein p is k1 Representing a second position code, d q Representing a second given vector dimension, j b Code representing the second position, j b Is given as d q 1/2, d of q Is equal to D S 。
Step 3.42, after the second position code is converted into vector representation, generating a second pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit, wherein the specific calculation mode is as follows:
And 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector.
Specifically, when the image passes through the generalized pooling module, the generalized pooling module performs sorting operation on the vectors, learns the pooling coefficient of each vector, performs weighted sum on the vectors, and finally outputs the node feature vector of the imageThe specific calculation method is as follows:
θ k1 =f(k1,N1),k1=1,2,…,N1
Wherein,the feature vector of the final text is represented,represents the i1 th feature, θ, in the final text feature vector k1 The second pooling coefficient is represented as a function of,representing the i1 th feature in the updated text feature vector, wherein the value of N1 is equal to D S 。
And 4, obtaining comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update the network parameters.
In one embodiment, step 4 comprises:
step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain a first similarity, wherein the first similarity is as follows:
wherein S is 1 (V L ,T S ) Denotes a first degree of similarity, V L Representing local feature vectors, T S Representing the initial text feature vector, | | · | | represents the module value of the feature vector.
Step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain a second similarity, wherein the second similarity is as follows:
wherein S is 2 (V U ,T S ) Denotes a second degree of similarity, V U Representing the initial image feature vector.
And 4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity, wherein the third similarity is as follows:
wherein,represents the third degree of similarity, and represents the third degree of similarity,the final image feature vector is represented by a vector of features,representing the final text feature vector;
and 4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity, wherein the comprehensive similarity is as follows:
s (I, T) represents the comprehensive similarity, I represents an input image to be matched, and T represents an input text to be matched;
and 4.5, calculating a loss function by utilizing the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image characteristic vector extraction part, the text characteristic vector extraction part and the drawing attention and generalized pooling combined module.
Specifically, a loss function training model is introduced, so that the matched image-text pair has a higher similarity score than the unmatched image-text pair, and the specific calculation method is as follows:
L=[d+S(I′,T)-S(I,T)] + +[d+S(I,T′)-S(I,T)] +
wherein L represents a loss function, d represents a deficit parameter, [ x ]] + ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max X≠I S (X, T) and T ═ argmax Y≠T S (I, Y), X denotes image information that does not match the given text information, and Y denotes text information that does not match the given image information. The above-described loss function calculation is introduced such that the overall similarity score between matching image and text pairs is higher compared to unmatched image and text pairs.
And 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters, wherein the model comprises an image characteristic vector extraction part, a text characteristic vector extraction part and a drawing attention and generalized pooling combined module.
Specifically, if the preset images to be matched are in the image retrieval text task, ranking the plurality of preset texts to be matched according to the comprehensive similarity obtained in the step 4, so as to obtain a text retrieval matching result corresponding to the preset images, namely taking the preset text with the highest score in the comprehensive similarity as a final retrieval matching result; similarly, in the task of searching the images in the text, for the preset text to be matched, ranking a plurality of preset images to be matched according to the final comprehensive similarity output by the model after updating the network parameters, so as to obtain an image searching and matching result corresponding to the preset text, namely, taking the preset image with the highest score in the comprehensive similarity as the final searching and matching result.
The image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention force mechanism of the embodiment utilizes a hierarchical similarity comprehensive calculation mode to extract image feature vectors and text feature vectors under different conditions to perform semantic alignment between modalities, enriches the learning of intra-modality and inter-modality interaction information, improves the problem of difficult alignment of retrieval tasks and further improves the retrieval accuracy compared with a retrieval model corresponding to the prior art. Compared with the prior art, the method has the advantages that the local object semantic relation and the global context information of the image and the text are enhanced, the image feature vector and the text feature vector which are more complete and can represent the image text matching relation more can be obtained, and the retrieval accuracy is improved.
Example two
In this embodiment, a simulation experiment is performed on the image-text retrieval method based on the hierarchical alignment and generalized pooling image attention machine mechanism in the first embodiment, and the effect of the present invention is further explained by comparing with the existing image-text retrieval method.
1. Simulation experiment conditions are as follows:
operating the system: ubuntu 16.04, python3.6
An experiment platform: pyroch-1.7.1
A processor: intel Xeon Gold 6226R CPU,64GB RAM,1T SSD
A display card: NVIDIA Tesla A100 GPU
Memory: 64GB
2. Simulation experiment contents:
simulation experiment I: image retrieval text task and accuracy rate experiment of text retrieval image task
The following experiments were all performed in the same experimental environment. The data set 1 and the data set 2 are both image-text retrieval task classic data sets, and the image-text retrieval method based on hierarchical alignment and generalized pooling image attention machine provided by the invention and the reference method belong to image-text retrieval algorithms in various semantic alignment modes.
Table 1 baseline method under data set 1 and method recall comparison proposed by the present invention
TABLE 2 Baseline method under data set 2 and recall comparison of methods proposed by the invention
From table 1 and table 2, it can be seen that under different data sets, the method provided by the present invention has good performance on both the image retrieval text task and the text retrieval image task, and especially exceeds the baseline method on both the R @1 and Rsum indexes. The experimental results carried out on the data set 1 can respectively reach 81.1, 67.4 and 533.2, and compared with the existing retrieval method (baseline method 1 in the figure), the improvement of 2.3%, 0.8% and 3.8% is respectively realized; in the experimental results performed on the data set 2, the results corresponding to the R @1 index achieved 2.3% and 2% improvement, respectively, compared to the existing retrieval method (baseline method 2 in the figure). The above experimental results show that the retrieval accuracy can be greatly improved by introducing the generalized pooling method into the image-text retrieval task and guiding the updating of the generalized pooling method by utilizing the similarity of the feature vectors.
And (2) simulation experiment II: importance comparison visualization experiment of combined module of graphic attention machine mechanism and generalized pooling operation in model
The following experiments were all performed in the same experimental environment. The substitution method for removing the module does not use the combined module of the graph attention machine mechanism and the generalized pooling operation, but uses the traditional graph attention machine mechanism and the maximal pooling mode to form a model to participate in experimental research.
Table 3 illustrates the significance comparison visualization experiment of the combined model of the attention mechanism and the generalized pooling operation
As can be seen from table 3, for the same image, the first five text descriptions retrieved by the method of the present invention correspond to correct texts, while the third result is wrong in the first five retrieval results of the alternative method. The above experimental results show that the retrieval accuracy can be greatly improved by introducing the generalized pooling method into the image-text retrieval task and guiding the updating of the generalized pooling method by utilizing the similarity of the feature vectors.
In the description of the invention, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic data point described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristic data points described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (10)
1. A graph-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine is characterized by comprising the following steps:
step 1, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector;
step 2, correspondingly obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector;
step 3, inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combined module respectively to obtain a final image characteristic vector and a final text characteristic vector;
step 4, obtaining a comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively located in an image feature vector extraction part, a text feature vector extraction part, a graph attention and generalized pooling combined module;
and 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters.
2. The method for teletext retrieval based on the hierarchical alignment and generalized pooling of image attention machine mechanism according to claim 1, wherein the step 1 comprises:
step 1.1, extracting the global feature vector V of the preset image G And local feature vector V L ;
Step 1.2, cascading the global feature vector V G And the local feature vector V L Obtaining the characteristic vector of the initial image;
step 1.3, extracting the initial text characteristic vector T of the preset text S 。
3. According to claim 2The image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention machine mechanism is characterized in that the global feature vector V G Comprises the following steps:
V G =W g G+b g ,
wherein, V G A global feature vector, W, representing said preset image g A first weight matrix is represented that is, representing the size of the first weight matrix, D representing the dimension of the feature vector of the output image, D 0 Represents the size of each pixel, G represents the first output characteristic and satisfies Representing the size of the first output feature, m representing the size of the reconstructed feature map, b g Represents a first bias constant;
the local feature vector V L Comprises the following steps:
V L =W l L+b l ,
wherein, V L Local feature vectors, W, representing said preset image l A second weight matrix is represented that represents a second weight matrix, representing the size of the second weight matrix, D k A dimension representing a feature of each region, L represents a second output feature and satisfies Representing the size of the second output feature, k representing the number of regions detected from the preset image, b l Represents a second bias constant;
the initial image feature vector is:
V U =V G ||V L ,
wherein, V U Representing the initial image feature vector, | | | representing a cascading operation, V U Can be expressed as Representing the size of the image feature vector, D U A dimension representing a feature vector of the image;
the initial text feature vector is:
T S =W S S+b S
wherein, T S Representing the initial text feature vector, S represents the output feature and satisfies Size of a feature vector representing text, D 1 Dimension representing a feature of the text, l representing the number of words in the text, W S A matrix of weights is represented by a matrix of weights,b S representing a third bias constant.
4. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein the step 2 comprises:
step 2.1, extracting the first image feature vector of the ith node from the initial image feature vectorAnd a second image feature vector of a j-th node
Step 2.2, the first image feature vector is processedAnd the second image feature vectorPerforming dot product operation to obtain a first relation E U ;
Step 2.3, according to the characteristic vector of the initial image and the first relation E U Constructing the image feature map;
step 2.4, extracting the first text feature vector of the ith 1 node from the initial text feature vectorAnd a second text feature vector of the j1 th node
Step 2.5, the first text feature vector is processedAnd the second text feature vectorPerforming dot product operation to obtain a second relation E S ;
Step 2.6, according to the initial text feature vector and the second relation E S And constructing the text feature graph.
5. The method of claim 4, wherein the first relationship E is a hierarchical alignment and generalized pooling graph attention machine based graph-text retrieval method U Comprises the following steps:
wherein, an indicates a dot-product operation;
the image characteristic map is as follows:
G V =(V U ,E U )
wherein G is V Representing an image feature graph, taking the features in the initial image feature vector as nodes, and taking the first relation E U As an edge;
the second relation E S Comprises the following steps:
the text characteristic graph is as follows:
G T =(T S ,E S )
wherein G is T Representing a text feature graph, taking the features in the initial text feature vector as nodes, and taking the second relation E S As an edge.
6. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein said step 3 comprises:
step 3.1, inputting the image feature map into a map attention network module, and propagating the initial image feature vector through a multi-head map attention mechanism algorithm to obtain an updated image feature vector;
step 3.2, inputting the text feature map into a map attention network module, and propagating the initial text feature vector through a multi-head map attention mechanism algorithm to obtain an updated text feature vector;
step 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector;
and 3.4, inputting the updated text feature vector into a generalized pooling module to obtain a final text feature vector.
7. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 6, wherein said step 3.1 comprises:
step 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix;
step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer;
step 3.14, all the first output characteristics of the same image are spliced to obtain spliced image characteristics;
step 3.15, the spliced image features are subjected to regularization network to obtain updated image feature vectors;
the step 3.2 comprises:
step 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix;
step 3.23, multiplying the second multi-head attention weight matrix and the learnable weight matrix by the initial text feature vector to obtain a second output feature of each layer;
step 3.24, all second output characteristics of the same text are spliced to obtain spliced text characteristics;
and 3.25, obtaining an updated text feature vector by the spliced text features through a regularization network.
8. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 6, wherein said step 3.3 comprises:
step 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code;
step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit;
step 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector, wherein the final image feature vector is as follows:
wherein,the final image feature vector is represented by a vector of features,representing the final image feature directionCharacteristic i of the quantity, θ k The first pooling coefficient is represented by a first pooling coefficient,representing ith feature in the updated image feature vector, wherein the value of N is equal to D U ;
The step 3.4 comprises:
step 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code;
step 3.42, after the second position code is converted into vector representation, a sequence model based on a bidirectional gating circulation unit is adopted to generate a second pooling coefficient;
step 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector, wherein the final text feature vector is as follows:
wherein,the final text feature vector is represented by a vector of characters,represents the i1 th feature, θ, in the final text feature vector k1 The second pooling coefficient is represented as a function of,representing the i1 th feature in the updated text feature vector, wherein the value of N1 is equal to D S 。
9. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein the step 4 comprises:
step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain the first similarity;
step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain the second similarity;
4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity;
4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity;
and 4.5, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image feature vector extraction part, the text feature vector extraction part and the drawing attention and generalized pooling combined module.
10. The method of teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism of claim 9, wherein the first similarity is:
wherein S is 1 (V L ,T S ) Represents the first degree of similarity, V L Representing said local feature vector, T S Representing the initial text feature vector, | | | |, represents the module value of the feature vector;
the second similarity is as follows:
wherein S is 2 (V U ,T S ) Represents the second degree of similarity, V U Representing the initial image feature vector;
the third similarity is as follows:
wherein,represents the third degree of similarity, and represents the third degree of similarity,the final image feature vector is represented by a vector of features,representing the final text feature vector;
the comprehensive similarity is as follows:
s (I, T) represents the comprehensive similarity, I represents an input image to be matched, and T represents an input text to be matched;
the loss function is calculated as follows:
L=[d+S(I′,T)-S(I,T)] + +[d+S(I,T′)-S(I,T)] +
wherein L represents a loss function, d represents a deficit parameter, [ x ]] + ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max X≠I S (X, T) and T ═ argmax Y≠T S (I, Y), X represents image information not matching given text information, Y represents image information not matching given text informationAnd determining text information of which the image information is not matched.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210504224.0A CN114896438B (en) | 2022-05-10 | 2022-05-10 | Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210504224.0A CN114896438B (en) | 2022-05-10 | 2022-05-10 | Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114896438A true CN114896438A (en) | 2022-08-12 |
CN114896438B CN114896438B (en) | 2024-06-28 |
Family
ID=82722248
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210504224.0A Active CN114896438B (en) | 2022-05-10 | 2022-05-10 | Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114896438B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115510193A (en) * | 2022-10-10 | 2022-12-23 | 北京百度网讯科技有限公司 | Query result vectorization method, query result determination method and related device |
CN115985509A (en) * | 2022-12-14 | 2023-04-18 | 广东省人民医院 | Medical imaging data retrieval system, method, device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108647350A (en) * | 2018-05-16 | 2018-10-12 | 中国人民解放军陆军工程大学 | Image-text associated retrieval method based on two-channel network |
CN109903314A (en) * | 2019-03-13 | 2019-06-18 | 腾讯科技(深圳)有限公司 | A kind of method, the method for model training and the relevant apparatus of image-region positioning |
US20210150373A1 (en) * | 2019-11-15 | 2021-05-20 | International Business Machines Corporation | Capturing the global structure of logical formulae with graph long short-term memory |
US20210303921A1 (en) * | 2020-03-30 | 2021-09-30 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Cross-modality processing method and apparatus, and computer storage medium |
CN114168784A (en) * | 2021-12-10 | 2022-03-11 | 桂林电子科技大学 | Layered supervision cross-modal image-text retrieval method |
-
2022
- 2022-05-10 CN CN202210504224.0A patent/CN114896438B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108647350A (en) * | 2018-05-16 | 2018-10-12 | 中国人民解放军陆军工程大学 | Image-text associated retrieval method based on two-channel network |
CN109903314A (en) * | 2019-03-13 | 2019-06-18 | 腾讯科技(深圳)有限公司 | A kind of method, the method for model training and the relevant apparatus of image-region positioning |
US20210150373A1 (en) * | 2019-11-15 | 2021-05-20 | International Business Machines Corporation | Capturing the global structure of logical formulae with graph long short-term memory |
US20210303921A1 (en) * | 2020-03-30 | 2021-09-30 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Cross-modality processing method and apparatus, and computer storage medium |
CN114168784A (en) * | 2021-12-10 | 2022-03-11 | 桂林电子科技大学 | Layered supervision cross-modal image-text retrieval method |
Non-Patent Citations (2)
Title |
---|
GUO JIE等: "HGAN:hierarchical graph alignment network for image-text retrieval", IEEE TRANSACTION ON MULTI, vol. 25, 28 February 2023 (2023-02-28), pages 9189 - 9202 * |
张天;靳聪;帖云;李小兵;: "面向跨模态检索的音频数据库内容匹配方法研究", 信号处理, vol. 36, no. 06, 12 June 2020 (2020-06-12), pages 966 - 976 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115510193A (en) * | 2022-10-10 | 2022-12-23 | 北京百度网讯科技有限公司 | Query result vectorization method, query result determination method and related device |
CN115510193B (en) * | 2022-10-10 | 2024-04-16 | 北京百度网讯科技有限公司 | Query result vectorization method, query result determination method and related devices |
CN115985509A (en) * | 2022-12-14 | 2023-04-18 | 广东省人民医院 | Medical imaging data retrieval system, method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114896438B (en) | 2024-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wu et al. | Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval. | |
CN108733742B (en) | Global normalized reader system and method | |
CN110162593B (en) | Search result processing and similarity model training method and device | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN111581401B (en) | Local citation recommendation system and method based on depth correlation matching | |
CN109753589A (en) | A kind of figure method for visualizing based on figure convolutional network | |
CN114896438B (en) | Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism | |
CN114398961A (en) | Visual question-answering method based on multi-mode depth feature fusion and model thereof | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN111209415B (en) | Image-text cross-modal Hash retrieval method based on mass training | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
CN112015868A (en) | Question-answering method based on knowledge graph completion | |
Diallo et al. | Auto-attention mechanism for multi-view deep embedding clustering | |
Liu et al. | Auto-weighted collective matrix factorization with graph dual regularization for multi-view clustering | |
CN113191357A (en) | Multilevel image-text matching method based on graph attention network | |
CN110008365B (en) | Image processing method, device and equipment and readable storage medium | |
CN112256866A (en) | Text fine-grained emotion analysis method based on deep learning | |
CN117556067B (en) | Data retrieval method, device, computer equipment and storage medium | |
CN112084312B (en) | Intelligent customer service system constructed based on knowledge graph | |
Li et al. | Multi-view clustering via adversarial view embedding and adaptive view fusion | |
CN116611024A (en) | Multi-mode trans mock detection method based on facts and emotion oppositivity | |
CN110598022A (en) | Image retrieval system and method based on robust deep hash network | |
CN117494051A (en) | Classification processing method, model training method and related device | |
CN110674293B (en) | Text classification method based on semantic migration | |
Yuan | Emotional tendency of online legal course review texts based on SVM algorithm and network data acquisition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |