CN114896438A - Image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine mechanism - Google Patents

Image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine mechanism Download PDF

Info

Publication number
CN114896438A
CN114896438A CN202210504224.0A CN202210504224A CN114896438A CN 114896438 A CN114896438 A CN 114896438A CN 202210504224 A CN202210504224 A CN 202210504224A CN 114896438 A CN114896438 A CN 114896438A
Authority
CN
China
Prior art keywords
feature vector
text
image
representing
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210504224.0A
Other languages
Chinese (zh)
Other versions
CN114896438B (en
Inventor
郭洁
王孟瀛
周妍
高雅
宋彬
池育浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210504224.0A priority Critical patent/CN114896438B/en
Publication of CN114896438A publication Critical patent/CN114896438A/en
Application granted granted Critical
Publication of CN114896438B publication Critical patent/CN114896438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an image-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine, which comprises the following steps: respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text; obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector; respectively inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combination module to obtain final image and text characteristic vectors; obtaining comprehensive similarity based on the first similarity, the second similarity and the third similarity, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters; and obtaining a retrieval matching result by utilizing the comprehensive similarity. The invention improves the problem of difficult alignment of the retrieval task, and can obtain more complete image characteristic vectors and text characteristic vectors which can represent the image text matching relationship, thereby improving the retrieval accuracy.

Description

Image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine mechanism
Technical Field
The invention belongs to the technical field of data mining, and relates to a graph-text retrieval method based on hierarchical alignment and generalized pooling image attention mechanism.
Background
In recent years, with the rapid development of the internet, people can receive a large amount of data every day, and researchers pay attention to how to accurately retrieve required information from a large amount of information. The proposal of the image-text retrieval provides a solution to the problems.
The essence of the image-text retrieval is that samples of two modes, namely an image and a text, are respectively coded to obtain semantic representation characteristics of the images and the text, and meanwhile, a corresponding similarity calculation method is designed to calculate the similarity between the image characteristics and the text characteristics. Through the image-text retrieval model, a user can quickly find the image corresponding to the description under the condition of giving the text description, and can quickly obtain the corresponding text description content under the condition of giving the image. The existing hierarchical alignment mode only considers semantic alignment between the whole image and the whole sentence and semantic alignment between an image area and a word, and ignores non-object elements such as global background information. Such semantic alignment is susceptible to negative examples with similar object entities but slightly different backgrounds. Meanwhile, the traditional feature aggregation method adopts a maximum pooling or average pooling mode, and ignores the importance of the multi-modal feature global-local feature cooperative relationship.
Therefore, how to improve the semantic alignment problem and how to enhance the multi-modal feature global-local feature collaborative relationship becomes an urgent problem to be solved.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an image-text retrieval method based on hierarchical alignment and generalized pooling graph attention machine. The technical problem to be solved by the invention is realized by the following technical scheme:
the embodiment of the invention provides an image-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine, which comprises the following steps:
step 1, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector;
step 2, correspondingly obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector;
step 3, inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combined module respectively to obtain a final image characteristic vector and a final text characteristic vector;
step 4, obtaining a comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively located in an image feature vector extraction part, a text feature vector extraction part, a graph attention and generalized pooling combined module;
and 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters.
In one embodiment of the present invention, the step 1 comprises:
step 1.1, extracting the global feature vector V of the preset image G And local feature vector V L
Step 1.2, cascading the global feature vector V G And the local feature vector V L Obtaining the characteristic vector of the initial image;
step 1.3, extracting the initial text characteristic vector T of the preset text S
In one embodiment of the invention, the global feature vector V G Comprises the following steps:
V G =W g G+b g ,
wherein, V G A global feature vector, W, representing said preset image g A first weight matrix is represented that is,
Figure BDA0003636741670000031
Figure BDA0003636741670000032
representing the size of the first weight matrix, D representing the dimension of the feature vector of the output image, D 0 Represents the size of each pixel, G represents the first output characteristic and satisfies
Figure BDA0003636741670000033
Figure BDA0003636741670000034
Representing the size of the first output feature, m representing the size of the reconstructed feature map, b g Represents a first bias constant;
the local feature vector V L Comprises the following steps:
V L =W l L+b l ,
wherein, V L Local feature vectors, W, representing said preset image l A second weight matrix is represented that represents a second weight matrix,
Figure BDA0003636741670000035
Figure BDA0003636741670000036
representing the size of the second weight matrix, D k A dimension representing a feature of each region, L represents a second output feature and satisfies
Figure BDA0003636741670000037
Figure BDA0003636741670000038
Representing the size of the second output feature, k representing the number of regions detected from the preset image, b l Represents a second bias constant;
the initial image feature vector is:
V U =V G ||V L ,
wherein, V U Representing the initial image feature vector, | | | representing a cascading operation, V U Can be expressed as
Figure BDA0003636741670000039
Figure BDA00036367416700000310
Representing the size of the image feature vector, D U A dimension representing a feature vector of the image;
the initial text feature vector is:
T S =W S S+b S
wherein, T S Representing the initial text feature vector, S represents the output feature and satisfies
Figure BDA0003636741670000041
Figure BDA0003636741670000042
Size of a feature vector representing text, D 1 Dimension representing a feature of the text, l representing the number of words in the text, W S Representing weight momentsThe number of the arrays is determined,
Figure BDA0003636741670000043
b S representing a third bias constant.
In one embodiment of the present invention, the step 2 comprises:
step 2.1, extracting the first image feature vector of the ith node from the initial image feature vector
Figure BDA0003636741670000044
And a second image feature vector of a j-th node
Figure BDA0003636741670000045
Step 2.2, the first image feature vector is processed
Figure BDA0003636741670000046
And the second image feature vector
Figure BDA0003636741670000047
Performing dot product operation to obtain a first relation E U
Step 2.3, according to the characteristic vector of the initial image and the first relation E U Constructing the image feature map;
step 2.4, extracting the first text feature vector of the ith 1 node from the initial text feature vector
Figure BDA0003636741670000048
And a second text feature vector of the j1 th node
Figure BDA0003636741670000049
Step 2.5, the first text feature vector is processed
Figure BDA00036367416700000410
And the second text feature vector
Figure BDA00036367416700000411
Performing dot product operation to obtain a second relation E S
Step 2.6, according to the initial text feature vector and the second relation E S And constructing the text feature graph.
In one embodiment of the invention, said first relation E U Comprises the following steps:
Figure BDA00036367416700000412
wherein, | represents a dot product operation;
the image characteristic map is as follows:
G V =(V U ,E U )
wherein G is V Representing an image feature graph, taking the features in the initial image feature vector as nodes, and taking the first relation E U As an edge;
the second relation E S Comprises the following steps:
Figure BDA0003636741670000051
the text characteristic graph is as follows:
G T =(T S ,E S )
wherein G is T Representing a text feature graph, taking the features in the initial text feature vector as nodes, and taking the second relation E S As an edge.
In one embodiment of the present invention, the step 3 comprises:
step 3.1, inputting the image feature map into a map attention network module, and propagating the initial image feature vector through a multi-head map attention mechanism algorithm to obtain an updated image feature vector;
step 3.2, inputting the text feature map into a map attention network module, and propagating the initial text feature vector through a multi-head map attention machine algorithm to obtain an updated text feature vector;
step 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector;
and 3.4, inputting the updated text feature vector into a generalized pooling module to obtain a final text feature vector.
In one embodiment of the invention, said step 3.1 comprises:
step 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix;
step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer;
step 3.14, all the first output characteristics of the same image are spliced to obtain spliced image characteristics;
step 3.15, the spliced image features are subjected to regularization network to obtain updated image feature vectors;
the step 3.2 comprises:
step 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix;
step 3.23, multiplying the second multi-head attention weight matrix and the learnable weight matrix with the initial text feature vector to obtain a second output feature of each layer;
step 3.24, all second output characteristics of the same text are spliced to obtain spliced text characteristics;
and 3.25, obtaining an updated text feature vector by the spliced text features through a regularization network.
In one embodiment of the invention, said step 3.3 comprises:
step 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code;
step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit;
step 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector, wherein the final image feature vector is as follows:
Figure BDA0003636741670000071
Figure BDA0003636741670000072
wherein,
Figure BDA0003636741670000073
the final image feature vector is represented by a vector of features,
Figure BDA0003636741670000074
representing the ith feature, θ, in the final image feature vector k The first pooling coefficient is represented by a first pooling coefficient,
Figure BDA0003636741670000075
representing ith feature in the updated image feature vector, wherein the value of N is equal to D U
The step 3.4 comprises:
step 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code;
step 3.42, after the second position code is converted into vector representation, a sequence model based on a bidirectional gating circulation unit is adopted to generate a second pooling coefficient;
step 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector, wherein the final text feature vector is as follows:
Figure BDA0003636741670000076
Figure BDA0003636741670000077
wherein,
Figure BDA0003636741670000078
the final text feature vector is represented by a vector of characters,
Figure BDA0003636741670000079
represents the i1 th feature, θ, in the final text feature vector k1 The second pooling coefficient is represented as a function of,
Figure BDA00036367416700000710
representing the i1 th feature in the updated text feature vector, wherein the value of N1 is equal to D S
In one embodiment of the present invention, the step 4 comprises:
step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain the first similarity;
step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain the second similarity;
4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity;
4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity;
and 4.5, calculating a loss function by utilizing the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image characteristic vector extraction part, the text characteristic vector extraction part and the drawing attention and generalized pooling combined module.
In one embodiment of the present invention, the first similarity is:
Figure BDA0003636741670000081
wherein S is 1 (V L ,T S ) Represents the first degree of similarity, V L Representing said local feature vector, T S Representing the initial text feature vector, | | | |, represents the module value of the feature vector;
the second similarity is as follows:
Figure BDA0003636741670000082
wherein S is 2 (V U ,T S ) Represents the second degree of similarity, V U Representing the initial image feature vector; the third similarity is as follows:
Figure BDA0003636741670000083
wherein,
Figure BDA0003636741670000084
represents the third degree of similarity, and represents the third degree of similarity,
Figure BDA0003636741670000085
the final image feature vector is represented by a vector of features,
Figure BDA0003636741670000086
representing the final text feature vector;
the comprehensive similarity is as follows:
Figure BDA0003636741670000091
s (I, T) represents the comprehensive similarity, I represents an input image to be matched, and T represents an input text to be matched;
the loss function is calculated as follows:
L=[d+S(I′,T)-S(I,T)] + +[d+S(I,T′)-S(I,T)] +
wherein L represents a loss function, d represents a deficit parameter, [ x ]] + ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max X≠I S (X, T) and T ═ argmax Y≠T S (I, Y), X denotes image information that does not match the given text information, and Y denotes text information that does not match the given image information.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention force mechanism, a hierarchical similarity comprehensive calculation mode is introduced into intra-modal and inter-modal semantic alignment, and the semantic alignment mode is calculated by utilizing the similarity between the extracted image feature vectors and the text description feature vectors under different conditions, so that the learning of intra-modal and inter-modal interaction information is enriched, the problem of difficulty in alignment of retrieval tasks is solved, and the retrieval accuracy is further improved.
2. The image-text retrieval method based on the hierarchical alignment and generalized pooling image-attention machine system utilizes a generalized pooling mode to replace the traditional modes of maximum pooling, average pooling and the like, integrates the pooling mode into the image-attention machine system, and extracts the maximum value in the feature vector.
Other aspects and features of the present invention will become apparent from the following detailed description, which proceeds with reference to the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims. It should be further understood that the drawings are not necessarily drawn to scale and that, unless otherwise indicated, they are merely intended to conceptually illustrate the structures and procedures described herein.
Drawings
Fig. 1 is a schematic flowchart of an image-text retrieval method based on hierarchical alignment and generalized pooling map attention machine according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a feature vector diagram according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating an attention mechanism and a generalized pooling module according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of an image-text retrieval method based on a hierarchical alignment and generalized pooling map attention machine provided in an embodiment of the present invention, and the present invention provides an image-text retrieval method based on a hierarchical alignment and generalized pooling map attention machine, which includes steps 1 to 5, wherein:
step 1, please refer to fig. 2, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector.
Specifically, the preset image is an image which needs to be matched with the text, the preset text is a text which needs to be matched with the image, if the preset image is 1 and the number of the initial texts is 5, the preset image and the 5 initial texts need to be retrieved to obtain a text with the highest similarity, and the text is used for describing the content of the image and serves as a matching result.
In a specific embodiment, step 1 may specifically include:
step 1.1, extracting global feature vector V of preset image G And local feature vector V L
In the present embodiment, the global feature vector V G Comprises the following steps:
V G =W g G+b g ,
wherein, V G Global feature vector, W, representing a preset image g A first weight matrix is represented that is,
Figure BDA0003636741670000111
Figure BDA0003636741670000112
representing the size of the first weight matrix, D representing the dimension of the global feature vector of the output image, D 0 Represents the size of each pixel, G represents the first output characteristic and satisfies
Figure BDA0003636741670000113
Figure BDA0003636741670000114
Representing the size of the first output feature, m representing the size of the reconstructed feature map, b g Representing a first bias constant.
In the present embodiment, the local feature vector V L Comprises the following steps:
V L =W l L+b l ,
wherein, V L Local feature vectors, W, representing a predetermined image l A second weight matrix is represented that represents a second weight matrix,
Figure BDA0003636741670000115
Figure BDA0003636741670000116
representing the size of the second weight matrix, D k A dimension representing a feature of each region, L represents a second output feature and satisfies
Figure BDA0003636741670000117
Figure BDA0003636741670000118
Representing the size of the second output feature, k representing the number of regions detected from the preset image, b l Representing a second bias constant.
Step 1.2, cascading Global feature vector V G And local feature vector V L Obtaining an initial image feature vector, wherein the initial image feature vector is as follows:
V U =V G ||V L ,
wherein, V U Representing the initial image feature vector, | | | representing a cascading operation, V U Can be expressed as
Figure BDA0003636741670000119
Figure BDA00036367416700001110
Representing the size of the image feature vector, D U Representing the dimensions of the image feature vector.
Step 1.3, extracting initial text characteristic vector T of preset text S The initial text feature vector is:
T S =W S S+b S
wherein, T S Representing the initial text feature vector, S represents the output feature and satisfies
Figure BDA0003636741670000121
Figure BDA0003636741670000122
Representing text featuresSize of vector, D 1 Dimension representing a feature of the text, l representing the number of words in the text, W S A matrix of weights is represented by a matrix of weights,
Figure BDA0003636741670000123
b S representing a third bias constant.
Optionally, the global feature vector extraction process may use a ResNet152 encoder module that is pre-trained on the ImageNet dataset to accurately extract pixel-level features of the image. In the local feature vector extraction process, a fast-RCNN module can be used as an encoder, and the module is obtained by pre-training on a Visual Genome data set. The image feature vector dimension is 2048 and is shared by global features and local features. The text feature vector extraction part selects a BERT pre-training model which comprises 12 layers, 12 heads, 768 hidden units and 110M parameters, and the dimensionality of the finally obtained text feature vector is 768.
And 2, correspondingly obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector.
In a specific embodiment, step 2 may specifically include:
step 2.1, extracting the first image feature vector of the ith node from the initial image feature vector
Figure BDA0003636741670000124
And a second image feature vector of a j-th node
Figure BDA0003636741670000125
Step 2.2, carrying out feature vector on the first image
Figure BDA0003636741670000126
And a second image feature vector
Figure BDA0003636741670000127
Performing dot product operation to obtain a first relation E U First relation E U Comprises the following steps:
Figure BDA0003636741670000128
wherein an "-" indicates a dot-product operation.
Step 2.3, according to the characteristic vector of the initial image and the first relation E U Constructing an image feature map, wherein the image feature map is as follows:
G V =(V U ,E U )
wherein G is V Representing an image feature map, taking the features in the initial image feature vector as nodes, and taking the first relation E U As an edge.
Step 2.4, extracting the first text feature vector of the ith 1 node from the initial text feature vector
Figure BDA0003636741670000131
And a second text feature vector of the j1 th node
Figure BDA0003636741670000132
Step 2.5, carrying out comparison on the first text feature vector
Figure BDA0003636741670000133
And a second text feature vector
Figure BDA0003636741670000134
Performing dot product operation to obtain a second relation E S Second relation E S Comprises the following steps:
Figure BDA0003636741670000135
step 2.6, according to the initial text feature vector and the second relation E S Constructing a text characteristic diagram, wherein the text characteristic diagram is as follows:
G T =(T S ,E S )
wherein G is T Representing text featuresThe graph takes the characteristics in the initial text characteristic vector as nodes and takes the second relation E S As an edge.
And 3, respectively inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combined module to obtain a final image characteristic vector and a final text characteristic vector.
Specifically, the attention and generalized pooling combination module is used for inputting the image feature map and the text feature map into the attention and generalized pooling combination module respectively and simultaneously, and iteratively updating the image feature vector and the text feature vector; referring to fig. 3, fig. 3 is a schematic diagram illustrating an attention mechanism and a generalized pooling module according to an embodiment of the present invention. As shown, in this embodiment, the image feature vectors and text feature vectors are updated and aggregated by a constructed graph attention mechanism and generalized pooling operation. The stacking of multiple graph attention and generalized pooling union modules enables better updating of vectors.
In a particular embodiment, step 3 may particularly comprise steps 3.1-3.4, wherein:
and 3.1, inputting the image feature map into a map attention network module, and spreading the initial image feature vector through a multi-head map attention machine mechanism algorithm to obtain an updated image feature vector.
Specifically, step 3.1 may specifically include steps 3.11-3.15, wherein:
and 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features.
Specifically, the initial image feature vector V of the image is obtained in step 1 U Is shown as
Figure BDA0003636741670000141
And (3) introducing a multi-head self-attention mechanism to calculate the attention coefficient of each node, and setting the number of parallel layers of the multi-head attention mechanism as H, wherein H is more than or equal to 1 and less than or equal to H. Inputting the image feature vectors into each parallel layer at the same time, and calculating the weight matrixInputting the vector dot product of the features to obtain a first preliminary quantization result of the node features, wherein the calculation mode of the first feature quantization result of the h-th layer node is as follows:
Figure BDA0003636741670000142
wherein,
Figure BDA0003636741670000143
represents the first preliminary quantization result, i.e., the importance of node i to node j in the h-th layer, D U Dimension, W, representing image features q And W k Each represents a learnable weight matrix.
And 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix.
Specifically, the first feature quantization result is regularized to facilitate comparison of parameters between nodes, so as to obtain a first multi-start attention weight matrix α ij The specific calculation method is as follows:
Figure BDA0003636741670000144
wherein softmax represents a normalization function, Ν i A set of neighbor nodes representing node i.
Step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer, wherein the first output feature is as follows:
Figure BDA0003636741670000151
wherein the head h Representing a first output characteristic, W v h Representing a weight matrix that the h-th layer can learn.
Step 3.14, all the first output features of the same image are spliced (vector end-to-end connection), so as to obtain spliced image features, wherein the spliced image features are as follows:
Figure BDA0003636741670000152
wherein,
Figure BDA0003636741670000153
representing features of the stitched image, W o Representing a learnable weight matrix and concat represents a splicing function.
And 3.15, obtaining an updated image feature vector by the spliced image features through a regularization network.
Specifically, the spliced image features are subjected to final output representation through a regularization network, namely, the updated image feature vectors are refined through an image attention machine mechanism
Figure BDA0003636741670000154
Ith feature of updated image feature vector
Figure BDA0003636741670000155
The specific calculation method is as follows:
Figure BDA0003636741670000156
in which, ReLU is selected as the activation function, and BN layer is used to keep the input of each layer of neural network in the same distribution.
And 3.2, inputting the text feature map into the map attention network module, and transmitting the initial text feature vector through a multi-head map attention machine mechanism algorithm to obtain an updated text feature vector.
Specifically, step 3.2 may specifically include steps 3.21-3.25, wherein:
and 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features.
Specifically, the initial text feature vector T of the image is obtained in step 1 S Is shown as
Figure BDA0003636741670000161
And simultaneously inputting the text feature vectors into each parallel layer, and calculating the vector dot product of the weight matrix and the input features to obtain a second preliminary quantization result of the node features. The second feature quantization result of the h-th layer node is calculated as follows:
Figure BDA0003636741670000162
wherein,
Figure BDA0003636741670000163
represents the second preliminary quantization result, i.e., the importance of node i1 to node j1 in the h-th level, D S Dimension, W, representing a feature of the text q And W k Each represents a learnable weight matrix.
And 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix.
Specifically, the second feature quantization result is regularized to facilitate comparison of parameters between nodes, so as to obtain a second multi-headed attention weight matrix α i1j1 The specific calculation method is as follows:
Figure BDA0003636741670000164
wherein N is i1 Representing a set of neighbor nodes for node i 1.
Step 3.23, multiplying the second multi-head attention weight matrix, the learnable weight matrix and the initial text feature vector to obtain a second output feature of each layer, wherein the second output feature is as follows:
Figure BDA0003636741670000165
wherein, head 1 h Representing a second output characteristic, W s h Representing a weight matrix that the h-th layer can learn.
Step 3.24, all second output features of the same text are spliced (vectors are connected end to end) to obtain spliced text features, wherein the spliced text features are as follows:
Figure BDA0003636741670000166
wherein,
Figure BDA0003636741670000167
and representing the spliced text features.
And 3.25, obtaining an updated text feature vector by the spliced text features through a regularization network.
Specifically, the spliced text features are subjected to final output representation through a regularization network, namely, the updated text feature vectors are refined through an attention machine mechanism. Ith 1 feature of updated text feature vector
Figure BDA0003636741670000171
The specific calculation method of (2) is as follows:
Figure BDA0003636741670000172
and 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector.
Specifically, step 3.3 may specifically include steps 3.31-3.33, wherein:
and 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code.
Specifically, the generalized pooling module consists of a triangular position coding strategy and a sequence model based on a bidirectional gating cyclic unit. Firstly, vectorizing a position index by an updated image feature vector through a triangular position coding strategy, wherein the specific calculation mode is as follows:
Figure BDA0003636741670000173
Figure BDA0003636741670000174
wherein p is k Representing a first position code, d p Representing a first given vector dimension, j a Code representing the first position, j a Is given as d p 1/2, d of p Is equal to D U
Step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit, wherein the specific calculation mode is as follows:
Figure BDA0003636741670000175
wherein,
Figure BDA0003636741670000176
representing a set of first pooling coefficients, MLP representing a multi-layer neural network unit, BiGRU representing a bi-directional gated cyclic unit.
And 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector.
Specifically, when the image passes through the generalized pooling module, the generalized pooling module performs sorting operation on the vectors, learns the pooling coefficient of each vector, performs weighted sum on the vectors, and finally outputs the node feature vector of the image
Figure BDA0003636741670000181
The specific calculation method is as followsThe following:
Figure BDA0003636741670000182
θ k =f(k,N),k=1,2,…,N
aggregating all image nodes to obtain a final image feature vector
Figure BDA0003636741670000183
Wherein,
Figure BDA0003636741670000184
representing the final image feature vector, f corresponding to the process of generating pooling coefficients, θ k Representing the first pooling coefficient, i.e. theta k Representing pooled coefficients of the classified kth vector and satisfying
Figure BDA0003636741670000185
Figure BDA0003636741670000186
Representing the updated image feature vector, with the value of N equal to D U
And 3.4, inputting the updated text feature vector into a generalized pooling module to obtain a final text feature vector.
Specifically, step 3.4 may specifically include steps 3.41 to 3.43, where:
and 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code.
Specifically, vectorization is performed on the position index by the updated text feature vector through a triangle position coding strategy, and the specific calculation mode is as follows:
Figure BDA0003636741670000187
Figure BDA0003636741670000191
wherein p is k1 Representing a second position code, d q Representing a second given vector dimension, j b Code representing the second position, j b Is given as d q 1/2, d of q Is equal to D S
Step 3.42, after the second position code is converted into vector representation, generating a second pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit, wherein the specific calculation mode is as follows:
Figure BDA0003636741670000192
wherein,
Figure BDA0003636741670000193
representing a set of second pooling coefficients.
And 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector.
Specifically, when the image passes through the generalized pooling module, the generalized pooling module performs sorting operation on the vectors, learns the pooling coefficient of each vector, performs weighted sum on the vectors, and finally outputs the node feature vector of the image
Figure BDA0003636741670000194
The specific calculation method is as follows:
Figure BDA0003636741670000195
θ k1 =f(k1,N1),k1=1,2,…,N1
all text nodes are aggregated to obtain a final text feature vector
Figure BDA0003636741670000196
Wherein,
Figure BDA0003636741670000197
the feature vector of the final text is represented,
Figure BDA0003636741670000198
represents the i1 th feature, θ, in the final text feature vector k1 The second pooling coefficient is represented as a function of,
Figure BDA0003636741670000199
representing the i1 th feature in the updated text feature vector, wherein the value of N1 is equal to D S
And 4, obtaining comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update the network parameters.
In one embodiment, step 4 comprises:
step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain a first similarity, wherein the first similarity is as follows:
Figure BDA0003636741670000201
wherein S is 1 (V L ,T S ) Denotes a first degree of similarity, V L Representing local feature vectors, T S Representing the initial text feature vector, | | · | | represents the module value of the feature vector.
Step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain a second similarity, wherein the second similarity is as follows:
Figure BDA0003636741670000202
wherein S is 2 (V U ,T S ) Denotes a second degree of similarity, V U Representing the initial image feature vector.
And 4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity, wherein the third similarity is as follows:
Figure BDA0003636741670000203
wherein,
Figure BDA0003636741670000204
represents the third degree of similarity, and represents the third degree of similarity,
Figure BDA0003636741670000205
the final image feature vector is represented by a vector of features,
Figure BDA0003636741670000206
representing the final text feature vector;
and 4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity, wherein the comprehensive similarity is as follows:
Figure BDA0003636741670000207
s (I, T) represents the comprehensive similarity, I represents an input image to be matched, and T represents an input text to be matched;
and 4.5, calculating a loss function by utilizing the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image characteristic vector extraction part, the text characteristic vector extraction part and the drawing attention and generalized pooling combined module.
Specifically, a loss function training model is introduced, so that the matched image-text pair has a higher similarity score than the unmatched image-text pair, and the specific calculation method is as follows:
L=[d+S(I′,T)-S(I,T)] + +[d+S(I,T′)-S(I,T)] +
wherein L represents a loss function, d represents a deficit parameter, [ x ]] + ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max X≠I S (X, T) and T ═ argmax Y≠T S (I, Y), X denotes image information that does not match the given text information, and Y denotes text information that does not match the given image information. The above-described loss function calculation is introduced such that the overall similarity score between matching image and text pairs is higher compared to unmatched image and text pairs.
And 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters, wherein the model comprises an image characteristic vector extraction part, a text characteristic vector extraction part and a drawing attention and generalized pooling combined module.
Specifically, if the preset images to be matched are in the image retrieval text task, ranking the plurality of preset texts to be matched according to the comprehensive similarity obtained in the step 4, so as to obtain a text retrieval matching result corresponding to the preset images, namely taking the preset text with the highest score in the comprehensive similarity as a final retrieval matching result; similarly, in the task of searching the images in the text, for the preset text to be matched, ranking a plurality of preset images to be matched according to the final comprehensive similarity output by the model after updating the network parameters, so as to obtain an image searching and matching result corresponding to the preset text, namely, taking the preset image with the highest score in the comprehensive similarity as the final searching and matching result.
The image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention force mechanism of the embodiment utilizes a hierarchical similarity comprehensive calculation mode to extract image feature vectors and text feature vectors under different conditions to perform semantic alignment between modalities, enriches the learning of intra-modality and inter-modality interaction information, improves the problem of difficult alignment of retrieval tasks and further improves the retrieval accuracy compared with a retrieval model corresponding to the prior art. Compared with the prior art, the method has the advantages that the local object semantic relation and the global context information of the image and the text are enhanced, the image feature vector and the text feature vector which are more complete and can represent the image text matching relation more can be obtained, and the retrieval accuracy is improved.
Example two
In this embodiment, a simulation experiment is performed on the image-text retrieval method based on the hierarchical alignment and generalized pooling image attention machine mechanism in the first embodiment, and the effect of the present invention is further explained by comparing with the existing image-text retrieval method.
1. Simulation experiment conditions are as follows:
operating the system: ubuntu 16.04, python3.6
An experiment platform: pyroch-1.7.1
A processor: intel Xeon Gold 6226R CPU,64GB RAM,1T SSD
A display card: NVIDIA Tesla A100 GPU
Memory: 64GB
2. Simulation experiment contents:
simulation experiment I: image retrieval text task and accuracy rate experiment of text retrieval image task
The following experiments were all performed in the same experimental environment. The data set 1 and the data set 2 are both image-text retrieval task classic data sets, and the image-text retrieval method based on hierarchical alignment and generalized pooling image attention machine provided by the invention and the reference method belong to image-text retrieval algorithms in various semantic alignment modes.
Table 1 baseline method under data set 1 and method recall comparison proposed by the present invention
Figure BDA0003636741670000231
TABLE 2 Baseline method under data set 2 and recall comparison of methods proposed by the invention
Figure BDA0003636741670000232
From table 1 and table 2, it can be seen that under different data sets, the method provided by the present invention has good performance on both the image retrieval text task and the text retrieval image task, and especially exceeds the baseline method on both the R @1 and Rsum indexes. The experimental results carried out on the data set 1 can respectively reach 81.1, 67.4 and 533.2, and compared with the existing retrieval method (baseline method 1 in the figure), the improvement of 2.3%, 0.8% and 3.8% is respectively realized; in the experimental results performed on the data set 2, the results corresponding to the R @1 index achieved 2.3% and 2% improvement, respectively, compared to the existing retrieval method (baseline method 2 in the figure). The above experimental results show that the retrieval accuracy can be greatly improved by introducing the generalized pooling method into the image-text retrieval task and guiding the updating of the generalized pooling method by utilizing the similarity of the feature vectors.
And (2) simulation experiment II: importance comparison visualization experiment of combined module of graphic attention machine mechanism and generalized pooling operation in model
The following experiments were all performed in the same experimental environment. The substitution method for removing the module does not use the combined module of the graph attention machine mechanism and the generalized pooling operation, but uses the traditional graph attention machine mechanism and the maximal pooling mode to form a model to participate in experimental research.
Table 3 illustrates the significance comparison visualization experiment of the combined model of the attention mechanism and the generalized pooling operation
Figure BDA0003636741670000241
As can be seen from table 3, for the same image, the first five text descriptions retrieved by the method of the present invention correspond to correct texts, while the third result is wrong in the first five retrieval results of the alternative method. The above experimental results show that the retrieval accuracy can be greatly improved by introducing the generalized pooling method into the image-text retrieval task and guiding the updating of the generalized pooling method by utilizing the similarity of the feature vectors.
In the description of the invention, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic data point described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristic data points described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A graph-text retrieval method based on hierarchical alignment and generalized pooling graphic attention machine is characterized by comprising the following steps:
step 1, respectively extracting an initial image feature vector of a preset image and an initial text feature vector of a preset text, wherein the initial image feature vector is obtained by cascading a global feature vector and a local feature vector;
step 2, correspondingly obtaining an image feature map and a text feature map according to the cascade relation of different nodes in the initial image feature vector and the initial text feature vector;
step 3, inputting the image characteristic diagram and the text characteristic diagram into a diagram attention and generalized pooling combined module respectively to obtain a final image characteristic vector and a final text characteristic vector;
step 4, obtaining a comprehensive similarity between the preset image and the preset text based on the first similarity of the global feature vector and the initial text feature vector, the second similarity of the local feature vector and the initial text feature vector, and the third similarity of the final image feature vector and the final text feature vector, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively located in an image feature vector extraction part, a text feature vector extraction part, a graph attention and generalized pooling combined module;
and 5, obtaining a retrieval matching result by utilizing the final comprehensive similarity output by the model after updating the network parameters.
2. The method for teletext retrieval based on the hierarchical alignment and generalized pooling of image attention machine mechanism according to claim 1, wherein the step 1 comprises:
step 1.1, extracting the global feature vector V of the preset image G And local feature vector V L
Step 1.2, cascading the global feature vector V G And the local feature vector V L Obtaining the characteristic vector of the initial image;
step 1.3, extracting the initial text characteristic vector T of the preset text S
3. According to claim 2The image-text retrieval method based on the hierarchical alignment and generalized pooling graphic attention machine mechanism is characterized in that the global feature vector V G Comprises the following steps:
V G =W g G+b g ,
wherein, V G A global feature vector, W, representing said preset image g A first weight matrix is represented that is,
Figure FDA0003636741660000021
Figure FDA0003636741660000022
representing the size of the first weight matrix, D representing the dimension of the feature vector of the output image, D 0 Represents the size of each pixel, G represents the first output characteristic and satisfies
Figure FDA0003636741660000023
Figure FDA0003636741660000024
Representing the size of the first output feature, m representing the size of the reconstructed feature map, b g Represents a first bias constant;
the local feature vector V L Comprises the following steps:
V L =W l L+b l ,
wherein, V L Local feature vectors, W, representing said preset image l A second weight matrix is represented that represents a second weight matrix,
Figure FDA0003636741660000025
Figure FDA0003636741660000026
representing the size of the second weight matrix, D k A dimension representing a feature of each region, L represents a second output feature and satisfies
Figure FDA0003636741660000027
Figure FDA0003636741660000028
Representing the size of the second output feature, k representing the number of regions detected from the preset image, b l Represents a second bias constant;
the initial image feature vector is:
V U =V G ||V L ,
wherein, V U Representing the initial image feature vector, | | | representing a cascading operation, V U Can be expressed as
Figure FDA0003636741660000029
Figure FDA00036367416600000210
Representing the size of the image feature vector, D U A dimension representing a feature vector of the image;
the initial text feature vector is:
T S =W S S+b S
wherein, T S Representing the initial text feature vector, S represents the output feature and satisfies
Figure FDA0003636741660000031
Figure FDA0003636741660000032
Size of a feature vector representing text, D 1 Dimension representing a feature of the text, l representing the number of words in the text, W S A matrix of weights is represented by a matrix of weights,
Figure FDA0003636741660000033
b S representing a third bias constant.
4. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein the step 2 comprises:
step 2.1, extracting the first image feature vector of the ith node from the initial image feature vector
Figure FDA0003636741660000034
And a second image feature vector of a j-th node
Figure FDA0003636741660000035
Step 2.2, the first image feature vector is processed
Figure FDA0003636741660000036
And the second image feature vector
Figure FDA0003636741660000037
Performing dot product operation to obtain a first relation E U
Step 2.3, according to the characteristic vector of the initial image and the first relation E U Constructing the image feature map;
step 2.4, extracting the first text feature vector of the ith 1 node from the initial text feature vector
Figure FDA0003636741660000038
And a second text feature vector of the j1 th node
Figure FDA0003636741660000039
Step 2.5, the first text feature vector is processed
Figure FDA00036367416600000310
And the second text feature vector
Figure FDA00036367416600000311
Performing dot product operation to obtain a second relation E S
Step 2.6, according to the initial text feature vector and the second relation E S And constructing the text feature graph.
5. The method of claim 4, wherein the first relationship E is a hierarchical alignment and generalized pooling graph attention machine based graph-text retrieval method U Comprises the following steps:
Figure FDA00036367416600000312
wherein, an indicates a dot-product operation;
the image characteristic map is as follows:
G V =(V U ,E U )
wherein G is V Representing an image feature graph, taking the features in the initial image feature vector as nodes, and taking the first relation E U As an edge;
the second relation E S Comprises the following steps:
Figure FDA0003636741660000041
the text characteristic graph is as follows:
G T =(T S ,E S )
wherein G is T Representing a text feature graph, taking the features in the initial text feature vector as nodes, and taking the second relation E S As an edge.
6. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein said step 3 comprises:
step 3.1, inputting the image feature map into a map attention network module, and propagating the initial image feature vector through a multi-head map attention mechanism algorithm to obtain an updated image feature vector;
step 3.2, inputting the text feature map into a map attention network module, and propagating the initial text feature vector through a multi-head map attention mechanism algorithm to obtain an updated text feature vector;
step 3.3, inputting the updated image feature vector into a generalized pooling module to obtain a final image feature vector;
and 3.4, inputting the updated text feature vector into a generalized pooling module to obtain a final text feature vector.
7. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 6, wherein said step 3.1 comprises:
step 3.11, simultaneously inputting the initial image feature vectors into each parallel layer in the graph attention network module, and obtaining a first feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.12, regularizing the first characteristic quantization result to obtain a first multi-head attention weight matrix;
step 3.13, multiplying the first multi-head attention weight matrix, the learnable weight matrix and the initial image feature vector to obtain a first output feature of each layer;
step 3.14, all the first output characteristics of the same image are spliced to obtain spliced image characteristics;
step 3.15, the spliced image features are subjected to regularization network to obtain updated image feature vectors;
the step 3.2 comprises:
step 3.21, simultaneously inputting the initial text feature vectors into each parallel layer in the graph attention network module, and obtaining a second feature quantization result of the h-th layer node by calculating the vector dot product of the weight matrix and the input features;
step 3.22, regularizing the second characteristic quantization result to obtain a second multi-head attention weight matrix;
step 3.23, multiplying the second multi-head attention weight matrix and the learnable weight matrix by the initial text feature vector to obtain a second output feature of each layer;
step 3.24, all second output characteristics of the same text are spliced to obtain spliced text characteristics;
and 3.25, obtaining an updated text feature vector by the spliced text features through a regularization network.
8. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 6, wherein said step 3.3 comprises:
step 3.31, vectorizing the position index by the updated image feature vector through a triangular position coding strategy to obtain a first position code;
step 3.32, after the first position code is converted into vector representation, generating a first pooling coefficient by adopting a sequence model based on a bidirectional gating circulation unit;
step 3.33, based on the first pooling coefficient, obtaining a final image feature vector according to the updated image feature vector, wherein the final image feature vector is as follows:
Figure FDA0003636741660000061
Figure FDA0003636741660000062
wherein,
Figure FDA0003636741660000063
the final image feature vector is represented by a vector of features,
Figure FDA0003636741660000064
representing the final image feature directionCharacteristic i of the quantity, θ k The first pooling coefficient is represented by a first pooling coefficient,
Figure FDA0003636741660000065
representing ith feature in the updated image feature vector, wherein the value of N is equal to D U
The step 3.4 comprises:
step 3.41, vectorizing the position index by the updated text feature vector through a triangular position coding strategy to obtain a second position code;
step 3.42, after the second position code is converted into vector representation, a sequence model based on a bidirectional gating circulation unit is adopted to generate a second pooling coefficient;
step 3.43, based on the second pooling coefficient, obtaining a final text feature vector according to the updated text feature vector, wherein the final text feature vector is as follows:
Figure FDA0003636741660000066
Figure FDA0003636741660000067
wherein,
Figure FDA0003636741660000068
the final text feature vector is represented by a vector of characters,
Figure FDA0003636741660000069
represents the i1 th feature, θ, in the final text feature vector k1 The second pooling coefficient is represented as a function of,
Figure FDA0003636741660000071
representing the i1 th feature in the updated text feature vector, wherein the value of N1 is equal to D S
9. The method for teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism according to claim 1, wherein the step 4 comprises:
step 4.1, cosine similarity calculation is carried out on the local feature vector and the initial text feature vector to obtain the first similarity;
step 4.2, cosine similarity calculation is carried out on the initial image feature vector and the initial text feature vector to obtain the second similarity;
4.3, calculating cosine similarity of the final image feature vector and the final text feature vector to obtain a third similarity;
4.4, obtaining the comprehensive similarity between the preset image and the preset text according to the sum of the first similarity, the second similarity and the third similarity;
and 4.5, calculating a loss function by using the comprehensive similarity, and reversely propagating the loss function to update network parameters, wherein the network parameters are respectively positioned in the image feature vector extraction part, the text feature vector extraction part and the drawing attention and generalized pooling combined module.
10. The method of teletext retrieval based on the hierarchical alignment and generalized pooling graphic attention machine mechanism of claim 9, wherein the first similarity is:
Figure FDA0003636741660000072
wherein S is 1 (V L ,T S ) Represents the first degree of similarity, V L Representing said local feature vector, T S Representing the initial text feature vector, | | | |, represents the module value of the feature vector;
the second similarity is as follows:
Figure FDA0003636741660000073
wherein S is 2 (V U ,T S ) Represents the second degree of similarity, V U Representing the initial image feature vector;
the third similarity is as follows:
Figure FDA0003636741660000081
wherein,
Figure FDA0003636741660000082
represents the third degree of similarity, and represents the third degree of similarity,
Figure FDA0003636741660000083
the final image feature vector is represented by a vector of features,
Figure FDA0003636741660000084
representing the final text feature vector;
the comprehensive similarity is as follows:
Figure FDA0003636741660000085
s (I, T) represents the comprehensive similarity, I represents an input image to be matched, and T represents an input text to be matched;
the loss function is calculated as follows:
L=[d+S(I′,T)-S(I,T)] + +[d+S(I,T′)-S(I,T)] +
wherein L represents a loss function, d represents a deficit parameter, [ x ]] + ≡ max (x,0), ≡ denotes identity, I ' and T ' denote the opposite cases of the most mismatch with respect to matching image and text pairs, respectively, and both satisfy I ' ═ arg max X≠I S (X, T) and T ═ argmax Y≠T S (I, Y), X represents image information not matching given text information, Y represents image information not matching given text informationAnd determining text information of which the image information is not matched.
CN202210504224.0A 2022-05-10 2022-05-10 Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism Active CN114896438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210504224.0A CN114896438B (en) 2022-05-10 2022-05-10 Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210504224.0A CN114896438B (en) 2022-05-10 2022-05-10 Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism

Publications (2)

Publication Number Publication Date
CN114896438A true CN114896438A (en) 2022-08-12
CN114896438B CN114896438B (en) 2024-06-28

Family

ID=82722248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210504224.0A Active CN114896438B (en) 2022-05-10 2022-05-10 Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism

Country Status (1)

Country Link
CN (1) CN114896438B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115510193A (en) * 2022-10-10 2022-12-23 北京百度网讯科技有限公司 Query result vectorization method, query result determination method and related device
CN115985509A (en) * 2022-12-14 2023-04-18 广东省人民医院 Medical imaging data retrieval system, method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647350A (en) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 Image-text associated retrieval method based on two-channel network
CN109903314A (en) * 2019-03-13 2019-06-18 腾讯科技(深圳)有限公司 A kind of method, the method for model training and the relevant apparatus of image-region positioning
US20210150373A1 (en) * 2019-11-15 2021-05-20 International Business Machines Corporation Capturing the global structure of logical formulae with graph long short-term memory
US20210303921A1 (en) * 2020-03-30 2021-09-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Cross-modality processing method and apparatus, and computer storage medium
CN114168784A (en) * 2021-12-10 2022-03-11 桂林电子科技大学 Layered supervision cross-modal image-text retrieval method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647350A (en) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 Image-text associated retrieval method based on two-channel network
CN109903314A (en) * 2019-03-13 2019-06-18 腾讯科技(深圳)有限公司 A kind of method, the method for model training and the relevant apparatus of image-region positioning
US20210150373A1 (en) * 2019-11-15 2021-05-20 International Business Machines Corporation Capturing the global structure of logical formulae with graph long short-term memory
US20210303921A1 (en) * 2020-03-30 2021-09-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Cross-modality processing method and apparatus, and computer storage medium
CN114168784A (en) * 2021-12-10 2022-03-11 桂林电子科技大学 Layered supervision cross-modal image-text retrieval method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUO JIE等: "HGAN:hierarchical graph alignment network for image-text retrieval", IEEE TRANSACTION ON MULTI, vol. 25, 28 February 2023 (2023-02-28), pages 9189 - 9202 *
张天;靳聪;帖云;李小兵;: "面向跨模态检索的音频数据库内容匹配方法研究", 信号处理, vol. 36, no. 06, 12 June 2020 (2020-06-12), pages 966 - 976 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115510193A (en) * 2022-10-10 2022-12-23 北京百度网讯科技有限公司 Query result vectorization method, query result determination method and related device
CN115510193B (en) * 2022-10-10 2024-04-16 北京百度网讯科技有限公司 Query result vectorization method, query result determination method and related devices
CN115985509A (en) * 2022-12-14 2023-04-18 广东省人民医院 Medical imaging data retrieval system, method, device and storage medium

Also Published As

Publication number Publication date
CN114896438B (en) 2024-06-28

Similar Documents

Publication Publication Date Title
Wu et al. Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval.
CN108733742B (en) Global normalized reader system and method
CN110162593B (en) Search result processing and similarity model training method and device
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111581401B (en) Local citation recommendation system and method based on depth correlation matching
CN109753589A (en) A kind of figure method for visualizing based on figure convolutional network
CN114896438B (en) Image-text retrieval method based on hierarchical alignment and generalized pooling image annotation mechanism
CN114398961A (en) Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN111291188B (en) Intelligent information extraction method and system
CN111209415B (en) Image-text cross-modal Hash retrieval method based on mass training
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN112015868A (en) Question-answering method based on knowledge graph completion
Diallo et al. Auto-attention mechanism for multi-view deep embedding clustering
Liu et al. Auto-weighted collective matrix factorization with graph dual regularization for multi-view clustering
CN113191357A (en) Multilevel image-text matching method based on graph attention network
CN110008365B (en) Image processing method, device and equipment and readable storage medium
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
CN117556067B (en) Data retrieval method, device, computer equipment and storage medium
CN112084312B (en) Intelligent customer service system constructed based on knowledge graph
Li et al. Multi-view clustering via adversarial view embedding and adaptive view fusion
CN116611024A (en) Multi-mode trans mock detection method based on facts and emotion oppositivity
CN110598022A (en) Image retrieval system and method based on robust deep hash network
CN117494051A (en) Classification processing method, model training method and related device
CN110674293B (en) Text classification method based on semantic migration
Yuan Emotional tendency of online legal course review texts based on SVM algorithm and network data acquisition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant