CN112861882B - Image-text matching method and system based on frequency self-adaption - Google Patents

Image-text matching method and system based on frequency self-adaption Download PDF

Info

Publication number
CN112861882B
CN112861882B CN202110260146.XA CN202110260146A CN112861882B CN 112861882 B CN112861882 B CN 112861882B CN 202110260146 A CN202110260146 A CN 202110260146A CN 112861882 B CN112861882 B CN 112861882B
Authority
CN
China
Prior art keywords
image
text
global
feature
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110260146.XA
Other languages
Chinese (zh)
Other versions
CN112861882A (en
Inventor
赵晶
秦宥煊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202110260146.XA priority Critical patent/CN112861882B/en
Publication of CN112861882A publication Critical patent/CN112861882A/en
Application granted granted Critical
Publication of CN112861882B publication Critical patent/CN112861882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image-text matching method and system based on frequency self-adaption. The method adds context information for the image area, adaptively aggregates low-frequency and high-frequency signals on the graph convolution, and realizes semantic reasoning among significant object areas; then, a attention interaction method is provided, global features are generated through an iteration mechanism, and the semantic alignment effect is gradually achieved in the aggregation process of words and image areas; and finally, obtaining a final matching effect by using the loss function.

Description

Image-text matching method and system based on frequency self-adaption
Technical Field
The invention belongs to the field of image-text matching, and particularly relates to an image-text matching method and system based on frequency self-adaption.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
1. The matching method comprises the following steps: in the prior matching method, images and texts are mapped into a common space by embedding for simple comparison, and the hierarchical structure of the vision-text is analyzed by learning the mapping of different modes, so that the corresponding matching effect is obtained. With the rapid development of the internet, the requirements of users on matching precision are also continuously improved. For this reason, attention mechanisms are currently widely used in modality matching. For example, a dual attention mechanism is used to collect the similarity points of each part to perform similarity measurement, so as to obtain the same protruding part in different modes. On the basis, a learner improves the matching capability by improving the feature extraction effect. For example, adding relative position information of entities in an image improves the accuracy of the image representation.
2. Attention mechanism: in order to accurately focus on important information in an image or text, filtering out non-functional information, the attention mechanism plays a key role in image-text matching. At present, a bottom-up attention mechanism close to human reality feeling is used as an image extraction method, and a strong effect is displayed. The method can acquire the remarkable target of the image so as to achieve better matching effect.
3. Semantic reasoning: the purpose of reasoning is to analyze the potential relation of targets in a knowledge graph by machine learning under the known condition, which is also a popular research subject. Early inferences represented relationships between symbols based on extrapolation, lacking interpretability. The path ordering algorithm adopts abstract relation paths to replace logic rules, so that the relation reasoning problem is converted into a supervised learning problem on the graph, and the method is also a method for relation reasoning. At present, many scholars propose improvement on the basis of a path sorting algorithm, and the reasoning accuracy and the calculation efficiency are improved greatly. In recent years, a relational reasoning model based on deep learning becomes a research hotspot in the reasoning field. Researchers have attempted to combine previous reasoning methods with deep learning, exploiting memory reasoning capabilities to find new breakthrough points for natural language processing and visual information processing.
The inventor finds that the semantic relation of fine granularity among different modes of the image-text is lacking in the image-text matching model which is proposed at present, so that the matching behavior of people in the real world is difficult to simulate. Individual intra-modal associations for complex semantics (e.g., associations between entities and attributes in images) remain to be improved. In the aspect of feature representation of the image, the prior method only focuses on the feature of the target, ignores the association among a plurality of targets and is not beneficial to learning the accurate representation of the whole image. The GCN currently used learns parameters greater than 0, and focuses on the aggregation of low-frequency signals, which can obscure the representation of nodes under certain conditions, and cannot show ideal effects when used in image processing.
Disclosure of Invention
In order to solve the above problems, a first aspect of the present invention provides a frequency-adaptive image-text matching method, which can adaptively add context information to an internal region of a picture by using high-low frequency signals of nodes in a graph convolution, and simultaneously, can efficiently align semantics of heterogeneous image and text data by using an iteration mechanism, so as to generate global feature expression and improve matching rate.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
an image-text matching method based on frequency adaptation, comprising:
acquiring data, wherein the data comprises an image and a text matched with the image;
training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; and calculating a loss function, and optimizing the loss function by using an optimizer.
Further, the initial characterization of the image comprises the following specific calculation steps:
obtaining each regional characteristic of the image through a convolutional neural network;
performing linear transformation on each region characteristic;
and carrying out normalization processing on each region characteristic after linear transformation to obtain the region characteristic after normalization processing of each region, and forming an initial representation of the image.
Further, the initial characterization of the text comprises the following specific calculation steps:
encoding each word in the text using one-hot;
computing an embedded representation of each word;
summarizing context information from both directions;
and obtaining word characteristics with the enhanced context information by adopting an average value mode, and forming an initial representation of the text.
Further, the calculating the image area set with the global context enhancement semantic relation comprises the following specific steps:
constructing an undirected graph for the image;
and (3) adaptively aggregating high-low frequency information of all associated nodes for each node in the undirected graph to obtain nodes subjected to semantic reasoning, and forming an image region set with global context enhanced semantic relations.
Further, the specific steps of obtaining the semantic enhanced image global feature and the semantic enhanced text global feature are as follows:
selecting any one of the image and the text as a query modality, and the other one as another modality;
iterative calculation is carried out by using the attention interaction function to obtain the global features of the query mode and the global features of another mode;
if the image is in a query mode, taking the global feature of the query mode as the semantically enhanced image global feature, and taking the global feature of the other mode as the semantically enhanced text global feature; if the text is in the query mode, the global feature of the query mode is taken as the text global feature with enhanced semantics, and the global feature of the other mode is taken as the image global feature with enhanced semantics.
Further, the loss function is a triplet loss function.
Still further, the attention interaction function has different attention degrees to different segments of the query modality under the guidance of another modality.
In order to solve the above problems, a second aspect of the present invention provides a frequency-adaptive image-text matching system, which can adaptively add context information to an internal region of a picture by using high-low frequency signals of nodes in a graph convolution, and can simultaneously align heterogeneous images and text data with high-efficiency semantics by using an iteration mechanism to generate global feature expression to improve a matching rate.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a frequency-adaptive based image-text matching system, comprising:
a data acquisition module configured to: acquiring data, wherein the data comprises an image and a text matched with the image;
a model training module configured to: training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; and calculating a loss function, and optimizing the loss function by using an optimizer.
A third aspect of the invention provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.
A fourth aspect of the invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.
The beneficial effects of the invention are as follows:
the invention connects image areas as nodes for complex visual information processing, and establishes connection between a significant area and relevant easily neglected parts thereof by adaptively aggregating high and low frequency information of the nodes.
The invention adopts iterative attention network to dynamically align segment information, achieves interaction of heterogeneous modes between vision and text, and improves matching precision; and efficient semantic alignment of heterogeneous images and text data is achieved by using an iteration mechanism, and global feature expression is generated to improve the matching rate.
According to the generated global features, the invention adopts the triplet loss as an objective function to enable the image-text matching to realize end-to-end optimization.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is an image-text matching framework of an embodiment of the present invention;
FIG. 2 is a diagram of an image-text matching model architecture in accordance with an embodiment of the present invention;
FIG. 3 is an image matching text ablation experiment on an MS-COCO 1K dataset according to an embodiment of the invention;
FIG. 4 is an ablation experiment of a text-matched image on an MS-COCO 1K dataset according to an embodiment of the invention;
FIG. 5 is a graph showing the trend of recall values over MS-COCO 1K as a function of the number of iterations in the iterative attention interaction module, in accordance with an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
The embodiment provides an image-text matching method based on frequency self-adaption.
As shown in fig. 1-2, a frequency-adaptive image-text matching method specifically extracts region-level information of an image and word-level information of text as initialized feature expressions; the image extraction uses the attention from bottom to top close to the true feeling of human, and the text extraction adopts the currently advanced two-way circulation neural network GRU to realize the coding process; for complex visual information processing, the image areas are used as nodes to be connected, and the significant areas are connected with relevant easily neglected parts through self-adaptive aggregation of high-low frequency information of the nodes; then, the iterative attention network is adopted to dynamically align the segment information, so that the interaction of heterogeneous modes between the vision and the text is achieved; according to the generated global features, the model adopts the triplet loss as an objective function to enable the image-text matching to realize end-to-end optimization. The method comprises the following specific steps:
s1: acquiring a data set, dividing the data set into a training set and a testing set, wherein the training set and the testing set comprise images and texts matched with the images, any one of the images and the texts is used as a query mode, and the other one is used as another mode; for example, an image is taken as a query modality, and text is taken as another modality; the text is used as a query mode, and the image is used as another mode;
s2: training an image-text matching model based on frequency adaptation and iterative attention interaction by using a training set;
s3: data serving as a query mode is input into the image-text matching model, and another mode representation of the query mode is matched.
The step S2 of training the image-text matching model based on frequency self-adaption and iterative attention interaction by using a training set comprises the following steps of: firstly, initializing a method of selecting the most advanced characteristic representation; then, introducing frequency self-adaption into image region semantic reasoning; an iterative attention interaction module is provided, and global semantic expression is generated after the heterogeneous features are aligned step by step; finally, optimizing the model by the set objective function for training; specific:
s201, carrying out initial feature representation on images and texts in a training set to obtain initial representation V of the images and initial representation S of the texts:
the initial characterization of the image comprises the following specific calculation steps: obtaining each regional characteristic of the image I through a convolutional neural network; performing linear transformation on each region characteristic to obtain a D-dimensional region characteristic; normalizing each region characteristic after line transformation to obtain a region characteristic v after normalization of each region i The initial characterization of image I is then v= { V 1 ,v 2 ,...,v n },v i ∈R D I=1, 2,..n. Specific:
image extraction uses Fast R-CNN, which is a pre-trained, closely to human realism, capable of representing salient content in an image with region vi, image II being denoted V i Is set of (a)
Figure BDA0002969607150000081
Figure BDA0002969607150000082
Set representation V for image I 0 By convolution neural network we can get the vector f of 2048 dimension after pooling i It represents each regional feature of the image I; f for subsequent operations i Linear transformation is required as in equation (1):
v i =W I f i +b I (1)
wherein W is I And b I Representing the learned parameters, let f i A region feature Vi that becomes a D-dimension; then, each region feature is normalized, and the normalized set v= { V 1 ,v 2 ,...,v n },v i ∈R D Is used as an initial characterization of image I.
The text initial characterization comprises the following specific calculation steps: encoding each word in the text T using one-hot; calculate each wordEmbedding the representation; summarizing context information from both directions; obtaining word characteristics s with context information enhancement by means of average value j The initial representation of the text T is s= [ S ] j |j=1,..,m,s j ∈R D ]. Specific:
obtaining sentence sequence representation by adopting an Encoder-Decoder architecture: first, each word w of a sentence T of m words is encoded with one-hot j ,w j A vector representation representing a j-th word; subsequently, a matrix W is learned e By vector t j =W e w j ,j∈[1,m]As word w j Is embedded in the representation; to obtain word sense enhanced sentence representations, bi-directional GRUs with forward GRU and backward GRU are used to summarize context information from both directions:
Figure BDA0002969607150000083
Figure BDA0002969607150000084
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002969607150000085
and->
Figure BDA0002969607150000086
The GRU which represents two different directions, and each word is input in turn; at this time, word characteristics with context information enhancement are defined by means of an average value +.>
Figure BDA0002969607150000087
Finally, enhanced word feature s is employed j Representing each word w j S= [ S ] j |j=1,..,m,s j ∈R D ]As an initial characterization of sentence T.
S202, region semantic reasoning method based on frequency self-adaptionThe method for calculating the image region set V' with the global context enhancement semantic relation comprises the following specific steps: an undirected graph g= (V, E) is constructed for image I, wherein an initial representation v= { V of image I 1 ,v 2 ,...,v n -representing a set of nodes consisting of all image areas, E representing a set of edges; for each node v in the undirected graph i Adaptively aggregating all associated nodes v j Obtaining the node v 'after semantic reasoning' i V '= [ V ]' i |i=1,..,n,v′ i ∈R D ]Enhancing a set of image regions of semantic relationship for a global context; specific:
constructing an undirected graph g= (V, E) with each image region as a node representation of the graph, where v= { V 1 ,v 2 ,...,v n },v i ∈R D Is a node set composed of all image areas, E represents a set of edges; learning a frequency adaptive parameter W by modified Graph Convolutional Networks (GCN) ij (W ij ∈[-1,1]) It can represent the high-low frequency proportion relation between adjacent nodes, in fact, the low frequency signal represents the summation of node characteristics and neighbor characteristics, the high frequency signal represents the difference value between node characteristics and neighbor characteristics, and the method is used
Figure BDA0002969607150000091
And->
Figure BDA0002969607150000092
Representing low-frequency and high-frequency coefficients of the node i and the neighbor node j; by the formula (3), we can learn a value of [ -1,1]Coefficient W between ij
W ij =tanh(g T |v i ||v j |) (3)
Wherein, the splicing operation of the node adopts g as the I T Can be regarded as a shared convolution kernel for mapping, v j Representing node v i V of the neighbor node of (v) i For the normalized region characteristics, in order to make W ij Values of (2)Limited to [ -1,1]The treatment was performed using hyperbolic tangent. So far, we can adaptively learn to use W ij The high-low frequency ratio of each node to its neighboring nodes; subsequently, for each node v i Aggregating high and low frequency information of each node adjacent to it, in the process, node v i By adding information of all associated nodes, an enhanced node v is inferred i The method comprises the steps of carrying out a first treatment on the surface of the This process is achieved by:
Figure BDA0002969607150000093
Figure BDA0002969607150000101
where φ is the activation function, l (l ε [1, 5)]) The number of layers for graph convolution, representing the number of node aggregations,
Figure BDA0002969607150000102
representing node v i At the output of layer I, v' i For node v i Output at the last layer; epsilon is a range of 0,1]In our experiments, epsilon=0.3; in order to prevent the processing content from being too large, n-1 is added as regularization processing in the aggregation process; output of last layer->
Figure BDA0002969607150000103
v′ i ∈R D For the semantic reasoning nodes obtained by aggregating high and low frequency information, V '= [ V' i |i=1,..,n,v′ i ∈R D ]As a collection of image regions with global context-enhanced semantic relationships.
S203, inputting the image region set V' and the initial representation S of the text into an iterative attention interaction layer to obtain the semantic enhanced image global feature V generated by the iterative attention interaction layer * And semantically enhanced text global features S generated via iterative attention exchange layers * The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the following specific steps:
taking the image region set V' as a fragment characteristic set of the image; taking the initial representation S of the text as a fragment characteristic set of the text;
taking any one of the image and the text as a query mode X, taking the other one as another mode Y, enabling the input Q to be a fragment level feature set of the query mode, and enabling the input P to be a fragment feature set of the other mode; and p0 is equal to Y, and initializing iteration times t;
at p t-1 As a priori guidance, the global features qt of Q after one semantic alignment are calculated using the attention-interaction function, the normalization of which is defined as:
q t =A(Q,p t-1 );
taking qt as a priori guidance, calculating a global feature representation pt of P after one time of semantic alignment by using an attention interaction function, wherein the standardization is defined as:
p t =A(P,q t );
the process of generating qT and pT is used as one iteration, and the iteration is carried out for T times, so that qT and pT are obtained; if the image is in a query mode, qT is a semantic enhanced image global feature generated by the iterative attention interaction layer, and pT is a semantic enhanced text global feature generated by the iterative attention interaction layer; if the text is in a query mode, qT is a semantic enhanced text global feature generated by the iterative attention interaction layer, and pT is a semantic enhanced image global feature generated by the iterative attention interaction layer;
the attention interaction function z=a (X, Y) is specifically defined as follows:
H=tanh(U X X+(U Y Y)1 T +b a 1 T )
Figure BDA0002969607150000113
Figure BDA0002969607150000111
wherein U is X ,U Y ∈R D*k ,b a ,u a ∈R D As a parameter of the attention interaction function a () family discipline; 1 represents a feature vector in which all elements are 1;
Figure BDA0002969607150000114
representing the characteristic X of the kth segment under the guidance of Y k Is concerned with the degree of interest of (2); z is the global feature of X after one time semantic alignment by using Y; x, Y represents a segment-level feature set of a two-input modality.
Specific: we define the module for attention interaction as z=a (X, Y), where the input X is the fragment-level feature set x= [ X ] of the query modality k |k=1,.,K,X k ∈R D ]When X represents the image region set V '= [ V ]' i |i=1,..,n,v′ i ∈R D ]When the segment level feature quantity k=n; when X represents a text word set s= [ S ] j |j=1,..,m,s j ∈R D ]When k=m; the input Y is another modality except X in cross-modality matching, represents the global representation of the opposite modality of X, uses Y as the attention guide of the attention interaction module, and realizes the initialization of Y through average pooling; for example, when X is an image region set, Y is a pooled global semantic vector at the initial sentence level, and output Z is a once semantically aligned global semantic representation of X; in practical application, the specific definition of the attention interaction function a () is as follows:
H=tanh(U X X+(U Y Y)1 T +b a 1 T )
Figure BDA0002969607150000112
/>
Figure BDA0002969607150000121
wherein U is X ,U Y ∈R D*k ,b a ,u a ∈R D As a parameter of the attention interactive function a () science system, 1 represents a feature vector in which all elements are 1,
Figure BDA0002969607150000122
is the attention weight of Z; when X represents the image region set, +.>
Figure BDA0002969607150000123
The image attention weight can be regarded as the k-th image region X under the guidance of the whole sentence k Is concerned with the degree of care of (2); z is the global semantic representation of X after one semantic alignment with Y;
when X represents the image region set, we initialize word-level features first, generate sentence-level feature vectors by averaging pooling, as the representation of Y, p0 equals Y; when X represents word-level feature S j Let us consider Y as the feature vector of the image level;
in fact, the text-image and image-text matching models are symmetrical; taking text matching pictures as an example, take p 0 As a priori guidance, a feature of global level of picture is generated using attention weighting V', denoted q 1 ,q 1 ∈R D The method comprises the steps of carrying out a first treatment on the surface of the Subsequently, at q 1 Generating updated text global level features p using attention weighting S for a priori guidance 1 ,p 1 ∈R D The method comprises the steps of carrying out a first treatment on the surface of the Generating q 1 And p 1 As one iteration, iterating for T times altogether; the standardized definition of this process is as follows:
g t =A(V;p t-1 ),
p t =A(S;g t ) ( 7 )
wherein t is the t-th iteration, p t And q t Global semantic representations of the text and the image after semantic alignment are respectively, and V' and S represent a set of region features and word features after semantic enhancement respectively; therefore, iterating the image global semantic representation (image-level) t times will focus more on features related to sentence descriptionsThe volume region content, text global semantic representation (score-level), will focus more on specific words related to the image description.
S204, calculating a loss function, and optimizing the loss function by using an optimizer:
the text and the image are respectively represented by D-dimensional characteristics in an embedded space; we use the triplet loss as a loss function, not paying attention to negatives in all training as before, but instead take negative samples as points of interest in small batches of samples, the loss function is expressed as:
Figure BDA0002969607150000131
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002969607150000132
as a margin parameter for losses [] + When the value contained in the representation is greater than zero, the value is taken as loss, when the value is less than zero, the loss is zero, Q () is a function realized by inner product, the semantic similarity is calculated, V * Representing semantically enhanced image global features (V) generated by an iterative attention interaction module * =q T ) I.e. the global semantic representation q of the image is obtained after T iterations in total T As V; s is S * Representing semantically enhanced text global features generated by an iterative attention interaction module (S * =p T ) I.e. the text global semantic representation p is obtained after T times of iteration T As S; />
Figure BDA0002969607150000133
And->
Figure BDA0002969607150000134
Representing negative samples in the small batch, treating data of a small part of image-text pairs as the small batch when loss is calculated, wherein the paired image-text is a positive sample, and vice versa; to enable matching models to achieve fine-grained semantic alignment at each iteration, we use an optimizer to targetAnd optimizing the function to enable the image-text matching model to realize end-to-end optimization.
We compared the gap between our model and the current most advanced model using 1000 and 5000 pictures after MSCOCO segmentation as test data, respectively. The results show that our model is very competitive with other models. The ablation experiment shows that (particularly shown in fig. 3-4), the model added with the frequency self-adaptive semantic reasoning module and the iterative attention interaction module is greatly improved compared with a baseline model, and the two modules provided by the model can be directly explained to obviously improve the matching performance.
FIGS. 3 and 4, ablation experiments were performed on MS-COCO 1K, baseline representing a Baseline model. Baseline+FA represents a region semantic reasoning module that replaces the average pooling of image regions with frequency adaptation. Baseline+IAM represents the addition of an iterative attention interaction module to the Baseline model. FA-IATI is the complete cross-modality matching model we propose. The test involves image matching and text matching. R@K, k=1, 5, 10 in fig. 3 and 4 refers to recall, which is an evaluation index of the matching model, representing the proportion of queries matching the correct item among K points closest to the query.
Fig. 5 analyzes the number of iterations in the iterative attention interaction module on MS-COCO 1K. The experiment contains the test results of the image query and the text query at the time of recall@1, and recall@1 represents the proportion of the query matching the correct item in the closest 1 point to the query.
Example 2
The embodiment provides an image-text matching system based on frequency adaptation, which comprises:
a data acquisition module configured to: acquiring data, wherein the data comprises an image and a text matched with the image;
a model training module configured to: training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; and calculating a loss function, and optimizing the loss function by using an optimizer.
Example 3
The present embodiment also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of embodiment 1.
Example 4
The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the method of embodiment 1.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. An image-text matching method based on frequency adaptation, comprising: acquiring data, wherein the data comprises an image and a text matched with the image;
training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; calculating a loss function, and optimizing the loss function by using an optimizer;
the specific steps of obtaining the semantic enhanced image global feature and the semantic enhanced text global feature are as follows: selecting any one of the image and the text as a query modality, and the other one as another modality; iterative calculation is carried out by using the attention interaction function to obtain the global features of the query mode and the global features of another mode; if the image is in a query mode, taking the global feature of the query mode as the semantically enhanced image global feature, and taking the global feature of the other mode as the semantically enhanced text global feature; if the text is in the query mode, taking the global feature of the query mode as the semantically enhanced text global feature, and taking the global feature of the other mode as the semantically enhanced image global feature;
wherein the attention interaction function z=a (X, Y) is specifically defined as follows:
H=tanh(U X X+(U Y Y)1 T +b a 1 T )
Figure QLYQS_1
Figure QLYQS_2
wherein X, Y represents a segment-level feature set of a two-input modality, U X 、U Y 、b a 、u a Parameters as a function of the attention interaction; 1 represents a feature vector in which all elements are 1;
Figure QLYQS_3
representing feature X for the kth segment under the guidance of Y k Is concerned with the degree of interest of (2);
the method for calculating the image area set with the global context enhancement semantic relation comprises the following specific steps: constructing an undirected graph for the image; aggregating the high of all associated nodes adaptively for each node in the undirected graphThe low-frequency information is used for obtaining nodes after semantic reasoning to form an image area set with global context enhanced semantic relation; wherein, self-adaptively learn to use W ij The high-low frequency ratio of each node to its neighboring nodes; for each node v i Aggregating high and low frequency information of each node adjacent to it, in the process, node v i By adding information of all the associated nodes, the enhanced node v 'is deduced' i
Figure QLYQS_4
Figure QLYQS_5
Where phi is the activation function, l is the number of layers of the graph convolution,
Figure QLYQS_6
representing node v i At the output of layer I, v' i For node v i At the output of the last layer, ε is a super-parameter, W ij =tanh(g T |v i ||v j I) and (g) as a splicing operation of the nodes T Is a shared convolution kernel used for mapping, v j Representing node v i V of the neighbor node of (v) i Is the region feature after normalization processing.
2. The image-text matching method based on frequency adaptation according to claim 1, wherein the initial characterization of the image comprises the following specific calculation steps:
obtaining each regional characteristic of the image through a convolutional neural network;
performing linear transformation on each region characteristic;
and carrying out normalization processing on each region characteristic after linear transformation to obtain the region characteristic after normalization processing of each region, and forming an initial representation of the image.
3. The frequency-adaptive image-text matching method as claimed in claim 1, wherein the initial representation of the text comprises the following steps:
encoding each word in the text using one-hot;
computing an embedded representation of each word;
summarizing context information from both directions;
and obtaining word characteristics with the enhanced context information by adopting an average value mode, and forming an initial representation of the text.
4. A frequency-adaptive image-text matching method as recited in claim 1, wherein the loss function is a triplet loss function.
5. A frequency-adaptive image-text matching method as claimed in claim 1, wherein the attention-interaction function is directed by another modality to have different degrees of attention to different segments of the query modality.
6. A frequency-adaptive based image-text matching system, comprising: a data acquisition module configured to: acquiring data, wherein the data comprises an image and a text matched with the image;
a model training module configured to: training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; calculating a loss function, and optimizing the loss function by using an optimizer;
the specific steps of obtaining the semantic enhanced image global feature and the semantic enhanced text global feature are as follows: selecting any one of the image and the text as a query modality, and the other one as another modality; iterative calculation is carried out by using the attention interaction function to obtain the global features of the query mode and the global features of another mode; if the image is in a query mode, taking the global feature of the query mode as the semantically enhanced image global feature, and taking the global feature of the other mode as the semantically enhanced text global feature; if the text is in the query mode, taking the global feature of the query mode as the semantically enhanced text global feature, and taking the global feature of the other mode as the semantically enhanced image global feature;
wherein the attention interaction function z=a (X, Y) is specifically defined as follows:
H=tanh(U X X+(U Y Y)1 T +b a 1 T )
Figure QLYQS_7
Figure QLYQS_8
wherein X, Y represents a segment-level feature set of a two-input modality, U X 、U Y 、b a 、u a Parameters as a function of the attention interaction; 1 represents a feature vector in which all elements are 1;
Figure QLYQS_9
representing feature X for the kth segment under the guidance of Y k Is concerned with the degree of interest of (2);
the method for calculating the image area set with the global context enhancement semantic relation comprises the following specific steps: constructing an undirected graph for the image; self-adaptively aggregating high-low frequency information of all associated nodes for each node in undirected graph to obtain nodes subjected to semantic reasoning to form a global context enhancementA set of image regions of a semantic relationship; wherein, self-adaptively learn to use W ij The high-low frequency ratio of each node to its neighboring nodes; for each node v i Aggregating high and low frequency information of each node adjacent to it, in the process, node v i By adding information of all the associated nodes, the enhanced node v 'is deduced' i
Figure QLYQS_10
Figure QLYQS_11
Where phi is the activation function, l is the number of layers of the graph convolution,
Figure QLYQS_12
representing node v i At the output of layer I, v' i For node v i At the output of the last layer, ε is a super-parameter, W ij =tanh(g T |v i ||v j I) and (g) as a splicing operation of the nodes T Is a shared convolution kernel used for mapping, v j Representing node v i V of the neighbor node of (v) i Is the region feature after normalization processing.
7. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of any of claims 1-5.
8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any of claims 1-5.
CN202110260146.XA 2021-03-10 2021-03-10 Image-text matching method and system based on frequency self-adaption Active CN112861882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110260146.XA CN112861882B (en) 2021-03-10 2021-03-10 Image-text matching method and system based on frequency self-adaption

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110260146.XA CN112861882B (en) 2021-03-10 2021-03-10 Image-text matching method and system based on frequency self-adaption

Publications (2)

Publication Number Publication Date
CN112861882A CN112861882A (en) 2021-05-28
CN112861882B true CN112861882B (en) 2023-05-09

Family

ID=75993861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110260146.XA Active CN112861882B (en) 2021-03-10 2021-03-10 Image-text matching method and system based on frequency self-adaption

Country Status (1)

Country Link
CN (1) CN112861882B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419351A (en) * 2022-01-28 2022-04-29 深圳市腾讯计算机系统有限公司 Image-text pre-training model training method and device and image-text prediction model training method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647350A (en) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 A kind of picture and text associative search method based on binary channels network
CN108960330A (en) * 2018-07-09 2018-12-07 西安电子科技大学 Remote sensing images semanteme generation method based on fast area convolutional neural networks
CN110825901A (en) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 Image-text matching method, device and equipment based on artificial intelligence and storage medium
CN111242197A (en) * 2020-01-07 2020-06-05 中国石油大学(华东) Image and text matching method based on double-view-domain semantic reasoning network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147457B (en) * 2019-02-28 2023-07-25 腾讯科技(深圳)有限公司 Image-text matching method, device, storage medium and equipment
CN109933802B (en) * 2019-03-25 2023-05-26 腾讯科技(深圳)有限公司 Image-text matching method, image-text matching device and storage medium
CN111026894B (en) * 2019-12-12 2021-11-26 清华大学 Cross-modal image text retrieval method based on credibility self-adaptive matching network
CN111428801B (en) * 2020-03-30 2022-09-27 新疆大学 Image-text matching method for improving alternate updating of fusion layer and loss function
CN111737458B (en) * 2020-05-21 2024-05-21 深圳赛安特技术服务有限公司 Attention mechanism-based intention recognition method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647350A (en) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 A kind of picture and text associative search method based on binary channels network
CN108960330A (en) * 2018-07-09 2018-12-07 西安电子科技大学 Remote sensing images semanteme generation method based on fast area convolutional neural networks
CN110825901A (en) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 Image-text matching method, device and equipment based on artificial intelligence and storage medium
CN111242197A (en) * 2020-01-07 2020-06-05 中国石油大学(华东) Image and text matching method based on double-view-domain semantic reasoning network

Also Published As

Publication number Publication date
CN112861882A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN111291212B (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
Liu et al. CNN-enhanced graph convolutional network with pixel-and superpixel-level feature fusion for hyperspectral image classification
Liu et al. Connecting image denoising and high-level vision tasks via deep learning
CN111061856B (en) Knowledge perception-based news recommendation method
US20190325342A1 (en) Embedding multimodal content in a common non-euclidean geometric space
CN112966127A (en) Cross-modal retrieval method based on multilayer semantic alignment
Peng et al. Research on image feature extraction and retrieval algorithms based on convolutional neural network
CN111753116B (en) Image retrieval method, device, equipment and readable storage medium
CN114048331A (en) Knowledge graph recommendation method and system based on improved KGAT model
CN112861936B (en) Graph node classification method and device based on graph neural network knowledge distillation
Zhai et al. One-shot object affordance detection in the wild
WO2022042043A1 (en) Machine learning model training method and apparatus, and electronic device
Gao et al. Self-attention driven adversarial similarity learning network
Chen et al. Multi-SVM based Dempster–Shafer theory for gesture intention understanding using sparse coding feature
Zhang et al. Dual-constrained deep semi-supervised coupled factorization network with enriched prior
Cai et al. A robust interclass and intraclass loss function for deep learning based tongue segmentation
Meng et al. Few-shot image classification algorithm based on attention mechanism and weight fusion
Ning et al. Conditional generative adversarial networks based on the principle of homologycontinuity for face aging
Li et al. Robustness comparison between the capsule network and the convolutional network for facial expression recognition
Yuan et al. Modeling spatial layout for scene image understanding via a novel multiscale sum-product network
Liao et al. FERGCN: facial expression recognition based on graph convolution network
CN112861882B (en) Image-text matching method and system based on frequency self-adaption
Miao et al. Research on visual question answering based on GAT relational reasoning
Liu et al. Image feature selection embedded distribution differences between classes for convolutional neural network
CN117972138A (en) Training method and device for pre-training model and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant