CN112861882B

CN112861882B - Image-text matching method and system based on frequency self-adaption

Info

Publication number: CN112861882B
Application number: CN202110260146.XA
Authority: CN
Inventors: 赵晶; 秦宥煊
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2023-05-09
Anticipated expiration: 2041-03-10
Also published as: CN112861882A

Abstract

The invention discloses an image-text matching method and system based on frequency self-adaption. The method adds context information for the image area, adaptively aggregates low-frequency and high-frequency signals on the graph convolution, and realizes semantic reasoning among significant object areas; then, a attention interaction method is provided, global features are generated through an iteration mechanism, and the semantic alignment effect is gradually achieved in the aggregation process of words and image areas; and finally, obtaining a final matching effect by using the loss function.

Description

Image-text matching method and system based on frequency self-adaption

Technical Field

The invention belongs to the field of image-text matching, and particularly relates to an image-text matching method and system based on frequency self-adaption.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

1. The matching method comprises the following steps: in the prior matching method, images and texts are mapped into a common space by embedding for simple comparison, and the hierarchical structure of the vision-text is analyzed by learning the mapping of different modes, so that the corresponding matching effect is obtained. With the rapid development of the internet, the requirements of users on matching precision are also continuously improved. For this reason, attention mechanisms are currently widely used in modality matching. For example, a dual attention mechanism is used to collect the similarity points of each part to perform similarity measurement, so as to obtain the same protruding part in different modes. On the basis, a learner improves the matching capability by improving the feature extraction effect. For example, adding relative position information of entities in an image improves the accuracy of the image representation.

2. Attention mechanism: in order to accurately focus on important information in an image or text, filtering out non-functional information, the attention mechanism plays a key role in image-text matching. At present, a bottom-up attention mechanism close to human reality feeling is used as an image extraction method, and a strong effect is displayed. The method can acquire the remarkable target of the image so as to achieve better matching effect.

3. Semantic reasoning: the purpose of reasoning is to analyze the potential relation of targets in a knowledge graph by machine learning under the known condition, which is also a popular research subject. Early inferences represented relationships between symbols based on extrapolation, lacking interpretability. The path ordering algorithm adopts abstract relation paths to replace logic rules, so that the relation reasoning problem is converted into a supervised learning problem on the graph, and the method is also a method for relation reasoning. At present, many scholars propose improvement on the basis of a path sorting algorithm, and the reasoning accuracy and the calculation efficiency are improved greatly. In recent years, a relational reasoning model based on deep learning becomes a research hotspot in the reasoning field. Researchers have attempted to combine previous reasoning methods with deep learning, exploiting memory reasoning capabilities to find new breakthrough points for natural language processing and visual information processing.

The inventor finds that the semantic relation of fine granularity among different modes of the image-text is lacking in the image-text matching model which is proposed at present, so that the matching behavior of people in the real world is difficult to simulate. Individual intra-modal associations for complex semantics (e.g., associations between entities and attributes in images) remain to be improved. In the aspect of feature representation of the image, the prior method only focuses on the feature of the target, ignores the association among a plurality of targets and is not beneficial to learning the accurate representation of the whole image. The GCN currently used learns parameters greater than 0, and focuses on the aggregation of low-frequency signals, which can obscure the representation of nodes under certain conditions, and cannot show ideal effects when used in image processing.

Disclosure of Invention

In order to solve the above problems, a first aspect of the present invention provides a frequency-adaptive image-text matching method, which can adaptively add context information to an internal region of a picture by using high-low frequency signals of nodes in a graph convolution, and simultaneously, can efficiently align semantics of heterogeneous image and text data by using an iteration mechanism, so as to generate global feature expression and improve matching rate.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an image-text matching method based on frequency adaptation, comprising:

acquiring data, wherein the data comprises an image and a text matched with the image;

training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; and calculating a loss function, and optimizing the loss function by using an optimizer.

Further, the initial characterization of the image comprises the following specific calculation steps:

obtaining each regional characteristic of the image through a convolutional neural network;

performing linear transformation on each region characteristic;

and carrying out normalization processing on each region characteristic after linear transformation to obtain the region characteristic after normalization processing of each region, and forming an initial representation of the image.

Further, the initial characterization of the text comprises the following specific calculation steps:

encoding each word in the text using one-hot;

computing an embedded representation of each word;

summarizing context information from both directions;

and obtaining word characteristics with the enhanced context information by adopting an average value mode, and forming an initial representation of the text.

Further, the calculating the image area set with the global context enhancement semantic relation comprises the following specific steps:

constructing an undirected graph for the image;

and (3) adaptively aggregating high-low frequency information of all associated nodes for each node in the undirected graph to obtain nodes subjected to semantic reasoning, and forming an image region set with global context enhanced semantic relations.

Further, the specific steps of obtaining the semantic enhanced image global feature and the semantic enhanced text global feature are as follows:

selecting any one of the image and the text as a query modality, and the other one as another modality;

iterative calculation is carried out by using the attention interaction function to obtain the global features of the query mode and the global features of another mode;

if the image is in a query mode, taking the global feature of the query mode as the semantically enhanced image global feature, and taking the global feature of the other mode as the semantically enhanced text global feature; if the text is in the query mode, the global feature of the query mode is taken as the text global feature with enhanced semantics, and the global feature of the other mode is taken as the image global feature with enhanced semantics.

Further, the loss function is a triplet loss function.

Still further, the attention interaction function has different attention degrees to different segments of the query modality under the guidance of another modality.

In order to solve the above problems, a second aspect of the present invention provides a frequency-adaptive image-text matching system, which can adaptively add context information to an internal region of a picture by using high-low frequency signals of nodes in a graph convolution, and can simultaneously align heterogeneous images and text data with high-efficiency semantics by using an iteration mechanism to generate global feature expression to improve a matching rate.

a frequency-adaptive based image-text matching system, comprising:

a data acquisition module configured to: acquiring data, wherein the data comprises an image and a text matched with the image;

a model training module configured to: training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; and calculating a loss function, and optimizing the loss function by using an optimizer.

A third aspect of the invention provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.

A fourth aspect of the invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

The beneficial effects of the invention are as follows:

the invention connects image areas as nodes for complex visual information processing, and establishes connection between a significant area and relevant easily neglected parts thereof by adaptively aggregating high and low frequency information of the nodes.

The invention adopts iterative attention network to dynamically align segment information, achieves interaction of heterogeneous modes between vision and text, and improves matching precision; and efficient semantic alignment of heterogeneous images and text data is achieved by using an iteration mechanism, and global feature expression is generated to improve the matching rate.

According to the generated global features, the invention adopts the triplet loss as an objective function to enable the image-text matching to realize end-to-end optimization.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is an image-text matching framework of an embodiment of the present invention;

FIG. 2 is a diagram of an image-text matching model architecture in accordance with an embodiment of the present invention;

FIG. 3 is an image matching text ablation experiment on an MS-COCO 1K dataset according to an embodiment of the invention;

FIG. 4 is an ablation experiment of a text-matched image on an MS-COCO 1K dataset according to an embodiment of the invention;

FIG. 5 is a graph showing the trend of recall values over MS-COCO 1K as a function of the number of iterations in the iterative attention interaction module, in accordance with an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

The embodiment provides an image-text matching method based on frequency self-adaption.

As shown in fig. 1-2, a frequency-adaptive image-text matching method specifically extracts region-level information of an image and word-level information of text as initialized feature expressions; the image extraction uses the attention from bottom to top close to the true feeling of human, and the text extraction adopts the currently advanced two-way circulation neural network GRU to realize the coding process; for complex visual information processing, the image areas are used as nodes to be connected, and the significant areas are connected with relevant easily neglected parts through self-adaptive aggregation of high-low frequency information of the nodes; then, the iterative attention network is adopted to dynamically align the segment information, so that the interaction of heterogeneous modes between the vision and the text is achieved; according to the generated global features, the model adopts the triplet loss as an objective function to enable the image-text matching to realize end-to-end optimization. The method comprises the following specific steps:

s1: acquiring a data set, dividing the data set into a training set and a testing set, wherein the training set and the testing set comprise images and texts matched with the images, any one of the images and the texts is used as a query mode, and the other one is used as another mode; for example, an image is taken as a query modality, and text is taken as another modality; the text is used as a query mode, and the image is used as another mode;

s2: training an image-text matching model based on frequency adaptation and iterative attention interaction by using a training set;

s3: data serving as a query mode is input into the image-text matching model, and another mode representation of the query mode is matched.

The step S2 of training the image-text matching model based on frequency self-adaption and iterative attention interaction by using a training set comprises the following steps of: firstly, initializing a method of selecting the most advanced characteristic representation; then, introducing frequency self-adaption into image region semantic reasoning; an iterative attention interaction module is provided, and global semantic expression is generated after the heterogeneous features are aligned step by step; finally, optimizing the model by the set objective function for training; specific:

s201, carrying out initial feature representation on images and texts in a training set to obtain initial representation V of the images and initial representation S of the texts:

the initial characterization of the image comprises the following specific calculation steps: obtaining each regional characteristic of the image I through a convolutional neural network; performing linear transformation on each region characteristic to obtain a D-dimensional region characteristic; normalizing each region characteristic after line transformation to obtain a region characteristic v after normalization of each region _i The initial characterization of image I is then v= { V ₁ ，v ₂ ，...，v _n }，v _i ∈R ^D I=1, 2,..n. Specific:

image extraction uses Fast R-CNN, which is a pre-trained, closely to human realism, capable of representing salient content in an image with region vi, image II being denoted V _i Is set of (a)

Set representation V for image I ⁰ By convolution neural network we can get the vector f of 2048 dimension after pooling _i It represents each regional feature of the image I; f for subsequent operations _i Linear transformation is required as in equation (1):

v _i ＝W _I f _i +b _I (1)

wherein W is _I And b _I Representing the learned parameters, let f _i A region feature Vi that becomes a D-dimension; then, each region feature is normalized, and the normalized set v= { V ₁ ，v ₂ ，...，v _n }，v _i ∈R ^D Is used as an initial characterization of image I.

The text initial characterization comprises the following specific calculation steps: encoding each word in the text T using one-hot; calculate each wordEmbedding the representation; summarizing context information from both directions; obtaining word characteristics s with context information enhancement by means of average value _j The initial representation of the text T is s= [ S ] _j |j＝1，..，m，s _j ∈R ^D ]. Specific:

obtaining sentence sequence representation by adopting an Encoder-Decoder architecture: first, each word w of a sentence T of m words is encoded with one-hot _j ，w _j A vector representation representing a j-th word; subsequently, a matrix W is learned _e By vector t _j ＝W _e w _j ，j∈[1，m]As word w _j Is embedded in the representation; to obtain word sense enhanced sentence representations, bi-directional GRUs with forward GRU and backward GRU are used to summarize context information from both directions:

wherein, the liquid crystal display device comprises a liquid crystal display device,

and->

The GRU which represents two different directions, and each word is input in turn; at this time, word characteristics with context information enhancement are defined by means of an average value +.>

Finally, enhanced word feature s is employed _j Representing each word w _j S= [ S ] _j |j＝1，..，m，s _j ∈R ^D ]As an initial characterization of sentence T.

S202, region semantic reasoning method based on frequency self-adaptionThe method for calculating the image region set V' with the global context enhancement semantic relation comprises the following specific steps: an undirected graph g= (V, E) is constructed for image I, wherein an initial representation v= { V of image I ₁ ，v ₂ ，...，v _n -representing a set of nodes consisting of all image areas, E representing a set of edges; for each node v in the undirected graph _i Adaptively aggregating all associated nodes v _j Obtaining the node v 'after semantic reasoning' _i V '= [ V ]' _i |i＝1，..，n，v′ _i ∈R ^D ]Enhancing a set of image regions of semantic relationship for a global context; specific:

constructing an undirected graph g= (V, E) with each image region as a node representation of the graph, where v= { V ₁ ，v ₂ ，...，v _n }，v _i ∈R ^D Is a node set composed of all image areas, E represents a set of edges; learning a frequency adaptive parameter W by modified Graph Convolutional Networks (GCN) _ij (W _ij ∈[-1，1]) It can represent the high-low frequency proportion relation between adjacent nodes, in fact, the low frequency signal represents the summation of node characteristics and neighbor characteristics, the high frequency signal represents the difference value between node characteristics and neighbor characteristics, and the method is used

And->

Representing low-frequency and high-frequency coefficients of the node i and the neighbor node j; by the formula (3), we can learn a value of [ -1,1]Coefficient W between _ij ：

W _ij ＝tanh(g ^T |v _i ||v _j |) (3)

Wherein, the splicing operation of the node adopts g as the I ^T Can be regarded as a shared convolution kernel for mapping, v _j Representing node v _i V of the neighbor node of (v) _i For the normalized region characteristics, in order to make W _ij Values of (2)Limited to [ -1,1]The treatment was performed using hyperbolic tangent. So far, we can adaptively learn to use W _ij The high-low frequency ratio of each node to its neighboring nodes; subsequently, for each node v _i Aggregating high and low frequency information of each node adjacent to it, in the process, node v _i By adding information of all associated nodes, an enhanced node v is inferred _i The method comprises the steps of carrying out a first treatment on the surface of the This process is achieved by:

where φ is the activation function, l (l ε [1, 5)]) The number of layers for graph convolution, representing the number of node aggregations,

representing node v _i At the output of layer I, v' _i For node v _i Output at the last layer; epsilon is a range of 0,1]In our experiments, epsilon=0.3; in order to prevent the processing content from being too large, n-1 is added as regularization processing in the aggregation process; output of last layer->

v′ _i ∈R ^D For the semantic reasoning nodes obtained by aggregating high and low frequency information, V '= [ V' _i |i＝1，..，n，v′ _i ∈R ^D ]As a collection of image regions with global context-enhanced semantic relationships.

S203, inputting the image region set V' and the initial representation S of the text into an iterative attention interaction layer to obtain the semantic enhanced image global feature V generated by the iterative attention interaction layer ^* And semantically enhanced text global features S generated via iterative attention exchange layers ^* The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the following specific steps:

taking the image region set V' as a fragment characteristic set of the image; taking the initial representation S of the text as a fragment characteristic set of the text;

taking any one of the image and the text as a query mode X, taking the other one as another mode Y, enabling the input Q to be a fragment level feature set of the query mode, and enabling the input P to be a fragment feature set of the other mode; and p0 is equal to Y, and initializing iteration times t;

at p _t-1 As a priori guidance, the global features qt of Q after one semantic alignment are calculated using the attention-interaction function, the normalization of which is defined as:

q _t ＝A(Q，p _t-1 )；

taking qt as a priori guidance, calculating a global feature representation pt of P after one time of semantic alignment by using an attention interaction function, wherein the standardization is defined as:

p _t ＝A(P，q _t )；

the process of generating qT and pT is used as one iteration, and the iteration is carried out for T times, so that qT and pT are obtained; if the image is in a query mode, qT is a semantic enhanced image global feature generated by the iterative attention interaction layer, and pT is a semantic enhanced text global feature generated by the iterative attention interaction layer; if the text is in a query mode, qT is a semantic enhanced text global feature generated by the iterative attention interaction layer, and pT is a semantic enhanced image global feature generated by the iterative attention interaction layer;

the attention interaction function z=a (X, Y) is specifically defined as follows:

H＝tanh(U _X X+(U _Y Y)1 ^T +b ^a 1 ^T )

wherein U is _X ，U _Y ∈R ^D*k ，b ^a ，u _a ∈R ^D As a parameter of the attention interaction function a () family discipline; 1 represents a feature vector in which all elements are 1;

representing the characteristic X of the kth segment under the guidance of Y _k Is concerned with the degree of interest of (2); z is the global feature of X after one time semantic alignment by using Y; x, Y represents a segment-level feature set of a two-input modality.

Specific: we define the module for attention interaction as z=a (X, Y), where the input X is the fragment-level feature set x= [ X ] of the query modality _k |k＝1，.，K，X _k ∈R ^D ]When X represents the image region set V '= [ V ]' _i |i＝1，..，n，v′ _i ∈R ^D ]When the segment level feature quantity k=n; when X represents a text word set s= [ S ] _j |j＝1，..，m，s _j ∈R ^D ]When k=m; the input Y is another modality except X in cross-modality matching, represents the global representation of the opposite modality of X, uses Y as the attention guide of the attention interaction module, and realizes the initialization of Y through average pooling; for example, when X is an image region set, Y is a pooled global semantic vector at the initial sentence level, and output Z is a once semantically aligned global semantic representation of X; in practical application, the specific definition of the attention interaction function a () is as follows:

H＝tanh(U _X X+(U _Y Y)1 ^T +b ^a 1 ^T )

/>

wherein U is _X ，U _Y ∈R ^D*k ，b ^a ，u _a ∈R ^D As a parameter of the attention interactive function a () science system, 1 represents a feature vector in which all elements are 1,

is the attention weight of Z; when X represents the image region set, +.>

The image attention weight can be regarded as the k-th image region X under the guidance of the whole sentence _k Is concerned with the degree of care of (2); z is the global semantic representation of X after one semantic alignment with Y;

when X represents the image region set, we initialize word-level features first, generate sentence-level feature vectors by averaging pooling, as the representation of Y, p0 equals Y; when X represents word-level feature S _j Let us consider Y as the feature vector of the image level;

in fact, the text-image and image-text matching models are symmetrical; taking text matching pictures as an example, take p ₀ As a priori guidance, a feature of global level of picture is generated using attention weighting V', denoted q ₁ ，q ₁ ∈R ^D The method comprises the steps of carrying out a first treatment on the surface of the Subsequently, at q ₁ Generating updated text global level features p using attention weighting S for a priori guidance ₁ ，p ₁ ∈R ^D The method comprises the steps of carrying out a first treatment on the surface of the Generating q ₁ And p ₁ As one iteration, iterating for T times altogether; the standardized definition of this process is as follows:

g _t ＝A(V；p _t-1 )，

p _t ＝A(S；g _t ) ( ₇ )

wherein t is the t-th iteration, p _t And q _t Global semantic representations of the text and the image after semantic alignment are respectively, and V' and S represent a set of region features and word features after semantic enhancement respectively; therefore, iterating the image global semantic representation (image-level) t times will focus more on features related to sentence descriptionsThe volume region content, text global semantic representation (score-level), will focus more on specific words related to the image description.

S204, calculating a loss function, and optimizing the loss function by using an optimizer:

the text and the image are respectively represented by D-dimensional characteristics in an embedded space; we use the triplet loss as a loss function, not paying attention to negatives in all training as before, but instead take negative samples as points of interest in small batches of samples, the loss function is expressed as:

as a margin parameter for losses [] ₊ When the value contained in the representation is greater than zero, the value is taken as loss, when the value is less than zero, the loss is zero, Q () is a function realized by inner product, the semantic similarity is calculated, V ^* Representing semantically enhanced image global features (V) generated by an iterative attention interaction module ^* ＝q _T ) I.e. the global semantic representation q of the image is obtained after T iterations in total _T As V; s is S ^* Representing semantically enhanced text global features generated by an iterative attention interaction module (S ^* ＝p _T ) I.e. the text global semantic representation p is obtained after T times of iteration _T As S; />

And->

Representing negative samples in the small batch, treating data of a small part of image-text pairs as the small batch when loss is calculated, wherein the paired image-text is a positive sample, and vice versa; to enable matching models to achieve fine-grained semantic alignment at each iteration, we use an optimizer to targetAnd optimizing the function to enable the image-text matching model to realize end-to-end optimization.

We compared the gap between our model and the current most advanced model using 1000 and 5000 pictures after MSCOCO segmentation as test data, respectively. The results show that our model is very competitive with other models. The ablation experiment shows that (particularly shown in fig. 3-4), the model added with the frequency self-adaptive semantic reasoning module and the iterative attention interaction module is greatly improved compared with a baseline model, and the two modules provided by the model can be directly explained to obviously improve the matching performance.

FIGS. 3 and 4, ablation experiments were performed on MS-COCO 1K, baseline representing a Baseline model. Baseline+FA represents a region semantic reasoning module that replaces the average pooling of image regions with frequency adaptation. Baseline+IAM represents the addition of an iterative attention interaction module to the Baseline model. FA-IATI is the complete cross-modality matching model we propose. The test involves image matching and text matching. R@K, k=1, 5, 10 in fig. 3 and 4 refers to recall, which is an evaluation index of the matching model, representing the proportion of queries matching the correct item among K points closest to the query.

Fig. 5 analyzes the number of iterations in the iterative attention interaction module on MS-COCO 1K. The experiment contains the test results of the image query and the text query at the time of recall@1, and recall@1 represents the proportion of the query matching the correct item in the closest 1 point to the query.

Example 2

The embodiment provides an image-text matching system based on frequency adaptation, which comprises:

Example 3

The present embodiment also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of embodiment 1.

Example 4

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the method of embodiment 1.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An image-text matching method based on frequency adaptation, comprising: acquiring data, wherein the data comprises an image and a text matched with the image;

training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; calculating a loss function, and optimizing the loss function by using an optimizer;

the specific steps of obtaining the semantic enhanced image global feature and the semantic enhanced text global feature are as follows: selecting any one of the image and the text as a query modality, and the other one as another modality; iterative calculation is carried out by using the attention interaction function to obtain the global features of the query mode and the global features of another mode; if the image is in a query mode, taking the global feature of the query mode as the semantically enhanced image global feature, and taking the global feature of the other mode as the semantically enhanced text global feature; if the text is in the query mode, taking the global feature of the query mode as the semantically enhanced text global feature, and taking the global feature of the other mode as the semantically enhanced image global feature;

wherein the attention interaction function z=a (X, Y) is specifically defined as follows:

H＝tanh(U _X X+(U _Y Y)1 ^T +b ^a 1 ^T )

wherein X, Y represents a segment-level feature set of a two-input modality, U _X 、U _Y 、b ^a 、u _a Parameters as a function of the attention interaction; 1 represents a feature vector in which all elements are 1;

representing feature X for the kth segment under the guidance of Y _k Is concerned with the degree of interest of (2);

the method for calculating the image area set with the global context enhancement semantic relation comprises the following specific steps: constructing an undirected graph for the image; aggregating the high of all associated nodes adaptively for each node in the undirected graphThe low-frequency information is used for obtaining nodes after semantic reasoning to form an image area set with global context enhanced semantic relation; wherein, self-adaptively learn to use W _ij The high-low frequency ratio of each node to its neighboring nodes; for each node v _i Aggregating high and low frequency information of each node adjacent to it, in the process, node v _i By adding information of all the associated nodes, the enhanced node v 'is deduced' _i ：

Where phi is the activation function, l is the number of layers of the graph convolution,

representing node v _i At the output of layer I, v' _i For node v _i At the output of the last layer, ε is a super-parameter, W _ij ＝tanh(g ^T |v _i ||v _j I) and (g) as a splicing operation of the nodes ^T Is a shared convolution kernel used for mapping, v _j Representing node v _i V of the neighbor node of (v) _i Is the region feature after normalization processing.

2. The image-text matching method based on frequency adaptation according to claim 1, wherein the initial characterization of the image comprises the following specific calculation steps:

performing linear transformation on each region characteristic;

3. The frequency-adaptive image-text matching method as claimed in claim 1, wherein the initial representation of the text comprises the following steps:

encoding each word in the text using one-hot;

computing an embedded representation of each word;

summarizing context information from both directions;

4. A frequency-adaptive image-text matching method as recited in claim 1, wherein the loss function is a triplet loss function.

5. A frequency-adaptive image-text matching method as claimed in claim 1, wherein the attention-interaction function is directed by another modality to have different degrees of attention to different segments of the query modality.

6. A frequency-adaptive based image-text matching system, comprising: a data acquisition module configured to: acquiring data, wherein the data comprises an image and a text matched with the image;

a model training module configured to: training an image-text matching model based on frequency adaptation and iterative attention interaction by using the acquired data, wherein the specific steps comprise: carrying out initial feature representation on the image and the text in the data to obtain initial representation of the image and initial representation of the text; calculating an image region set with global context enhancement semantic relation based on a frequency self-adaptive region semantic reasoning method; inputting the image region set and the initial characterization of the text into an iterative attention interaction layer to obtain semantic enhanced image global features and semantic enhanced text global features; calculating a loss function, and optimizing the loss function by using an optimizer;

H＝tanh(U _X X+(U _Y Y)1 ^T +b ^a 1 ^T )

the method for calculating the image area set with the global context enhancement semantic relation comprises the following specific steps: constructing an undirected graph for the image; self-adaptively aggregating high-low frequency information of all associated nodes for each node in undirected graph to obtain nodes subjected to semantic reasoning to form a global context enhancementA set of image regions of a semantic relationship; wherein, self-adaptively learn to use W _ij The high-low frequency ratio of each node to its neighboring nodes; for each node v _i Aggregating high and low frequency information of each node adjacent to it, in the process, node v _i By adding information of all the associated nodes, the enhanced node v 'is deduced' _i ：

7. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of any of claims 1-5.

8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any of claims 1-5.