CN116150418A

CN116150418A - Image-text matching method and system based on mixed focusing attention mechanism

Info

Publication number: CN116150418A
Application number: CN202310424288.4A
Authority: CN
Inventors: 鲍秉坤; 叶俊杰; 邵曦
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-05-23
Anticipated expiration: 2043-04-20
Also published as: CN116150418B

Abstract

The invention discloses an image-text matching method and system based on a mixed focusing attention mechanism, wherein the method comprises the following steps: s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description; s2, adaptively adjusting temperature coefficients of an attention mechanism to different pictures by using a focused cross-mode attention mechanism, so as to distinguish effective and ineffective regional characteristics; s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, and controlling a self-attention matrix to adaptively select the effective regional features and the word features by using a gating signal; and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features. The invention can realize the mutual search between pictures and texts.

Description

Image-text matching method and system based on mixed focusing attention mechanism

Technical Field

The invention belongs to the crossing field of computer vision and natural language processing, and particularly relates to a method for calculating image and text matching.

Background

The image and the text are taken as main media of internet propagation information, daily life of people is filled, the image is taken as visual data, the image is naturally different from natural language data such as the text, although the two data are different in mode, in many scenes, the contents of the image and the text propagation are closely related, one image and one sentence of natural language description usually have internal semantic association, and how to mine the association has great application prospect and value for realizing semantic alignment between the image and the natural language. By mining the similarity score between the image and the natural language text, the image-text pair with semantic matching is found, so that the development of the current text search image/image search text can be greatly promoted, and a user can be helped to search more valuable information in the Internet, namely the research value and meaning of image-text matching.

The image-text matching method needs to score the matching degree of a given image and natural language description, so that understanding the content of the image and the natural language description is the key for determining the matching score, and only the image-text matching method can understand the content of the image and the text, the matching degree of the image and the text can be judged more accurately and comprehensively. In the traditional image-text matching method, in order to realize fine matching among images and texts, a pre-trained target detector is often utilized to extract a significant region in an image, and for natural language description, the characteristic of each word in a sentence is often extracted in a sequence modeling mode, so that the matching image and the global information describing the whole situation are converted into matching of local information of the region and the word, and the matching degree of the images and the texts is calculated from bottom to top.

The above-described method still currently has the following two challenges: (a) The existing redundant information/noise information, the conventional graphic matching model often uses a fixed number (typical value is 36) of region features extracted from the image in advance, wherein partial regions do not contain information related to texts, namely noise features; there is also some degree of overlap of the partial regions, i.e. redundancy features. (b) The graph-text matching model cannot distinguish useful information from useless information, a single-mode self-attention mechanism is not always focused on whether a certain area is a useful area or not, and the existing cross-mode attention mechanism is always capable of distinguishing all areas in all pictures by only using one temperature coefficient, and cannot assign different temperature coefficients to different pictures.

Disclosure of Invention

The invention aims to solve the technical problems that: in the process of mutual retrieval between pictures and texts, how to remove redundant/noise area information in the image and how to construct a cross-modal and self-modal attention mechanism, so that the picture and text matching method can pay excessive attention to the redundant/noise area information.

In order to solve the technical problems, the invention provides an image-text matching method based on a mixed focusing attention mechanism, which comprises the following steps:

s1, extracting characteristics of a salient region in an image and characteristics of each word in natural language description;

s2, utilizing a focused cross-modal attention mechanism to adaptively adjust temperature coefficients of the attention mechanism on different pictures, so as to distinguish effective and ineffective regional features and realize cross-modal context extraction and fusion of regional-level and word-level features;

s3, realizing intra-modal fusion of the regional features and the word features by using a gating self-attention mechanism, controlling a self-attention matrix to adaptively select effective regional features and word features by using a gating signal, masking noise and redundant regions, and enhancing the distinguishing degree of the different regional features and word features;

and S4, calculating the matching score of the whole image and the sentence by using the cross-modal and self-modal regional features and the word features.

The image-text matching method based on the mixed focusing attention mechanism further comprises the following steps:

and S5, optimizing all the linear layers in the steps S1-S4 by using a triplet loss function, and executing the steps S1-S4 after optimizing.

In the foregoing method for matching graphics based on a hybrid focus attention mechanism, in step S1, two sub-steps are included:

s11, detecting the most remarkable image by using a pre-trained Faster R-CNN target detector

The regions are extracted, and the corresponding features of each region are extracted, and then mapped to +.>

The dimension hidden space, the obtained region feature is marked as +.>

Wherein the feature vector->

Each element in (a) is real,/-, a ∈>

Representing the dimension of the feature vector ∈>

Representing the real number field, ++>

Representation->

Real vectors of dimensions;

step S12 for inclusion of

The natural language description of each word adopts a Bi-gating circulation unit Bi-GRU to extract the characteristics of each word, and the forward process of the Bi-GRU reads the last word from the first word and records the hidden state when reading each word:

，

wherein ,

indicating the hidden state of the forward process, +.>

Indicate->

Single hot code of individual word,/->

Representing the forward process of Bi-GRU;

the Bi-GRU backward process reads from the last word to the first word and records the hidden state when reading each word:

，

wherein ,

indicating the hidden state of the backward process, +.>

Represents a backward process of Bi-GRU;

word characteristics

Hidden state from forward process->

And hidden state of backward process->

Averaging, namely:

，

mapping its features to linear layers

The vitamin-hidden space, marked as +.>

，

Representing the dimension of the feature vector.

In the foregoing method for matching graphics based on a hybrid focus attention mechanism, in step S2, two sub-steps are included;

step S21, giving the image area characteristics

And word feature of description->

The average feature is determined separately and recorded as the image area average feature +.>

And word average feature +.>

Mean feature in image area->

Sum word averagingCharacteristics->

For the query object, the attention score for each region and word is calculated:

，

，/>

wherein ,

representing the average feature of an image region->

For->

Individual image area features->

Attention score of->

Representing word average feature +.>

For->

Individual word feature +.>

Attention score of->

、

and

、

The parameters are respectively a first parameter matrix, a second parameter matrix, a third parameter matrix, a fourth parameter matrix and a +.>

and

For the parameter vector +.>

Representing element multiplication, and weighting and summing the region and word characteristics through the attention score to obtain global characteristics of the image and the text, namely:

，

wherein ,

representing global features of the image;

Global features representing sentence descriptions;

for a size of

Calculating the current text description for the lot size of the image +.>

Focusing degree of sheet image->

The method comprises the following steps:

，

wherein ,

for the parameter vector +.>

Representing a concatenation operation of two feature vectors, +.>

Activating the function for sigmoid, thereby obtaining the current text description pair +.>

Focusing degree of sheet image->

；

Step S22, obtaining the first

Regional characteristics of a sheet of image->

And word characteristics of the text description->

And (4) the right->

Focusing fraction +.>

Then, by local word and region interaction, the similarity score of each word to each region is calculated>

The method comprises the following steps:

，

wherein ,

representing transpose, for similarity score +.>

Performing L2 normalizationNormalized similarity degree obtained by chemical treatment>

Represents->

Personal word and->

The degree of similarity of the individual regions;

the attention score is given by:

，

weighting and summing the attention scores of each region through each word to obtain the corresponding cross-modal context characteristics of each word

The method comprises the following steps:

，

implementing the first via the linear layer

Individual word feature +.>

And corresponding cross-modal context feature +.>

Is a fusion of (1), namely:

，

wherein ,

representing the characteristics of the two mode information after being fused;

Is a linear layer;>

global features of the image obtained in step S21

And global features of sentence descriptions->

Fusion is performed as a global feature after fusion +.>

The method comprises the following steps:

，

the fused global features

Fusion feature corresponding to each word +.>

Merging features recorded as multi-modal

。

In the foregoing image-text matching method based on the hybrid focusing attention mechanism, in step S3, the attention coefficient matrix is calculated as follows:

，

wherein ,

and

Representing two linear layers of different parameters;

gating signal

The calculation is as follows:

，

wherein ,

to activate the function +.>

Is a learnable parameter vector, gating signal

，

Each scalar element +.>

The importance of each feature is considered as being +.>

Before softmax normalization of each row of elements, the gating score is separated into important features and unimportant features by threshold values, namely, each +.>

Fixed as hard (hard) score:

，

wherein ,

is threshold value, < >>

Score for unimportant local feature, +.>

A score that is an important local feature;

the gating vector is expressed as

；

By the first

Individual gating signals->

Attention score matrix->

Is>

The column elements are weighted, expressed by:

，

wherein ,

representing the attention score matrix->

Is a single element;

post-gating attention moment array by softmax function

Carrying out normalization processing on each row of elements in the list;

updated global features

Multimodal features from attention score ++>

Weighted sum is performed, namely:

，

wherein ,

is an activation function;

Is a linear layer;

The multi-mode feature matrix obtained in the previous step is obtained;

the characteristic matrix updated by the gating self-modal attention mechanism is recorded as

, wherein ,

Representing updated global features +.>

Representing the updated local features.

In step S4, the foregoing image-text matching method based on the hybrid focusing attention mechanism is based on the updated features obtained in step S3

The score of the current image-text pair is predicted by adopting a linear layer, and the score is expressed as follows: />

，

wherein ,

activating a function for sigmoid;

Representing a linear layer->

Representation of image->

And text description->

Matching scores between.

In the foregoing image-text matching method based on the hybrid focusing attention mechanism, in step S5, the triplet loss function

Expressed as:

，

wherein ,

，

is a threshold value, < >>

and

The first and second most difficult negative samples are respectively;

the formula for optimizing all the linear layers by using the triplet loss function is as follows:

，

wherein ,

and

Representing the parameter scalar I and parameter scalar II inside the linear layer, respectively, < >>

For parameter scalar before optimization, ++>

To be optimizedParameter scalar->

For learning rate->

Representing the gradient solving process.

An image-text matching system based on a mixed focusing attention mechanism comprises the following functional modules:

and a characteristic extraction module of the image-text pair: extracting the characteristics of the salient region in the image and the characteristics of each word in the natural language description;

cross-modal attention mechanism module: the focused cross-modal attention mechanism is utilized to adaptively adjust the temperature coefficients of the attention mechanism to different pictures, so that effective and ineffective regional features are distinguished, and cross-modal context extraction and fusion of regional-level and word-level features are realized;

gated self-attention mechanism module: the intra-modal fusion of the regional features and the word features is realized by using a gating self-attention mechanism, the effective regional features and word features are adaptively selected by controlling the self-attention matrix through a gating signal, noise and redundant regions are covered, and the distinguishing degree of the different regional features and word features is enhanced;

and a matching score calculation module: the cross-modal and self-modal region features and word features are used to calculate the matching score for the entire image and sentence.

The foregoing image-text matching system based on the hybrid focusing attention mechanism further comprises:

loss function optimizing module: and optimizing all linear layers in the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair by using the triple loss function, and executing the working processes of the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair after optimizing.

A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as described above.

The invention has the beneficial effects that: the invention can automatically judge whether the contents of the given image and the natural language description are consistent to obtain a matching score, can be used for cross-modal retrieval in the Internet, namely, the text retrieval corresponding to the text retrieval or the text retrieval corresponding to the text retrieval can adaptively filter and compress the redundant or noise regional characteristics in the process of matching the images and texts, thereby better realizing the mutual retrieval between the images and texts.

Drawings

Fig. 1 is a flow chart of a graph-text matching method based on a mixed focus attention mechanism.

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings.

Example 1

As shown in fig. 1, the present invention provides a graph-text matching method based on a hybrid focusing attention mechanism, which comprises the following steps:

s1, extracting characteristics of image-text pairs, namely extracting characteristics of a salient region in an image and characteristics of each word in natural language description;

In step S1, two sub-steps of extracting salient region features in the image and extracting word features in the natural language description are included:

s11, extracting regional characteristics in the image, and detecting the most remarkable image by using a pre-trained Faster R-CNN target detector

Area(s)>

Is typically 36 and extracts the corresponding features of each region, then maps the features to +.>

The dimension hidden space, the obtained region feature is marked as +.>

Wherein the feature vector->

Each element in (a) is real,/-, a ∈>

Representing the dimension of the feature vector ∈>

Representing the real number field, ++>

Representation->

Real vectors of dimensions;

s12, extracting word characteristics in the text, wherein the text comprises

，

wherein ,

indicating the hidden state of the forward process, +.>

Indicate->

Single hot code of individual word,/->

Representing the forward process of Bi-GRU;

then, the Bi-GRU backward process reads from the last word to the first word and records the hidden state of each word read:

，

wherein ,

indicating the hidden state of the backward process, +.>

Represents a backward process of Bi-GRU;

finally, word characteristics

Hidden state from forward process->

And hidden state of backward process->

Averaging, namely:

，/>

mapping its features to linear layers

The vitamin-hidden space, marked as +.>

，

Representing the dimension of the feature vector.

In step S2, after the salient region features in the image and the word features in the text are extracted, local interactions of different modes are performed by using a focused cross-mode attention mechanism, so as to obtain information of mode complementation. In order to distinguish the importance degree of the areas, focusing operation is carried out on the attention moment array obtained by the cross-modal attention mechanism, so that the difference between attention scores is increased, redundant and noisy areas can be better filtered, and useful areas are reserved, and the method comprises two substeps, namely, the calculation process of the attention focusing score is firstly, and then, the realization flow of the focused cross-modal attention mechanism is carried out;

in step S21, the attention focusing score is calculated, for a batch (batch) image in the training process, for a given specific description, the matching degree with different images should be different, if the matching image-text sample pair is the matching image-text sample pair, the matching degree of the current description with the corresponding image should be stronger, otherwise weaker, so the attention focusing score is calculated from the content of the image-text whole, and the focusing degree of the current text on different images in a batch of samples is calculated.

The embodiment distinguishes the focusing degree of the text to the image through the global information, complements the local characteristic of the cross-modal attention mechanism, and gives the image area characteristics

And word feature of description->

The average feature is determined as the image area average feature +.>

And word average feature +.>

Mean feature in image area->

And word average feature +.>

，

，

wherein ,

representing the average feature of an image region->

For->

Individual image area features->

Attention score of->

Representing word average feature +.>

For->

Individual word feature +.>

Attention score of->

、

and

、

and

For the parameter vector +.>

，

wherein ,

representing global features of the image;

Representing the global features of the sentence description.

For a size of

Calculating the current text description for the lot size of the image +.>

Focusing degree of sheet image->

The method comprises the following steps:

，

wherein ,

for the parameter vector +.>

Representing a concatenation operation of two feature vectors, +.>

Focusing degree of sheet image->

。

Step S22, implementing a flow by a focused cross-mode attention mechanism, and obtaining the first step

Regional characteristics of a sheet of image->

And word characteristics of the text description->

And (4) the right->

Focusing fraction +.>

The method comprises the following steps: />

，

wherein ,

representing transpose, for similarity score +.>

The L2 normalization processing can obtain the normalization similarity degree

Represents->

Personal word and->

The degree of similarity of the individual regions;

the existing image-text matching method can pass through the super-parameter temperature coefficient

To control the similarity score to sharpen the degree of interest of the word in the region, resulting in the attention score +.>

The method comprises the following steps:

，

when the temperature coefficient is

When rising, the attention fraction will be more focused, the first +.>

The word will be in one or several areas, when the temperature coefficient of the super parameter is +>

When decreasing, the attention score is more distracted, first +.>

The degree of attention of individual words to all regions will tend to be uniform.

Super-parametric temperature coefficient in the above manner

Is fixed, and will often be the same temperature coefficient for different images in one batch, in this embodiment by focusing on the fraction +.>

The temperature coefficient is controlled so as to realize that the text description can have different temperature coefficients for different images, so that the validity conditions of the areas in different images can be better distinguished, namely whether the areas of different images are useful information or noise or redundant information, and the attention score in the embodiment is obtained by the following formula:

，

by focusing the fraction

The focused cross-modal attention mechanism in this embodiment can more effectively distinguish between different images;

The method comprises the following steps:

，

implementing the first via the linear layer

Individual word feature +.>

And corresponding cross-modal context feature +.>

Is a fusion of (1), namely:

，

wherein ,

representing the characteristics of the two mode information after being fused;

Is a linear layer;

global features of the image obtained in step S21

And global features of sentence descriptions->

Fusion is performed as a global feature after fusion +.>

The method comprises the following steps:

，

the fused global features

Fusion feature corresponding to each word +.>

Merge marked as a multimodal feature->

The method comprises the steps of carrying out a first treatment on the surface of the In the next stepIn the step, the extraction and fusion of the intra-mode information are realized through a gating self-attention mechanism.

In step S3, a gated self-modal attention mechanism is used for a given multi-modal feature

, wherein ,

and

Can be regarded as local features, such as word features fused with visual information, and +.>

The method can be regarded as global features, and the importance degree of each local feature is different, for example, the importance degree of each word is different, and the nouns in a sentence are generally more important than prepositions and the like, so the embodiment designs a gated self-modal attention mechanism, adopts a gating signal to control the attention score matrix of the local feature, further controls the importance degree of different local information, and the attention score matrix is calculated as follows:

，

wherein ,

and

Representing two linear layers of different parameters;

gating signal

It can be calculated as:

，

wherein ,

to activate the function +.>

Is a learnable parameter vector, gating signal

，

Each scalar element +.>

Can be regarded as the importance of each feature in the attention matrix +.>

Fixed as hard (hard) score:

，

wherein ,

is a threshold value, in this experiment 0, < >>

Score for unimportant local feature, +.>

A score that is an important local feature;

the gating vector is expressed as

；

By the first

Individual gating signals->

Attention score matrix->

Is>

The column elements are weighted, expressed by:

，

wherein ,

representing the attention score matrix->

Is a single element; the gated attention matrix is then subjected to a softmax function>

Normalization processing is performed on each row of elements, and gating weighting is performed, so that when the softmax function processing is performed on each row, the attention distribution is sharpened, so that each query focuses on important features;

finally, updated global features

Multimodal features from attention score ++>

Weighted sum is performed, namely:

，

wherein ,

is an activation function;

Is a linear layer;

The multi-mode feature matrix obtained in the previous step is obtained;

the feature matrix updated by the gated self-modal attention mechanism can be recorded as

, wherein ,

Representing updated global features +.>

Representing the updated local features. />

In step S4, the graph-text matching score is calculated and the model is trained, based on the updated features obtained in step S3

The score of the current graphic pair is predicted by a linear layer, and can be expressed as:

，

wherein ,

activating a function for sigmoid;

Representing a linear layer.

Representation of image->

And text description->

The matching score between the above formula shows that the global feature +.>

And predicting the score of the image-text matching.

In the feature extraction module of the image-text pair, the following steps are executed:

The dimension hidden space, the obtained region feature is marked as +.>

Wherein the feature vector->

Each element in (a) is real,/-, a ∈>

Representing the dimension of the feature vector ∈>

Representing the real number field, ++>

Representation->

Real vectors of dimensions;

step S12 for inclusion of

，

wherein ,

indicating the hidden state of the forward process, +.>

Indicate->

Single hot code of individual word,/->

Representing the forward process of Bi-GRU;

，

wherein ,

indicating the hidden state of the backward process, +.>

Represents a backward process of Bi-GRU;

finally, word characteristics

Hidden state from forward process->

And hidden state of backward process->

Averaging, namely:

，/>

mapping its features to linear layers

The vitamin-hidden space, marked as +.>

，

Representing the dimension of the feature.

Example 2

The image-text matching method based on the mixed focusing attention mechanism further comprises, after executing step S1-step S4 in embodiment 1:

In step S5, all the linear layers in step S1-step S4 are optimized by using a triplet loss function, and step S1-step S4 are executed after the optimization, wherein the triplet loss function

Expressed as:

，

wherein ,

，

is a threshold value, < >>

and

The most difficult negative samples are represented by the first and second most difficult negative samples: sample +.>

And the current query image->

Is the lowest, i.e.)>

To be in the picture->

The lowest matching score for the case of (2); similarly, image sample->

And at presentInquiry text of +.>

Is the lowest, i.e.)>

For at text +.>

The lowest matching score for the case of (2);

，

wherein ,

and

For parameter scalar before optimization, ++>

For optimized parameter scalar, ++>

For learning rate->

Representing the gradient solving process.

Steps S1 to S4 are specifically performed in the same manner as in example 1.

The image-text matching system based on the mixed focusing attention mechanism further comprises, based on the feature extraction module, the modal attention mechanism module, the gated self-attention mechanism module and the matching score calculation module of the image-text pair in the embodiment 1:

The specific execution process of each functional module is the same as that of embodiment 1.

The following is a description of specific experimental data.

Training, verifying and testing are carried out on a flicker30k data set, wherein the data set comprises 31783 pictures, each picture has 5 corresponding natural language descriptions, 29783 pictures are used for model training, 1000 pictures are used for verifying, and 1000 pictures are used for testing, so that the invention has a good effect.

The effectiveness of the image-text matching method based on the mixed focusing attention mechanism proposed in this embodiment is measured by recall@k (abbreviated as R@K, where k=1, 5, 10), and recall@k represents the proportion of the correct answer appearing in the previous top-K in the search result, and the comprehensive performance of the image-text matching method is measured by rsum, which is obtained by adding R@1, R@5, and r@10 of the image search text and the text search image, namely:

，

the left term in the above equation represents the sum of the text of the image search R@1, R@5, r@10, and the right term represents the sum of the text of the image search R@1, R@5, r@10.

Table 1 is a table of comparison of the method of the present invention with other methods of pattern matching on a flicker30k dataset. Comparing the inventive method with several classical methods in the field of image-text matching, SCAN (2018 CVPR), PFAN (2019 IJCAI), VSRN (2019 ICCV), DP-RNN (2020 AAAI), CVSE (2020 ECCV), CAAN (2020 CVPR), as can be seen from the result, the inventive method has more balanced retrieval effect at two angles of image retrieval text and text retrieval image, and compared with the existing method, the inventive image-text matching method based on mixed focusing attention mechanism has the best comprehensive performance, namely rsum is highest and is 489.3; secondly, in two subtasks of image retrieval text and text retrieval image, the method provided by the invention has the best effect on recall@1, and the retrieval success rate of the method is proved to be far higher than that of other methods.

It should be noted that each step/component described in the present application may be split into more steps/components, or two or more steps/components or part of the operations of the steps/components may be combined into new steps/components, as needed for implementation, to achieve the object of the present invention.

The above-described method according to the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored on such software process on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes a memory component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the processing methods described herein. Further, when the general-purpose computer accesses code for implementing the processes shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the processes shown herein.

It will be readily appreciated by those skilled in the art that the foregoing is merely illustrative of the present invention and is not intended to limit the invention, but any modifications, equivalents, improvements or the like which fall within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. The image-text matching method based on the mixed focusing attention mechanism is characterized by comprising the following steps of:

2. The method for matching graphics based on a hybrid focus attention mechanism of claim 1, further comprising:

3. A method of matching graphics based on a mixed focus attention mechanism as claimed in claim 1 or 2, characterized in that in step S1 two sub-steps are included:

The dimension hidden space is marked as the obtained region characteristics

Wherein the feature vector->

Each element in (a) is real,/-, a ∈>

Representing the dimension of the feature vector ∈>

Representing the real number field, ++>

Representation->

Real vectors of dimensions;

step S12 for inclusion of

，

wherein ,

indicating the hidden state of the forward process, +.>

Indicate->

Single hot code of individual word,/->

Representing the forward process of Bi-GRU;

，

wherein ,

indicating the hidden state of the backward process, +.>

Represents a backward process of Bi-GRU;

word characteristics

Hidden state from forward process->

And hidden state of backward process->

Averaging, namely:

，

mapping its features to linear layers

The vitamin-hidden space, marked as +.>

，

Representing the dimension of the feature vector.

4. A method of matching text based on a mixed focus attention mechanism as claimed in claim 3, characterized in that in step S2, two sub-steps are included;

step S21, giving the image area characteristics

And word feature of description->

And word average feature +.>

Mean feature in image area->

And word average feature +.>

，

，

wherein ,

representing the average feature of an image region->

For->

Individual image area features->

Attention score of->

Representing word average feature +.>

For->

Individual word feature +.>

Attention score of->

、

and

、

and

For the parameter vector +.>

，

wherein ,

representing global features of the image;

Global features representing sentence descriptions;

for a size of

Calculating the current text description for the lot size of the image +.>

Focusing degree of sheet image->

The method comprises the following steps:

，

wherein ,

for the parameter vector +.>

Representing a concatenation operation of two feature vectors, +.>

Focusing degree of sheet image->

；

Step S22, obtaining the first

Regional characteristics of a sheet of image->

And word characteristics of the text description->

And (4) the right->

Focusing fraction +.>

The method comprises the following steps:

，

wherein ,

representing transpose, for similarity score +.>

L2 normalizationThe reason is normalized similarity->

Represents->

Personal word and->

The degree of similarity of the individual regions;

the attention score is given by:

，

The method comprises the following steps: />

，

Implementing the first via the linear layer

Individual word feature +.>

And corresponding cross-modal context feature +.>

Is a fusion of (1), namely:

，

wherein ,

representing the characteristics of the two mode information after being fused;

Is a linear layer;

global features of the image obtained in step S21

And global features of sentence descriptions->

Fusion is performed as a global feature after fusion +.>

The method comprises the following steps:

，

the fused global features

Fusion feature corresponding to each word +.>

Merging features recorded as multi-modal

。

5. The method of claim 4, wherein in step S3, the attention coefficient matrix is calculated by the following formula:

，

wherein ,

and

Representing two linear layers of different parameters;

gating signal

The calculation is as follows:

，

wherein ,

to activate the function +.>

Is a learnable parameter vector, gating signal +.>

，

Each scalar element +.>

The importance of each feature is considered as being +.>

Fixed as hard score:

，

wherein ,

is threshold value, < >>

Score for unimportant local feature, +.>

A score that is an important local feature;

the gating vector is expressed as

；

By the first

Individual gating signals->

Attention score matrix->

Is>

The column elements are weighted, expressed by:

，

wherein ,

representing the attention score matrix->

Is a single element;

post-gating attention moment by softmax functionArray

Carrying out normalization processing on each row of elements in the list;

updated global features

Multimodal features from attention score ++>

Weighted sum is performed, namely: />

，

wherein ,

is an activation function;

Is a linear layer;

The multi-mode feature matrix obtained in the previous step is obtained;

, wherein

Representing updated global features +.>

Representing the updated local features.

6. The method for matching graphics and text based on a mixed focus attention mechanism as recited in claim 5, wherein the method comprises the steps ofIn step S4, the updated features obtained in step S3 are used

The score of the current image-text pair is predicted by adopting a linear layer, and the score is expressed as follows:

，

wherein ,

activating a function for sigmoid;

Representing a linear layer->

Representation of image->

And text description->

Matching scores between.

7. The method of claim 6, wherein in step S5, the triplet loss function

Expressed as:

，

wherein ,

，

is a threshold value, < >>

and

The first and second most difficult negative samples are respectively;

，

wherein ,

and

For parameter scalar before optimization, ++>

For optimized parameter scalar, ++>

For learning rate->

Representing the gradient solving process.

8. The image-text matching system based on the mixed focusing attention mechanism is characterized by comprising the following functional modules:

9. The system for matching text based on a mixed focus attention mechanism of claim 8, further comprising:

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.