CN110147457B

CN110147457B - Image-text matching method, device, storage medium and equipment

Info

Publication number: CN110147457B
Application number: CN201910152063.1A
Authority: CN
Inventors: 贲有成; 吴航昊; 袁春; 周杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2023-07-25
Anticipated expiration: 2039-02-28
Also published as: CN110147457A

Abstract

The embodiment of the application discloses an image-text matching method, an image-text matching device, a storage medium and storage equipment, and belongs to the technical field of computers. The method comprises the following steps: acquiring an image and a text to be matched; generating a candidate instance feature set according to the image; aggregating candidate instance features in the candidate instance feature set by using a self-attention mechanism to obtain an instance feature set, wherein each instance feature in the instance feature set corresponds to an object in the image; encoding the text to obtain a text vector; a degree of matching between the image and the text is calculated from the set of example features and the text vector. The embodiment of the application can simplify the implementation difficulty of image-text matching and improve the accuracy of image-text matching.

Description

Image-text matching method, device, storage medium and equipment

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an image-text matching method, an image-text matching device, a storage medium and image-text matching equipment.

Background

Cross-modal retrieval is a novel retrieval mode, and data retrieval among different modalities can be achieved. Taking the example of mutual text retrieval, the user may input an image to retrieve the descriptive text of the image, or the user may input a text to retrieve the image described by the sentence.

Taking the example of retrieving text from an image, the server may generate a retrieval result according to the degree of matching between the retrieved text and the image. When calculating the matching degree of the text and the image, the server extracts an example feature set of the image by using a trained object detector; generating a text vector of the text by using the recurrent neural network; a degree of matching between the image and the text is calculated from the set of instance features and the text vector using a matching model.

The difficulty of training the object detector is high because the class and position information of all the examples in the image need to be marked on each image when the object detector is trained; in addition, the object detector and the matching model are trained separately, so that example features identified by the object detector may not be suitable for matching text by the matching model, thereby affecting the accuracy of the image-text matching.

Disclosure of Invention

The embodiment of the application provides an image-text matching method, an image-text matching device, a storage medium and image-text matching equipment, which are used for solving the problems that the training difficulty of an object detector is high, the identified example features are not suitable for matching texts, and the accuracy of image-text matching is affected. The technical scheme is as follows:

In one aspect, a method for matching graphics and text is provided, the method comprising:

acquiring an image and a text to be matched;

generating a candidate instance feature set according to the image;

aggregating candidate instance features in the candidate instance feature set by using a self-attention mechanism to obtain an instance feature set, wherein each instance feature in the instance feature set corresponds to an object or region in the image;

encoding the text to obtain a text vector;

a degree of matching between the image and the text is calculated from the set of example features and the text vector.

In one aspect, there is provided a graphic matching apparatus, the apparatus comprising:

the acquisition module is used for acquiring the images and the texts to be matched;

the generation module is used for generating a candidate instance feature set according to the image obtained by the acquisition module;

the aggregation module is used for aggregating the candidate example features in the candidate example feature set generated by the generation module by utilizing a self-attention mechanism to obtain an example feature set, and each example feature in the example feature set corresponds to an object or region in the image;

The coding module is used for coding the text obtained by the obtaining module to obtain a text vector;

and the calculating module is used for calculating the matching degree between the image and the text according to the example feature set obtained by the aggregation module and the text vector obtained by the encoding module.

In one aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement a method of graph matching as described above is provided.

In one aspect, a teletext matching arrangement is provided, comprising a processor and a memory, in which at least one instruction is stored, the instruction being loaded and executed by the processor to implement a teletext matching method as described above.

The beneficial effects of the technical scheme provided by the embodiment of the application at least comprise:

according to the method, a candidate example feature set is generated according to the image, then the candidate example features in the candidate example feature set are aggregated by utilizing a self-attention mechanism, so that an example feature set can be obtained, and then the matching degree between the image and the text is calculated according to the example feature set and the text vector, so that the example features can be aggregated by utilizing the self-attention mechanism through the relevance between the candidate example features, the example feature set of the image is prevented from being acquired through the object detector, the problem that the training difficulty of the object detector is high because the category and position information of all examples need to be marked on each image when the object detector is trained is solved, and the effect of simplifying the implementation difficulty of image-text matching is achieved; the problem that the object detector outputs corresponding position information besides semantic information, and the position information does not help in image-text matching, so that example features recognized by the object detector are not suitable for matching texts and the accuracy of image-text matching is affected is solved, and the effect of improving the accuracy of image-text matching is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a search result shown in accordance with some demonstrative embodiments;

FIG. 2 is a schematic diagram of a pattern matching system according to some exemplary embodiments;

FIG. 3 is a method flow diagram of a method for matching text provided in one embodiment of the present application;

FIG. 4 is a flow chart of a method for matching graphics and text according to another embodiment of the present application;

FIG. 5 is a block diagram of a teletext matching system according to another embodiment of the application;

FIG. 6 is a block diagram of an image matching device according to one embodiment of the present application;

fig. 7 is a schematic structural diagram of a server according to another embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Visual content recognition and natural language understanding are two major challenges in the field of artificial intelligence, and currently popular research directions are to determine the intersection point between an image and a text, and then implement some applications based on the intersection point. For example, descriptive text is generated from images, visual questions and answers, images are generated from text, images and text are retrieved from each other, and so forth.

The present application relates to mutual inspection of images and text, the primary purpose of which is to search for matching images by a given text or to query matching text by a given image. The following describes several possible application scenarios according to different display forms of images and characters.

1) Mutual inspection of images and text

The text may be one sentence or a combination of multiple sentences with complete semantics. The sentence referred to herein may be a sentence in any natural language.

When retrieving text using images, an image may be entered and text matching the visual semantics of the image may be retrieved from a text library containing at least one text. In order to facilitate understanding, 4 images in the Flickr30K dataset can be used as input, 5 texts which are most similar to visual semantics of each image can be respectively queried, and each image and the 5 texts searched based on the image are correspondingly displayed, so that a search result shown in fig. 1 is obtained. It should be noted that, the text found by the server may be matched with the image (i.e. the search result is accurate), or may be not matched with the image (i.e. the search result is wrong), in fig. 1, the text matched with the image is denoted by "v", and the text not matched with the image is denoted by "x".

When retrieving images using text, a text may be entered and an image matching the text semantics of the text may be retrieved from an image library comprising at least one image.

2) Mutual inspection rope of image and label

The tag may be a word or a combination of words. The vocabulary here may be vocabulary in any natural language.

When retrieving tags using images, an image may be input and tags matching the visual semantics of the image may be retrieved from a tag library containing at least one tag. If the first image in fig. 1 is used as input, the retrieved tag may be beach volleyball, bikini, sports, etc.

When retrieving an image using a tag, a tag may be entered and an image matching the tag's tag semantics may be retrieved from an image library containing at least one image.

3) Mutual inspection of video and text

When retrieving text using video, a video may be input, each image frame extracted from the video, each image frame taken as an input image, and text matching the visual semantics of the image retrieved from a text library containing at least one text.

When retrieving video using text, a text may be entered and video containing image frames that match the text semantics of the text may be retrieved from a video library containing at least one piece of video.

4) Mutual inspection cable for video and label

When the labels are retrieved by using the video, a video may be input, each image frame is extracted from the video, each image frame is taken as an input image, and the labels matching the visual semantics of the image are retrieved from a label library containing at least one label.

When retrieving video using tags, a tag may be entered and video containing image frames that match the tag semantics of the tag may be retrieved from a video library containing at least one piece of video.

It should be noted that the embodiments of the present application may be implemented in a terminal, or may be implemented in a server, or may be implemented by the terminal and the server together, as shown in fig. 2, where, for example, the terminal 21 is configured to generate a text according to a text search image, and send the text to the server 22, and the server 22 sends the searched image to the terminal 21 for display based on the text search image. Optionally, the terminal 21 and the server 22 are connected through a communication network, where the communication network may be a wired network or a wireless network, which is not limited in the embodiment of the present application.

Illustratively, a machine learning model for matching graphics and texts is stored in the server 22, after a user inputs a text "A woman is playing volleyball" to be searched in the terminal 21, the terminal 21 sends the text to the server 22, the server 22 reads each image from the image library, the matching degree of each image and the text is calculated through the machine learning model, and the image matched with the text is sent to the terminal 21 for displaying.

Referring to fig. 3, a method flowchart of a graph-text matching method according to an embodiment of the present application is shown. The image-text matching method comprises the following steps:

in step 301, an image and text to be matched are acquired.

Corresponding to the four application scenarios described above, the image and text to be matched may be acquired in the following four ways.

1) Mutual inspection of images and text

When retrieving text using images, one text may be sequentially retrieved from a text library containing at least one text, and for each text retrieved, the text and the input image are taken as a set of images and texts to be matched.

When retrieving images using text, one image may be sequentially acquired from an image library containing at least one image, and for each acquired image, the image and the input text are used as a set of images and texts to be matched.

2) Mutual inspection rope of image and label

When searching for a tag by using an image, one tag can be sequentially obtained from a tag library containing at least one tag, and for each obtained tag, the tag and the input image are used as a group of images and texts to be matched.

When retrieving images using tags, one image may be sequentially acquired from an image library containing at least one image, and for each acquired image, the image and the input tag are used as a set of images and text to be matched.

3) Mutual inspection of video and text

When the text is searched by using the video, since the content of f (f is a positive integer) continuous video frames in the video is not greatly different, the input video can be sampled every f video frames to obtain each image frame, when each image frame is taken as an input image, one text is sequentially obtained from a text library containing at least one text, and for each obtained text, the text and the input image are taken as a group of images and texts to be matched. Subsequently, the average or maximum value of the matching degree of all image frames in the video and a text can be used as the matching degree of the video and the text.

When the text is utilized to search the image, a video segment can be acquired from a video library containing at least one video segment in sequence, and for each acquired video segment, as the content of f (f is a positive integer) video frames in succession in the video is not greatly different, the video can be sampled every f video frames to obtain each image frame, and for each image frame, the image frame and the input text are used as a group of images and texts to be matched. Subsequently, the average or maximum value of the matching degree of all image frames in the video and the text can be used as the matching degree of the video and the text.

4) Mutual inspection cable for video and label

When the labels are searched by using the video, since the content of f (f is a positive integer) continuous video frames in the video is not greatly different, the input video can be sampled every f video frames to obtain each image frame, when each image frame is taken as an input image, one label is sequentially obtained from a label library containing at least one label, and for each obtained label, the label and the input image are taken as a group of images and texts to be matched. Subsequently, the average value or the maximum value of the matching degree of all the image frames in the video and one label can be used as the matching degree of the video and the label.

When searching images by using labels, a video segment can be acquired from a video library containing at least one video segment in turn, and for each acquired video segment, as the content of f (f is a positive integer) video frames in the video segment is not greatly different, the video can be sampled every f video frames to obtain each image frame, and for each image frame, the image frame and the input label are used as a group of images and texts to be matched. Subsequently, the average value or the maximum value of the matching degree of all the image frames in the video and the label can be used as the matching degree of the video and the label.

After obtaining a set of images and texts to be matched, the images may be processed through steps 302-303, and the texts may be processed through step 304, which is not limited in the sequence of processing the images and texts in this embodiment, that is, the sequence of execution between steps 302-303 and step 304 in this embodiment is not limited.

When a text is searched by using one image, the image may be processed once, and the processing result may be matched with the processing result of each text; that is, steps 301-305 are performed when the present method is first implemented, and steps 301, 304-305 are performed when the present method is subsequently implemented. When searching an image by using a text, the text can be processed once, and the processing result is matched with the processing result of each image; that is, steps 301-305 are performed when the present method is first implemented, and steps 301-303 and 305 are performed when the present method is subsequently implemented.

Step 302, a candidate instance feature set is generated from the image.

The set of candidate example features includes at least one candidate example feature, and each candidate example feature corresponds to a region in a feature map of the image.

In this embodiment, the image may be input into the convolutional neural network to obtain a feature map, and then the candidate example feature set may be obtained according to the feature map, which will be described in detail below. The convolutional neural network may be ResNet (residual network), VGGNet (Visual Geometry Group ), googleNet, alexNet, and the like, which is not limited in this embodiment.

Step 303, aggregating candidate instance features in the candidate instance feature set using a self-attention mechanism to obtain an instance feature set, each instance feature in the instance feature set corresponding to an object or region in the image.

The set of example features includes at least one example feature, and each example feature corresponds to one object or region in the image. Taking the first image in fig. 1 as an example, example features may be people, nets, auditoriums, etc.

Before explaining the self-attention mechanism, the attention mechanism is explained. The attention mechanism is an imitation of the human visual mechanism. The human vision mechanism obtains a target area needing to be focused, namely a focus of attention, through rapidly scanning the global image, and then inputs more attention resources to the target area so as to acquire more detail information of the target needing to be focused, and suppresses other useless information. It can be seen that the attention mechanism is a mechanism of aligning internal experience and external feeling to increase the observation fineness of a target region, which can rapidly extract important features of sparse data, and thus is widely used. While self-attention mechanisms are improvements in attention mechanisms that reduce reliance on external information, are more adept at capturing internal dependencies of data or features.

Since each candidate instance feature corresponds to a region in the feature map of the image, similar visual semantics can be aggregated based on the correlation between the candidate instance features using a self-attention mechanism to obtain individual objects or regions in the image that need attention, i.e., the set of instance features.

And 304, encoding the text to obtain a text vector.

In this embodiment, text may be input into the recurrent neural network to obtain text vectors. The recurrent neural network may be RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), GRU (Gate Recurrent Unit, gated loop unit), etc., which is not limited in this embodiment.

Step 305, calculating the matching degree between the image and the text according to the instance feature set and the text vector.

In this embodiment, the instance feature set and the text vector may be mapped into a common semantic space, and then global similarity between the instance feature set and the text vector is calculated in the semantic space, and the matching degree between the image and the text is measured according to the global similarity.

After the matching degree between the image and the text is obtained, the matching degree larger than a preset threshold value can be selected, and the image or the text corresponding to the matching degree is determined as a retrieval result; alternatively, the matching degree of all the images or texts may be ranked, and the images or texts arranged in the first few digits may be determined as the search result.

In summary, in the image-text matching method provided by the embodiment of the application, the candidate example feature set is generated according to the image, then the candidate example features in the candidate example feature set are aggregated by using the self-attention mechanism, so that the example feature set can be obtained, and then the matching degree between the image and the text is calculated according to the example feature set and the text vector, so that the example feature can be aggregated by using the self-attention mechanism through the relevance between the candidate example features, the example feature set of the image is prevented from being obtained through the object detector, the problem that the training difficulty of the object detector is high because the category and position information of all examples need to be marked on each image when the object detector is trained is solved, and the effect of simplifying the implementation difficulty of image-text matching is achieved; the problem that the object detector outputs corresponding position information besides semantic information, and the position information does not help in image-text matching, so that example features recognized by the object detector are not suitable for matching texts and the accuracy of image-text matching is affected is solved, and the effect of improving the accuracy of image-text matching is achieved.

In addition, the output of the object detector provides useless position information, which does not fully consider the characteristics of the cross-modal search itself, and the inventor can find out that the object detector has problems in the cross-modal search, which has difficulty.

In addition, since the size of the receptive field of the convolution kernel is fixed, self-attention mechanisms are often used to capture long-term dependencies between image features, while the use of self-attention mechanisms in this application aggregates similar semantic information to obtain an example feature set, unlike the conventional role of self-attention mechanisms, so introducing self-attention mechanisms into cross-modal retrieval is not easily thinkable from this point of view.

Referring to fig. 4, a method flowchart of a graph-text matching method according to another embodiment of the present application is shown. The image-text matching method comprises the following steps:

in step 401, an image and text to be matched are acquired.

Corresponding to the above four application scenarios, the image and text to be matched may be obtained in four ways, which are detailed in step 301 and are not described herein.

After obtaining a set of images and texts to be matched, the images may be processed through steps 402-404, and the texts may be processed through step 405, which is not limited in the sequence of processing the images and texts in this embodiment, that is, the sequence of execution between steps 402-404 and step 405 is not limited in this embodiment.

When a text is searched by using one image, the image may be processed once, and the processing result may be matched with the processing result of each text; that is, steps 401-406 are performed when the method is first implemented, and steps 401, 405-406 are performed when the method is subsequently implemented. When searching an image by using a text, the text can be processed once, and the processing result is matched with the processing result of each image; that is, steps 401-406 are performed when the present method is first implemented, and steps 401-404 and 405 are performed when the present method is subsequently implemented.

Step 402, inputting the image into a convolutional neural network, and acquiring a feature map output by the convolutional neural network.

The convolutional neural network may be ResNet, VGGNet, googleNet, alexNet, etc., and the present embodiment is not limited thereto.

Optionally, the convolutional neural network may be trained using the data set, and then the image is input into the trained convolutional neural network to obtain the feature map. For example, the convolutional neural network may be trained using an ImageNet dataset, and the embodiment is not limited.

In this embodiment, the output result of the convolutional layer of the convolutional neural network is referred to as a feature map.

In this embodiment, the output of at least one convolutional layer in the convolutional neural network may be extracted to obtain at least one feature map. The present embodiment does not limit the number of feature maps nor the convolution layers that output the feature maps.

Referring to fig. 5, the convolutional neural network is ResNet-152, the convolutional layers are Conv1, conv2_x, conv3_x, conv4_x and conv5_x, and assuming that output results of conv3_x, conv4_x and conv5_x are selected, the conv3_x is a feature map with a scale of 28×28×512, the conv4_x is a feature map with a scale of 14×14×1024, and the conv5_x is a feature map with a scale of 7×7×2048.

It should be noted that, before the image is input into the convolutional neural network, the image may be further preprocessed so that the image satisfies the input condition of the convolutional neural network. Taking the image on both the MS-COCO and Flickr30K data sets as an example, the image may be randomly cropped and scaled to a 224X 224 image size.

Step 403, dividing the feature map, and forming the candidate instance feature obtained after division into a candidate instance feature set.

If the convolutional neural network outputs a feature map, the feature map can be uniformly divided into k candidate instance areas, each candidate instance area is a candidate instance feature, a candidate instance feature set is obtained, and k is a positive integer greater than or equal to 2. Wherein each candidate instance feature in the candidate instance feature set corresponds to a region in the image. If the convolutional neural network outputs at least two feature maps of different scales, a candidate example feature set can be obtained for each feature map in the manner described above.

If the feature map is divided into k candidate example areas in the above manner, and the space occupied by an object in the image is assumed to correspond to a plurality of candidate example areas, the example feature corresponding to the object also relates to the plurality of candidate example areas, so the feature corresponding to the candidate example area is referred to as a candidate example feature.

Given an input image I, candidate instance features may be defined on a feature map as follows: for a feature map with a dimension of m×n and a channel number of C, feature values are taken at each spatial position along the channel dimension, so as to obtain k=m×n feature vectors, i.e., u= { U ₁ ，…，u _k }，u _i ∈R ^C . These feature vectors can be considered candidate instance features since they each correspond to a particular region in image I. Wherein, C may be 2048 or other numerical values, and the embodiment is not limited.

For ease of understanding, taking a feature map with a scale of 3×3 and a channel number of 512 as an example, if the feature map is equally divided into 9 candidate instance areas based on spatial dimensions (width and height), each candidate instance area is a feature vector of 512 dimensions.

Step 404, aggregating candidate instance features in the candidate instance feature set using a self-attention mechanism to obtain an instance feature set, each instance feature in the instance feature set corresponding to an object or region in the image.

The explanation of the self-attention mechanism is detailed in step 303 and is not described here.

Since each candidate instance feature corresponds to a region in the feature map of the image, similar visual semantics can be aggregated based on the correlation between the candidate instance features using a self-attention mechanism to obtain individual objects or regions in the image that need attention, i.e., the set of instance features. That is, for an ith candidate example feature in the set of candidate example features, a correlation between the ith candidate example feature and the remaining candidate example features is calculated using a self-attention mechanism, and example features based on the ith candidate example feature are calculated from the correlation.

In this embodiment, the instance features may be calculated based on each candidate instance feature. For each candidate instance feature, the similarity between itself and other candidate instance features may be calculated using a self-attention mechanism and converted to weights to represent the correlation between the candidate instance features by the weights, such that the aggregate instance feature is essentially a weighted sum operation of all candidate instance features.

How to design a proper self-attention mechanism is an implementation difficulty of the present application, two implementation manners of the self-attention mechanism are described below, and the relevance is represented by weights in the two self-attention mechanisms.

In a first implementation, for an ith candidate example feature in the candidate example feature set, calculating a cosine similarity between the ith candidate example feature and the jth candidate example feature, and calculating a weight of the jth candidate example feature according to the cosine similarity, where the weight is used to represent a degree of attention to the jth candidate example feature when aggregating other candidate examples based on the ith candidate example feature, and i and j are positive integers; multiplying each candidate instance feature in the candidate instance feature set by a corresponding weight, and adding the obtained products to obtain the instance feature based on the ith candidate instance feature.

For the input image I, a candidate instance feature set U corresponding to the feature map may be obtained according to step 403, and then the cosine similarity between all candidate instance features in the candidate instance feature set U is calculated, i.e.

Wherein s is _ij Representing cosine similarity between the ith candidate instance feature and the jth candidate instance feature.

The following is based on the remainderThe chord similarity calculates the weight of the j candidate example feature, and before calculating the weight, the cosine similarity can be normalized, namely

Wherein [ x ]] ₊ ≡max(x，0)。

The weight of the feature of the j candidate example is calculated according to the normalized cosine similarity, and the weight is used for representing the attention degree of the feature of the j candidate example when other candidate examples are aggregated based on the feature of the i candidate example, namely

Wherein lambda is ₁ Is a super parameter for controlling the effect of the aggregation.

In one possible implementation, λ ₁ ＝9。

For the ith candidate instance feature u _i Other relevant candidate instance features can be aggregated by a weighted summation algorithm to yield a u-based _i Example features obtained by polymerisation, i.e

In this embodiment, the self-Attention mechanism in the first implementation may be referred to as a deterministic self-Attention mechanism (DSA), which measures similarity between candidate instance features using cosine similarity between the candidate instance features, without introducing additional learning parameters.

In a second implementation, mapping each candidate instance feature in the candidate instance feature set into a first feature space, a second feature space, and a third feature space, respectively; for the ith candidate example feature in the candidate example feature set, calculating the weight of the jth candidate example feature according to the jth candidate example feature in the first feature space and the ith candidate example feature in the second feature space, wherein the weight is used for representing the attention degree of the jth candidate example feature when other candidate examples are aggregated based on the ith candidate example feature, and i and j are positive integers; multiplying each candidate example feature in the third feature space by a corresponding weight, adding the obtained products, and performing residual fitting to obtain an example feature based on the ith candidate example feature.

Feature map x e R for input image I ^C×M×N It can be converted into a two-dimensional matrix u e R ^C×k K=m×n, which corresponds to the k candidate instance features described in step 403. To model the weight matrix β ε R between candidate instance features ^k×k First, u is mapped into the first feature space θ and the second feature space Φ, then θ (u) =w _θ u，φ(u)＝W _φ u. Wherein, the liquid crystal display device comprises a liquid crystal display device,and W is _θ And W is _φ Are model parameters that can be learned. In one possible implementation, this can be done using a 1 x 1 convolution operation and taking +.>

For the ith candidate instance feature u _i Calculating the weight of the feature of the jth candidate instance, wherein the weight is used for representing the attention degree of the feature of the jth candidate instance when other candidate instances are aggregated based on the feature of the jth candidate instance, namely

Wherein s is _ji ＝θ(u _j ) ^T φ(u _i ),i,j∈[1,k]，θ(u _j ) I.e., the j-th candidate example feature in the first feature space, phi (u) _i ) I.e. the i-th candidate example feature in the second feature space.

Next, u is mapped into the third feature space g, then there is g (u) =w _g u. Wherein W is _g ∈R ^C×C And W is _g Are model parameters that can be learned. In one possible implementation, this may be achieved using a 1 x 1 convolution operation.

In addition, the example feature a obtained by the weighted summation algorithm can be also obtained _i Residual fitting is performed, i.e. at a _i Adding a residual fitting module to obtain final example characteristics y _i ＝ηa _i +u _i 。(7)

Where η may be a learned model parameter.

In this embodiment, the Self-Attention mechanism in the second implementation may be referred to as an Adaptive Self-Attention mechanism (ASA), which adaptively models the correlation between candidate instances based on a neural network, and requires the introduction of additional learning parameters.

It should be noted that, when the convolutional neural network outputs a feature map, an example feature set may be obtained by any one of the two implementations described above; when the convolutional neural network outputs n feature graphs with different scales, and n is more than or equal to 2, the self-attention mechanism does not change the scale of the feature graphs, so that the n feature graphs with different scales are required to be fused.

The scale of the feature map determines the number of candidate instance features, which are aggregated, so that the scale of the feature map indirectly determines the number of instance features. That is, the larger the scale of the feature map, the greater the number of encoded instance features. The more the number of example features, the more difficult it is to align the example features with the lexical features in the text in the semantic space, so in this embodiment, the number of example features needs to be reduced when feature maps of different scales are fused.

In one possible implementation, the feature maps of different scales may be fused in a downsampling manner. At this time, for the m Zhang Tezheng drawing in the n feature drawings, the scale of the m+1st feature drawing is obtained, and m is more than or equal to 1 and less than n; downsampling the example feature set generated based on the m Zhang Tezheng graph according to the scale of the m+1th feature graph, and combining the obtained example feature set with the example feature set generated based on the m+1th feature graph; and determining the combined instance feature set as the finally generated instance feature set based on the m+1st feature map.

Referring to fig. 5, the dimensions of the three feature maps are 28×28×512, 14×14×1024 and 7×7×2048, respectively, for the first feature map, the feature map of 28×28×512 can be downsampled (maxpool) to 14×14×512, and then combined (Concat) with the feature map of 14×14×1024 in the channel dimension to obtain a feature map of 14×14×1536; the 14×14×1536 feature map is then downsampled (maxpool) to 7×7×1536 and then concatenated (Concat) with the 7×7×2048 feature map in the channel dimension to yield a 7×7×3584 feature map.

The first point to be described is: because of the feature map and the example feature set, downsampling the feature map is to downsample the example feature set.

The second point is that the downsampling method of Maxpool is illustrated in fig. 5, and downsampling may be performed by means of average pool, convolution, or the like, which is not limited in this embodiment.

The third point to be described is that after scale fusion, the instance feature set can be mapped into the shared semantic space of D dimension through the full connection layer, i.e. v _i ＝W _v a _i +b _v 。(7)

Wherein W is _v And b _v Is the model parameter corresponding to the full connection. D may be 1024 or other values, which is not limited in this embodiment.

The fourth point is that, if the object detector is used to extract the example feature, since the object detector has a multi-scale design in its detection frame, the object detector is used to extract the example feature without fusing the multi-scale feature map. Since the fusion of multi-scale feature maps is not considered when the object detector is used to extract the example features, the inventors can think that the fusion process of multi-scale feature maps is added to enhance visual semantics when the self-attention mechanism is used to obtain the example features, which is inherently difficult.

The fifth point to be described is that, after the example feature set of the image is obtained by using the self-attention mechanism, the example feature set may be applied to other applications, for example, generating descriptive text from the image, visual questions and answers, generating the image from the text, and the like, which is not limited in this embodiment.

The second point to be described is that, when the image is an image frame in the video, besides extracting a feature map of the image frame, dynamic association information between each image frame may be extracted, and then an example feature set is generated according to the feature map and the dynamic association information, which is not limited in this embodiment.

And step 405, encoding the text to obtain a text vector.

The recurrent neural network may be RNN, LSTM, GRU, etc., and the embodiment is not limited thereto. Taking the bidirectional GRU as an example, the encoding process includes the following steps:

1) And segmenting the sentences to obtain r vocabularies, wherein r is a positive integer.

Statement s= { w given input ₁ ，…，w _r Most of the potential examples are nouns or noun phrases, where the goal is to learn a vocabulary vector. Therefore, word segmentation processing is required for sentences to obtain r vocabularies. The word segmentation processing is more implemented, and is not limited herein.

2) And for the t vocabulary in the r vocabularies, coding the t vocabulary according to the position of the t vocabulary in the vocabulary table to obtain a t vocabulary feature vector, wherein t is more than or equal to 1 and less than or equal to r.

After the word segmentation process, the vocabulary vectors may be embedded using the context of the bi-directional GRU join statement. For the t-th word w in the sentence _t In other words, one-heat encoding II can be used first _t Identify its location in the entire vocabulary and then base II on that location _t Lexical feature vector x mapped to a predetermined dimension _t I.e. x (t) =w _x Ⅱ _t ，i∈[1，r]. The vocabulary is composed of words with occurrence frequency greater than a preset frequency after all sentences are segmented. In one possible implementation, the predetermined dimension is 300 dimensions and is initialized by default using a pre-trained GloVe (word embedded vector) feature.

For example, assume that 300 words are included in the vocabulary, and w _t The position in the vocabulary is 29, then the t-th vocabulary feature vector is a 300-dimensional vector, and wherein the 29-th dimension has a value of 1 and the remaining dimensions have values of 0.

3) Inputting the t word characteristic vector into a bidirectional gating circulation unit in the t time step, and determining the t word vector according to the bidirectional hiding state of the bidirectional gating circulation unit in the t time step.

The bidirectional GRU comprises a forward GRU and a backward GRU, and can be represented by w ₁ To w _r Sequentially taking r vocabulary feature vectors as the input of the forward GRU, which can be represented by w _r To w ₁ Sequentially taking r vocabulary feature vectors as the input to the backward GRU.

For the forward direction of the GRU,

for a backward-oriented GRU,

the t-th vocabulary vector can be hidden from the forward directionAnd a backward hidden state->Average value determination of (1) ofDescribing the whole sentence in word w _t Surrounding information, i.e.)>

4) And forming the obtained r vocabulary vectors into text vectors.

Referring to fig. 5, text vectors can be obtained according to each hidden state h by inputting text into the bidirectional GRU.

Step 406, calculating the matching degree between the image and the text according to the example feature set and the text vector.

For a given input image I and sentence S, an example feature set v= { V can be obtained according to the above steps ₁ ,…,v _k },v _i ∈R ^D And text vector e= { E ₁ ,…,e _r },e _r ∈R ^D . To measure global similarity between image I and sentence S, a stacked cross-attention method (Stacked Cross Attention) may be employed that can evaluate the final global similarity by aggregating local similarities between the instance feature set and the text vector. The local similarity is aggregated in two directions, namely, the similarity (Image-Text, abbreviated as i-t) between the local similarity and the Text vector is aggregated based on the instance feature set, and the similarity (Text-Image, abbreviated as t-i) between the local similarity and the instance feature set is aggregated based on the Text vector. These two polymerization modes are described below, respectively.

Taking i-t direction aggregation as an example, calculating the similarity between the p-th example feature and the q-th vocabulary vector in the text vector for the p-th example feature in the example feature set, and calculating the weight of the q-th vocabulary vector according to the similarity, wherein p and q are positive integers; multiplying each vocabulary vector in the text vector by a corresponding weight, and adding the obtained products to obtain a text semantic vector based on the p-th example feature; calculating cosine similarity between the p-th instance feature and the text semantic vector; and calculating the global similarity between the image and the text according to the cosine similarity between all feature examples in the example feature set and the corresponding text semantic vector, wherein the global similarity is used for indicating the matching degree between the image and the text.

For the p-th example feature v _p It can first calculate all vocabulary vectors { e } in the text vector ₁ ,…,e _r Similarity between the word vectors, and then weighting and summing all word vectors by taking the similarity as weight to obtain a v-based word vector _p Aggregated text semantic vectorBased on v _p Can be used with v _p And->Cosine similarity betweenTo quantify, and the global similarity between image I and statement S may be aggregated with the LogSumExp function, i.e

Alternatively, the global similarity between image I and statement S may be aggregated with a mean function, i.e

Taking the aggregation in the t-i direction as an example, calculating the similarity between the p-th vocabulary vector and the q-th example feature in the example feature set for the p-th vocabulary vector in the text vector, and calculating the weight of the q-th example feature according to the similarity, wherein p and q are positive integers; multiplying each instance feature in the instance feature set by a corresponding weight, and adding the obtained products to obtain an image semantic vector based on the p-th vocabulary vector; calculating cosine similarity between the p-th vocabulary vector and the image semantic vector; and calculating the global similarity between the texts and the images according to cosine similarity between all vocabulary vectors in the text vectors and corresponding image semantic vectors, wherein the global similarity is used for indicating the matching degree between the images and the texts.

The calculation method of the global similarity of the t-i direction aggregation and the global similarity of the i-t direction aggregation are the same, and are not described herein.

Alternatively, an average value of the global similarity aggregated in the t-i direction and the global similarity aggregated in the i-t direction may be calculated, and the average value is used as the global similarity of the image and the text.

In this embodiment, the global similarity between the image and the text is the matching degree between the image and the text. After the matching degree between the image and the text is obtained, the matching degree larger than a preset threshold value can be selected, and the image or the text corresponding to the matching degree is determined as a retrieval result; alternatively, the matching degree of all the images or texts may be ranked, and the images or texts arranged in the first few digits may be determined as the search result. Referring to fig. 5, after stacking the cross-attention, images or texts arranged in the first few digits can also be determined by the sorting error to obtain a retrieval result.

By fusing feature maps of multiple scales, visual semantics can be enhanced. In addition, since the fusion of the multi-scale feature maps is not considered when the object detector is used to extract the example features, the inventors can think that the fusion process of the multi-scale feature maps is added to enhance visual semantics when the self-attention mechanism is used to obtain the example features, which is inherently difficult.

The feature graphs of multiple scales are fused in a downsampling mode, the number of example features can be reduced while the feature graphs are fused, so that the difficulty of alignment of the example features and words in texts in a semantic space is reduced, and the accuracy of image-text matching is improved.

The method can be realized by a model for image-text matching, and if the method is named as a SAVE method, the model can be named as a SAVE model. The SAVE model attempts to extract example features (which may also be referred to as example-level visual features) from potential objects or regions in an image in an end-to-end fashion, and then perform cross-modal retrieval based on the example features. Inspired by the object detector based acquisition of example features, the self-attention mechanism is used in this embodiment to replace the object detector and explore its effect on example feature extraction. In order to obtain example features of different levels, feature graphs of different scales can be extracted by using a convolutional neural network, and then self-attention mechanisms are respectively applied to the feature graphs, so that detail features of low-order objects or areas are expected to be extracted from the feature graphs of high resolution, and high-order semantic concept information is aggregated from the feature graphs of low resolution.

In addition, the SAVE model includes, in addition to a convolutional neural network, a recurrent neural network for extracting text vectors and a matching model for extracting an instance feature set based on the feature map and matching the instance feature set with the text vectors.

In training a matching model, we focus on the negative examples of the training process. For a pair of matched teletext samples (i, s), the negative sample can be defined as i _h ＝arg max _x≠i S (x, S) and S _h ＝arg max _y≠s S (i, y). Alternatively, a ternary ordering penalty function in the form of a minimum compound page may also be defined, the penalty function being defined as

Wherein m is the boundary parameter of hinge loss, [ x ]] ₊ ≡max(x，0)。

In training the SAVE model, the boundary value of the loss function may be set to 0.2, and the maximum value of the gradient clipping during training is set to 2.0. Alternatively, the SAVE model may be trained with an Adam optimizer, and the number of training samples per batch set to 128. The training process is divided into two stages, wherein the first stage firstly fixes the parameters of ResNet-152, and the initial learning rate is set to be 5e-4; the second phase trains ResNet-152 with the rest of the SAVE model, with an initial learning rate set to 2e-5. The learning rate is related to the data set, and the learning rate exemplified here is a learning rate based on the MS-COCO data set.

It should be noted that, one implementation difficulty of the present application is to adjust parameters of the SAVE model, which is closely related to the learning rate of training, the selection of the training method, and the setting of the training discussion, and the selection of these parameters requires observation of the change of the training loss, and then adjust the parameters based on the change by experience. The selection of hyper-parameters typically uses a grid search method, which is relatively time consuming.

The image-text matching method is applied to the four application scenes, and the whole process of cross-mode retrieval in the four application scenes is introduced.

1) Mutual inspection of images and text

When text is retrieved using an image, the SAVE model obtains an input image, and performs steps 402-404 described above to calculate an example feature set for the image; reading a z text from a text library containing at least one text, and executing the steps 405-406 to calculate the matching degree between the image and the z text; updating z to z+1, continuing to execute the step of reading the z text from the text library containing at least one text until the matching degree of all texts and images in the text library is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the texts according to the matching degree with the image by the SAVE model, determining the texts arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

When retrieving an image using text, the SAVE model obtains an input text, and performs the above-described step 405 to calculate a text vector; reading a z-th image from an image library containing at least one image, and executing the steps 402-404 and 406 to calculate the matching degree between the text and the z-th image; and updating z to be z+1, and continuing to execute the step of reading the z-th image from the image library containing at least one image until the matching degree of all the images in the image library and the texts is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the images according to the matching degree with the text by the SAVE model, determining the images arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

2) Mutual inspection rope of image and label

When the tag is retrieved using the image, the SAVE model obtains an input image, and performs steps 402-404 described above to calculate an example feature set for the image; reading a z-th label from a label library containing at least one label, and executing the steps 405-406 to calculate the matching degree between the image and the z-th label; updating z to z+1, continuing to execute the step of reading the z label from the label library containing at least one label until the matching degree of all labels in the label library and the image is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the labels according to the matching degree with the image by the SAVE model, determining the labels arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

When retrieving an image using a tag, the SAVE model obtains an input tag, and performs the above-described step 405 to calculate a text vector; reading a z-th image from an image library containing at least one image, and executing the steps 402-404 and 406 to calculate the matching degree between the label and the z-th image; and updating z to be z+1, and continuing to execute the step of reading the z-th image from the image library containing at least one image until the matching degree of all the images in the image library and the labels is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the images according to the matching degree with the label by the SAVE model, determining the images arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

3) Mutual inspection of video and text

When a text is searched by using a video, the SAVE model acquires an input video, and as the content of f (f is a positive integer) continuous video frames in the video is not greatly different, the input video can be sampled every f video frames to obtain each image frame, and for each image frame, the SAVE model takes the image frame as an input image, and the steps 402-404 are executed to calculate an example feature set of the image; reading a z text from a text library containing at least one text, executing the steps 405-406 to calculate the matching degree between the image and the z text, and determining the matching degree between the z text and the video by the average value or the maximum value of the matching degree between the z text and all the images in the video; and updating z to z+1, continuing to execute the step of reading the z text from the text library containing at least one text until the matching degree of all texts in the text library and videos is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the texts according to the matching degree with the video by the SAVE model, determining the texts arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

When retrieving video using text, the SAVE model obtains one text entered, and performs step 405 described above to calculate a text vector; reading a z-th video from a video library containing at least one video, wherein the z-th video can be sampled every f video frames to obtain each image frame, each image frame is used as an image, the steps 402-404 and 406 are executed to calculate the matching degree between the text and the image, and the matching degree between the text and the z-th video is determined according to the average value or the maximum value of the matching degree between the text and all the images in the z-th video; and updating z to z+1, continuing to execute the step of reading the video of the z-th section from a video library containing at least one section of video until the matching degree of all videos and texts in the video library is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the videos according to the matching degree with the text by the SAVE model, determining the videos arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

4) Mutual inspection cable for video and label

When the video retrieval tag is utilized, the SAVE model acquires an input video, and as the content of f (f is a positive integer) continuous video frames in the video is not greatly different, the input video can be sampled every f video frames to obtain each image frame, and for each image frame, the SAVE model takes the image frame as an input image, and the steps 402-404 are executed to calculate an example feature set of the image; reading a z-th label from a label library containing at least one label, executing the steps 405-406 to calculate the matching degree between the image and the z-th label, and determining the matching degree between the z-th label and the video according to the average value or the maximum value of the matching degree between the z-th label and all the images in the video; updating z to z+1, and continuing to execute the step of reading the z label from the label library containing at least one label until the matching degree of all labels in the label library and the video is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the labels according to the matching degree with the video by the SAVE model, determining the labels arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

When retrieving video using tags, the SAVE model obtains one tag entered, and performs step 405 described above to calculate a text vector; reading a z-th video from a video library containing at least one video, wherein the z-th video can be sampled every f video frames to obtain each image frame, each image frame is taken as an image, the steps 402-404 and 406 are executed to calculate the matching degree between the tag and the image, and the matching degree between the tag and the z-th video is determined according to the average value or the maximum value of the matching degree between the tag and all the images in the z-th video; and updating z to z+1, continuing to execute the step of reading the video of the z-th section from the video library containing at least one section of video until the matching degree of all videos in the video library and the labels is obtained, and stopping circulation, wherein z is a positive integer. And sequencing the videos according to the matching degree with the tag by the SAVE model, determining the videos arranged in the first few bits as the retrieval result of the time, and outputting the retrieval result.

The following runs the SAVE model in the following external hardware environment configuration to verify the technical effects of the SAVE model:

CPU:8 cores

Memory: 128G

Display card: nvidia Tesla P40 block

When verifying the technical effect of the SAVE method, two data sets retrieved across modes are introduced. MS-COCO and Flickr30K are two benchmark datasets retrieved across modalities. The Flickr30K dataset contained 31783 images gathered from the Flickr website and each image had 5 text. We used 29000 images as training set, 1014 images as validation set, 1000 images as test set. The MS-COCO dataset contained 123287 images, likewise 5 text for each image. We used 82783 images as training set, 5000 images as validation set, 5000 images as test set. In addition, the original 30504 verification set images in the MS-COCO data set can be added into the retrieved training set, at the moment, the training set images are expanded to 113287, and the other training set images are kept unchanged. The tests on the MS-COCO dataset fall into two cases: 1) Directly using 5000 images for testing; 2) 5000 images were divided into 5 portions, each of 1000 images were tested and the results averaged.

R@K (k=1, 5, 10) is an evaluation index commonly used for cross-modal search tasks, and is defined as the percentage of queries that contain at least one correct result among all queries that search for the first K results. Med r is another evaluation index that corresponds to the median of the sequence numbers corresponding to the top correct results in all queries. In addition, sum can be defined as an evaluation index for measuring the overall effect of cross-modal retrieval,

The SAVE method is compared with the experimental results of some methods in the related art on the Flickr30K dataset, see Table 1, below. The experimental results of the various methods are divided into three columns in table 1 for display, and the implementation principles of the methods in the three columns are described below.

The first column is the end-to-end trainable, visual features can be directly extracted by VGG or ResNet, and the extracted visual features are mapped into semantic space one-to-one or many-to-many. Wherein, the one-to-one mapping refers to mapping the global feature vector of the image and the text into the semantic space, and measuring the matching degree between the image and the text by using the distance of the global feature vector in the semantic space. However, the higher-order semantic information contained in the global feature vector is interlaced, and the content of the image (or text) retrieved based thereon often does not correspond in some detail to the content of a given text (or image), such as objects in the image do not correspond one-to-one to nouns in the text. The multi-to-many mapping refers to mapping a local feature set of an image and a vocabulary feature set of a text into a semantic space, and aggregating local similarity between the image and the text to measure global similarity between the image and the text, so as to obtain the matching degree between the image and the text.

The various methods in the second column are to extract an instance feature set with an object detector.

The third column is the SAVE method described above, which is classified according to the self-attention mechanism and the number of feature maps of different scales, where ms2 means that the SAVE method extracts feature maps of two scales of 14×14 and 7×7, and ms3 means that the SAVE method extracts feature maps of three scales of 28×28, 14×14 and 7×7.

As can be seen, the SAVE method is significantly better than the various methods in the first column (end-to-end trainable) under all evaluation criteria. Our best experimental results are achieved by fusing the adaptable self-attention-aggregation 3-layer example feature sets, which is comparable to the best existing search methods (i.e., SCAN methods). It is worth mentioning that our SAVE method does not require the use of additional manual annotation data to train the object detector.

TABLE 1

Table 2 shows the experimental results of the various methods on the 1K and 5K test sets on the MS-COCO dataset, and the experimental results of the various methods are still divided into six columns for display according to the implementation principles of the various methods in table 1. Where our best experimental results are achieved by fusing a set of layer 2 example features that can be adapted for self-attention aggregation, all of which are better than the various methods in the first and fourth columns (end-to-end trainable). However, there is some gap between the SAVE method and the best existing search method (i.e., SCAN method), especially under the 5K test set, which may be because the SCAN method trains the object detector using an additional data set (Visual Genome) in addition to the additional labeling data, so that the generalization ability of the features extracted by the trained object detector may be better, so that the SCAN method has better versatility.

TABLE 2

Referring to fig. 6, a block diagram of an image matching apparatus according to an embodiment of the present application is shown. The image-text matching device comprises:

an obtaining module 610, configured to obtain an image and a text to be matched;

a generating module 620, configured to generate a candidate instance feature set according to the image obtained by the obtaining module 610;

an aggregation module 630, configured to aggregate the candidate instance features in the candidate instance feature set generated by the generation module 620 by using a self-attention mechanism, to obtain an instance feature set, where each instance feature in the instance feature set corresponds to an object or region in the image;

the encoding module 640 is configured to encode the text obtained by the obtaining module 610 to obtain a text vector;

the calculating module 650 is configured to calculate a matching degree between the image and the text according to the example feature set obtained by the aggregating module 630 and the text vector obtained by the encoding module 640.

In one possible implementation, the aggregation module 630 is further configured to:

for an ith candidate example feature in the set of candidate example features, calculating a correlation between the ith candidate example feature and the remaining candidate example features using a self-attention mechanism, and calculating an example feature based on the ith candidate example feature from the correlation.

In one possible implementation, when the correlation is a weight, the aggregation module 630 is further configured to:

for the ith candidate example feature in the candidate example feature set, calculating cosine similarity between the ith candidate example feature and the jth candidate example feature, and calculating the weight of the jth candidate example feature according to the cosine similarity, wherein the weight is used for representing the attention degree of the jth candidate example feature when other candidate examples are aggregated based on the ith candidate example feature, and i and j are positive integers;

multiplying each candidate instance feature in the candidate instance feature set by a corresponding weight, and adding the obtained products to obtain the instance feature based on the ith candidate instance feature.

mapping each candidate instance feature in the candidate instance feature set into a first feature space, a second feature space and a third feature space respectively;

for an ith candidate example feature in the candidate example feature set, calculating the weight of the jth candidate example feature according to the ith candidate example feature in the first feature space and the jth candidate example feature in the second feature space, wherein the weight is used for representing the attention degree of the jth candidate example feature when other candidate examples are aggregated based on the ith candidate example feature, and i and j are positive integers;

Multiplying each candidate example feature in the third feature space by a corresponding weight, adding the obtained products, and performing residual fitting to obtain an example feature based on the ith candidate example feature.

In one possible implementation, the generating module 620 is further configured to:

inputting the image into a convolutional neural network, and obtaining a feature map output by the convolutional neural network;

dividing the feature map, and forming candidate instance features obtained after division into candidate instance feature sets.

In one possible implementation manner, when the convolutional neural network outputs n feature graphs with different scales and n is greater than or equal to 2, the obtaining module 610 is further configured to obtain, for an m Zhang Tezheng graph in the n feature graphs, a scale of the m+1th feature graph, where m is greater than or equal to 1 and less than n;

the apparatus further comprises: the downsampling module is configured to downsample the example feature set generated based on the m Zhang Tezheng th graph according to the scale of the m+1th feature graph obtained by the obtaining module 610, and combine the obtained example feature set with the example feature set generated based on the m+1th feature graph;

and the determining module is used for determining the example feature set combined by the downsampling module as the example feature set finally generated based on the m+1th feature map.

In one possible implementation, when the text is a sentence, the computing module 650 is further configured to:

for the p-th example feature in the example feature set, calculating the similarity between the p-th example feature and the q-th vocabulary vector in the text vector, and calculating the weight of the q-th vocabulary vector according to the similarity, wherein p and q are positive integers;

multiplying each vocabulary vector in the text vector by a corresponding weight, and adding the obtained products to obtain a text semantic vector based on the p-th example feature;

calculating cosine similarity between the p-th instance feature and the text semantic vector;

and calculating the global similarity between the image and the text according to the cosine similarity between all feature examples in the example feature set and the corresponding text semantic vector, wherein the global similarity is used for indicating the matching degree between the image and the text.

for the p-th vocabulary vector in the text vector, calculating the similarity between the p-th vocabulary vector and the q-th example feature in the example feature set, and calculating the weight of the q-th example feature according to the similarity, wherein p and q are positive integers;

Multiplying each instance feature in the instance feature set by a corresponding weight, and adding the obtained products to obtain an image semantic vector based on the p-th vocabulary vector;

calculating cosine similarity between the p-th vocabulary vector and the image semantic vector;

and calculating the global similarity between the texts and the images according to cosine similarity between all vocabulary vectors in the text vectors and corresponding image semantic vectors, wherein the global similarity is used for indicating the matching degree between the images and the texts.

In one possible implementation, when the text is a sentence, the encoding module 640 is further configured to:

word segmentation is carried out on the sentences to obtain r vocabularies, wherein r is a positive integer;

for the t vocabulary in the r vocabularies, coding the t vocabulary according to the position of the t vocabulary in a vocabulary table to obtain t vocabulary feature vectors, wherein t is more than or equal to 1 and less than or equal to r;

inputting the t vocabulary feature vector into a bidirectional gating circulation unit in the t time step, and determining the t vocabulary vector according to the bidirectional hiding state of the bidirectional gating circulation unit in the t time step;

and forming the obtained r vocabulary vectors into text vectors.

In summary, the image-text matching device provided in the embodiment of the present application generates a candidate example feature set according to an image, and aggregates candidate example features in the candidate example feature set by using a self-attention mechanism to obtain an example feature set, and calculates a matching degree between the image and the text according to the example feature set and the text vector, so that the example feature can be aggregated by using the self-attention mechanism through the relevance between the candidate example features, and the example feature set of the image is prevented from being acquired by the object detector, thereby solving the problem that the training difficulty of the object detector is high because the category and position information of all examples need to be marked on each image when the object detector is trained, and achieving the effect of simplifying the implementation difficulty of image-text matching; the problem that the object detector outputs corresponding position information besides semantic information, and the position information does not help in image-text matching, so that example features recognized by the object detector are not suitable for matching texts and the accuracy of image-text matching is affected is solved, and the effect of improving the accuracy of image-text matching is achieved.

The application also provides a server, which comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to realize the image-text matching method provided by each method embodiment. It should be noted that the server may be a server as provided in fig. 7 below.

Referring to fig. 7, a schematic structural diagram of a server according to an exemplary embodiment of the present application is shown. Specifically, the present invention relates to a method for manufacturing a semiconductor device. The server 700 includes a Central Processing Unit (CPU) 701, a system memory 704 including a Random Access Memory (RAM) 702 and a Read Only Memory (ROM) 703, and a system bus 705 connecting the system memory 704 and the central processing unit 701. The server 700 also includes a basic input/output system (I/O system) 706, for aiding in the transfer of information between the various devices within the computer, and a mass storage device 707 for storing an operating system 713, application programs 714, and other program modules 715.

The basic input/output system 706 includes a display 708 for displaying information and an input device 709, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 708 and the input device 709 are coupled to the central processing unit 701 through an input output controller 710 coupled to a system bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 710 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 707 is connected to the central processing unit 701 through a mass storage controller (not shown) connected to the system bus 705. The mass storage device 707 and its associated computer readable storage medium provide non-volatile storage for the server 700. That is, the mass storage device 707 may include a computer readable storage medium (not shown) such as a hard disk or CD-ROI drive.

The computer-readable storage medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 704 and mass storage device 707 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by one or more central processing units 701, the one or more programs containing instructions for implementing the sentence encoding or sentence decoding method described above, and the central processing unit 701 executes the one or more programs to implement the image-text matching method provided in the respective method embodiments described above.

The server 700 may also operate via a network, such as the internet, connected to a remote computer on the network, in accordance with various embodiments of the present invention. I.e. the server 700 may be connected to the network 712 via a network interface unit 711 connected to the system bus 705, or alternatively, the network interface unit 711 may be used to connect to other types of networks or remote computer systems (not shown).

The memory further includes one or more programs stored in the memory, the one or more programs including steps executed by the server for performing the image matching method provided by the embodiment of the present invention.

Embodiments of the present application also provide a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which are loaded and executed by the processor 710 to implement the method of graph matching as described above.

The application also provides a computer program product which, when run on a computer, causes the computer to execute the image-text matching method provided by the method embodiments.

One embodiment of the present application provides a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement a method of graph matching as described above.

An embodiment of the present application provides a graphics matching device, where the graphics matching device includes a processor and a memory, where at least one instruction is stored in the memory, where the instruction is loaded and executed by the processor to implement the graphics matching method as described above.

It should be noted that: in the image-text matching device provided in the above embodiment, only the division of the functional modules is used for illustration, and in practical application, the above-mentioned functional allocation can be completed by different functional modules according to the needs, i.e. the internal structure of the image-text matching device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the image-text matching device provided in the above embodiment and the image-text matching method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description is not intended to limit the embodiments of the present application, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the embodiments of the present application are intended to be included within the scope of the embodiments of the present application.

Claims

1. The image-text matching method is characterized by comprising the following steps of:

acquiring an image and a text to be matched;

generating a candidate instance feature set according to the image;

for an ith candidate example feature in the candidate example feature set, calculating cosine similarity between the ith candidate example feature and a jth candidate example feature, and calculating a weight of the jth candidate example feature according to the cosine similarity, wherein the weight is used for representing the attention degree of the jth candidate example feature when other candidate examples are aggregated based on the ith candidate example feature, and i and j are positive integers; multiplying each candidate instance feature in the candidate instance feature set by a corresponding weight, and adding the obtained products to obtain an instance feature based on the ith candidate instance feature;

Or mapping each candidate instance feature in the candidate instance feature set into a first feature space, a second feature space and a third feature space respectively; for an ith candidate example feature in the candidate example feature set, calculating a weight of the jth candidate example feature according to the jth candidate example feature in the first feature space and the ith candidate example feature in the second feature space, wherein the weight is used for representing the attention degree of the jth candidate example feature when other candidate examples are aggregated based on the ith candidate example feature, and i and j are positive integers; multiplying each candidate example feature in the third feature space by a corresponding weight, adding the obtained products, and performing residual fitting to obtain example features based on the ith candidate example feature;

determining a set of example features, each example feature in the set of example features corresponding to an object or region in the image;

encoding the text to obtain a text vector;

for the p-th example feature in the example feature set, calculating the similarity between the p-th example feature and the q-th vocabulary vector in the text vector, and calculating the weight of the q-th vocabulary vector according to the similarity, wherein p and q are positive integers; multiplying each vocabulary vector in the text vector by a corresponding weight, and adding the obtained products to obtain a text semantic vector based on the p-th example feature; calculating cosine similarity between the p-th example feature and the text semantic vector; calculating global similarity between the image and the text according to cosine similarity between all feature examples in the example feature set and corresponding text semantic vectors, wherein the global similarity is used for indicating matching degree between the image and the text;

Or, for the p-th vocabulary vector in the text vector, calculating the similarity between the p-th vocabulary vector and the q-th example feature in the example feature set, and calculating the weight of the q-th example feature according to the similarity, wherein p and q are positive integers; multiplying each instance feature in the instance feature set by a corresponding weight, and adding the obtained products to obtain an image semantic vector based on the p-th vocabulary vector; calculating cosine similarity between the p-th vocabulary vector and the image semantic vector; and calculating global similarity between the text and the image according to cosine similarity between all vocabulary vectors in the text vector and corresponding image semantic vectors, wherein the global similarity is used for indicating the matching degree between the image and the text.

2. The method of claim 1, wherein the generating a candidate instance feature set from the image comprises:

dividing the feature map, and forming the candidate instance feature set by the candidate instance feature obtained after dividing.

3. The method of claim 2, wherein when the convolutional neural network outputs n feature graphs of different scales and n is ≡2, the method further comprises:

for the m Zhang Tezheng drawing in the n feature drawings, obtaining the scale of the m+1st feature drawing, wherein m is more than or equal to 1 and less than n;

downsampling the example feature set generated based on the m Zhang Tezheng graph according to the scale of the m+1th feature graph, and combining the obtained example feature set with the example feature set generated based on the m+1th feature graph;

and determining the combined instance feature set as an instance feature set finally generated based on the m+1st feature map.

4. The method of claim 1, wherein when the text is a sentence, the encoding the text to obtain a text vector comprises:

And forming the obtained r vocabulary vectors into the text vector.

5. A graphic matching apparatus, the apparatus comprising:

an aggregation module, configured to calculate, for an ith candidate example feature in the candidate example feature set, a cosine similarity between the ith candidate example feature and a jth candidate example feature, and calculate a weight of the jth candidate example feature according to the cosine similarity, where the weight is used to represent a degree of attention to the jth candidate example feature when aggregating other candidate examples based on the ith candidate example feature, and i and j are positive integers; multiplying each candidate instance feature in the candidate instance feature set by a corresponding weight, and adding the obtained products to obtain an instance feature based on the ith candidate instance feature; or mapping each candidate instance feature in the candidate instance feature set into a first feature space, a second feature space and a third feature space respectively; for an ith candidate example feature in the candidate example feature set, calculating a weight of the jth candidate example feature according to the jth candidate example feature in the first feature space and the ith candidate example feature in the second feature space, wherein the weight is used for representing the attention degree of the jth candidate example feature when other candidate examples are aggregated based on the ith candidate example feature, and i and j are positive integers; multiplying each candidate example feature in the third feature space by a corresponding weight, adding the obtained products, and performing residual fitting to obtain example features based on the ith candidate example feature; determining a set of example features, each example feature in the set of example features corresponding to an object or region in the image;

the computing module is used for computing the similarity between the p-th example feature and the q-th vocabulary vector in the text vector for the p-th example feature in the example feature set, and computing the weight of the q-th vocabulary vector according to the similarity, wherein p and q are positive integers; multiplying each vocabulary vector in the text vector by a corresponding weight, and adding the obtained products to obtain a text semantic vector based on the p-th example feature; calculating cosine similarity between the p-th example feature and the text semantic vector; calculating global similarity between the image and the text according to cosine similarity between all feature examples in the example feature set and corresponding text semantic vectors, wherein the global similarity is used for indicating matching degree between the image and the text; or, for the p-th vocabulary vector in the text vector, calculating the similarity between the p-th vocabulary vector and the q-th example feature in the example feature set, and calculating the weight of the q-th example feature according to the similarity, wherein p and q are positive integers; multiplying each instance feature in the instance feature set by a corresponding weight, and adding the obtained products to obtain an image semantic vector based on the p-th vocabulary vector; calculating cosine similarity between the p-th vocabulary vector and the image semantic vector; and calculating global similarity between the text and the image according to cosine similarity between all vocabulary vectors in the text vector and corresponding image semantic vectors, wherein the global similarity is used for indicating the matching degree between the image and the text.

6. The apparatus of claim 5, wherein the generating module is further configured to:

7. The apparatus of claim 6, wherein the aggregation module is further configured to:

when the convolutional neural network outputs n feature graphs with different scales, and n is more than or equal to 2, obtaining the scale of the m+1th feature graph for the m Zhang Tezheng th graph in the n feature graphs, wherein m is more than or equal to 1 and less than n;

8. The apparatus of claim 5, wherein the encoding module is further configured to:

when the text is a sentence, word segmentation is carried out on the sentence to obtain r vocabularies, wherein r is a positive integer;

and forming the obtained r vocabulary vectors into the text vector.

9. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of matching text as claimed in any one of claims 1 to 4.

10. A teletext matching arrangement, comprising a processor and a memory, in which at least one instruction is stored, the instruction being loaded and executed by the processor to implement a teletext matching method according to any one of claims 1 to 4.