CN110147457A

CN110147457A - Picture and text matching process, device, storage medium and equipment

Info

Publication number: CN110147457A
Application number: CN201910152063.1A
Authority: CN
Inventors: 贲有成; 吴航昊; 袁春; 周杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-08-20
Anticipated expiration: 2039-02-28
Also published as: CN110147457B

Abstract

The embodiment of the present application discloses a kind of picture and text matching process, device, storage medium and equipment, belongs to field of computer technology.The described method includes: obtaining image and text to be matched；Candidate translation example characteristic set is generated according to described image；The candidate translation example feature in the candidate translation example characteristic set is polymerize using from attention mechanism, obtains example aspects set, each example aspects in the example aspects set correspond to an object in described image；The text is encoded, text vector is obtained；The matching degree between described image and the text is calculated according to the example aspects set and the text vector.The embodiment of the present application can simplify the matched realization difficulty of picture and text and improve the matched accuracy rate of picture and text.

Description

Picture and text matching process, device, storage medium and equipment

Technical field

The invention relates to field of computer technology, in particular to a kind of picture and text matching process, device, storage are situated between Matter and equipment.

Background technique

The retrieval of cross-module state is a kind of novel retrieval mode, and the data retrieval between different modalities may be implemented.To scheme For literary mutually retrieval, user can input an image to retrieve the description text of the image, alternatively, user can input one A text retrieves image described in the sentence.

For according to image retrieval text, server can be according to the matching degree between the text and image retrieved To generate search result.When calculating the matching degree of text and image, the trained object detector of server by utilizing extracts should The example aspects set of image；The text vector of the text is generated using Recognition with Recurrent Neural Network；Using Matching Model according to example Characteristic set and text vector calculate the matching degree between the image and the text.

When due to training object detector, need to mark the classification of all examples and position in image on every image Information causes to train the difficulty of object detector larger；In addition, object detector and Matching Model are to separate training, institute With the example aspects that object detector identifies may be not appropriate for for Matching Model matched text, to influence picture and text matching Accuracy rate.

Summary of the invention

The embodiment of the present application provides a kind of picture and text matching process, device, storage medium and equipment, for solving object The training difficulty of detector is larger, and the example aspects of its identification are not particularly suited for matched text, and it is matched accurate to influence picture and text The problem of rate.The technical solution is as follows:

On the one hand, a kind of picture and text matching process is provided, which comprises

Obtain image and text to be matched；

Candidate translation example characteristic set is generated according to described image；

The candidate translation example feature in the candidate translation example characteristic set is polymerize using from attention mechanism, is obtained Example aspects set, each example aspects in the example aspects set correspond to an object or area in described image Domain；

The text is encoded, text vector is obtained；

The matching between described image and the text is calculated according to the example aspects set and the text vector Degree.

On the one hand, a kind of picture and text coalignment is provided, described device includes:

Module is obtained, for obtaining image and text to be matched；

Generation module, the described image for being obtained according to the acquisition module generate candidate translation example characteristic set；

Aggregation module, for utilizing the candidate translation example feature set generated from attention mechanism to the generation module Candidate translation example feature in conjunction is polymerize, and example aspects set, each example aspects in the example aspects set are obtained Corresponding in described image an object or region；

Coding module, the text for obtaining to the acquisition module encode, and obtain text vector；

Computing module, the example aspects set and the coding module for being obtained according to the aggregation module obtain The text vector arrived calculates the matching degree between described image and the text.

On the one hand, a kind of computer readable storage medium is provided, at least one finger is stored in the storage medium It enables, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code set Or instruction set is loaded by processor and is executed to realize picture and text matching process as described above.

On the one hand, a kind of picture and text matching unit is provided, the picture and text matching unit includes processor and memory, institute It states and is stored at least one instruction in memory, described instruction is loaded by the processor and executed as described above to realize Picture and text matching process.

The beneficial effect of technical solution provided by the embodiments of the present application includes at least:

By generating candidate translation example characteristic set according to image, recycle from attention mechanism to candidate translation example feature set Candidate translation example feature in conjunction is polymerize, and example aspects set can be obtained, further according to example aspects set and text vector The matching degree between image and text is calculated, is come from attention mechanism by between candidate translation example feature in this way, can use Relevance polymerize example aspects, avoids passing through object detector to obtain the example aspects set of image, both solves training The classification and location information for needing to mark all examples when object detector on every image, lead to the instruction of object detector Practice the big problem of difficulty, to achieve the effect that the matched realization difficulty of simplified picture and text；Also solve object detector in addition to Corresponding location information is also exported except output semantic information, and location information is not helpful to picture and text matching, leads to object The problem of example aspects of detector identification are not particularly suited for matched text, influence picture and text matched accuracy rate, to reach Improve the effect of picture and text matched accuracys rate.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, required in being described below to embodiment The attached drawing used is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings Other attached drawings.

Fig. 1 is the schematic diagram for implementing a kind of search result exemplified according to exemplary partial；

Fig. 2 is the structural schematic diagram for implementing a kind of picture and text matching system exemplified according to exemplary partial；

Fig. 3 is the method flow diagram for the picture and text matching process that the application one embodiment provides；

Fig. 4 is the method flow diagram for the picture and text matching process that another embodiment of the application provides；

Fig. 5 is the block diagram for the picture and text matching system that another embodiment of the application provides；

Fig. 6 is the structural block diagram for the picture and text coalignment that the application one embodiment provides；

Fig. 7 is the structural schematic diagram for the server that the application another embodiment provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with attached drawing to the application Embodiment is described in further detail.

Vision content identification and natural language understanding are two big challenges of artificial intelligence field, and currently more popular grinds Studying carefully direction is the crosspoint between determining image and text, then realizes some applications based on the crosspoint.For example, according to image Generate description text, vision question and answer, mutually retrieved according to text generation image, image and text etc..

This application involves the mutual retrieval of image and text, main purpose is matched by given text search Image or the text to be matched by given image querying.Form is shown below according to image and the different of text, to possible Several application scenarios be illustrated.

1) the mutual retrieval of image and text

Text can be the combination with complete semantic a sentence or multiple sentences.Sentence mentioned here can be with It is the sentence in any one natural language.

When using image retrieval text, an image can be inputted, then from the text library comprising at least one text The text that the vision semanteme of retrieval and the image matches.In order to make it easy to understand, can be with 4 in Flickr30K data set Image inquires 5 texts most like with the vision of every image semanteme as input respectively, and by every image and is based on The image retrieval to 5 texts carry out corresponding display, obtain search result shown in Fig. 1.It should be noted that server The text found may match (i.e. search result is accurate) with image, it is also possible to which (i.e. search result goes out with image mismatch It is wrong), the text to match with image is indicated with " √ " in Fig. 1, is indicated and the unmatched text of image with "×".

When using text retrieval image, a text can be inputted, then from the image library comprising at least one image The image that the text semantic of retrieval and the text matches.

2) the mutual retrieval of image and label

Label can be the combination of a vocabulary or multiple vocabulary.Vocabulary mentioned here can be any one nature Vocabulary in language.

When using image retrieval label, an image can be inputted, then from the tag library comprising at least one label The label that the vision semanteme of retrieval and the image matches.If being retrieved using first image in Fig. 1 as input Label can be beach volleyball, bikini, sports etc..

When using label search image, a label can be inputted, then from the image library comprising at least one image The image that the label semanteme of retrieval and the label matches.

3) the mutual retrieval of video and text

When using video frequency searching text, one section of video can be inputted, each picture frame is extracted from the video, it will be every An a picture frame image as input, then retrieved and the vision of the image from the text library comprising at least one text The text that semanteme matches.

When using text retrieval video, a text can be inputted, then from the video library comprising at least one section video The video for the picture frame that retrieval matches comprising the text semantic with the text.

4) the mutual retrieval of video and label

When using video frequency searching label, one section of video can be inputted, each picture frame is extracted from the video, it will be every An a picture frame image as input, then retrieved and the vision of the image from the tag library comprising at least one label The label that semanteme matches.

When using label search video, a label can be inputted, then from the video library comprising at least one section video The video for the picture frame that retrieval matches comprising the label semanteme with the label.

It is worth noting that, the embodiment of the present application may be implemented in the terminal, also may be implemented in the server, may be used also To be realized jointly by terminal and server, as shown in Fig. 2, terminal 21 is for generating text for according to text retrieval image This, and the text is sent to server 22, server 22 is based on the text and retrieves image, the image retrieved is sent to Terminal 21 is shown.Optionally, it is attached between terminal 21 and server 22 by communication network, which can To be that cable network is also possible to wireless network, the embodiment of the present application is not limited this.

Schematically, it is stored in server 22 for the matched machine learning model of picture and text, user is defeated in terminal 21 After entering the text " A woman is playing volleyball " for needing to retrieve, the text is sent to server by terminal 21 22, every image is read from image library by server 22, and every image and the text are calculated by the machine learning model Matching degree, and be shown terminal 21 is sent to the image that the text matches.

Referring to FIG. 3, the method flow diagram of the picture and text matching process provided it illustrates the application one embodiment.It should Picture and text matching process, comprising:

Step 301, image and text to be matched are obtained.

Corresponding to above-mentioned four kinds of application scenarios, image and text to be matched can be obtained by following four mode.

1) the mutual retrieval of image and text

When using image retrieval text, a text can be successively obtained from the text library comprising at least one text This, for each text got, using the image and text that the text and the image of input are to be matched as one group.

When using text retrieval image, a figure can be successively obtained from the image library comprising at least one image Picture, for the every image got, using the image and text that the image and the text of input are to be matched as one group.

2) the mutual retrieval of image and label

When using image retrieval label, a mark can be successively obtained from the tag library comprising at least one label Label, for each label got, using the image and text that the label and the image of input are to be matched as one group.

When using label search image, a figure can be successively obtained from the image library comprising at least one image Picture, for the every image got, using the image and text that the image and the label of input are to be matched as one group.

3) the mutual retrieval of video and text

When using video frequency searching text, due to the difference of the content of a video frame of f (f is positive integer) continuous in video Less, so, it can be sampled every video of the f video frame to input, obtain each picture frame, by each image When frame image as input, a text is successively obtained from the text library comprising at least one text, for what is got Each text, using the image and text that the text and the image of input are to be matched as one group.It is subsequent, it can will be in the video Matching degree of the average value or maximum value of the matching degree of all picture frames and a text as the video and the text.

When using text retrieval image, one section of view can be successively obtained from the video library comprising at least one section video Frequently, for the every section of video got, since the difference of the content of a video frame of f (f is positive integer) continuous in video is little, So can sample every f video frame to the video, each picture frame is obtained, for each picture frame, by the figure As frame and the text of the input image and text to be matched as one group.It is subsequent, it by picture frames all in the video and can be somebody's turn to do Matching degree of the average value or maximum value of the matching degree of text as the video and the text.

4) the mutual retrieval of video and label

When using video frequency searching label, due to the difference of the content of a video frame of f (f is positive integer) continuous in video Less, so, it can be sampled every video of the f video frame to input, obtain each picture frame, by each image When frame image as input, a label is successively obtained from the tag library comprising at least one label, for what is got Each label, using the image and text that the label and the image of input are to be matched as one group.It is subsequent, it can will be in the video Matching degree of the average value or maximum value of the matching degree of all picture frames and a label as the video and the label.

When using label search image, one section of view can be successively obtained from the video library comprising at least one section video Frequently, for the every section of video got, since the difference of the content of a video frame of f (f is positive integer) continuous in video is little, So can sample every f video frame to the video, each picture frame is obtained, for each picture frame, by the figure As frame and the label of the input image and text to be matched as one group.It is subsequent, it by picture frames all in the video and can be somebody's turn to do Matching degree of the average value or maximum value of the matching degree of label as the video and the label.

After obtaining one group of image and text to be matched, image can be handled by step 302-303, pass through step Rapid 304 handle text, and the present embodiment does not limit the successive processing sequence of image and text, that is, the present embodiment does not limit step Successive between 302-303 and step 304 executes sequence.

It should be noted that when using an image retrieval text single treatment can be carried out to the image, and will The processing result is matched with the processing result of each text；That is, executing step 301- when first time implementing this method 305, step 301,304-305 are executed in subsequent implementation this method.It, can be to this when using a text retrieval image Text carries out single treatment, and the processing result is matched with the processing result of each image；That is, implementing in first time Step 301-305 is executed when this method, and step 301-303 and 305 is executed in subsequent implementation this method.

Step 302, candidate translation example characteristic set is generated according to image.

Candidate translation example characteristic set includes at least one candidate translation example feature, and each candidate translation example feature corresponds to figure A region in the characteristic pattern of picture.

In the present embodiment, first image can be inputted in convolutional neural networks to obtain characteristic pattern, obtained further according to characteristic pattern To candidate translation example characteristic set, it is as detailed below in description.Wherein, convolutional neural networks can be ResNet (residual error network), VGGNet (Visual Geometry Group, visual geometric group), GoogleNet, AlexNet etc., the present embodiment does not limit It is fixed.

Step 303, the candidate translation example feature in candidate translation example characteristic set is polymerize using from attention mechanism, Example aspects set is obtained, each example aspects in the example aspects set correspond to an object or region in image.

Example aspects set includes at least one example aspects, and each example aspects correspond to an object in image Body or region.By taking first image in Fig. 1 as an example, then example aspects can be people, net, auditorium etc..

It is explaining from before attention mechanism, first attention mechanism is being explained.Attention mechanism is to human vision The imitation of mechanism.Visual perception obtains the target area for needing to pay close attention to, also by quickly scanning global image It is general described ' s focus of attention, more attention resources then is put into this target area, needed for obtains more The detailed information of target is paid close attention to, and inhibits other garbages.As it can be seen that attention mechanism is a kind of by internal experience and outer Feel that alignment, can be with the important spy of rapidly extracting sparse data to increase the mechanism of the observation fineness of target area in portion Sign, thus be widely used.And be the improvement of attention mechanism from attention mechanism, which reduce the dependence to external information, It is more good to capture the interdependency of data or feature.

Correspond to a region in the characteristic pattern of image due to each candidate translation example feature, so, it can use from note Meaning power mechanism polymerize similar vision semanteme based on the correlation between these candidate translation example features, is needed with obtaining in image The each object to be paid close attention to or region namely example aspects set.

Step 304, text is encoded, obtains text vector.

In the present embodiment, it can enter text into Recognition with Recurrent Neural Network to obtain text vector.Wherein, circulation nerve Network can be RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory, shot and long term note Recall network), GRU (Gate Recurrent Unit, gating cycle unit) etc., this embodiment is not limited.

Step 305, the matching degree between image and text is calculated according to example aspects set and text vector.

In the present embodiment, example aspects set and text vector can be mapped in common semantic space, then at this The global similarity between example aspects set and text vector is calculated in semantic space, is measured according to the overall situation similarity Matching degree between image and text.

After obtaining the matching degree between image and text, the matching degree greater than predetermined threshold can be chosen, by this It is determined as search result with the corresponding image of degree or text；Alternatively, can arrange the matching degree of all images or text The image or text that are arranged in former are determined as search result by sequence.

In conclusion picture and text matching process provided by the embodiments of the present application, by generating candidate translation example feature according to image Set, recycling polymerize the candidate translation example feature in candidate translation example characteristic set from attention mechanism, and reality can be obtained Example characteristic set calculates the matching degree between image and text further according to example aspects set and text vector, in this way, can be with Using example aspects are polymerize by the relevance between candidate translation example feature from attention mechanism, object inspection is avoided passing through Device is surveyed to obtain the example aspects set of image, needed to mark institute on every image when both having solved trained object detector There are the classification and location information of example, the problem for causing the training difficulty of object detector big, to reach simplified picture and text The effect for the realization difficulty matched；Also it solves object detector and also exports corresponding position letter other than exporting semantic information Breath, and location information matches picture and text not helpful, the example aspects for causing object detector to identify are not particularly suited for With text, the problem of influencing picture and text matched accuracy rate, to achieve the effect that improve the matched accuracy rate of picture and text.

In addition, the output of object detector can provide useless location information, this does not fully consider that cross-module state is retrieved The characteristic of itself, for inventor it can be found that the problem of object detector is in the presence of cross-module state is retrieved, this inherently has difficulty Degree.

In addition, since the size of the receptive field of convolution kernel is fixed, so, it is frequently used to grab from attention mechanism It takes the long-term dependence between characteristics of image, and polymerize similar semantic information using from attention mechanism in the application To obtain example aspects set, this is different from the conventional effect from attention mechanism, so, it, will be from attention for this point Power mechanism introduces the retrieval of cross-module state and does not allow to be readily conceivable that.

Referring to FIG. 4, the method flow diagram of the picture and text matching process provided it illustrates another embodiment of the application.It should Picture and text matching process, comprising:

Step 401, image and text to be matched are obtained.

Corresponding to above-mentioned four kinds of application scenarios, image and text to be matched can be obtained by four kinds of modes, is detailed in Description in step 301, does not repeat herein.

After obtaining one group of image and text to be matched, image can be handled by step 402-404, pass through step Rapid 405 handle text, and the present embodiment does not limit the successive processing sequence of image and text, that is, the present embodiment does not limit step Successive between 402-404 and step 405 executes sequence.

It should be noted that when using an image retrieval text single treatment can be carried out to the image, and will The processing result is matched with the processing result of each text；That is, executing step 401- when first time implementing this method 406, step 401,405-406 are executed in subsequent implementation this method.It, can be to this when using a text retrieval image Text carries out single treatment, and the processing result is matched with the processing result of each image；That is, implementing in first time Step 401-406 is executed when this method, and step 401-404 and 405 is executed in subsequent implementation this method.

Step 402, image is inputted in convolutional neural networks, and obtains the characteristic pattern of convolutional neural networks output.

Convolutional neural networks can be ResNet, VGGNet, GoogleNet, AlexNet etc., and the present embodiment does not limit It is fixed.

Optionally, it can use data set to be trained convolutional neural networks, then image inputted into trained convolution In neural network, to obtain characteristic pattern.Convolutional neural networks are trained for example, can use ImageNet data set, this Embodiment is not construed as limiting.

The output result of the convolutional layer of convolutional neural networks is known as characteristic pattern in the present embodiment.

In the present embodiment, the output of at least one convolutional layer in convolutional neural networks can be extracted, obtains at least one spy Sign figure.The present embodiment neither limits the quantity of characteristic pattern, does not also limit the convolutional layer of output characteristic pattern.

Referring to FIG. 5, convolutional neural networks therein be ResNet-152, convolutional layer be respectively Conv1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, it is assumed that choose the output of Conv3_x, Conv4_x and Conv5_x as a result, then Conv3_ What x was obtained is the characteristic pattern that scale is 28 × 28 × 512, and what Conv4_x was obtained is the feature that scale is 14 × 14 × 1024 Figure, what Conv5_x was obtained is the characteristic pattern that scale is 7 × 7 × 2048.

It should be noted that the image can also be pre-processed before image is inputted convolutional neural networks, So that image meets the input condition of convolutional neural networks.With the image on the two data sets of MS-COCO and Flickr30K For, random cropping and scaling can be carried out to the image, so that the size of image is 224 × 224.

Step 403, characteristic pattern is divided, the candidate translation example feature obtained after division is formed into candidate translation example feature Set.

If convolutional neural networks output a characteristic pattern, this feature figure can be evenly dividing into k candidate translation example Region, each candidate translation example region are a candidate translation example feature, obtain a candidate translation example characteristic set, k be greater than etc. In 2 positive integer.Wherein, each candidate translation example feature is corresponding with a region in image in candidate translation example characteristic set.If Convolutional neural networks output the characteristic pattern of at least two different scales, can be according to described above for every characteristic pattern Mode obtains a candidate translation example characteristic set.

If characteristic pattern is divided into k candidate translation example region in the manner described above, it is assumed that an object in image occupies Space correspond to multiple candidate translation example regions, then example aspects corresponding to the object can also be related to multiple candidate real Example region, so, the corresponding feature in candidate translation example region is known as candidate translation example feature.

The image I of given input, it is as follows can to define candidate translation example feature on characteristic pattern: for scale size be M × The characteristic pattern that N and port number are C, takes characteristic value, available k=M × N number of spy in each spatial position along channel dimension Levy vector, i.e. U={ u₁..., u_k, u_i∈R^C.Some given zone in image I is corresponded respectively to due to these feature vectors Domain, thus candidate translation example feature can be regarded as.Wherein, C can be 2048, be also possible to other numerical value, and the present embodiment does not limit It is fixed.

In order to make it easy to understand, by taking the characteristic pattern that scale is 3 × 3 and port number is 512 as an example, if (wide based on Spatial Dimension And height) this feature figure is divided into 9 candidate translation example regions, then each candidate translation example region is the feature of one 512 dimension Vector.

Step 404, the candidate translation example feature in candidate translation example characteristic set is polymerize using from attention mechanism, Example aspects set is obtained, each example aspects in the example aspects set correspond to an object or region in image.

It is detailed in the description in step 303 from the explanation of attention mechanism, is not repeated herein.

Correspond to a region in the characteristic pattern of image due to each candidate translation example feature, so, it can use from note Meaning power mechanism polymerize similar vision semanteme based on the correlation between these candidate translation example features, is needed with obtaining in image The each object to be paid close attention to or region namely example aspects set.That is, candidate for i-th in candidate translation example characteristic set Example aspects are related between i-th of candidate translation example feature and remaining candidate translation example feature using calculating from attention mechanism Property, and the example aspects according to correlation calculations based on i-th of candidate translation example feature.

In the present embodiment, each candidate translation example feature calculation example aspects can be based on.It is special for each candidate translation example Sign, can use from attention mechanism and calculates its similarity between other candidate translation example features, and the similarity is converted For weight, to indicate the correlation between candidate translation example feature by weight, polymerization example aspects in this way are substantially to institute There is candidate translation example feature to be weighted summation operation.

How to design suitably be from attention mechanism the application realization difficult point, below to two kinds from attention mechanism Implementation be introduced, and both indicate correlation with weight from attention mechanism.

In the first implementation, for i-th of candidate translation example feature in candidate translation example characteristic set, i-th is calculated Cosine similarity between a candidate translation example feature and j-th candidates example aspects, and calculated j-th according to cosine similarity The weight of candidate translation example feature, to the when the weight is for indicating based on i-th of other candidate translation example of candidate translation example characteristic aggregation The degree of concern of j candidate translation example feature, i and j are positive integer；Each candidate translation example in candidate translation example characteristic set is special For sign multiplied by corresponding weight, each product addition that will be obtained obtains the example aspects based on i-th of candidate translation example feature.

For the image I of input, according to the corresponding candidate translation example characteristic set of the available characteristic pattern of step 403 U, then calculate the cosine similarity in candidate translation example characteristic set U between all candidate translation example features, i.e.,

Wherein, s_ijIndicate the cosine similarity between i-th of candidate translation example feature and j-th candidates example aspects.

The weight for calculating j-th candidates example aspects below according to cosine similarity may be used also before calculating the weight To do normalized to cosine similarity, i.e.,

Wherein, [x]₊≡ max (x, 0).

The weight of j-th candidates example aspects is calculated below according to the normalization cosine similarity, the weight is for indicating To the degree of concern of j-th candidates example aspects when based on i-th of other candidate translation example of candidate translation example characteristic aggregation, i.e.,

Wherein, λ₁It is hyper parameter, for controlling the effect of polymerization.

In one possible implementation, λ₁=9.

For i-th of candidate translation example feature u_i, other relevant candidate translation examples can be gathered by weighted sum algorithm Feature is obtained based on u_iIt polymerize obtained example aspects, i.e.,

In the present embodiment, it can will be known as certainty from attention from attention mechanism in the first above-mentioned implementation Power mechanism (Deterministic Self-Attention, DSA) uses the cosine similarity between candidate translation example feature The similitude between candidate translation example feature is measured, does not need to introduce additional learning parameter.

In the second implementation, each candidate translation example feature in candidate translation example characteristic set is respectively mapped to In fisrt feature space, second feature space and third feature space；It is candidate for i-th in candidate translation example characteristic set Example aspects, it is candidate real according to i-th in the j-th candidates example aspects and second feature space in fisrt feature space Example feature, calculates the weight of j-th candidates example aspects, weight for indicate based on i-th candidate translation example characteristic aggregation other To the degree of concern of j-th candidates example aspects when candidate translation example, i and j are positive integer；By each of third feature space Candidate translation example feature by obtained each product addition and carries out residual error fitting multiplied by corresponding weight, obtains based on i-th The example aspects of candidate translation example feature.

For the characteristic pattern x ∈ R of the image I of input^C×M×N, two-dimensional matrix u ∈ R can be translated into^C×k, k=M × N has corresponded to k candidate translation example feature described in step 403.In order to model the weight matrix β between candidate translation example feature ∈R^k×k, u is mapped in fisrt feature space θ and second feature space φ first, then has θ (u)=W_θU, φ (u)=W_φu。 Wherein,And W_θAnd W_φIt is all the model parameter that can learn.In a kind of possible realization side In formula, 1 × 1 convolution operation can be used to realize, and take in confirmatory experiment

For i-th of candidate translation example feature u_i, the weight of j-th candidates example aspects is calculated, the weight is for indicating base To the degree of concern of j-th candidates example aspects when i-th of other candidate translation example of candidate translation example characteristic aggregation, i.e.,

Wherein, s_ji=θ (u_j)^Tφ(u_i), i, j ∈ [1, k], θ (u_j) it is that j-th candidates in fisrt feature space are real Example feature, φ (u_i) it is i-th of candidate translation example feature in second feature space.

U is mapped in the g of third feature space below, then has g (u)=W_gu.Wherein, W_g∈R^C×C, and W_gIt is all that can learn The model parameter of habit.In one possible implementation, 1 × 1 convolution operation can be used to realize.

Furthermore it is also possible to the example aspects a obtained to weighted sum algorithm_iResidual error fitting is carried out, i.e., in a_iUpper addition is residual Poor fitting module obtains final example aspects y_i=η a_i+u_i。 (7)

Wherein, η can be the model parameter of study.

In the present embodiment, it will can be known as to be adapted to from attention from attention mechanism in above-mentioned second of implementation Power mechanism (Adaptive Self-Attention, ASA), can adaptedly be modeled between candidate translation example based on neural network Correlation needs to introduce additional learning parameter.

It should be noted that above two implementation can be passed through when convolutional neural networks export a characteristic pattern Any one of obtain an example aspects set；Characteristic pattern and n >=2 when convolutional neural networks n different scales of output When, since the scale size of characteristic pattern can't be changed from attention mechanism, it is therefore desirable to the feature of this n different scales Figure is merged.

Since the scale size of characteristic pattern determines the quantity of candidate translation example feature, and example aspects are to candidate translation example What feature was polymerize, so, the scale size of characteristic pattern determines the quantity of example aspects indirectly.That is, characteristic pattern Scale is bigger, and the quantity of the example aspects of coding is more.If the quantity of example aspects is more, with the lexical feature in text Alignment in semantic space is more difficult, so, it needs to reduce when merging the characteristic pattern of different scale in the present embodiment real The quantity of example feature.

In one possible implementation, the characteristic pattern of different scale can be merged by the way of down-sampling.This When, for m characteristic patterns in n characteristic patterns, obtain the scale of m+1 characteristic patterns, 1≤m < n；According to m+1 The scale of characteristic pattern carries out down-sampling, the example aspects that will be obtained to the example aspects set generated based on m characteristic patterns Gather and is merged with the example aspects set generated based on m+1 characteristic patterns；Example aspects set after merging is determined For the example aspects set ultimately generated based on m+1 characteristic patterns.

Referring to FIG. 5, the scale of three characteristic patterns is 28 × 28 × 512,14 × 14 × 1024 and 7 × 7 × 2048 respectively, Then for first characteristic pattern, can by 28 × 28 × 512 characteristic pattern down-sampling (maxpool) to 14 × 14 × 512, then make to be coupled (Concat) on channel dimension with 14 × 14 × 1024 characteristic pattern, obtain 14 × 14 × 1536 spy Sign figure；It again will be by 14 × 14 × 1536 characteristic pattern down-sampling (maxpool) to 7 × 7 × 1536, then with 7 × 7 × 2048 Characteristic pattern is made to be coupled (Concat) on channel dimension, obtains 7 × 7 × 3584 characteristic pattern.

First point for needing to illustrate is: due to characteristic pattern and example aspects set, so, it is as down-sampling to characteristic pattern For to example aspects collection cooperation down-sampling.

The second point for needing to illustrate is illustrated in a manner of the down-sampling of Maxpool in Fig. 5, can also be passed through The modes such as average pool, convolution carry out down-sampling, and this embodiment is not limited.

Need to illustrate is thirdly that after scale fusion, can be arrived example aspects compound mapping by full articulamentum In the shared semantic space of D dimension, i.e. v_i=W_va_i+b_v。 (7)

Wherein, W_vAnd b_vCorrespond to the model parameter connected entirely.D can be 1024, be also possible to other numerical value, this reality Example is applied to be not construed as limiting.

The 4th point for needing to illustrate is, if extracting example aspects using object detector, due to the inspection of object detector It surveys in frame and has carried multiple dimensioned design, so, using object detector come multiple dimensioned without merging when extracting example aspects Characteristic pattern.Due to not needing to consider the fusion of multiple dimensioned characteristic pattern, institute when extracting example aspects using object detector With inventor increases the fusion of multiple dimensioned characteristic pattern it is conceivable that when using from attention mechanism to obtain example aspects Process, to enhance vision semanteme, this inherently has difficulty.

The 5th point for needing to illustrate be after can also utilizing the example aspects set for obtaining image from attention mechanism, The example aspects set is applied in other application, for example, generating description text, vision question and answer according to image, according to text Image etc. is generated, this embodiment is not limited.

The second point for needing to illustrate is the feature when image is the picture frame in video, in addition to that can extract picture frame Figure, can also extract and dynamically associate information between each picture frame, dynamically associate information next life with this further according to this feature figure At example aspects set, this embodiment is not limited.

Step 405, text is encoded, obtains text vector.

Wherein, Recognition with Recurrent Neural Network can be RNN, LSTM, GRU etc., and this embodiment is not limited.Below with two-way GRU For, then cataloged procedure including the following steps:

1) sentence is segmented, obtains r vocabulary, r is positive integer.

Sentence S={ the w of given input₁..., w_r, most of potential example be noun or noun phrase, here Target is learning Vocabulary vector.It is therefore desirable to carry out word segmentation processing to sentence, r vocabulary is obtained.Wherein, sentence is divided The implementation of word processing is more, is not construed as limiting herein.

2) for t-th of vocabulary in r vocabulary, according to position of t-th of the vocabulary in vocabulary to the t vocabulary It is encoded, obtains t-th of lexical feature vector, 1≤t≤r.

After word segmentation processing, the context insertion vocabulary vector of two-way GRU combination sentence can be used.For in sentence T-th of vocabulary w_tFor, it can first use one-hot coding II_tIts position in entire vocabulary is identified, then is based on the position By II_tIt is mapped to the lexical feature vector x of predetermined dimension_t, i.e. x (t)=W_xⅡ_t, i ∈ [1, r].Wherein, vocabulary is by institute After having sentence to be segmented, the vocabulary that the wherein frequency of occurrences is greater than preset frequency is formed.In a kind of possible implementation In, predetermined dimension is 300 dimensions, and is initialized under default situations using GloVe (word insertion vector) feature trained in advance.

For example, assuming includes 300 vocabulary, and w in vocabulary_tPosition in vocabulary is 29, then t-th of vocabulary Feature vector is the vector of one 300 dimension, and wherein the numerical value of the 29th dimension is 1, and the numerical value of remaining dimension is 0.

3) t-th of lexical feature vector is inputted in bidirectional valve controlled cycling element in t-th of time step, and according to double T-th of vocabulary vector is determined in the two-way hidden state of t-th of time step to gating cycle unit.

Two-way GRU includes an a forward direction GRU and backward GRU, can be with w₁To w_rSequence it is successively that r vocabulary is special Sign vector, can be with w as the preceding input to GRU_rTo w₁Sequence it is successively backward using r lexical feature vector as this The input of GRU.

For forward direction GRU,

For backward GRU,

T-th of vocabulary vector can be by forward direction hidden stateWith backward hidden stateAverage value determine, description Entire sentence is in word w_tThe information of surrounding, i.e.,

4) by r obtained vocabulary Vector Groups at text vector.

Referring to FIG. 5, entering text into two-way GRU, text vector can be obtained according to each hidden state h.

Step 406, the matching degree between image and text is calculated according to example aspects set and text vector.

For the image I and sentence S of given input, according to the available example aspects set V=of above-mentioned steps {v₁,…,v_k},v_i∈R^DWith text vector E={ e₁,…,e_r},e_r∈R^D.In order to measure the overall situation between image I and sentence S Similitude can intersect attention method (Stacked Cross Attention) using stacking, can be real by polymerization Local similarity between example characteristic set and text vector assesses final global similarity.Wherein local similarity There are two directions for polymerization, are similarity (Image-Text, letter that Case-based Reasoning characteristic set polymerize it with text vector respectively It is denoted as i-t) and its similarity between example aspects set polymerize based on text vector (Text-Image is abbreviated as t- i).Both polymerization methods are introduced separately below.

By taking the polymerization of the direction i-t as an example, then for p-th of example aspects in example aspects set, pth example is calculated Similarity in feature and text vector between q-th of vocabulary vector, and according to the power of the q vocabulary vector of similarity calculation Weight, p and q are positive integer；By each vocabulary vector in text vector multiplied by corresponding weight, each product phase that will be obtained Add, obtains the text semantic vector based on p-th of example aspects；It calculates between p-th of example aspects and text semantic vector Cosine similarity；According to the cosine phase between Example characteristics all in example aspects set and corresponding text semantic vector The global similarity between image and text is calculated like degree, which is used to indicate the matching between image and text Degree.

For p-th of example aspects v_p, can first calculate itself and vocabulary vector { e all in text vector₁,…,e_r} Between similarity summation operation is weighted to all vocabulary vectors, can be obtained then using the similarity as weight To based on v_pThe text semantic vector of polymerizationBased on v_pLocal similarity can use v_pWithBetween cosine similarityQuantify, and the global similarity between image I and sentence S can be polymerize with LogSumExp function, i.e.,

Alternatively, the global similarity between image I and sentence S can be polymerize with mean function, i.e.,

By taking the polymerization of the direction t-i as an example, then for p-th of vocabulary vector in text vector, p-th of vocabulary vector is calculated With the similarity in example aspects set between q-th of example aspects, and according to the power of the q example aspects of similarity calculation Weight, p and q are positive integer；By each example aspects in example aspects set multiplied by corresponding weight, each multiply what is obtained Product is added, and obtains the image, semantic vector based on p-th of vocabulary vector；Calculate p-th of vocabulary vector and image, semantic vector it Between cosine similarity；According to the cosine phase between vocabulary vector all in text vector and corresponding image, semantic vector The global similarity between text and image is calculated like degree, which is used to indicate the matching between image and text Degree.

Wherein, the calculation method for the global similarity that the global similarity of the direction t-i polymerization and the direction i-t polymerize is identical, It does not repeat herein.

Optionally, the global similarity that the global similarity and the direction i-t that can also calculate the polymerization of the direction t-i polymerize Average value, using the average value as the global similarity of image and text.

In the present embodiment, the global similarity between image and text is the matching degree between image and text.? To after the matching degree between image and text, the matching degree greater than predetermined threshold can be chosen, by the corresponding figure of the matching degree Picture or text are determined as search result；Alternatively, can be ranked up to the matching degree of all images or text, will be arranged in front Several images or text are determined as search result.Referring to FIG. 5, sequence can also be passed through after stacking intersection attention Error is arranged in former images or text to determine, to obtain search result.

It is merged by the characteristic pattern to multiple scales, can be used to enhance vision semanteme.In addition, due to utilizing object Detector extract example aspects when do not need to consider the fusion of multiple dimensioned characteristic pattern, so, inventor it is conceivable that Using the fusion process for when obtaining example aspects, increasing multiple dimensioned characteristic pattern from attention mechanism, to enhance vision language Justice, this inherently has difficulty.

The fusion of the characteristic pattern of multiple scales is realized by way of down-sampling, can be subtracted while fusion feature figure The quantity of few example aspects improves figure to reduce the difficulty that example aspects are aligned in semantic space with the vocabulary in text The matched accuracy of text.

The above method can be used for the matched model realization of picture and text by one, if the above method is named as the side SAVE The model can be then named as SAVE model by method.SAVE model trial potential object from image in a manner of end to end Or example aspects (being referred to as the other visual signature of instance-level) is extracted in region, then carry out cross-module based on the example aspects State retrieval.By the inspiration for obtaining example aspects based on object detector, substituted using from attention mechanism in the present embodiment Object detector, and explore its effect in example aspects extraction.In order to obtain the example aspects of different levels, Ke Yixian The characteristic pattern of different scale is extracted with convolutional neural networks, then will be separately in these characteristic patterns from attention mechanism, with It is expected that extracting the minutia in low order object or region from high-resolution characteristic pattern, gather from the characteristic pattern of low resolution Close high-order semantic concept information.

In addition, SAVE model is in addition to including a convolutional neural networks, it further include a Recognition with Recurrent Neural Network and one With model, for extracting text vector, which is used to extract example aspects collection based on characteristic pattern the Recognition with Recurrent Neural Network It closes, and example aspects set and text vector is matched.

In training Matching Model, we are important to notice that the negative sample in training process.For a pair of matched figure Literary sample (i, s), negative sample can be defined as i_h=arg max_x≠iS (x, s) and s_h=arg max_y≠sS (i, y).It is optional , the ternary sequence loss function for minimizing hinge form can also be defined, which is defined as

Wherein, m is the boundary parameter of hinge loss, [x]₊≡ max (x, 0).

In training SAVE model, the boundary value of loss function can be set as to 0.2, gradient is cut in training process Maximum value be set as 2.0.Optionally, SAVE model, and the training sample of each batch can be trained with Adam optimizer Number is set as 128.Training process is divided into two stages, and the first stage first fixes the parameter of ResNet-152, and initial learning rate is set For 5e-4；Second stage trains the other parts of ResNet-152 and SAVE model together, and initial learning rate is set as 2e-5. It should be noted that learning rate be it is relevant to data set, the learning rate enumerated here is based on MS-COCO data set Learning rate.

It should be noted that one of the application realizes that difficult point is the adjustment of the parameter of SAVE model, this and training The setting that learning rate, the selection of training method, training are discussed is closely related, and the selection of these parameters needs to observe training damage The variation of mistake recycles experience to be based on the variation adjusting parameter.The selection of hyper parameter generally uses grid data service, compares consumption When.

Picture and text matching process is applied in above-mentioned four kinds of application scenarios below, and in above-mentioned four kinds of application scenarios The entire flow of cross-module state retrieval is introduced.

1) the mutual retrieval of image and text

When using image retrieval text, SAVE model obtains an image of input, executes above-mentioned steps 402-404 To calculate the example aspects set of image；Z-th of text is read from the text library comprising at least one text again, in execution Step 405-406 is stated to calculate the matching degree between image and z-th of text；Z is updated to z+1, continue to execute from comprising to The step of reading z-th of text in the text library of a few text, until obtaining the matching of all texts and image in text library Stop circulation after degree, z is positive integer.SAVE model is ranked up text according still further to the matching degree with the image, will be arranged in Former texts are determined as this search result, and export to the search result.

When using text retrieval image, SAVE model obtains a text of input, executes above-mentioned steps 405 to count Calculate text vector；Z images are read from the image library comprising at least one image again, execute above-mentioned steps 402-404 with And 406 calculate the matching degree between text and z images；Z is updated to z+1, continues to execute and schemes from comprising at least one The step of z images are read in the image library of picture, until obtaining stopping after the matching degree of all images and text in image library Circulation, z is positive integer.SAVE model is ranked up image according still further to the matching degree with the text, will be arranged in former Image is determined as this search result, and exports to the search result.

2) the mutual retrieval of image and label

When using image retrieval label, SAVE model obtains an image of input, executes above-mentioned steps 402-404 To calculate the example aspects set of image；Z-th of label is read from the tag library comprising at least one label again, in execution Step 405-406 is stated to calculate the matching degree between image and z-th of label；Z is updated to z+1, continue to execute from comprising to The step of reading z-th of label in the tag library of a few label, until obtaining the matching of all labels and image in tag library Stop circulation after degree, z is positive integer.SAVE model is ranked up label according still further to the matching degree with the image, will be arranged in Former labels are determined as this search result, and export to the search result.

When using label search image, SAVE model obtains a label of input, executes above-mentioned steps 405 to count Calculate text vector；Z images are read from the image library comprising at least one image again, execute above-mentioned steps 402-404 with And 406 calculate the matching degree between label and z images；Z is updated to z+1, continues to execute and schemes from comprising at least one The step of z images are read in the image library of picture, until obtaining stopping after the matching degree of all images and label in image library Circulation, z is positive integer.SAVE model is ranked up image according still further to the matching degree with the label, will be arranged in former Image is determined as this search result, and exports to the search result.

3) the mutual retrieval of video and text

When using video frequency searching text, SAVE model obtains one section of video of input, since (f is f continuous in video Positive integer) a video frame content difference it is little, so, can be sampled every video of the f video frame to input, Each picture frame is obtained, for each picture frame, SAVE model executes a picture frame image as input above-mentioned Step 402-404 calculates the example aspects set of image；It is read from the text library comprising at least one text z-th again Text executes above-mentioned steps 405-406 to calculate the matching degree between image and z-th of text, and by z-th of text and is somebody's turn to do The average value of the matching degree of all images or maximum value determine the matching degree of z-th of text and video in video；Z is updated to z + 1, continue to execute from include at least one text text library in read z-th of text the step of, until obtain in text library Stop circulation after the matching degree of all texts and video, z is positive integer.SAVE model is according still further to the matching degree pair with the video Text is ranked up, and the text for being arranged in former is determined as to this search result, and is carried out to the search result defeated Out.

When using text retrieval video, SAVE model obtains a text of input, executes above-mentioned steps 405 to count Calculate text vector；Z sections of videos are read from the video library comprising at least one section video again, due to continuous in the z sections of videos The difference of the content of f (f is positive integer) a video frame is little, so, the z sections of videos can be carried out every f video frame Sampling, obtains each picture frame, using each picture frame as an image, executes above-mentioned steps 402-404 and 406 to count Calculate the matching degree between text and image, and according to the average value of the matching degree of all images in text and the z sections of videos or Maximum value determines the matching degree of text Yu the z sections of videos；Z is updated to z+1, is continued to execute from comprising at least one section of video Video library in the step of reading z sections of videos, until obtaining stopping after the matching degree of all videos and text in video library Circulation, z is positive integer.SAVE model is ranked up video according still further to the matching degree with the text, will be arranged in former Video is determined as this search result, and exports to the search result.

4) the mutual retrieval of video and label

When using video frequency searching label, SAVE model obtains one section of video of input, since (f is f continuous in video Positive integer) a video frame content difference it is little, so, can be sampled every video of the f video frame to input, Each picture frame is obtained, for each picture frame, SAVE model executes a picture frame image as input above-mentioned Step 402-404 calculates the example aspects set of image；It is read from the tag library comprising at least one label z-th again Label executes above-mentioned steps 405-406 to calculate the matching degree between image and z-th of label, and according to z-th of label with The average value of the matching degree of all images or maximum value determine the matching degree of z-th of label and video in the video；Z is updated For z+1, continue to execute from include at least one label tag library in read z-th of label the step of, until obtaining tag library In stop circulation after the matching degree of all labels and video, z is positive integer.SAVE model is according still further to the matching degree with the video Label is ranked up, the label for being arranged in former is determined as to this search result, and the search result is carried out defeated Out.

When using label search video, SAVE model obtains a label of input, executes above-mentioned steps 405 to count Calculate text vector；Z sections of videos are read from the video library comprising at least one section video again, due to continuous in z-th of video The difference of the content of f (f is positive integer) a video frame is little, so, z-th of video can be carried out every f video frame Sampling, obtains each picture frame, using each picture frame as an image, executes above-mentioned steps 402-404 and 406 to count Calculate the matching degree between label and image, and according to the average value of the matching degree of all images in label and z-th of video or Maximum value determines the matching degree of label and z sections of videos；Z is updated to z+1, is continued to execute from comprising at least one section of video The step of z sections of videos are read in video library, until obtaining stopping following after the matching degree of all videos and label in video library Ring, z are positive integer.SAVE model is ranked up video according still further to the matching degree with the label, will be arranged in former Video is determined as this search result, and exports to the search result.

SAVE model is run in following external hardware environment configurations below, to verify the technical effect of SAVE model:

CPU:8 core

Memory: 128G

Video card: 2 pieces of Nvidia Tesla P40

When verifying the technical effect of SAVE method, first the two datasets of cross-module state retrieval are introduced. MS- COCO and Flickr30K is two benchmark datasets of cross-module state retrieval.Flickr30K data set includes from the website Flickr 31783 images of upper collection, and every image possesses 5 texts.We make training set with 29000 images, and 1014 Image makees verifying collection, and 1000 images make test set.MS-COCO data set includes 123287 images, similarly every image 5 texts are corresponded to.We make training set with 82783 images, and 5000 images make verifying collection, and 5000 images are tested Collection.Further, it is also possible to 30504 verifyings original in MS-COCO data set collection image is added in the training set of retrieval, this When, training set image extends to 113287, other are remained unchanged.Test on MS-COCO data set is divided into two kinds of feelings Condition: 1) it directly with 5000 images is tested；2) 5000 images are divided into 5 parts, are tested on every part of 1000 images respectively And to results are averaged.

R@K (K=1,5,10) is the common evaluation index of cross-module state retrieval tasks, its definition is the K before all retrievals Percentage shared by the inquiry of correct result is contained at least one in the inquiry of a result.Med r is that another evaluation refers to Mark, it is the median for corresponding to serial number in all inquiries near preceding correct result that it is corresponding.Further, it is also possible to define Sum this A evaluation index is used to measure the overall effect of cross-module state retrieval,

Below by SAVE method in the related technology experimental result of the certain methods on Flickr30K data set into Row compares, and please refers to table 1.The experimental result of various methods is divided into three columns in table 1 and is shown, separately below to three The realization principle of method is introduced in column.

Various methods in first column be it is end-to-end train, visual signature can directly be extracted by VGG or ResNet It obtains, then the visual signature extracted is one-to-one or multi-to-multi is mapped in semantic space.Wherein, one-to-one mapping refers to By the global characteristics DUAL PROBLEMS OF VECTOR MAPPING of image and text into semantic space, and with the global characteristics vector in semantic space Distance measures the matching degree between image and text.However, the high-order semantic information that global characteristics vector is included is to interlock It is mixed in together, the content of the image (or text) retrieved based on it usually in some details with given text (or Image) content it is inconsistent, for example noun in object and text in image cannot correspond.Multi-to-multi maps Refer to by the lexical feature compound mapping of the local feature set of image and text into semantic space, and cohesive image and text Between local similarity measure the global similitude between image and text, to obtain between image and text With degree.

Various methods in second column are to extract example aspects set using object detector.

Various methods in third column are SAVE method described above, according to from attention mechanism and different rulers The quantity of the characteristic pattern of degree is classified, wherein ms2 refers to that SAVE method is extracted 14 × 14 and 7 × 7 the two scales Characteristic pattern, ms3 refers to that SAVE method is extracted the characteristic pattern of 28 × 28,14 × 14 and 7 × 7 these three scales.

As it can be seen that the various methods that SAVE method is all substantially better than in the first column under all evaluation indexes are (end-to-end to instruct Experienced).Our best experimental result is can be adapted to 3 layers of example aspects set polymerizeing from attention by fusion to realize , this result and existing best search method (i.e. SCAN method) are suitable.It is noted that our SAVE method It does not need to train object detector using additional manual annotation data.

Table 1

Table 2 shows the experimental result of 1K and 5K test set of the various methods on MS-COCO data set, still according to The experimental result of various methods is divided into six columns and shows by the realization principle of various methods in table 1.Wherein we are most Good experimental result is can be adapted to 2 layers of example aspects set polymerizeing from attention by fusion to realize, all evaluations refer to Mark is better than the various methods (end-to-end trainable) in the first column and the 4th column.But SAVE method and existing best There is also some gaps between search method (i.e. SCAN method), and especially under 5K test set, this may be because of SCAN Method also uses additional data set (Visual other than using additional labeled data training object detector Genome) training object detector, so, the generalization ability for the extracted feature of object detector that training obtains may More preferably, so that SCAN method has better versatility.

Table 2

Referring to FIG. 6, the structural block diagram of the picture and text coalignment provided it illustrates the application one embodiment.The figure Literary coalignment, comprising:

Module 610 is obtained, for obtaining image and text to be matched；

Generation module 620, the image generation candidate translation example characteristic set for being obtained according to module 610 is obtained；

Aggregation module 630, for utilizing the candidate translation example characteristic set generated from attention mechanism to generation module 620 In candidate translation example feature polymerize, obtain example aspects set, each example aspects in example aspects set correspond to An object or region in image；

Coding module 640 obtains text vector for encoding to the text for obtaining the acquisition of module 610；

Computing module 650, what example aspects set and coding module 640 for being obtained according to aggregation module 630 obtained Text vector calculates the matching degree between image and text.

In one possible implementation, aggregation module 630 are also used to:

For i-th of candidate translation example feature in candidate translation example characteristic set, calculated i-th using from attention mechanism Correlation between candidate translation example feature and remaining candidate translation example feature, and i-th of candidate translation example is based on according to correlation calculations The example aspects of feature.

In one possible implementation, when correlation is weight, aggregation module 630 is also used to:

For i-th of candidate translation example feature in candidate translation example characteristic set, i-th of candidate translation example feature and jth are calculated Cosine similarity between a candidate translation example feature, and according to cosine similarity calculate j-th candidates example aspects weight, Concern when weight is used to indicate based on i-th of other candidate translation example of candidate translation example characteristic aggregation to j-th candidates example aspects Degree, i and j are positive integer；

By each candidate translation example feature in candidate translation example characteristic set multiplied by corresponding weight, each multiply what is obtained Product is added, and obtains the example aspects based on i-th of candidate translation example feature.

Each candidate translation example feature in candidate translation example characteristic set is respectively mapped to fisrt feature space, the second spy It levies in space and third feature space；

For i-th of candidate translation example feature in candidate translation example characteristic set, according to i-th in fisrt feature space J-th candidates example aspects in candidate translation example feature and second feature space calculate the weight of j-th candidates example aspects, Concern when weight is used to indicate based on i-th of other candidate translation example of candidate translation example characteristic aggregation to j-th candidates example aspects Degree, i and j are positive integer；

By each candidate translation example feature in third feature space multiplied by corresponding weight, each product phase that will be obtained Adduction carries out residual error fitting, obtains the example aspects based on i-th of candidate translation example feature.

In one possible implementation, generation module 620 are also used to:

Image is inputted in convolutional neural networks, and obtains the characteristic pattern of convolutional neural networks output；

Characteristic pattern is divided, the candidate translation example feature obtained after division is formed into candidate translation example characteristic set.

In one possible implementation, when convolutional neural networks output has characteristic pattern and n >=2 of n different scales When, module 610 is obtained, is also used to for m characteristic patterns in n characteristic pattern, obtains the scale of m+1 characteristic patterns, 1 ≤ m < n；

The device further include: down sample module, for according to the ruler for obtaining the m+1 characteristic patterns that module 610 obtains Degree carries out down-sampling to the example aspects set generated based on m characteristic patterns, by obtained example aspects set be based on The example aspects set that m+1 characteristic patterns generate merges；

Determining module is determined as the example aspects set after merging down sample module based on m+1 characteristic patterns The example aspects set ultimately generated.

In one possible implementation, when text is sentence, computing module 650 is also used to:

For p-th of example aspects in example aspects set, calculate in p-th of example aspects and text vector q-th Similarity between vocabulary vector, and according to the weight of q-th of vocabulary vector of similarity calculation, p and q are positive integer；

By each vocabulary vector in text vector multiplied by corresponding weight, each product addition that will be obtained obtains base In the text semantic vector of p-th of example aspects；

Calculate the cosine similarity between p-th of example aspects and text semantic vector；

According to the cosine similarity between Example characteristics all in example aspects set and corresponding text semantic vector The global similarity between image and text is calculated, global similarity is used to indicate the matching degree between image and text.

For p-th of vocabulary vector in text vector, calculate in p-th of vocabulary vector and example aspects set q-th Similarity between example aspects, and according to the weight of q-th of example aspects of similarity calculation, p and q are positive integer；

By each example aspects in example aspects set multiplied by corresponding weight, each product addition that will be obtained is obtained To the image, semantic vector based on p-th of vocabulary vector；

Calculate the cosine similarity between p-th of vocabulary vector and image, semantic vector；

It is calculated according to the cosine similarity between vocabulary vector all in text vector and corresponding image, semantic vector Global similarity between text and image, global similarity are used to indicate the matching degree between image and text.

In one possible implementation, when text is sentence, coding module 640 is also used to:

Sentence is segmented, r vocabulary is obtained, r is positive integer；

For t-th of vocabulary in r vocabulary, according to position of t-th of the vocabulary in vocabulary to t-th of vocabulary into Row coding, obtains t-th of lexical feature vector, 1≤t≤r；

T-th of lexical feature vector is inputted in bidirectional valve controlled cycling element in t-th of time step, and according to two-way Gating cycle unit determines t-th of vocabulary vector in the two-way hidden state of t-th of time step；

By r obtained vocabulary Vector Groups at text vector.

In conclusion picture and text coalignment provided by the embodiments of the present application, by generating candidate translation example feature according to image Set, recycling polymerize the candidate translation example feature in candidate translation example characteristic set from attention mechanism, and reality can be obtained Example characteristic set calculates the matching degree between image and text further according to example aspects set and text vector, in this way, can be with Using example aspects are polymerize by the relevance between candidate translation example feature from attention mechanism, object inspection is avoided passing through Device is surveyed to obtain the example aspects set of image, needed to mark institute on every image when both having solved trained object detector There are the classification and location information of example, the problem for causing the training difficulty of object detector big, to reach simplified picture and text The effect for the realization difficulty matched；Also it solves object detector and also exports corresponding position letter other than exporting semantic information Breath, and location information matches picture and text not helpful, the example aspects for causing object detector to identify are not particularly suited for With text, the problem of influencing picture and text matched accuracy rate, to achieve the effect that improve the matched accuracy rate of picture and text.

Present invention also provides a kind of server, which includes processor and memory, be stored in memory to A few instruction, at least one instruction are loaded by processor and are executed the picture and text to realize above-mentioned each embodiment of the method offer Matching process.It should be noted that the server can be server provided by following Fig. 7.

Referring to FIG. 7, the structural schematic diagram of the server provided it illustrates one exemplary embodiment of the application.Tool For body: the server 700 is including central processing unit (CPU) 701 including random access memory (RAM) 702 and only Read the system storage 704 of memory (ROM) 703, and the system of connection system storage 704 and central processing unit 701 Bus 705.The server 700 further includes the basic input/output that information is transmitted between each device helped in computer System (I/O system) 706, and for the large capacity of storage program area 713, application program 714 and other program modules 715 Store equipment 707.

The basic input/output 706 includes display 708 for showing information and inputs for user The input equipment 709 of such as mouse, keyboard etc of information.Wherein the display 708 and input equipment 709 all pass through company The input and output controller 710 for being connected to system bus 705 is connected to central processing unit 701.The basic input/output system System 706 can also include input and output controller 710 for receiving and handling from keyboard, mouse or electronic touch pen etc. The input of multiple other equipment.Similarly, input and output controller 710 also provides output to display screen, printer or other classes The output equipment of type.

The mass-memory unit 707 is by being connected to the bulk memory controller (not shown) of system bus 705 It is connected to central processing unit 701.The mass-memory unit 707 and its associated computer readable storage medium are Server 700 provides non-volatile memories.That is, the mass-memory unit 707 may include such as hard disk or The computer readable storage medium (not shown) of CD-ROI driver etc.

Without loss of generality, the computer readable storage medium may include computer storage media and communication media.Meter Calculation machine storage medium is believed including computer readable instructions, data structure, program module or other data etc. for storage The volatile and non-volatile of any method or technique realization of breath, removable and irremovable medium.Computer storage medium Including RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storages its technologies, CD-ROM, DVD or other optical storages, Cassette, tape, disk storage or other magnetic storage devices.Certainly, skilled person will appreciate that the computer stores Medium is not limited to above-mentioned several.Above-mentioned system storage 704 and mass-memory unit 707 may be collectively referred to as storing Device.

Memory is stored with one or more programs, and one or more programs are configured to by one or more centres Unit 701 to be managed to execute, one or more programs include the instruction for realizing above-mentioned statement coding or sentence coding/decoding method, in Central Processing Unit 701 executes the one or more program and realizes the picture and text matching process that above-mentioned each embodiment of the method provides.

According to various embodiments of the present invention, the server 700 can also be arrived by network connections such as internets Remote computer operation on network.Namely server 700 can be connect by the network being connected on the system bus 705 Mouth unit 711 is connected to network 712, in other words, Network Interface Unit 711 can be used also to be connected to other kinds of net Network or remote computer system (not shown).

The memory further includes one or more than one program, the one or more programs storage In memory, the one or more programs include for carrying out picture and text match party provided in an embodiment of the present invention The step as performed by server in method.

The embodiment of the present application also provides a kind of computer readable storage medium, is stored at least one in the storage medium Instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code Collection or instruction set are loaded by the processor 710 and are executed to realize picture and text matching process as described above.

Present invention also provides a kind of computer program products to make when computer program product is run on computers It obtains computer and executes the picture and text matching process that above-mentioned each embodiment of the method provides.

The application one embodiment provides a kind of computer readable storage medium, be stored in the storage medium to Few an instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, institute Code set or instruction set is stated to be loaded by processor and executed to realize picture and text matching process as described above.

The application one embodiment provides a kind of picture and text matching unit, the picture and text matching unit include processor and Memory, at least one instruction is stored in the memory, and described instruction is loaded by the processor and executed to realize such as The upper picture and text matching process.

It should be understood that picture and text coalignment provided by the above embodiment is when carrying out picture and text matching, only with above-mentioned each The division progress of functional module can according to need and for example, in practical application by above-mentioned function distribution by different function Energy module is completed, i.e., the internal structure of picture and text coalignment is divided into different functional modules, described above complete to complete Portion or partial function.In addition, picture and text coalignment provided by the above embodiment belong to picture and text matching process embodiment it is same Design, specific implementation process are detailed in embodiment of the method, and which is not described herein again.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The above is not to limit the embodiment of the present application, all within the spirit and principle of the embodiment of the present application, institute Any modification, equivalent substitution, improvement and etc. of work, should be included within the protection scope of the embodiment of the present application.

Claims

1. a kind of picture and text matching process, which is characterized in that the described method includes:

Obtain image and text to be matched；

The candidate translation example feature in the candidate translation example characteristic set is polymerize using from attention mechanism, obtains example spy Collection is closed, and each example aspects in the example aspects set correspond to an object or region in described image；

The text is encoded, text vector is obtained；

The matching degree between described image and the text is calculated according to the example aspects set and the text vector.

2. the method according to claim 1, wherein described using special to the candidate translation example from attention mechanism Candidate translation example feature in collection conjunction is polymerize, and example aspects set is obtained, comprising:

For i-th of candidate translation example feature in the candidate translation example characteristic set, institute is calculated from attention mechanism using described The correlation between i-th of candidate translation example feature and remaining candidate translation example feature is stated, and is based on institute according to the correlation calculations State the example aspects of i-th of candidate translation example feature.

3. described for the time according to the method described in claim 2, it is characterized in that, when the correlation is weight I-th of candidate translation example feature in example aspects set is selected, calculates i-th of candidate translation example from attention mechanism using described Correlation between feature and remaining candidate translation example feature, and i-th of candidate translation example is based on according to the correlation calculations The example aspects of feature, comprising:

For i-th of candidate translation example feature in the candidate translation example characteristic set, calculate i-th of candidate translation example feature with Cosine similarity between j-th candidates example aspects, and the j-th candidates example spy is calculated according to the cosine similarity The weight of sign, to the jth when weight is for indicating based on described other candidate translation examples of i-th of candidate translation example characteristic aggregation The degree of concern of a candidate translation example feature, i and j are positive integer；

By each candidate translation example feature in the candidate translation example characteristic set multiplied by corresponding weight, each product that will be obtained It is added, obtains the example aspects based on i-th of candidate translation example feature.

4. described for the time according to the method described in claim 2, it is characterized in that, when the correlation is weight I-th of candidate translation example feature in example aspects set is selected, calculates i-th of candidate translation example from attention mechanism using described Correlation between feature and remaining candidate translation example feature, and i-th of candidate translation example is based on according to the correlation calculations The example aspects of feature, comprising:

Each candidate translation example feature in the candidate translation example characteristic set is respectively mapped to fisrt feature space, second feature In space and third feature space；

For i-th of candidate translation example feature in the candidate translation example characteristic set, according to the jth in the fisrt feature space It is special to calculate the j-th candidates example for i-th of candidate translation example feature in a candidate translation example feature and the second feature space The weight of sign, to the jth when weight is for indicating based on described other candidate translation examples of i-th of candidate translation example characteristic aggregation The degree of concern of a candidate translation example feature, i and j are positive integer；

By each candidate translation example feature in the third feature space multiplied by corresponding weight, each product addition that will be obtained And residual error fitting is carried out, obtain the example aspects based on i-th of candidate translation example feature.

5. the method according to claim 1, wherein described generate candidate translation example feature set according to described image It closes, comprising:

Described image is inputted in convolutional neural networks, and obtains the characteristic pattern of the convolutional neural networks output；

The characteristic pattern is divided, the candidate translation example feature obtained after division is formed into the candidate translation example characteristic set.

6. according to the method described in claim 5, it is characterized in that, when convolutional neural networks output has n different scales Characteristic pattern and when n >=2, the method also includes:

For m characteristic patterns in the n characteristic patterns, the scale of m+1 characteristic patterns, 1≤m < n are obtained；

According to the scale of the m+1 characteristic patterns, the example aspects set generated based on the m characteristic patterns is carried out down Sampling merges obtained example aspects set with the example aspects set generated based on the m+1 characteristic patterns；

Example aspects set after merging is determined as to the example aspects set ultimately generated based on the m+1 characteristic patterns.

7. method according to any one of claims 1 to 6, which is characterized in that described according to institute when the text is sentence State the matching degree between example aspects set and text vector calculating described image and the text, comprising:

For p-th of example aspects in the example aspects set, p-th of example aspects and the text vector are calculated In similarity between q-th of vocabulary vector, and the weight of q-th of vocabulary vector according to the similarity calculation, p and q For positive integer；

By each vocabulary vector in the text vector multiplied by corresponding weight, each product addition that will be obtained obtains base In the text semantic vector of p-th of example aspects；

Calculate the cosine similarity between p-th of example aspects and the text semantic vector；

According to the cosine similarity between Example characteristics all in the example aspects set and corresponding text semantic vector The global similarity between described image and the text is calculated, the overall situation similarity is used to indicate described image and the text Matching degree between this.

8. method according to any one of claims 1 to 6, which is characterized in that described according to institute when the text is sentence State the matching degree between example aspects set and text vector calculating described image and the text, comprising:

For p-th of vocabulary vector in the text vector, p-th of the vocabulary vector and the example aspects set are calculated In similarity between q-th of example aspects, and the weight of q-th of example aspects according to the similarity calculation, p and q For positive integer；

By each example aspects in the example aspects set multiplied by corresponding weight, each product addition that will be obtained is obtained To the image, semantic vector based on p-th of the vocabulary vector；

Calculate the cosine similarity between p-th of the vocabulary vector and described image semantic vector；

It is calculated according to the cosine similarity between vocabulary vector all in the text vector and corresponding image, semantic vector Global similarity between the text and described image, the overall situation similarity be used to indicate described image and the text it Between matching degree.

9. the method according to claim 1, wherein when the text be sentence when, it is described to the text into Row coding, obtains text vector, comprising:

The sentence is segmented, r vocabulary is obtained, r is positive integer；

For t-th of vocabulary in the r vocabulary, according to position of t-th of the vocabulary in vocabulary to described t-th Vocabulary is encoded, and t-th of lexical feature vector, 1≤t≤r are obtained；

T-th of lexical feature vector is inputted in bidirectional valve controlled cycling element in t-th of time step, and according to described double T-th of vocabulary vector is determined in the two-way hidden state of t-th of time step to gating cycle unit；

R obtained vocabulary vector is formed into the text vector.

10. a kind of picture and text coalignment, which is characterized in that described device includes:

Module is obtained, for obtaining image and text to be matched；

Aggregation module, for using in the candidate translation example characteristic set generated from attention mechanism to the generation module Candidate translation example feature is polymerize, and example aspects set is obtained, and each example aspects in the example aspects set correspond to An object or region in described image；

Computing module, the institute that the example aspects set and the coding module for being obtained according to the aggregation module obtain State the matching degree between text vector calculating described image and the text.

11. device according to claim 10, which is characterized in that the aggregation module is also used to:

12. device according to claim 11, which is characterized in that when the correlation is weight, the aggregation module, It is also used to:

13. device according to claim 11, which is characterized in that when the correlation is weight, the aggregation module, It is also used to:

14. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium A few Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code set or instruction Collection is loaded by processor and is executed to realize picture and text matching process as described in any one of claim 1 to 9.

15. a kind of picture and text matching unit, which is characterized in that the picture and text matching unit includes processor and memory, described to deposit At least one instruction is stored in reservoir, described instruction is loaded by the processor and executed to realize such as claim 1 to 9 times Picture and text matching process described in one.