CN108228686A

CN108228686A - It is used to implement the matched method, apparatus of picture and text and electronic equipment

Info

Publication number: CN108228686A
Application number: CN201710453664.7A
Authority: CN
Inventors: 李爽; 肖桐; 李鸿升; 杨巍; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-06-15
Filing date: 2017-06-15
Publication date: 2018-06-29
Anticipated expiration: 2037-06-15
Also published as: CN108228686B

Abstract

The embodiment of the invention discloses one kind to be used to implement the matched method, apparatus of picture and text, electronic equipment and computer-readable medium, wherein, it is used to implement the matched method of picture and text and mainly includes：Obtain one group of image and text；The characteristics of image of described image is obtained using the first convolutional neural networks, and each word feature in the text is obtained using the first recurrent neural network；Semantic attention processing is carried out for described image feature and each word feature, obtains semantic attention force value；According to the semantic matching degree for paying attention to force value, calculating described image and the text.The embodiment of the present invention improves the matched accuracy of picture and text to a certain extent.

Description

It is used to implement the matched method, apparatus of picture and text and electronic equipment

Technical field

The matched method of picture and text is used to implement the present invention relates to computer vision technique, especially one kind, medium, is used for Realize the matched device of picture and text and electronic equipment.

Background technology

Picture and text matching technique can identify the image and text being mutually matched according to characteristics of image and text feature This.Picture and text matching technique due to its can be widely applied to vision question and answer and image subtitle generation etc. fields, so as into For an important technology in technical field of computer vision.

Existing picture and text matching technique generally includes：Using the characteristics of image of convolutional neural networks extraction input picture, and Correlation between the characteristics of image of calculating input image and the text feature of all texts, is extracted using recurrent neural network The text feature of text is inputted, and calculates the correlation between the input text feature of text and the characteristics of image of all images； For example, Euclidean distance between text feature and characteristics of image either horse formula distance or inner product of vectors etc. are calculated, so as to obtain The correlation of the two；Then, judged according to the correlation for calculating acquisition with the matched text of input picture and with inputting text Matched image.

Invention content

Embodiment of the present invention provides a kind of for the matched technical solution of picture and text.

According to the one side of embodiment of the present invention, provide one kind and be used to implement the matched method of picture and text, including： Obtain one group of image and text；The characteristics of image of described image is obtained using the first convolutional neural networks, and utilizes the first recurrence Neural network obtains each word feature in the text；It is carried out for described image feature and each word feature semantic Attention processing obtains semantic attention force value；According to semantic for paying attention to force value, calculating described image and the text With degree.

In one embodiment of the present invention, described the step of obtaining one group of image and text, includes：Obtain input figure Picture, and any one text is chosen from text collection, using the input picture and the text of the selection as a group picture Picture and text；Alternatively, obtaining input text, and any one image is chosen from image collection, by the input picture and described The image of selection is as one group of image and text；Wherein, the text collection is screens the text in text library Filtering, the text collection formed by the multiple texts obtained after screening and filtering, and described image collection is combined into in image library Image carries out screening and filtering, the image collection formed by the multiple images obtained after screening and filtering.

In another embodiment of the present invention, the step of text in text library carries out screening and filtering, includes： The characteristics of image of the input picture is obtained using the second convolutional neural networks, and text is obtained using the second recurrent neural network The text feature of each text in this library；The characteristics of image for calculating the input picture is related to the text feature of each text Property；Multiple texts, and selected multiple text conducts are selected from each text according to the sequence of the correlation The text collection.

In yet another embodiment of the present invention, the step of image in image library carries out screening and filtering, includes： The text feature of the input text is obtained using the second recurrent neural network, and figure is obtained using the second convolutional neural networks As the characteristics of image of each image in library；The text feature for calculating the input text is related to the characteristics of image of each image Property；Multiple images are selected, and selected multiple images are formed from each image according to the sequence of the correlation Described image set.

In yet another embodiment of the present invention, the method further includes：Using the image pattern with individual marking with And the samples of text with individual marking the step of the second convolutional neural networks and the second recurrent neural network are trained.

In yet another embodiment of the present invention, the step of the training includes：It is obtained using the second convolutional neural networks The characteristics of image of image pattern with individual marking, and obtain the text with individual marking using the second recurrent neural network The text feature of this sample；Calculate the characteristics of image of described image sample and the text of each samples of text in text feature set First matching degree of eigen, and the text feature for calculating the samples of text and each image pattern in characteristics of image set Characteristics of image the second matching degree；Institute is updated according to the cross entropy loss function of first matching degree and the second matching degree State the parameter of the second convolutional neural networks and the second recurrent neural network.

In yet another embodiment of the present invention, in described image characteristic set, there are the different figures of same individual mark The characteristics of image of decent shares the characteristics of image memory space of the individual marking；And/or in the text feature set, tool There is the characteristics of image memory space that the text feature of different samples of text that same individual identifies shares the individual marking.

In yet another embodiment of the present invention, the method further includes：In described image characteristic set is determined not When including the characteristics of image of the image pattern with individual marking, by the image pattern with individual marking Characteristics of image is added in described image characteristic set；Described carry is not included in the text feature set is determined During the text feature of the samples of text of individual marking, the text feature addition of the samples of text with individual marking is existed In the text feature set.

In yet another embodiment of the present invention, the method further includes：The second convolutional neural networks after training are made The first convolutional neural networks for initialization；Using the second recurrent neural network after training as the first recurrence god of initialization Through network.

It is each in the text using the acquisition of the first recurrent neural network in yet another embodiment of the present invention The step of word feature, includes：Obtain only heat vector of each word in the text；By only heat vector input of each word Full articulamentum is encoded；The corresponding coding of each word is sequentially input into the first recurrent neural network, and according to the first recurrence god Output through network obtains each word feature.

In yet another embodiment of the present invention, the method is for described image feature and each word feature Semantic attention processing is carried out, obtains further including before semantic the step of paying attention to force value：Characteristics of image is modified, is obtained Correct characteristics of image；And it is described for described image feature and each semantic attention processing of word feature progress, obtain language Justice notices that the step of force value includes：Semantic attention processing is carried out for the amendment characteristics of image and each word feature, Obtain the semantic attention force value.

In yet another embodiment of the present invention, described the step of being modified to characteristics of image, includes：According in image The characteristics of image in each region and each word feature, obtain space transforms force value；According to the space transforms force value, in institute State selection target region in each region；The corresponding characteristics of image in the target area is obtained, as the amendment characteristics of image.

In yet another embodiment of the present invention, the size in each region is identical in described image, and each image-region The quantity of included characteristics of image is identical.

In yet another embodiment of the present invention, the characteristics of image according to region each in image and each word The step of feature, acquisition space transforms force value, includes：The characteristics of image in each region and each word are calculated using spatial attention model Cohesion between language feature, and each cohesion is normalized respectively；According to each parent after the normalized Density and the box counting algorithm described image in each region are directed to the characteristics of image of each word.

It is described to be directed to described image feature and each word feature progress language in yet another embodiment of the present invention The step of adopted attention processing, includes：Described image is distinguished into grade for the characteristics of image of each word and the feature of corresponding word Connection, and semantic attention model is inputted respectively, it is right in different concepts to calculate each word by the semantic attention model The contribution of described image.

In yet another embodiment of the present invention, the result handled according to the semantic attention calculates the figure The step of matching degree of picture and the text, includes：According to each described word in different concepts to described image Contribution and the cascade determine the contribution for the characteristics of image of each word in different concepts；Utilize recurrent neural Network is decoded processing to the contribution for the characteristics of image of each word in different concepts；Using full articulamentum and Two-value grader determines described image and the matching degree of text to the information after the decoding process.

According to the other side of embodiment of the present invention, provide one kind and be used to implement the matched device of picture and text, and should Device includes：Picture and text module is obtained, for obtaining one group of image and text；First obtains characteristic module, for utilizing the first volume Product neural network obtains the characteristics of image of described image；Second obtains characteristic module, for being obtained using the first recurrent neural network Take each word feature in the text；Processing module carries out language for being directed to described image feature and each word feature Adopted attention processing obtains semantic attention force value；Judgment module, for according to the semantic attention force value, calculating described image With the matching degree of the text.

Another aspect according to embodiments of the present invention, provides a kind of electronic equipment, including：Memory, for storing Computer program；Processor, for performing the computer program stored in the memory, and the computer program is held During row, following instructions are run：For obtaining the instruction of one group of image and text；For being obtained using the first convolutional neural networks Take the instruction of the characteristics of image of described image；It is special for obtaining each word in the text using the first recurrent neural network The instruction of sign；Semantic attention processing is carried out for being directed to described image feature and each word feature, obtains semantic attention The instruction of force value；For according to the semantic instruction for paying attention to force value, calculating described image and the matching degree of text.

It is described to include for obtaining the instruction of one group of image and text in one embodiment of the present invention：For obtaining Input picture, and any one text is chosen from text collection, using the input picture and the text of the selection as described in The instruction of one group of image and text；Alternatively, for obtaining input text, and any one image is chosen from image collection, by institute State instruction of the image of input picture and the selection as one group of image and text；Wherein, the text collection is pair Text progress screening and filtering in text library, the text collection formed by the multiple texts obtained after screening and filtering, and it is described Image collection is to carry out screening and filtering, the figure formed by the multiple images obtained after screening and filtering to the image in image library Image set closes.

In another embodiment of the present invention, the equipment further includes：For being screened to the text in text library The instruction of filtering, and the instruction specifically includes：It is special for obtaining the image of the input picture using the second convolutional neural networks The instruction of sign；For obtaining the instruction of the text feature of each text in text library using the second recurrent neural network；Based on Calculate the characteristics of image of the input picture and the instruction of the correlation of the text feature of each text；For according to the correlation Sequence multiple texts, and finger of the selected multiple texts as the text collection are selected from each text It enables.

In yet another embodiment of the present invention, the equipment further includes：For being screened to the image in image library The instruction of filtering, and the instruction specifically includes：It is special for obtaining the text of the input text using the second recurrent neural network The instruction of sign；For obtaining the instruction of the characteristics of image of each image in image library using the second convolutional neural networks；Based on Calculate the input text feature of text and the instruction of the correlation of the characteristics of image of each image；For according to the correlation Sequence select multiple images from each image, and selected multiple images form the finger of described image set It enables.

In yet another embodiment of the present invention, the equipment further includes：For utilizing the image sample with individual marking The finger that sheet and the samples of text with individual marking are trained the second convolutional neural networks and the second recurrent neural network It enables.

It is described to be used for using the image pattern with individual marking and carry in yet another embodiment of the present invention The instruction that the samples of text of individual marking is trained the second convolutional neural networks and the second recurrent neural network includes：With In the instruction for the characteristics of image that the image pattern with individual marking is obtained using the second convolutional neural networks；For utilizing Two recurrent neural networks obtain the instruction of the text feature of the samples of text with individual marking；For calculating described image sample This characteristics of image and the instruction of the first matching degree of the text feature of each samples of text in text feature set；Based on Calculate the text feature of the samples of text and the second matching degree of the characteristics of image of each image pattern in characteristics of image set Instruction；For updating second convolutional Neural according to the cross entropy loss function of first matching degree and the second matching degree The instruction of the parameter of network and the second recurrent neural network.

In yet another embodiment of the present invention, the equipment further includes：For determining described image characteristic set In when not including the characteristics of image of the image pattern with individual marking, by the image sample with individual marking This characteristics of image adds the instruction in described image characteristic set；For in the text feature set is determined not When including the text feature of the samples of text with individual marking, by the samples of text with individual marking Text feature adds the instruction in the text feature set.

In yet another embodiment of the present invention, the equipment further includes：For the second convolution nerve net after training Instruction of the network as the first convolutional neural networks of initialization；It is used as initially for the second recurrent neural network after training The instruction for the first recurrent neural network changed.

It is described to be used to obtain in the text using the first recurrent neural network in yet another embodiment of the present invention The instruction of each word feature include：For obtaining the instruction of only heat vector of each word in the text；For will be described each Only heat vector of word inputs the instruction that full articulamentum is encoded；For the corresponding coding of each word to be sequentially input first Recurrent neural network, and the instruction of each word feature of output acquisition according to the first recurrent neural network.

In yet another embodiment of the present invention, the equipment is for being directed to described image feature and each word Feature carries out semantic attention processing, obtains further including before the semantic instruction for paying attention to force value：For being repaiied to characteristics of image Just, obtain correcting the instruction of characteristics of image；It is and described semantic for being carried out for described image feature and each word feature Attention processing, obtaining the semantic instruction for paying attention to force value is specially：For being directed to the amendment characteristics of image and each word Feature carries out semantic attention processing, obtains the semantic instruction for paying attention to force value.

It is described for being modified to characteristics of image in yet another embodiment of the present invention, it obtains correcting characteristics of image Instruction include：For the characteristics of image according to region each in image and each word feature, space transforms force value is obtained Instruction；For according to the space transforms force value, the instruction in selection target region in each region；It is described for obtaining The corresponding characteristics of image in target area, as the instruction for correcting characteristics of image.

In yet another embodiment of the present invention, it is described be used for according to the characteristics of image in region each in image with it is described each Word feature, the instruction for obtaining space transforms force value include：It is special for calculating the image in each region using spatial attention model Levy the cohesion between each word feature, and to instruction that each cohesion is normalized respectively；For according to The box counting algorithm described image of each cohesion and each region after normalized is for the characteristics of image of each word Instruction.

In yet another embodiment of the present invention, it is described be used for for described image feature and each word feature into Row semanteme attention processing obtains the semantic instruction for paying attention to force value and includes：For described image to be directed to the figure of each word As feature and the feature of corresponding word cascade respectively, and the instruction of semantic attention model is inputted respectively, semantic paid attention to by described Power model calculate each word in different concepts to the contribution of described image.

It is described to be used to calculate institute according to the result that the semantic attention is handled in yet another embodiment of the present invention The instruction for stating the matching degree of image and the text includes：For according to each described word in different concepts to institute It states the contribution of image and the cascade determines the finger of the contribution for the characteristics of image of each word in different concepts It enables；For being decoded using recurrent neural network to the contribution for the characteristics of image of each word in different concepts The instruction of processing；For using full articulamentum and two-value grader the information after the decoding process is determined described image and The instruction of the matching degree of text.

A kind of another aspect according to embodiments of the present invention, the computer storage media provided, is stored thereon with calculating Machine program when the computer program is executed by processor, performs each step in the method for the present invention embodiment, for example, obtaining Take one group of image and text；The characteristics of image of described image is obtained using the first convolutional neural networks, and utilizes the first recurrence god Each word feature in the text is obtained through network；Semantic note is carried out for described image feature and each word feature Power of anticipating processing obtains semantic attention force value；According to the semantic matching for paying attention to force value, calculating described image and the text Degree.

Based on being used to implement of providing of the above embodiment of the present invention, the matched method of picture and text, to be used to implement picture and text matched Device, electronic equipment and computer storage media, embodiment of the present invention in picture and text matching process by introducing semantic note Meaning power, and each word feature in characteristics of image and text based on each region in image carries out semantic attention processing, it can So that each region in image is more accurately associated together with each word in text, so as to keep away to a certain extent Exempt from the text feature of the characteristics of image and text only for image overall relevancy account for caused by picture and text error hiding The phenomenon that；It follows that the technical solution that embodiment of the present invention provides can improve the matched accuracy of picture and text.

Below by drawings and examples, technical scheme of the present invention is described in further detail.

Description of the drawings

The attached drawing of a part for constitution instruction describes the embodiment of the present invention, and is used to solve together with description Release the principle of the present invention.

With reference to attached drawing, according to following detailed description, the present invention can be more clearly understood, wherein：

Fig. 1 is the flow chart of the method for the present invention one embodiment；

Fig. 2 is being trained to the second convolutional neural networks and the second recurrent neural network for embodiment of the present invention The schematic diagram of one specific example；

Fig. 3 is the schematic diagram of the specific example of the method for the present invention one embodiment；

Fig. 4 is the structure diagram of apparatus of the present invention one embodiment；

Fig. 5 is the schematic diagram of one embodiment of computer readable storage medium of the present invention；

Fig. 6 is the structure diagram of one embodiment of electronic equipment of the present invention.

Specific embodiment

Carry out the various exemplary embodiments of detailed description of the present invention now with reference to attached drawing.It should be noted that：Unless in addition have Body illustrates that the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally The range of invention.Simultaneously, it should be appreciated that for ease of description, the sizes of the various pieces shown in attached drawing be not according to What practical proportionate relationship was drawn.Be to the description only actually of at least one exemplary embodiment below it is illustrative, certainly Not as to the present invention and its application or any restrictions used.For technology, side known to person of ordinary skill in the relevant Method and equipment may be not discussed in detail, but in the appropriate case, and the technology, method and apparatus should be considered as specification A part.It should be noted that：Similar label and letter represents similar terms in following attached drawing, therefore, once a certain item exists It is defined in one attached drawing, then in subsequent attached drawing does not need to that it is further discussed.

The embodiment of the present invention can be applied to computer system/server, can be with numerous other general or specialized calculating System environments or configuration operate together.Suitable for be used together with computer system/server well-known computing system, Environment and/or the example of configuration include but not limited to：Personal computer system, server computer system, thin client, thickness Client computer, hand-held or laptop devices, the system based on microprocessor, set-top box, programmable consumer electronics, network individual Computer, minicomputer system, large computer system and the distributed cloud computing technology environment including any of the above described system, Etc..

Computer system/server can be in computer system executable instruction (such as journey performed by computer system Sequence module) general linguistic context under describe.In general, program module can include routine, program, target program, component, logic with And data structure etc., they perform specific task or realize specific abstract data type.Computer system/service Device can be implemented in distributed cloud computing environment, and in distributed cloud computing environment, task is by being linked through a communication network What remote processing devices performed.In distributed cloud computing environment, program module can be located at include storage device local or On person's remote computing system storage medium.

Lower mask body introduces the various non-limiting embodiments of the present invention.

Embodiment one is used to implement the matched method of picture and text.

Fig. 1 is the method flow diagram of the embodiment of the present invention one.As shown in Figure 1, the method for the present embodiment mainly includes：Step Rapid S100, step S110, step S130 and step S140, optionally, this method can also include：Step S120.It is right below Each step in Fig. 1 illustrates respectively.

S100, one group of image and text are obtained.

As an example, the one group of image and text in embodiment of the present invention is need to judge that whether matched picture and text are right As, no matter one group of image and text include an image and multiple texts or including multiple images and a text or Including multiple images and multiple texts, can form at least a pair of of image and text, embodiment of the present invention should judge each Whether image and text are matched.The image is usually picture, and the text is usually natural language, is such as directed to same person More descriptive statements etc..

As an example, under the application scenarios of text that acquisition is needed to match with image, the mode for obtaining image is usual To obtain input picture, and the mode for obtaining text is usually that a text is chosen from text collection；For example, from text collection In randomly select a text；For another example a text is chosen according to the clooating sequence of text from text collection.

As an example, under the application scenarios of image that acquisition is needed to match with text, the mode for obtaining text is usual Text is inputted to obtain, and the mode for obtaining image is usually that an image is chosen from image collection；For example, from image collection In randomly select an image；For another example an image is chosen according to the clooating sequence of image from image collection.

As an example, above-mentioned text collection can be specially that all texts in text library are carried out with screening and filtering, and by The text collection that text after screening and filtering is formed；Text in text library therein can be crawled by network or Artificially collect etc. what modes obtained.Since embodiment of the present invention carries out screening and filtering to all texts in text library in advance, A part can be removed before step S100 execution (match journey with input picture with the text that input picture differs greatly Spend poor text), therefore, embodiment of the present invention can be reduced by the screening and filtering to all texts in text library The treating capacity of subsequent spatial attention and semantic attention, so as to which embodiment of the present invention can be to avoid spatial attention And phenomena such as more resource of processing consuming of semantic attention and more used time.Space note in embodiment of the present invention The processing of meaning power is mainly used for the region in image and corresponding word association.Semanteme in embodiment of the present invention The processing of attention is mainly used for adjusting the structure of text by learning the meaning of word, makes the difference of specific identical meanings Describing mode can reach essentially identical processing, so as to enhance the matched robustness of picture and text.

As an example, above-mentioned image collection can be specially that all images in image library are carried out with screening and filtering, and by The image collection that image after screening and filtering is formed；Image in image library therein can be crawled by network or Artificially collect etc. what modes obtained.Since embodiment of the present invention carries out screening and filtering to all images in image library in advance, A part of image to differ greatly with input text can be removed before step S100 execution (i.e. with inputting text matches journey Spend poor image), therefore, embodiment of the present invention can be reduced by the screening and filtering to all images in image library The treating capacity of subsequent spatial attention and semantic attention, so as to which embodiment of the present invention can be to avoid spatial attention And phenomena such as more resource of processing consuming of semantic attention and more used time.

As an example, embodiment of the present invention carries out all texts in text library one of screening and filtering specifically Example is：First, text collection is emptied, input picture is input in the second convolutional neural networks, by the second convolution nerve net Network extracts the characteristics of image of the input picture and output, and each text in text library is separately input into the second recurrent neural net In network, text feature and the output of each text are extracted by the second recurrent neural network；Secondly, the input picture is calculated respectively Correlation between characteristics of image and the text feature of each text, for example, calculate respectively the characteristics of image of the input picture with it is each The inner product of vectors of the text feature of text, and using inner product of vectors result of calculation as correlation (vector therein between the two Inner product may be Euclidean distance or horse formula distance etc.)；Later, judge calculating each correlation obtained, and by phase The text for closing the first correlation requirement of sexual satisfaction (i.e. for the correlation requirement of text collection setting) is added to text collection In, for example, be ranked up according to the descending sequence of the correlation calculated to each text in text library, and by preceding N (such as N=100) a text is added as the text after screening in text collection.By above-mentioned correlation calculations and whether Meet the judgement of the first correlation requirement, can usually select at least one text and (multiple texts be generally included in text collection This), and the multiple texts selected can be ranked up according to corresponding correlation, however, such put in order Inaccuracy is likely to, is difficult to protect if putting in order to determine the text to match with input picture according to such Demonstrate,prove the matching accuracy of the text determined and input picture；The prior art is both to utilize sequence of the above-mentioned correlation to text Determine the text to match with input picture, that this so that the picture and text of the prior art match accuracy is poor.The present invention is implemented Mode can be adjusted by following step S110- S140 to put in order to this, matched accurate so as to improve picture and text Property.

As an example, embodiment of the present invention carries out all images in image library one of screening and filtering specifically Example is：First, image collection is emptied, will be inputted in text input to the second recurrent neural network, by the second recurrent neural net Network extracts the text feature of the input text and output, and each image in image library is separately input into the second convolution nerve net In network, characteristics of image and the output of each image are extracted by the second convolutional neural networks；Secondly, the input text is calculated respectively Correlation between text feature and the characteristics of image of each image, for example, calculate respectively the text feature of the input text with it is each The inner product of vectors of the characteristics of image of image, and using inner product of vectors result of calculation as correlation (vector therein between the two Inner product may be Euclidean distance or horse formula distance etc.)；Later, judge calculating each correlation obtained, and by phase The second correlation requirement of sexual satisfaction is closed (i.e. for the correlation requirement of image collection setting, the first correlation requirement and the second phase Close property requirement may be the same or different) image be added in image collection, for example, according to the correlation calculated by Each image in image library is ranked up to small sequence greatly, and is being schemed top n image as the image addition after screening During image set closes.By above-mentioned correlation calculations and whether meet the judgement of the second correlation requirement, can usually select to A few image (multiple images are generally included in image collection), and the multiple images selected can be according to corresponding Correlation is ranked up, however, such put in order is likely to inaccuracy, if come according to such put in order Determine the image to match with input text, then it is difficult to ensure that the matching accuracy of the image determined and input text；It is existing Technology is both to determine and input the image that text matches to the sequence of image using above-mentioned correlation, this causes existing The matched accuracy of picture and text of technology is poor.Embodiment of the present invention can be by following step S110-S140 come to the arrangement Sequence is adjusted, so as to improve the matched accuracy of picture and text.

It should be strongly noted that if the present invention has prestored the text feature of each text in text library (for example, each text and the respective text feature of each text are stored in text library), then embodiment of the present invention can not be The text feature of each text is obtained in text screening process using the second recurrent neural network, but directly use prestores Each text text feature；Similarly, if the present invention has prestored the image spy of each image in image library It levies (for example, each image and the respective characteristics of image of each image are stored in image library), then embodiment of the present invention can not The characteristics of image of each image is obtained using the second convolutional neural networks during optical sieving, but directly uses and deposits in advance The characteristics of image of each image of storage.

As an example, the second convolutional neural networks and the second recurrent neural network in embodiment of the present invention be through Trained neural network is crossed, embodiment of the present invention can utilize the image pattern for carrying individual marking and be marked with individual The samples of text of knowledge is respectively trained the second convolutional neural networks and the second recurrent neural network, embodiment of the present invention A specific example be trained to the second convolutional neural networks and the second recurrent neural network as shown in Fig. 2, with reference to Fig. 2 and step a to step h illustrates the training method of the second convolutional neural networks and the second recurrent neural network.

In Fig. 2, characteristics of image set and text feature set are previously provided with, and characteristics of image set and text are special When sign is integrated into initialization, sky can be set to.

Step a, an image pattern (being properly termed as input picture sample) and a band for carrying individual marking is obtained There is the samples of text (being properly termed as input samples of text) of individual marking, individual marking therein is mainly used for characterizing unique one Individual, and the individual marking that is carried of the individual marking that is carried of image pattern and samples of text be typically it is artificial in advance Mark；In artificial annotation process, for the image pattern and samples of text of same individual, identical individual should be marked Mark；For example, for same target the slightly discrepant picture (i.e. image pattern) of shooting angle should mark it is same have only The individual marking of one property, and for same target different word descriptions (i.e. samples of text) should mark it is same with only The individual marking of one property, if in addition, word description is to be directed to the word description of a picture, the word description and the picture Same individual marking with uniqueness etc. should be marked.

In Fig. 2, the picture in the upper left corner is the image pattern obtained, and the individual marking of the image pattern is 2, bottom right Word description (i.e. The model wears a bright orange dress. She ...) in the dotted line frame of angle is to obtain Samples of text, the individual marking of text sample is also 2, that is to say, that this image pattern obtained and samples of text tool There is identical individual marking；However, the image pattern and samples of text accessed by embodiment of the present invention can have completely Different individual markings；

Step b, the image pattern got is inputted into the second convolutional neural networks (i.e. Visual in Fig. 2 Convolutional Neural Network, Visual CNN, vision convolutional neural networks) in, by the second convolution nerve net Network extracts the characteristics of image (i.e. Visual Feature in Fig. 2) of the image pattern.

Step c, the samples of text of acquisition is inputted into the second recurrent neural network (i.e. LSTM networks in Fig. 2, Long Short-Term Memory networks, shot and long term memory network are a kind of time recurrent neural networks) in, by the second recurrence god Text feature (i.e. Textual Feature in Fig. 2) through network extraction text sample；Note：Embodiment of the present invention is simultaneously Do not limit the sequencing for performing step b and step c.

Step d, calculate respectively the characteristics of image of the image pattern that above-mentioned steps b is obtained with it is each in text feature set The matching degree of text feature corresponding to a individual marking obtains for example, embodiment of the present invention can computationally state step b The Europe between the text feature corresponding to each individual marking in the characteristics of image and text feature set of the image pattern obtained Formula distance either on the basis of horse formula distance or inner product of vectors etc., obtains each matching degree；One specific example, Ke Yili Calculated respectively with following formula (1) above-mentioned steps b acquisition image pattern characteristics of image with it is each in text feature set The matching degree between text feature corresponding to individual marking：

In above-mentioned formula (1),Represent that characteristics of image and all text feature S of input picture sample v are (i.e. literary Eigen set S) in i-th of individual marking corresponding to the probability (matching i.e. between the two that matches of text feature Degree, in order to mutually be distinguished with the matching degree in following step e, following matching degrees obtained that calculate step d are known as the first matching Degree), S represents the text feature of the samples of text of all individual markings, and v represents input picture sample,Represent input picture Correlation between the characteristics of image of sample v and the text feature corresponding to i-th of individual marking in all text feature S,Represent text of the characteristics of image of input picture sample v corresponding to j-th of individual marking in all text feature S Correlation between feature, N represent the total quantity of the individual marking corresponding to all text features in text feature set, σ_vRepresent the first temperature hyper parameter for controlling probability distribution, exp (*) represents the exponent arithmetic for *, and T representing matrixes turn It puts.

Step e, calculate respectively the text feature of the samples of text that above-mentioned steps c is obtained with it is each in characteristics of image set The matching degree of characteristics of image corresponding to a individual marking obtains for example, embodiment of the present invention can computationally state step c The Europe between the characteristics of image corresponding to each individual marking in the text feature and characteristics of image set of the samples of text obtained Formula distance either on the basis of horse formula distance or inner product of vectors etc., obtains each matching degree；One specific example, Ke Yili Calculated respectively with following formula (2) above-mentioned steps c acquisition samples of text text feature with it is each in characteristics of image set The matching degree between characteristics of image corresponding to individual marking：

In above-mentioned formula (2),The text feature and all characteristics of image V of expression input samples of text s is (i.e. Characteristics of image set V) in kth individual marking corresponding to the probability (matching i.e. between the two that matches of characteristics of image Degree, in order to mutually be distinguished with the matching degree in above-mentioned steps d, following matching degrees obtained that calculate step e are known as the second matching Degree), V represents the characteristics of image of the image pattern of all individual markings,Represent the text feature of input samples of text s with The correlation between the characteristics of image corresponding to k-th of individual marking in all characteristics of image V, σ_sRepresent general for controlling The second temperature hyper parameter of rate distribution,Represent the text feature of input samples of text s and the in all characteristics of image V The correlation between characteristics of image corresponding to j individual marking, N represent all characteristics of image institutes in characteristics of image set The total quantity of corresponding individual marking, exp (*) represent the exponent arithmetic for *, T representing matrix transposition.

Step f, the cross entropy loss function update the of the first matching degree and the second matching degree obtained according to above-mentioned calculating The parameter of two convolutional neural networks and the second recurrent neural network；

One specific example can use following formula (3) to represent the cross entropy of the first matching degree and the second matching degree Loss function (i.e. cross-module formula cross entropy loss function)；Further, it is possible to use following formula (4) update the second convolution nerve net The parameter of network, and use the parameter of following formula (5) the second recurrent neural network of update；

In above-mentioned formula (3), formula (4) and formula (5), t_sRepresent the individual marking of input samples of text s, t_vTable Show the individual marking of input picture sample v,AndRepresent the characteristics of image of input picture sample v and all texts Individual marking t in eigen S (i.e. text feature set S)_vThe probability that corresponding text feature matches,AndThe text feature and the individual marking in all characteristics of image V (i.e. characteristics of image set V) for representing input samples of text s t_sThe probability that corresponding characteristics of image matches,Represent the individual marking t in all text feature S_vCorresponding text Eigen,Represent the individual marking t in all characteristics of image V_sCorresponding characteristics of image, S_jRepresent all text features In individual marking j corresponding to text feature, V_jRepresent that the image corresponding to the individual marking j in all characteristics of image is special Sign, σ_vRepresent the first temperature hyper parameter for controlling probability distribution, σ_sIt represents that the second temperature of probability distribution is controlled to surpass Parameter,Represent text of the characteristics of image of input picture sample v corresponding to the individual marking j in all text feature S The probability that feature matches,The text feature and the individual marking j in all characteristics of image V for representing input samples of text s The probability that corresponding characteristics of image matches, N represent the individual marking in characteristics of image set or text feature set Total quantity；

The arrow of the band of characteristics of image 1. is directed toward in Fig. 2 and represents that calculating intersects with the arrow of the band of text feature 1. is directed toward Entropy loss function.

If the individual marking of input picture sample step g, acquired in above-mentioned steps a is not belonging to characteristics of image set In individual marking, then by the characteristics of image of the above-mentioned steps b input picture samples obtained addition in characteristics of image set (for example, the figure corresponding to using the corresponding individual marking in the characteristics of image filling image characteristic set of the input picture sample As characteristic storage space), otherwise, it determines the figure that the individual marking of the input picture sample is corresponding in characteristics of image set Content in the characteristics of image memory space is updated as characteristic storage space, and using the characteristics of image of the input picture sample； The arrow of the band of characteristics of image 2. is directed toward in Fig. 2 and represents step g；

If the individual marking of input samples of text step h, acquired in above-mentioned steps a is not belonging to text feature set In individual marking, then by the text feature of the above-mentioned steps c input samples of text obtained addition in text feature set (for example, the text corresponding to using the corresponding individual marking in the text feature filling text feature set of the input samples of text Eigen memory space), otherwise, it determines the text that the individual marking of the input samples of text is corresponding in text feature set Eigen memory space, and utilize the content in the text feature update text characteristic storage space of the input samples of text； The arrow of the band of text feature 2. is directed toward in Fig. 2 and represents step g.

The second convolutional neural networks and the second recurrent neural network are trained using above-mentioned steps a to step h Afterwards, it (can also be used the second convolutional neural networks after training as the first convolutional neural networks of initialization in addition First convolutional neural networks of the trained convolutional neural networks as initialization), and by the second recurrent neural network after training The first recurrent neural network as initialization (can also use in addition trained recurrent neural network as the initialized One recurrent neural network), and the first convolutional neural networks and the first recurrent neural network can continue to be trained to after initialization.

S110, the characteristics of image that each region in image is obtained using the first convolutional neural networks, and utilize the first recurrence god Each word feature in text is obtained through network.

As an example, image (picture as shown in Fig. 3 lower left corners) can be inputted the first convolution by embodiment of the present invention In neural network (the Visual CNN in such as Fig. 3), the image in each region in the image is extracted by first convolutional neural networks Feature for example, the characteristics of image in each region extracted can be expressed as 7 × 7 × 512, that is, extracts characteristics of image and belongs to The identical image-region of 49 sizes, and each image-region is expressed as the characteristics of image of one 512 dimension.

As an example, the feature of the word in embodiment of the present invention is different from the text feature of above-mentioned samples of text, such as The text feature of fruit samples of text is the vector for entire samples of text, then word be characterized in for word to Amount.Embodiment of the present invention obtains the process of each word feature in text using the first recurrent neural network：To text This progress word segmentation processing, and each word obtained after word segmentation processing is respectively mapped in dictionary, as will be in Fig. 2 " The model wears a bright orange dress.She ... " carry out word segmentation processing after, can obtain " The ", " model ", " wears ", " a ", " bright ", " orange ", " dress ", " She " ... wait words, and each word is divided It is not mapped in dictionary, so as to obtain only hot (one-hot) vector of each word, then, by each word for representing word Only hot vector of position of the language in dictionary is separately input into full articulamentum, for example, in Fig. 3, " The ", " model ", " wears " ... and only hot vector of the words such as " dress " is both input into word-fc (the full articulamentum for being directed to word) In；The coding of only heat vector for each word is realized by full articulamentum；Later, by the corresponding coding of each word successively It is sequentially inputted in the first recurrent neural network (the Encoder LSTM in such as Fig. 3), which can remember Recall the coding (such as the coding of each word of caching input) of each word of input, and may learn the pass between different terms Connection property, the first recurrent neural network export each word feature (being referred to as each word feature vector).Embodiment of the present invention Existing mode may be used and realize word mapping to obtain only heat vector, for the coding of only heat vector and each word of acquisition Language feature, specific implementation process are no longer described in detail herein.Word feature in embodiment of the present invention is using word as one The feature that a entirety is shown, and text feature is the feature for being shown text as an entirety, it follows that Word feature in embodiment of the present invention is thinner than the granularity of text feature.

As an example, embodiment of the present invention can use H={ h₁..., h_TRepresent each word feature in text Vector, wherein, h₁Presentation code LSTM the moment 1 hidden layer state (feature of the word i.e. in 1 corresponding text of moment to Amount), h_TPresentation code LSTM moment T hidden layer state (feature vector of the word i.e. in the corresponding texts of moment T),D_HRepresent the dimension of hidden layer state, D_HT in × T represents the word length of text.

S120, characteristics of image is modified, obtains correcting characteristics of image.

As an example, embodiment of the present invention can be obtained according to the characteristics of image in region each in image and each word feature Space transforms force value is taken, and according to space transforms force value, the selection target region in each region later, obtains target area pair The characteristics of image answered, as amendment characteristics of image.Specifically, embodiment of the present invention obtains a tool of space transforms force value Body example can be：The cohesion between the characteristics of image in each region and each word feature is calculated using spatial attention model, And each cohesion is normalized respectively, it later can be according to each cohesion after normalized and each region Box counting algorithm image be directed to each word characteristics of image.It is intimate between the characteristics of image in region and the feature of word Degree is referred to as degree of correlation either correlation degree or close degree etc., a specific example, the image of cap region Cohesion between feature and the feature of word " cap " would generally be higher than the characteristics of image of cap region and word " glasses " Cohesion between feature.

One of the cohesion between the characteristics of image in each region and each word feature is calculated using spatial attention model Specific example is to export the Visual CNN in each word feature and Fig. 3 of the Encoder LSTM outputs in Fig. 3 Input of the characteristics of image in each region as spatial attention model (i.e. Spatial Attention Module in Fig. 3), Spatial attention model can utilize following formula (6) to calculate the characteristics of image in each region and each word spy in image respectively Cohesion between sign, and each cohesion is normalized respectively using following formula (7)：

e_{T, k}=W_P{tanh[W_Ii_k+(W_Hh_t+b_H)]}+b_P (6)

In above-mentioned formula (6) and formula (7),And W_P∈R^1×KEqual representing matrix Parameter, b_HAnd b_pRepresent offset parameter, e_{T, k}For intermediate variable, and e_{T, k}In the feature and image that represent the word of moment t Cohesion between the characteristics of image in k-th of region, tanh [*] represent the hyperbolic tangent function for *, i_kIt represents in image K-th of region characteristics of image, h_tPresentation code LSTM is in the hidden layer state (namely in the corresponding texts of moment t of moment t Word feature vector), exp (*) represent for * exponential function, anda_{T, k}After representing normalized At the time of t word feature and image in k-th of region characteristics of image between cohesion, L represent image included Region total quantity.

The a that embodiment of the present invention can export spatial attention model_{T, k}Regard the word for moment t as and Speech, be each region characteristics of image distribute weighted value, embodiment of the present invention according to each cohesion after normalized with And the box counting algorithm image in each region is for for example following formula (8) institutes of a specific example of the characteristics of image of each word Show：

In above-mentioned formula (8),Represent characteristics of image of the image for the word of moment t (if step S110 is obtained The image feature representation in each region obtained is the characteristics of image of 7 × 7 × 512 dimensions, then in this stepThe characteristics of image of expression Characteristics of image for 512 dimensions), a_{T, k}K-th of region at the time of after expression normalized in the feature and image of the word of t Characteristics of image between cohesion, i_kRepresent the characteristics of image in k-th of region in image, L represents the figure that image is included As the total quantity in region.

Existing spatial attention model, the specific implementation of spatial attention model may be used in embodiment of the present invention Mode is no longer described in detail herein.

S130, semantic attention processing is carried out for characteristics of image and each word feature, obtains semantic attention force value； In the case that the method for embodiment of the present invention includes step S120, step S130 can be specially：For amendment characteristics of image Semantic attention processing is carried out with each word feature, obtains semantic attention force value.

As an example, embodiment of the present invention can be by image for the characteristics of image of each word and the spy of corresponding word Sign cascades respectively, and inputs semantic attention model respectively, so as to calculate each word in difference by semantic attention model The conceptive contribution (being referred to as each word in different concepts to the potential applications attention of image) to image.This hair Concept in bright embodiment can include the diversified forms such as color, clothes and preposition.

As an example, the cascade result in embodiment of the present invention can useIt represents, and its In t={ 1 ..., T }, T represents moment T, the x in Fig. 3₁、x₂、x_tAnd x_TIt is as above-mentioned

As an example, semantic attention model can calculate each word by following formula (9) and formula (10) To the contribution of image, i.e., semantic attention model output a ' in different concepts_{M, t}：

e′_{M, t}=f (c_m- 1, x_t), (9)

In above-mentioned formula (9) and formula (10), f (*) is the power function for determining importance, i.e. the function letter Number can be weighed out (is referred to as decryption moment m), the importance of t-th of word, and f (*) can be one for concept m Two layers of convolutional neural networks of a modelling, c_m-1Represent the hidden layer state of the LSTM when decoding moment m-1,Represent image for the characteristics of image of the word of moment t and the level link of the feature of the word of moment t Fruit, a '_{M, t}The word of expression moment t is on concept m to the contribution of image, e '_{M, t}With e '_{M, j}It is intermediate variable, T represents the moment T.The a ' that embodiment of the present invention can export semantic attention model_{M, t}Regard the word distributed as cascade result as Weighted value on concept m.

Existing semantic attention model, the specific implementation of semantic attention model may be used in embodiment of the present invention Mode is no longer described in detail herein.

S140, force value is paid attention to according to semanteme, calculates the matching degree of image and text.

As an example, embodiment of the present invention can according to each word in different concepts to the contribution of image with And above-mentioned cascade is as a result, determine contribution of the characteristics of image for each word in different concepts；Then, recurrence god is utilized Processing is decoded in the contribution of different concepts to image to word through network；Later, embodiment of the present invention can utilize complete Articulamentum and two-value grader are handled for the result after decoding process, so that it is determined that going out the matching journey of image and text Degree.

As an example, embodiment of the present invention determines contribution of the characteristics of image for each word in different concepts A specific example be：It is realized using following formula (11) to the above-mentioned steps S130 cascade results obtained and voice The result of attention processing is weighted read group total：

In above-mentioned formula (11),Represent contribution of the characteristics of image for each word on concept m, a '_{M, j}It represents The word of moment j is on concept m to the contribution of image, x_jRepresent image for the characteristics of image of the word of moment j and moment j Word feature cascade as a result, and can be expressed as Represent image of the image for the word of moment j Feature, h_jRepresent the feature of the word of moment j.

As an example, embodiment of the present invention can utilize recurrent neural network (the Decoder LSTM in such as Fig. 3) right What above-mentioned calculating obtainedIt is decoded, and determines that decoded result carries out using full articulamentum and two-value grader Similarity measures, so as to which the matching degree of image and text in step S100 can be calculated according to Similarity measures result.

As an example, embodiment of the present invention can be according to the descending sequence of matching degree in text collection Image of the text either in image collection is resequenced so as to the text collection or image set after resequencing It closes and determines text matched with input picture or the image with inputting text matches.Since embodiment of the present invention is being schemed Spatial attention model and semantic attention model are introduced in literary matching process, it can will be in the region and text in image Word effectively connect, so as to be conducive to improve the matched accuracy of picture and text.

In addition, in the first convolutional neural networks (Visual in Fig. 3) and the first recurrent neural network (in Fig. 3 Encoder LSTM) be trained during, embodiment of the present invention can also utilize two-value cross entropy loss function to the One convolutional neural networks and the first recurrent neural network are trained supervision, and above-mentioned two-value cross entropy loss function can be as follows It states shown in formula (12)：

In above-mentioned formula (12), N ' represents the quantity of text-image pair for training, C_iIt represents for i-th of text Sheet-image is to calculating matched accuracy, y_iRepresent target labels, y_iBelong to same individual for 1 expression text and image, and y_iRepresent text and image to belonging to different individuals for 0.

Embodiment two is used to implement the matched device of picture and text.

Fig. 4 is the structure diagram of apparatus of the present invention one embodiment.The device of the embodiment can be used for realizing the present invention Above-mentioned each method embodiment.As shown in figure 4, the device of the embodiment includes：It obtains picture and text module 400, first and obtains character modules Block 410, second obtain characteristic module 420, processing module 440 and judgment module 450, optionally, which can also wrap It includes：Correcting process module 430, the first screening and filtering module (being not shown in Fig. 4), the second screening and filtering module (are not shown in Fig. 4 Go out), the first training module (being not shown in Fig. 4) and the second training module (being not shown in Fig. 4).

Picture and text module 400 is obtained to be mainly used for obtaining one group of image and text.It can be with specifically, obtaining picture and text module 400 Input picture is obtained, and chooses from text collection any one text (for example, sequence or randomly select a text), it will be defeated Enter image and the text chosen as one group of image and text；Input text can also be obtained by obtaining picture and text module 400, and from An image (for example, sequence or randomly select an image) is chosen in image collection, input picture and image of its selection are made For one group of image and text；Wherein, text collection to the text in text library screened for the first screening and filtering module Filter, the text collection formed by the multiple texts obtained after screening and filtering, and image collection are the second screening and filtering module Screening and filtering, the image collection formed by the multiple images obtained after screening and filtering are carried out to the image in image library.

First screening and filtering module is specifically used for obtaining the characteristics of image of input picture using the second convolutional neural networks, Using the second recurrent neural network obtain text library in each text text feature, the characteristics of image of calculating input image with The correlation of the text feature of each text, and multiple texts are selected from each text according to the sequence of correlation, and select Multiple texts are as text collection.

Second screening and filtering module is specifically used for obtaining the text feature of input text using the second recurrent neural network, Using the second convolutional neural networks obtain image library in each image characteristics of image, calculate input text text feature with The correlation of the characteristics of image of each image selects multiple images according to the sequence of correlation from each image, and selects more A image forms image collection.

Specific screening operation performed by first screening and filtering module and the second screening and filtering module may refer to above-mentioned side The description of two examples in step S100 in method embodiment, is no longer described in detail herein.

Above-mentioned second convolutional neural networks and the second recurrent neural network are that the training of the first training module forms, specifically , the first training module can utilize the image pattern with individual marking and samples of text with individual marking to the Two convolutional neural networks and the second recurrent neural network are trained.One specific example, the first training module can utilize Second convolutional neural networks obtain the characteristics of image of the image pattern with individual marking, and utilize the second recurrent neural network The text feature of the samples of text with individual marking is obtained, later, the first training module calculates the characteristics of image of image pattern With the first matching degree of the text feature of each samples of text in text feature set, and the text feature of samples of text is calculated With the second matching degree of the characteristics of image of each image pattern in characteristics of image set；Later, the first training module is according to The cross entropy loss function of one matching degree and the second matching degree updates the second convolutional neural networks and the second recurrent neural network Parameter.In addition, the first training module does not include the image pattern with individual marking in characteristics of image set is determined During characteristics of image, the characteristics of image of the image pattern with individual marking is added in characteristics of image set, is determining text When not including the text feature of the samples of text with individual marking in eigen set, by the text with individual marking The text feature of sample is added in text feature set；Wherein, there is the difference of same individual mark in characteristics of image set The characteristics of image of image pattern shares the characteristics of image memory space of the individual marking；Wherein, in text feature set, have The text feature of the different samples of text of same individual mark shares the characteristics of image memory space of the individual marking.First instruction Practice module and train needle in the second convolutional neural networks and the concrete operations of the second recurrent neural network such as above method embodiment To step a to the associated description of step h, this will not be repeated here.In addition, the first training module can also will be after training First convolutional neural networks of second convolutional neural networks as initialization, and the second recurrent neural network after training is made The first recurrent neural network for initialization.

First acquisition characteristic module 410 is mainly used for obtaining the figure in each region in image using the first convolutional neural networks As feature.Specifically, the first acquisition characteristic module 410 can input image in the first convolutional neural networks, by the first volume Product neural network extracts the characteristics of image in each region in the image；The figure in each region that first acquisition characteristic module 410 is got As feature can be expressed as 7 × 7 × 512, i.e., the characteristics of image that the first acquisition characteristic module 410 obtains belongs to 49 sizes Identical image-region, and each image-region includes the characteristics of image of 512 dimensions.

It is special that second acquisition characteristic module 420 is mainly used for each word obtained in text using the first recurrent neural network Sign.Specifically, the second acquisition characteristic module 420 can first obtain only heat vector of each word in text, and by the only of each word Hot vector inputs full articulamentum and is encoded, and then, the second acquisition characteristic module 420 is defeated successively by the corresponding coding of each word Enter the first recurrent neural network, and each word feature is obtained according to the output of the first recurrent neural network.More specific content can With referring to the description that S420 is directed in above method embodiment, this will not be repeated here.

Correcting process module 430 is mainly used for that characteristics of image is modified to obtain to correct characteristics of image；For example, it corrects Characteristics of image and each word feature of the processing module 430 according to region each in image, obtain space transforms force value；Correcting process Module 430 is according to space transforms force value, the selection target region in each region；Correcting process module 430 obtains target area pair The characteristics of image answered, as amendment characteristics of image.

Specifically, correcting process module 430 can utilize spatial attention model calculate the characteristics of image in each region with it is each Cohesion between word feature, and each cohesion is normalized respectively；Correcting process module 430 can basis The box counting algorithm image of each cohesion and each region after normalized is directed to the characteristics of image of each word.It corrects The operation that processing module 430 specifically performs may refer to be directed to the description of step S430 in above method embodiment, herein not Repeat explanation.

Processing module 440 is mainly used for carrying out semantic attention processing for characteristics of image and each word feature, obtains language Justice pays attention to force value；In the case where the device of embodiment of the present invention includes correcting process module 430, processing module 440 can be with For characteristics of image and the semantic attention processing of each word feature progress is corrected, obtain semanteme and pay attention to force value.Specifically, processing Module 440 can cascade feature of the image for the characteristics of image and corresponding word of each word respectively, and input is semantic respectively In attention model, each word is calculated in different concepts to image by semantic attention model.Processing module 440 has The operation that body performs can be found in the description that step S440 is directed in above method embodiment, and this will not be repeated here.

Judgment module 450 is mainly used for paying attention to force value according to semanteme, calculates the matching degree of image and text.Specifically, Judgment module 450 can be determined for each the contribution of image and cascade result in different concepts according to each word Contribution of the characteristics of image of word in different concepts；Later, judgment module 450 can utilize recurrent neural network to being directed to Contribution of the characteristics of image of each word in different concepts is decoded processing；Later, judgment module 450 utilizes full articulamentum The matching degree of image and text is determined to the information after decoding process with two-value grader.What judgment module 450 specifically performed Operation may refer to be directed to the description of step S450 in above method embodiment, and this will not be repeated here.

Second training module is mainly used for being trained the first convolutional neural networks and the first recurrent neural network In the process, prison is trained to the first convolutional neural networks and the first recurrent neural network using two-value cross entropy loss function It superintends and directs.Second training module can carry out the first convolutional neural networks and the first recurrent neural network using above-mentioned formula (12) Training supervision, the specific description as being directed to formula (12) in above method embodiment, this will not be repeated here.

Embodiment three, computer readable storage medium.

One specific example of computer readable storage medium of embodiment of the present invention is as shown in Figure 5.

The computer readable storage medium of Fig. 5 is CD 500, is stored thereon with computer program (i.e. program product), should When program is executed by processor, each step recorded in above method embodiment can be realized, for example, obtaining one group of image And text；The characteristics of image of described image is obtained using the first convolutional neural networks, and is obtained using the first recurrent neural network Each word feature in the text；Semantic attention processing is carried out for described image feature and each word feature, is obtained Pay attention to force value to semanteme；According to the semantic matching degree for paying attention to force value, calculating described image and the text.It is above-mentioned each The specific implementation of step may refer to the associated description in above method embodiment, and this will not be repeated here.

Example IV, electronic equipment.

Electronic equipment provided in an embodiment of the present invention can be mobile terminal, personal computer (PC), tablet computer, clothes Business device etc..Below with reference to Fig. 6, it illustrates suitable for being used for realizing the electronics of the terminal device of the embodiment of the present application or server The structure diagram of equipment 600：As shown in fig. 6, computer system 600 includes one or more processors, communication unit etc., institute State one or more processors for example：At one or more central processing unit (CPU) 601 and/or one or more images Manage device (GPU) 613 etc., processor can be according to the executable instruction being stored in read-only memory (ROM) 602 or from depositing It stores up the executable instruction that part 608 is loaded into random access storage device (RAM) 603 and performs various appropriate actions and place Reason.Communication unit 612 may include but be not limited to network interface card, and the network interface card may include but be not limited to IB (Infiniband) network interface card,

Processor can communicate with read-only memory 602 and/or random access storage device 630 to perform executable instruction, It is connected by bus 604 with communication unit 612 and is communicated through communication unit 612 with other target devices, so as to completes the application reality The corresponding operation of any one method of example offer is applied, for example, obtaining one group of image and text；Utilize the first convolutional neural networks The characteristics of image of described image is obtained, and each word feature in the text is obtained using the first recurrent neural network；For Described image feature and each word feature carry out semantic attention processing, obtain semantic attention force value；According to the semanteme Pay attention to force value, calculate described image and the matching degree of the text.

In addition, in RAM 603, it can also be stored with various programs and data needed for device operation. CPU601、 ROM602 and RAM603 is connected with each other by bus 604.In the case where there is RAM603, ROM602 is optional module. RAM603 stores executable instruction or executable instruction is written into ROM602 at runtime, and executable instruction makes processor 601 perform the corresponding operation of the above method.Input/output (I/O) interface 605 is also connected to bus 604.Communication unit 612 can be with It is integrally disposed, it may be set to be with multiple submodule (such as multiple IB network interface cards), and in bus link.

I/O interfaces 605 are connected to lower component：Importation 606 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.；Storage section including hard disk etc. 608；And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via The network of such as internet performs communication process.Driver 610 is also according to needing to be connected to I/O interfaces 605.Detachable media 611, such as disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 610 as needed, in order to It is mounted into storage section 608 as needed from the computer program read thereon.

Need what is illustrated, framework as shown in Figure 6 is only a kind of optional realization method, can root during concrete practice The component count amount and type of above-mentioned Fig. 6 are selected, are deleted, increased or replaced according to actual needs；It is set in different function component Put, can also be used it is separately positioned or integrally disposed and other implementations, such as GPU and CPU separate setting or can be by GPU It being integrated on CPU, communication unit separates setting, can also be integrally disposed on CPU or GPU, etc..These are alternatively implemented Mode each falls within protection domain disclosed by the invention.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product, it can including being tangibly embodied in machine The computer program on medium is read, computer program was included for the program code of the method shown in execution flow chart, program generation Code may include it is corresponding perform the corresponding instruction of method and step provided by the embodiments of the present application, for example, for obtain one group of image and The instruction (being properly termed as the first instruction) of text；For obtaining the finger of the characteristics of image of image using the first convolutional neural networks It enables and (is properly termed as the second instruction)；Instruction for obtaining each word feature in text using the first recurrent neural network (can To be known as third instruction)；For being modified to characteristics of image, the instruction for obtaining correcting characteristics of image (is properly termed as the 4th finger It enables)；Semantic attention processing is carried out for being directed to characteristics of image and each word feature, obtaining the semantic instruction for paying attention to force value (can To be known as the 5th instruction)；For paying attention to force value according to semanteme, the instruction for calculating the matching degree of image and text (is properly termed as 6th instruction).In such embodiments, which can be downloaded and pacified from network by communications portion 609 It fills and/or is mounted from detachable media 611.When the computer program is performed by central processing unit (CPU) 601, hold The above-mentioned function of being limited in row the present processes.

Above-mentioned first instruction can include：For obtaining input picture, and any one text is chosen from text collection (for example, sequence or randomly select a text), using input picture and the text of the selection as one group of image and text Instruction (is properly termed as the 7th instruction)；First instruction can also include：Text is inputted, and selected from image collection for obtaining Take any one image (for example, sequence or randomly select an image), using input picture and the image chosen as one group of image and The instruction (being properly termed as the 8th instruction) of text；Wherein, text collection screens the text in text library for the 9th instruction Filtering, the text collection formed by the multiple texts obtained after screening and filtering, and image collection are the tenth instruction to image library In image carry out screening and filtering, the image collection formed by the multiple images obtained after screening and filtering.

Above-mentioned 9th instruction is mainly used for carrying out the text in text library screening and filtering, and the 9th instruction includes：For The instruction of the characteristics of image of input picture is obtained using the second convolutional neural networks, for being obtained using the second recurrent neural network Take the instruction of the text feature of each text in text library, for the characteristics of image of calculating input image and the text of each text The instruction of the correlation of feature and multiple texts are selected from each text for the sequence according to correlation, that selects is more Instruction of a text as text collection.

Above-mentioned tenth instruction is mainly used for carrying out the image in image library screening and filtering, and the tenth instruction includes：For The instruction of the text feature of input text is obtained using the second recurrent neural network, for being obtained using the second convolutional neural networks Take the instruction of the characteristics of image of each image in image library, for the text feature of calculating input text and the image of each image The instruction of the correlation of feature and multiple images are selected from each image for the sequence according to correlation, that selects is more A image forms the instruction of image collection.

Specific screening operation performed by 9th instruction and the tenth instruction may refer to the step in above method embodiment The description of two examples in S100, is no longer described in detail herein.

Above-mentioned second convolutional neural networks and the second recurrent neural network can be that the 11st instruction training forms, and have Body, the 11st instruction is mainly used for utilizing the image pattern with individual marking and the samples of text with individual marking Second convolutional neural networks and the second recurrent neural network are trained；And the 11st instruction can specifically include：For profit The instruction of the characteristics of image of the image pattern with individual marking is obtained with the second convolutional neural networks, for being passed using second Neural network is returned to obtain the instruction of the text feature of the samples of text with individual marking, the image for calculating image pattern The instruction of first matching degree of the text feature of each samples of text in feature and text feature set, for calculating text sample The instruction of second matching degree of the characteristics of image of each image pattern in this text feature and characteristics of image set, Yi Jiyong According to the cross entropy loss function of the first matching degree and the second matching degree the second convolutional neural networks of update and the second recurrence god The instruction of parameter through network.It is carried in addition, the 12nd instruction is mainly used for not including in characteristics of image set is determined During the characteristics of image of the image pattern of individual marking, the characteristics of image of the image pattern with individual marking is added in image In characteristic set, the 13rd instruction is mainly used for not including the text with individual marking in text feature set is determined During the text feature of this sample, the text feature of the samples of text with individual marking is added in text feature set；Its In, there is the figure that the characteristics of image of different images sample that same individual identifies shares the individual marking in characteristics of image set As characteristic storage space；Wherein, in text feature set, there is the text feature of the different samples of text of same individual mark Share the characteristics of image memory space of the individual marking.11st instruction the second convolutional neural networks of training and the second recurrence god It is not repeated herein to the associated description of step h for step a in concrete operations such as above method embodiment through network Explanation.In addition, the 14th instruction is mainly used for the first convolution god using the second convolutional neural networks after training as initialization Through network, the 15th instruction is mainly used for the first recurrent neural using the second recurrent neural network after training as initialization Network.

As an example, the second instruction can input image in the first convolutional neural networks, by the first convolution nerve net Network extracts the characteristics of image in each region in the image；Second characteristics of image in each region that gets of instruction can be expressed as 7 × 7 × 512, i.e., the characteristics of image that the second instruction obtains belongs to 49 identical image-regions of size, and each image-region is equal Include the characteristics of image of 512 dimensions.

Third instruction can specifically include：For obtain only heat vector of each word in text instruction, for by each word Only heat vector of language inputs the instruction that full articulamentum encoded, is passed for the corresponding coding of each word to be sequentially input first Return neural network, and the instruction of each word feature is obtained according to the output of the first recurrent neural network.Third instruction is included The operation specifically performed is respectively instructed to may refer to be directed to the description of S420 in above method embodiment, is not repeated herein It is bright.

4th instruction can specifically include：For the characteristics of image according to region each in image and each word feature, obtain The instruction of space transforms force value；For according to space transforms force value, the instruction in selection target region in each region；For obtaining The corresponding characteristics of image in target area is taken, as the instruction for correcting characteristics of image.The above-mentioned figure being used for according to region each in image As feature and each word feature, obtaining the instruction of space transforms force value can be specially：By using based on spatial attention model The cohesion between the characteristics of image in each region and each word feature is calculated, and each cohesion is normalized respectively It instructs and for being directed to each word according to each cohesion after normalized and the box counting algorithm image in each region Characteristics of image instruction.What the 4th instruction was included respectively instructs the operation specifically performed to may refer to above method embodiment party The description of step S430 is directed in formula, this will not be repeated here.

5th instruction can include：For image to be distinguished for the characteristics of image of each word and the feature of corresponding word Cascade, and the instruction of semantic attention model is inputted respectively, each word is calculated in different concepts by semantic attention model On contribution to image.The operation that the included instruction of 5th instruction specifically performs may refer in above method embodiment For the description of step S440, this will not be repeated here.

6th instruction can specifically include：For according to each word in different concepts to the contribution of image and Cascade result determines the instruction of contribution of the characteristics of image in different concepts for each word, for utilizing recurrent neural net Network is decoded the instruction of processing and for using entirely to being directed to contribution of the characteristics of image of each word in different concepts Articulamentum and two-value grader determine the information after decoding process the instruction of the matching degree of image and text.6th instruction Comprising the operation that specifically performs of instruction may refer to be directed to the description of step S450 in above method embodiment, herein It is not repeated to illustrate.

Program code in embodiment of the present invention can also include passing to the first convolutional neural networks and first Return during neural network is trained, the first convolutional neural networks and first are passed using two-value cross entropy loss function Neural network is returned to be trained the instruction of supervision.The instruction above-mentioned formula (12) can be utilized to the first convolutional neural networks and First recurrent neural network is trained supervision, the specific description as being directed to formula (12) in above method embodiment, herein It is not repeated to illustrate.

Methods and apparatus of the present invention, electronic equipment and computer-readable storage medium may be achieved in many ways Matter.For example, can by any combinations of software, hardware, firmware or software, hardware, firmware come realize the present invention method and Device, electronic equipment and computer readable storage medium.The said sequence of the step of for method is merely to said Bright, the step of method of the invention, is not limited to sequence described in detail above, unless specifically stated otherwise.In addition, In some embodiments, the present invention can be also embodied as recording program in the recording medium, these programs include being used to implement root According to the machine readable instructions of the method for the present invention.Thus, the present invention also covering storage is used to perform according to the method for the present invention Program recording medium.

Description of the invention provides for the sake of example and description, and is not exhaustively or by this to send out It is bright to be limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Selection and Description embodiment is to more preferably illustrate the principle of the present invention and practical application, and enables those of ordinary skill in the art It is enough to understand the present invention so as to design the various embodiments with various modifications suitable for special-purpose.

Claims

1. one kind is used to implement the matched method of picture and text, which is characterized in that including：

Obtain one group of image and text；

The characteristics of image of described image is obtained using the first convolutional neural networks, and using described in the acquisition of the first recurrent neural network Each word feature in text；

Semantic attention processing is carried out for described image feature and each word feature, obtains semantic attention force value；

According to the semantic matching degree for paying attention to force value, calculating described image and the text.

2. according to the method described in claim 1, it is characterized in that, one group of image of the acquisition and the step of text include：

Input picture is obtained, and any one text is chosen from text collection, by the input picture and the text of the selection As one group of image and text；Or

Input text is obtained, and any one image is chosen from image collection, by the input picture and the image of the selection As one group of image and text；

Wherein, the text collection is to carry out screening and filtering to the text in text library, by the multiple texts obtained after screening and filtering This text collection formed, and described image collection is combined into and carries out screening and filtering to the image in image library, after screening and filtering The image collection that the multiple images of acquisition are formed.

3. according to the method described in claim 2, it is characterized in that, the text in text library carries out the step of screening and filtering Suddenly include：

The characteristics of image of the input picture is obtained using the second convolutional neural networks, and is obtained using the second recurrent neural network The text feature of each text in text library；

Calculate the correlation of the characteristics of image and the text feature of each text of the input picture；

Multiple texts, and selected multiple text conducts are selected from each text according to the sequence of the correlation The text collection.

4. according to the method described in claim 2, it is characterized in that, the image in image library carries out the step of screening and filtering Suddenly include：

The text feature of the input text is obtained using the second recurrent neural network, and is obtained using the second convolutional neural networks The characteristics of image of each image in image library；

Calculate the correlation of the text feature and the characteristics of image of each image of the input text；

Multiple images are selected, and selected multiple images are formed from each image according to the sequence of the correlation Described image set.

5. method according to claim 3 or 4, which is characterized in that the method further includes：

Using the image pattern with individual marking and the samples of text with individual marking to the second convolutional neural networks and The step of second recurrent neural network is trained.

6. according to the method described in claim 5, it is characterized in that, the step of the training includes：

The characteristics of image of the image pattern with individual marking is obtained using the second convolutional neural networks, and utilizes the second recurrence god The text feature of the samples of text with individual marking is obtained through network；

Calculate first of the text feature of each samples of text in the characteristics of image of described image sample and text feature set With degree, and calculate the text feature and the second of the characteristics of image of each image pattern in characteristics of image set of the samples of text Matching degree；

According to the cross entropy loss function of first matching degree and the second matching degree update second convolutional neural networks and The parameter of second recurrent neural network.

7. method according to claim 5 or 6, which is characterized in that

In described image characteristic set, the characteristics of image with the different images sample of same individual mark shares the individual marking Characteristics of image memory space；

And/or

In the text feature set, the text feature with the different samples of text of same individual mark shares the individual marking Characteristics of image memory space.

8. one kind is used to implement the matched device of picture and text, which is characterized in that including：

Picture and text module is obtained, for obtaining one group of image and text；

First obtains characteristic module, for obtaining the characteristics of image of described image using the first convolutional neural networks；

Second obtains characteristic module, for obtaining each word feature in the text using the first recurrent neural network；

Processing module carries out semantic attention processing for being directed to described image feature and each word feature, obtains semanteme Pay attention to force value；

Judgment module, for according to the semantic matching degree for paying attention to force value, calculating described image and the text.

9. a kind of electronic equipment, including：

Memory, for storing computer program；

Processor, for performing the computer program stored in the memory, and the computer program is performed, following Instruction is run：

For obtaining the instruction of one group of image and text；

For obtaining the instruction of the characteristics of image of described image using the first convolutional neural networks；

For obtaining the instruction of each word feature in the text using the first recurrent neural network；

Semantic attention processing is carried out for being directed to described image feature and each word feature, obtains semantic attention force value Instruction；

For according to the semantic instruction for paying attention to force value, calculating described image and the matching degree of the text.

10. a kind of computer readable storage medium, is stored thereon with computer program, when which is executed by processor Realize the method described in any one of the claims 1-7.