CN109359196A

CN109359196A - Text Multimodal presentation method and device

Info

Publication number: CN109359196A
Application number: CN201811230363.9A
Authority: CN
Inventors: 黄苹苹; 乔敏; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2019-02-19
Anticipated expiration: 2038-10-22
Also published as: CN109359196B

Abstract

The present invention proposes a kind of text Multimodal presentation method and device, and wherein method includes: to obtain text to be processed, is identified to text, obtains the corresponding text object set of text and the corresponding text vector of each text object；For each text object, image set relevant to text object is obtained；According to the corresponding image vector of image each in image set, image vector corresponding with text object is determined；According to the corresponding text vector of text object and image vector, determine the corresponding multi-modal vector of text object, and then determine the corresponding multi-modal vector of text, so as to be indicated simultaneously using text vector and image vector to text, it is matched with multi-modal task, and due to the Multimodal presentation of text, so that integrated classification model or image description model in multi-modal task, it is trained by less image and text, ensure that certain accuracy, reduce trained cost, improve the execution accuracy and execution efficiency of multi-modal task.

Description

Text Multimodal presentation method and device

Technical field

The present invention relates to technical field of data processing more particularly to a kind of text Multimodal presentation method and devices.

Background technique

Multi-modal task, refer to by the various ways such as text, voice, video, movement, environment carry out human-computer interaction, Simulate the task of interpersonal interactive mode.Current multi-modal task, such as vision question-answering task (Visual Question Answering, VQA) in, obtain the image and question text of input first, obtain the corresponding image of image to Amount and the corresponding text vector of question text, by the corresponding image vector of image and the corresponding text vector of question text into Row fusion and classification, determine the corresponding answer of question text.In another example picture talk task (Image Caption, IC) In, the image and the corresponding image vector of image of input are obtained first, and the corresponding image vector input picture of image is described In model, first word of output is obtained, then by the corresponding text vector of first word and the corresponding image of image In vector input picture descriptive model, second word is obtained；By the corresponding text vector of first word, second word pair In the corresponding image vector input picture descriptive model of the text vector and image answered, successively carries out, obtain iamge description language Sentence.

In above-mentioned two multi-modal task, the vector expression of image and question text is single mode, and image is only with figure As vector expression, text is only indicated with text vector, is mismatched with multi-modal task；And due to the single mode of image and text It indicates, so that integrated classification model and image description model are in the training process, a large amount of image and text is needed to carry out Training, can ensure certain accuracy, improve trained cost, reduce the execution accuracy of multi-modal task and hold Line efficiency.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, the first purpose of this invention is to propose a kind of text Multimodal presentation method, for solving existing skill The problem of multi-modal task execution accuracy and execution efficiency difference in art.

Second object of the present invention is to propose a kind of text Multimodal presentation device.

Third object of the present invention is to propose another text Multimodal presentation device.

Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.

5th purpose of the invention is to propose a kind of computer program product.

In order to achieve the above object, first aspect present invention embodiment proposes a kind of text Multimodal presentation method, comprising:

Text to be processed is obtained, the text is identified, obtains the corresponding text object set of the text, with And the corresponding text vector of each text object in the text object set；

For each text object in the text object set, image set relevant to the text object is obtained；

Concentrate the corresponding image vector of each image according to described image, determine image corresponding with the text object to Amount；

According to the corresponding text vector of the text object and image vector, the corresponding multimode of the text object is determined State vector；

According to the corresponding multi-modal vector of text object each in the text object set, determine that the text is corresponding Multi-modal vector.

Further, each text object in the text object set, obtains and the text object Relevant image set, comprising:

Obtain the associated picture of the text object；

The associated picture is polymerize, image set corresponding with each senses of a dictionary entry of the text object is obtained；

According to the text object and the text, the current senses of a dictionary entry of the text object is determined；

By the corresponding image set of the current senses of a dictionary entry, it is determined as image set relevant to the text object.

Further, the problem of text is input text；In the text object set further include: with described problem The corresponding candidate answers of text；

The method further include:

Obtain input picture corresponding with described problem text；

Image recognition is carried out to the input picture, obtains the corresponding image vector of the input picture；

The corresponding image vector of the input picture and the corresponding multi-modal vector of described problem text are merged And classification, determine answer corresponding with described problem text.

Further, described corresponding multi-modal to the corresponding image vector of the input picture and described problem text Vector is merged and is classified, and determines answer corresponding with described problem text, comprising:

The corresponding multi-modal vector input of the corresponding image vector of the input picture and described problem text is default Disaggregated model, obtain the probability of each candidate answers of disaggregated model output；

The candidate answers that corresponding probability is met to preset condition are determined as answer corresponding with described problem text.

Further, the text is that text is described composed by each words of description of image description model output；

Before acquisition text to be processed, further includes:

Obtain the image to be described, of input；

Image recognition is carried out to the image to be described, obtains the corresponding image vector of the image to be described,；

The corresponding image vector of the image to be described, is inputted into described image descriptive model, described image is obtained and describes mould First words of description of type output；

First words of description is determined as the description text；

According to the corresponding multi-modal vector of text object each in the text object set, determine that the text is corresponding After multi-modal vector, further includes:

The corresponding multi-modal vector of the description text and the corresponding image vector of the image to be described, are inputted into institute State image description model, obtain the second words of description of described image descriptive model output, will first words of description with Second words of description is integrated to obtain the description text, until all words of description of described image descriptive model output are Only.

The text Multimodal presentation method of the embodiment of the present invention identifies text by obtaining text to be processed, Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set；For Each text object in text object set obtains image set relevant to text object；According to image each in image set Corresponding image vector determines image vector corresponding with text object；According to the corresponding text vector of text object and figure As vector, the corresponding multi-modal vector of text object is determined；According to the corresponding multimode of text object each in text object set State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text Indicate, matched with multi-modal task, and due to the Multimodal presentation of text so that integrated classification model in multi-modal task or Person's image description model in the training process, is trained, it will be able to ensure certain standard by less image and text Exactness improves the execution accuracy and execution efficiency of multi-modal task to reduce trained cost.

In order to achieve the above object, second aspect of the present invention embodiment proposes a kind of text Multimodal presentation device, comprising:

It obtains module to identify the text for obtaining text to be processed, obtains the corresponding text of the text The corresponding text vector of each text object in this object set and the text object set；

The acquisition module is also used to obtain and the text for each text object in the text object set The relevant image set of this object；

Determining module, for concentrating the corresponding image vector of each image, the determining and text pair according to described image As corresponding image vector；

The determining module is also used to determine institute according to the corresponding text vector of the text object and image vector State the corresponding multi-modal vector of text object；

The determining module, be also used to according to text object each in the text object set it is corresponding it is multi-modal to Amount, determines the corresponding multi-modal vector of the text.

Further, the acquisition module is specifically used for,

Obtain the associated picture of the text object；

The device further include: the first picture recognition module and integrated classification module；

The acquisition module is also used to obtain input picture corresponding with described problem text；

The first image identification module obtains the input picture for carrying out image recognition to the input picture Corresponding image vector；

The integrated classification module, for corresponding to the corresponding image vector of the input picture and described problem text Multi-modal vector merged and classified, determine corresponding with described problem text answer.

Further, the integrated classification module is specifically used for,

The device further include: the second picture recognition module and input module；

The acquisition module is also used to obtain the image to be described, of input；

Second picture recognition module obtains described to be described, for carrying out image recognition to the image to be described, The corresponding image vector of image；

The input module, for will the corresponding image vector input described image descriptive model of the image to be described, Obtain first words of description of described image descriptive model output；

The determining module is also used to first words of description being determined as the description text；

The input module is also used to the corresponding multi-modal vector of the description text and the image pair to be described, The image vector input described image descriptive model answered, obtains the second words of description of described image descriptive model output, by institute It states first words of description and second words of description is integrated to obtain the description text, until described image descriptive model Until exporting all words of description.

The text Multimodal presentation device of the embodiment of the present invention identifies text by obtaining text to be processed, Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set；For Each text object in text object set obtains image set relevant to text object；According to image each in image set Corresponding image vector determines image vector corresponding with text object；According to the corresponding text vector of text object and figure As vector, the corresponding multi-modal vector of text object is determined；According to the corresponding multimode of text object each in text object set State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text Indicate, matched with multi-modal task, and due to the Multimodal presentation of text so that integrated classification model in multi-modal task or Person's image description model in the training process, is trained, it will be able to ensure certain standard by less image and text Exactness improves the execution accuracy and execution efficiency of multi-modal task to reduce trained cost.

In order to achieve the above object, third aspect present invention embodiment proposes another text Multimodal presentation device, comprising: Memory, processor and storage are on a memory and the computer program that can run on a processor, which is characterized in that the place Reason device realizes text Multimodal presentation method as described above when executing described program.

To achieve the goals above, fourth aspect present invention embodiment proposes a kind of computer readable storage medium, On be stored with computer program, which realizes text Multimodal presentation method as described above when being executed by processor.

To achieve the goals above, fifth aspect present invention embodiment proposes a kind of computer program product, when described When instruction processing unit in computer program product executes, text Multimodal presentation method as described above is realized.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is a kind of flow diagram of text Multimodal presentation method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention；

Fig. 3 is the schematic diagram of vision question-answering task；

Fig. 4 is the execution schematic diagram of vision question-answering task；

Fig. 5 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention；

Fig. 6 is the schematic diagram of picture talk task；

Fig. 7 is a kind of structural schematic diagram of text Multimodal presentation device provided in an embodiment of the present invention；

Fig. 8 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention；

Fig. 9 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention；

Figure 10 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

Below with reference to the accompanying drawings the text Multimodal presentation method and device of the embodiment of the present invention is described.

Fig. 1 is a kind of flow diagram of text Multimodal presentation method provided in an embodiment of the present invention.As shown in Figure 1, Text Multimodal presentation method the following steps are included:

S101, text to be processed is obtained, text is identified, obtain the corresponding text object set of text, and The corresponding text vector of each text object in text object set.

The executing subject of text Multimodal presentation method provided by the invention is text Multimodal presentation device, text multimode State indicates that device can be the hardware devices such as terminal device, server, or the software to install on hardware device.Wherein, to The text of processing can be single word, multiple words or the sentence being made of multiple words.Text object can be text In all words, have word relevant to current multi-modal task in the word or text of physical meaning.The present embodiment In, the process that text Multimodal presentation device identifies text is specifically as follows, and segments to text, obtains in text At least one included word；Word relevant to current multi-modal task is selected from least one word, will with work as The preceding relevant word of multi-modal task is combined, and obtains the corresponding text object set of text；For text object set In each text object, by text object and the preset term vector model of text input, obtain the output of term vector model to Amount, is determined as the corresponding text vector of text object for the vector.Wherein, term vector model for example can be convolutional neural networks Model (Convolutional Neural Networks, CNN) is bag of words (Bag-of-words model, BOW) Deng.Term vector model can be trained using a large amount of text.

In the present embodiment, when multi-modal task is VQA task, text to be processed can be the problem of user inputs text This, wherein question text can directly input for user, or voice carries out after speech recognition the problem of input to user It arrives.When multi-modal task is IC task, text to be processed can be the single word, multiple of image description model output Word or the description text being made of multiple words.

S102, for each text object in text object set, obtain image set relevant to text object.

In the present embodiment, the process that text Multimodal presentation device executes step 102 is specifically as follows, for text object Text object carrying is sent to search engine, so that search engine is searched by each text object in set in searching request Rope multiple images relevant to text object, are combined multiple images, obtain image set relevant to text object.Its In, search engine for example can be the search engines such as Flickr, imageNet.

Further, on the basis of the above embodiments, since partial words have multiple senses of a dictionary entry, and the not difference of synonymity Not very greatly, when such as word " batter " is used as noun, the senses of a dictionary entry can be batter or paste.Therefore, text is multi-modal The process for indicating that device executes step 102 specifically can also be to obtain the associated picture of text object；Associated picture is gathered It closes, obtains image set corresponding with each senses of a dictionary entry of text object；According to text object and text, working as text object is determined The preceding senses of a dictionary entry；By the corresponding image set of the current senses of a dictionary entry, it is determined as image set relevant to text object.

In the present embodiment, the process that text Multimodal presentation device polymerize associated picture is specifically as follows, to appoint Two associated pictures of anticipating carry out similarity calculations, by two associated pictures that similarity is greater than preset threshold be aggregated to together, from And the corresponding image set of each senses of a dictionary entry can be obtained.In addition, text Multimodal presentation device is according to text object and text, really The process for determining the current senses of a dictionary entry of text object for example can be, text Multimodal presentation device by each senses of a dictionary entry of text object with The preset senses of a dictionary entry model of text input, obtains the probability of each senses of a dictionary entry, and corresponding probability is greater than to the senses of a dictionary entry of preset value, is determined as The current senses of a dictionary entry of text object.Wherein, senses of a dictionary entry model can for CNN model etc., senses of a dictionary entry model can using a large amount of texts and The current senses of a dictionary entry of each text object is trained in text.

S103, according to the corresponding image vector of image each in image set, determine image vector corresponding with text object.

In the present embodiment, the dimension of the corresponding image vector of each image is generally identical in image set, determining with text pair As the mode of corresponding image vector can be, summation is weighted to the corresponding image vector of each image, or summation is made even Mean value etc..

S104, according to the corresponding text vector of text object and image vector, determine that text object is corresponding multi-modal Vector.

In the present embodiment, text Multimodal presentation device determines that the mode of the corresponding multi-modal vector of text object for example may be used Think, the corresponding text vector of text object and image vector spliced, obtain text object it is corresponding it is multi-modal to Amount.Wherein, the dimension of the vector spliced be text vector dimension and image vector dimension and.

S105, according to the corresponding multi-modal vector of text object each in text object set, determine that text is corresponding more Modal vector.

In the present embodiment, the corresponding multi-modal vector of text can be, by the corresponding multi-modal vector of each text object The vector matrix of composition, or the vector spliced by the corresponding multi-modal vector of each text object.

The text Multimodal presentation method of the embodiment of the present invention identifies text by obtaining text to be processed, Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set；For Each text object in text object set obtains image set relevant to text object；According to image each in image set Corresponding image vector determines image vector corresponding with text object；According to the corresponding text vector of text object and figure As vector, the corresponding multi-modal vector of text object is determined；According to the corresponding multimode of text object each in text object set State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text It indicates, is matched with multi-modal task, and due to the Multimodal presentation of text, so that the integrated classification model in multi-modal task exists It in training process, is trained by less image and text, it will be able to ensure certain accuracy, to reduce instruction Practice cost, improves the execution accuracy and execution efficiency of multi-modal task.

Fig. 2 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention.Such as Fig. 2 institute Show, on the basis of embodiment shown in Fig. 1, text is text the problem of input；In text object set further include: with problem text This corresponding candidate answers；

Corresponding, the method is further comprising the steps of:

S106, acquisition input picture corresponding with question text.

In the present embodiment, multi-modal task is specifically as follows vision question-answering task VQA.In the task, input can be Question text, and input picture corresponding with question text.As shown in figure 3, being the schematic diagram of vision question-answering task.In Fig. 3 In, input picture is that batter hits are played baseball figure, and question text is " What color shirt is the batter Wearing? ".Wherein, word relevant to vision question-answering task is " shirt ", " batter " in question text, with the problem The corresponding candidate answers of text for example can be " blue ", " red " etc..Therefore, text object knot corresponding with the question text Closing for example to be " shirt, batter, blue, red ... ".

It is corresponding, in vision question-answering task, since the corresponding candidate answers of question text do not have multiple senses of a dictionary entry generally, Only have the single senses of a dictionary entry, for each candidate answers, after getting the associated picture of candidate word, do not need to associated picture into Row polymerization, associated picture can be integrated directly, obtain image set relevant to candidate answers.Therefore, text is multi-modal Device is indicated for each text object in text object set, the process for obtaining image set relevant to text object is specific It can be for each text object in text object set, to judge whether text object is word in question text；If It is then to obtain the associated picture of text object, polymerize to associated picture, obtains corresponding with each senses of a dictionary entry of text object Image set determines the current senses of a dictionary entry of text object according to text object and text, by the corresponding image set of the current senses of a dictionary entry, really It is set to image set relevant to text object；If it is not, then obtaining the associated picture of text object, associated picture is integrated, Obtain image set relevant to text object.

S107, image recognition is carried out to input picture, obtains the corresponding image vector of input picture.

In the present embodiment, the process that text Multimodal presentation device executes step 107 is specifically as follows, and input picture is defeated Enter preset image recognition model, obtains the image vector of image recognition model output, which is determined as input figure As corresponding image vector.Wherein, image recognition model is trained using a large amount of image.Image recognition model for example can be with For convolutional neural networks model (Convolutional Neural Networks, CNN) or depth residual error network (Deep Residual network, ResNet) etc..

S108, to the corresponding image vector of input picture and the corresponding multi-modal vector of question text carry out fusion and Classification determines answer corresponding with question text.

In the present embodiment, the process that text Multimodal presentation device executes step 108 is specifically as follows, by input picture pair The corresponding multi-modal vector of the image vector and question text answered inputs preset disaggregated model, obtains disaggregated model output The probability of each candidate answers；The candidate answers that corresponding probability is met to preset condition are determined as corresponding with question text Answer.Wherein, disaggregated model can be trained using a large amount of image, problem and corresponding answer.Wherein, preset condition It such as can be predetermined probabilities threshold value.

As shown in figure 4, being the execution schematic diagram of vision question-answering task.In Fig. 4, image recognition is carried out to input picture, Obtain the corresponding image vector of input picture；Question text is handled, the corresponding multi-modal vector of question text is obtained；It will The corresponding image vector of input picture multi-modal vector corresponding with question text is merged and is classified, and determines question text Corresponding answer is red.It wherein, may include two images in image set relevant to text object " shirt ", in Fig. 4 Two T-shirt cottas；It may include three images in image set relevant to text object " batter ", such as three in Fig. 4 Batter's image.

In the present embodiment, above-mentioned text Multimodal presentation method is applied in vision question-answering task, obtains question text And input picture corresponding with question text, word relevant to vision question-answering task in question text and candidate are answered Case determines text object set, then using above-mentioned text Multimodal presentation method obtain question text it is corresponding it is multi-modal to Amount, is merged and is classified to the corresponding image vector of input picture and the corresponding multi-modal vector of question text, is determined Answer corresponding with question text to improve the accuracy of the answer got, and reduces being trained to for disaggregated model This, improves the execution accuracy and execution efficiency of vision question-answering task.

Fig. 5 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention.Such as Fig. 5 institute Show, on the basis of embodiment shown in Fig. 1, text is description text composed by each words of description of image description model output This；

Corresponding, before step 101, the method is further comprising the steps of:

S109, the image to be described, for obtaining input.

In the present embodiment, multi-modal task is specifically as follows picture talk task IC.In the task, input can for Image is described.As shown in fig. 6, being the schematic diagram of picture talk task.In Fig. 6, input picture is astigmatism step figure, corresponding to retouch Stating text is " Two elephants and a baby elephant walking together ".

S110, image recognition is carried out to image to be described, obtains the corresponding image vector of image to be described,.

In the present embodiment, the process that text Multimodal presentation device executes step 110 is specifically as follows, by image to be described, Preset image recognition model is inputted, the image vector of image recognition model output is obtained, which is determined as wait retouch State the corresponding image vector of image.Wherein, image recognition model is trained using a large amount of image.Image recognition model is for example It can be convolutional neural networks model (Convolutional Neural Networks, CNN) or depth residual error network (Deep residual network, ResNet) etc..

S111, by the corresponding image vector input picture descriptive model of image to be described, obtain image description model output First words of description.

In the present embodiment, image description model for example can be end-to-end neural network model.The input of the model can be with For image, output can be the corresponding description text of image.The model can be using a large amount of image and corresponding description text Originally it is trained.

S112, first words of description is determined as to describe text.

Corresponding, in the case where describing text is first words of description, after step 105, the method may be used also With the following steps are included:

S113, the corresponding multi-modal vector of text will be described and the corresponding image vector input picture of image to be described, is retouched Model is stated, the second words of description of image description model output is obtained, first words of description and second words of description is whole Conjunction obtains description text, repeats step 101-105, until image description model exports all words of description.

In the present embodiment, by above-mentioned text Multimodal presentation method be applied to picture talk task in, obtain input to Image is described, image recognition is carried out to image to be described, obtains the corresponding image vector of image to be described,；By image pair to be described, The image vector input picture descriptive model answered obtains first words of description of image description model output, first is retouched Predicate language is determined as describing text；Then using above-mentioned text Multimodal presentation method obtain description text it is corresponding it is multi-modal to Amount, will describe the corresponding multi-modal vector of text and the corresponding image vector input picture descriptive model of image to be described, obtains The second words of description for taking image description model to export, first words of description and second words of description are integrated and are described Text, until image description model exports all words of description, so that the accuracy of the description text got is improved, And the training cost of image description model is reduced, improve the execution accuracy and execution efficiency of picture talk task.

Fig. 7 is a kind of structural schematic diagram of text Multimodal presentation device provided in an embodiment of the present invention.As shown in fig. 7, It include: to obtain module 71 and determining module 72.

Wherein, it obtains module 71 to identify the text for obtaining text to be processed, obtains the text The corresponding text vector of each text object in corresponding text object set and the text object set；

The acquisition module 71, is also used to for each text object in the text object set, obtain with it is described The relevant image set of text object；

Determining module 72, for concentrating the corresponding image vector of each image, the determining and text according to described image The corresponding image vector of object；

The determining module 72 is also used to be determined according to the corresponding text vector of the text object and image vector The corresponding multi-modal vector of the text object；

The determining module 72, be also used to according to text object each in the text object set it is corresponding it is multi-modal to Amount, determines the corresponding multi-modal vector of the text.

Text Multimodal presentation device provided by the invention can be the hardware devices such as terminal device, server, Huo Zhewei The software installed on hardware device.Wherein, text to be processed can be for single word, multiple words or by multiple words The sentence of composition.In the present embodiment, the process that text Multimodal presentation device identifies text is specifically as follows, to text It is segmented, obtains at least one word included in text；It is selected from least one word and current multi-modal It is engaged in relevant word, word relevant to current multi-modal task is combined, the corresponding text object collection of text is obtained It closes；Text object and the preset term vector model of text input are obtained for each text object in text object set The vector of term vector model output, is determined as the corresponding text vector of text object for the vector.Wherein, term vector model is for example It can be convolutional neural networks model (Convolutional Neural Networks, CNN) or be bag of words (Bag- Of-words model, BOW) etc..Term vector model can be trained using a large amount of text.

Further, on the basis of the above embodiments, since partial words have multiple senses of a dictionary entry, and the not difference of synonymity Not very greatly, when such as word " batter " is used as noun, the senses of a dictionary entry can be batter or paste.Therefore, module 71 is obtained It specifically can be used for, obtain the associated picture of text object；Associated picture is polymerize, each justice with text object is obtained The corresponding image set of item；According to text object and text, the current senses of a dictionary entry of text object is determined；By the corresponding figure of the current senses of a dictionary entry Image set is determined as image set relevant to text object.

In the present embodiment, obtains the process that module 71 polymerize associated picture and be specifically as follows, to any two phase It closes image and carries out similarity calculation, two associated pictures that similarity is greater than preset threshold are aggregated to together, so as to obtain To the corresponding image set of each senses of a dictionary entry.In addition, obtaining module 71 according to text object and text, the current of text object is determined The process of the senses of a dictionary entry for example can be, text Multimodal presentation device is by each senses of a dictionary entry of text object and the preset justice of text input Item model, obtains the probability of each senses of a dictionary entry, and corresponding probability is greater than to the senses of a dictionary entry of preset value, is determined as the current justice of text object ?.Wherein, senses of a dictionary entry model can be CNN model etc., and senses of a dictionary entry model can be using each text object in a large amount of texts and text The current senses of a dictionary entry be trained.

The text Multimodal presentation device of the embodiment of the present invention identifies text by obtaining text to be processed, Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set；For Each text object in text object set obtains image set relevant to text object；According to image each in image set Corresponding image vector determines image vector corresponding with text object；According to the corresponding text vector of text object and figure As vector, the corresponding multi-modal vector of text object is determined；According to the corresponding multimode of text object each in text object set State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text It indicates, is matched with multi-modal task, and due to the Multimodal presentation of text, so that the integrated classification model in multi-modal task exists It in training process, is trained by less image and text, it will be able to ensure certain accuracy, to reduce instruction Practice cost, improves the execution accuracy and execution efficiency of multi-modal task.

In conjunction with reference Fig. 8, on the basis of the embodiment shown in fig. 7, the text is text the problem of input；The text In object set further include: candidate answers corresponding with described problem text；

Corresponding, the device can also include: the first picture recognition module 73 and integrated classification module 74；

The acquisition module 71 is also used to obtain input picture corresponding with described problem text；

The first image identification module 73 obtains the input figure for carrying out image recognition to the input picture As corresponding image vector；

The integrated classification module 74, for the corresponding image vector of the input picture and described problem text pair The multi-modal vector answered is merged and is classified, and determines answer corresponding with described problem text.

It is corresponding, in vision question-answering task, since the corresponding candidate answers of question text do not have multiple senses of a dictionary entry generally, Only have the single senses of a dictionary entry, for each candidate answers, after getting the associated picture of candidate word, do not need to associated picture into Row polymerization, associated picture can be integrated directly, obtain image set relevant to candidate answers.Therefore, module 71 is obtained For each text object in text object set, the process for obtaining image set relevant to text object is specifically as follows, For each text object in text object set, judge whether text object is word in question text；If so, obtaining The associated picture for taking text object, polymerize associated picture, obtains image set corresponding with each senses of a dictionary entry of text object, According to text object and text, the current senses of a dictionary entry of text object is determined, the corresponding image set of the current senses of a dictionary entry is determined as and text The relevant image set of this object；If it is not, then obtaining the associated picture of text object, associated picture is integrated, is obtained and text The relevant image set of this object.

In the present embodiment, the first picture recognition module 73 carries out image recognition to input picture, and it is corresponding to obtain input picture The process of image vector be specifically as follows, input picture is inputted into preset image recognition model, obtains image recognition model The image vector is determined as the corresponding image vector of input picture by the image vector of output.Wherein, image recognition model uses A large amount of image is trained.Image recognition model for example can be convolutional neural networks model (Convolutional Neural Networks, CNN) or depth residual error network (Deep residual network, ResNet) etc..

In the present embodiment, integrated classification module 74 determines that the process of answer corresponding with question text is specifically as follows, will The corresponding image vector of input picture and the corresponding multi-modal vector of question text input preset disaggregated model, obtain classification The probability of each candidate answers of model output；The candidate answers that corresponding probability is met to preset condition, are determined as and problem The corresponding answer of text.Wherein, disaggregated model can be trained using a large amount of image, problem and corresponding answer.Its In, preset condition for example can be predetermined probabilities threshold value.

In conjunction with reference Fig. 9, on the basis of the embodiment shown in fig. 7, the text is each of image description model output Text is described composed by words of description；

Corresponding, the device can also include: the second picture recognition module 75 and input module 76；

The acquisition module 71, is also used to obtain the image to be described, of input；

Second picture recognition module 75 obtains described wait retouch for carrying out image recognition to the image to be described, State the corresponding image vector of image；

The input module 76, for the corresponding image vector input described image of the image to be described, to be described mould Type obtains first words of description of described image descriptive model output；

The determining module 72 is also used to first words of description being determined as the description text；

The input module 76 is also used to the corresponding multi-modal vector of the description text and the image to be described, Corresponding image vector inputs described image descriptive model, obtains the second words of description of described image descriptive model output, will First words of description and second words of description are integrated to obtain the description text, until described image describes mould Until type exports all words of description.

In the present embodiment, the process that the second picture recognition module 75 obtains the corresponding image vector of image to be described, specifically may be used Think, image to be described, is inputted into preset image recognition model, the image vector of image recognition model output is obtained, by the figure As vector is determined as the corresponding image vector of image to be described,.Wherein, image recognition model is trained using a large amount of image. Image recognition model for example can be, convolutional neural networks model (Convolutional Neural Networks, CNN) or Person's depth residual error network (Deep residual network, ResNet) etc..

Figure 10 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention.The text is more Mod table showing device includes:

Memory 1001, processor 1002 and it is stored in the calculating that can be run on memory 1001 and on processor 1002 Machine program.

Processor 1002 realizes the text Multimodal presentation method provided in above-described embodiment when executing described program.

Further, text Multimodal presentation device further include:

Communication interface 1003, for the communication between memory 1001 and processor 1002.

Memory 1001, for storing the computer program that can be run on processor 1002.

Memory 1001 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non- Volatile memory), a for example, at least magnetic disk storage.

Processor 1002 realizes text Multimodal presentation method described in above-described embodiment when for executing described program.

If memory 1001, processor 1002 and the independent realization of communication interface 1003, communication interface 1003, memory 1001 and processor 1002 can be connected with each other by bus and complete mutual communication.The bus can be industrial standard Architecture (Industry Standard Architecture, referred to as ISA) bus, external equipment interconnection (Peripheral Component, referred to as PCI) bus or extended industry-standard architecture (Extended Industry Standard Architecture, referred to as EISA) bus etc..The bus can be divided into address bus, data/address bus, control Bus processed etc..Only to be indicated with a thick line in Figure 10, it is not intended that an only bus or a type of convenient for indicating Bus.

Optionally, in specific implementation, if memory 1001, processor 1002 and communication interface 1003, are integrated in one It is realized on block chip, then memory 1001, processor 1002 and communication interface 1003 can be completed mutual by internal interface Communication.

Processor 1002 may be a central processing unit (Central Processing Unit, referred to as CPU), or Person is specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC) or quilt It is configured to implement one or more integrated circuits of the embodiment of the present invention.

The present invention also provides a kind of non-transitorycomputer readable storage mediums, are stored thereon with computer program, the journey Text Multimodal presentation method as described above is realized when sequence is executed by processor.

The present invention also provides a kind of computer program products, when the instruction processing unit in the computer program product executes When, realize text Multimodal presentation method as described above.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..

Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.

It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention Type.

Claims

1. a kind of text Multimodal presentation method characterized by comprising

Text to be processed is obtained, the text is identified, obtains the corresponding text object set of the text, Yi Jisuo State the corresponding text vector of each text object in text object set；

The corresponding image vector of each image is concentrated according to described image, determines image vector corresponding with the text object；

According to the corresponding text vector of the text object and image vector, determine the text object it is corresponding it is multi-modal to Amount；

According to the corresponding multi-modal vector of text object each in the text object set, the corresponding multimode of the text is determined State vector.

2. the method according to claim 1, wherein each text in the text object set Object obtains image set relevant to the text object, comprising:

Obtain the associated picture of the text object；

3. the method according to claim 1, wherein the problem of text is input text；The text pair As in set further include: candidate answers corresponding with described problem text；

The method further include:

Obtain input picture corresponding with described problem text；

To the corresponding image vector of the input picture and the corresponding multi-modal vector of described problem text carry out fusion and Classification determines answer corresponding with described problem text.

4. according to the method described in claim 3, it is characterized in that, it is described to the corresponding image vector of the input picture and The corresponding multi-modal vector of described problem text is merged and is classified, and determines answer corresponding with described problem text, packet It includes:

The corresponding image vector of the input picture and the corresponding multi-modal vector of described problem text are inputted preset point Class model obtains the probability of each candidate answers of the disaggregated model output；

5. the method according to claim 1, wherein the text is each description of image description model output Text is described composed by word；

Before acquisition text to be processed, further includes:

Obtain the image to be described, of input；

The corresponding image vector of the image to be described, is inputted into described image descriptive model, it is defeated to obtain described image descriptive model First words of description out；

First words of description is determined as the description text；

According to the corresponding multi-modal vector of text object each in the text object set, the corresponding multimode of the text is determined After state vector, further includes:

The corresponding multi-modal vector of the description text and the corresponding image vector of the image to be described, are inputted into the figure As descriptive model, the second words of description of described image descriptive model output is obtained, by first words of description and described Second words of description is integrated to obtain the description text, until described image descriptive model exports all words of description.

6. a kind of text Multimodal presentation device characterized by comprising

It obtains module to identify the text for obtaining text to be processed, obtains the corresponding text pair of the text As the corresponding text vector of text object each in set and the text object set；

The acquisition module is also used to obtain and the text pair for each text object in the text object set As relevant image set；

Determining module, for concentrating the corresponding image vector of each image, the determining and text object pair according to described image The image vector answered；

The determining module is also used to determine the text according to the corresponding text vector of the text object and image vector The corresponding multi-modal vector of this object；

The determining module is also used to according to the corresponding multi-modal vector of text object each in the text object set, really Determine the corresponding multi-modal vector of the text.

7. device according to claim 6, which is characterized in that the acquisition module is specifically used for,

Obtain the associated picture of the text object；

8. device according to claim 6, which is characterized in that the problem of text is input text；The text pair As in set further include: candidate answers corresponding with described problem text；

It is corresponding to obtain the input picture for carrying out image recognition to the input picture for the first image identification module Image vector；

The integrated classification module, for corresponding more to the corresponding image vector of the input picture and described problem text Modal vector is merged and is classified, and determines answer corresponding with described problem text.

9. device according to claim 8, which is characterized in that the integrated classification module is specifically used for,

10. device according to claim 6, which is characterized in that the text is each the retouching of image description model output Text is described composed by predicate language；

Second picture recognition module obtains the image to be described, for carrying out image recognition to the image to be described, Corresponding image vector；

The input module, for the corresponding image vector of the image to be described, to be inputted described image descriptive model, acquisition First words of description of described image descriptive model output；

The input module is also used to the corresponding multi-modal vector of the description text and the image to be described, is corresponding Image vector inputs described image descriptive model, obtains the second words of description of described image descriptive model output, by described the One words of description and second words of description are integrated to obtain the description text, until described image descriptive model exports Until all words of description.

11. a kind of text Multimodal presentation device characterized by comprising

Memory, processor and storage are on a memory and the computer program that can run on a processor, which is characterized in that institute It states when processor executes described program and realizes such as text Multimodal presentation method as claimed in any one of claims 1 to 5.

12. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program Such as text Multimodal presentation method as claimed in any one of claims 1 to 5 is realized when being executed by processor.

13. a kind of computer program product realizes such as right when the instruction processing unit in the computer program product executes It is required that any text Multimodal presentation method in 1-5.