CN109359196A - Text Multimodal presentation method and device - Google Patents
Text Multimodal presentation method and device Download PDFInfo
- Publication number
- CN109359196A CN109359196A CN201811230363.9A CN201811230363A CN109359196A CN 109359196 A CN109359196 A CN 109359196A CN 201811230363 A CN201811230363 A CN 201811230363A CN 109359196 A CN109359196 A CN 109359196A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- vector
- text object
- description
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The present invention proposes a kind of text Multimodal presentation method and device, and wherein method includes: to obtain text to be processed, is identified to text, obtains the corresponding text object set of text and the corresponding text vector of each text object;For each text object, image set relevant to text object is obtained;According to the corresponding image vector of image each in image set, image vector corresponding with text object is determined;According to the corresponding text vector of text object and image vector, determine the corresponding multi-modal vector of text object, and then determine the corresponding multi-modal vector of text, so as to be indicated simultaneously using text vector and image vector to text, it is matched with multi-modal task, and due to the Multimodal presentation of text, so that integrated classification model or image description model in multi-modal task, it is trained by less image and text, ensure that certain accuracy, reduce trained cost, improve the execution accuracy and execution efficiency of multi-modal task.
Description
Technical field
The present invention relates to technical field of data processing more particularly to a kind of text Multimodal presentation method and devices.
Background technique
Multi-modal task, refer to by the various ways such as text, voice, video, movement, environment carry out human-computer interaction,
Simulate the task of interpersonal interactive mode.Current multi-modal task, such as vision question-answering task (Visual
Question Answering, VQA) in, obtain the image and question text of input first, obtain the corresponding image of image to
Amount and the corresponding text vector of question text, by the corresponding image vector of image and the corresponding text vector of question text into
Row fusion and classification, determine the corresponding answer of question text.In another example picture talk task (Image Caption, IC)
In, the image and the corresponding image vector of image of input are obtained first, and the corresponding image vector input picture of image is described
In model, first word of output is obtained, then by the corresponding text vector of first word and the corresponding image of image
In vector input picture descriptive model, second word is obtained;By the corresponding text vector of first word, second word pair
In the corresponding image vector input picture descriptive model of the text vector and image answered, successively carries out, obtain iamge description language
Sentence.
In above-mentioned two multi-modal task, the vector expression of image and question text is single mode, and image is only with figure
As vector expression, text is only indicated with text vector, is mismatched with multi-modal task;And due to the single mode of image and text
It indicates, so that integrated classification model and image description model are in the training process, a large amount of image and text is needed to carry out
Training, can ensure certain accuracy, improve trained cost, reduce the execution accuracy of multi-modal task and hold
Line efficiency.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, the first purpose of this invention is to propose a kind of text Multimodal presentation method, for solving existing skill
The problem of multi-modal task execution accuracy and execution efficiency difference in art.
Second object of the present invention is to propose a kind of text Multimodal presentation device.
Third object of the present invention is to propose another text Multimodal presentation device.
Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.
5th purpose of the invention is to propose a kind of computer program product.
In order to achieve the above object, first aspect present invention embodiment proposes a kind of text Multimodal presentation method, comprising:
Text to be processed is obtained, the text is identified, obtains the corresponding text object set of the text, with
And the corresponding text vector of each text object in the text object set;
For each text object in the text object set, image set relevant to the text object is obtained;
Concentrate the corresponding image vector of each image according to described image, determine image corresponding with the text object to
Amount;
According to the corresponding text vector of the text object and image vector, the corresponding multimode of the text object is determined
State vector;
According to the corresponding multi-modal vector of text object each in the text object set, determine that the text is corresponding
Multi-modal vector.
Further, each text object in the text object set, obtains and the text object
Relevant image set, comprising:
Obtain the associated picture of the text object;
The associated picture is polymerize, image set corresponding with each senses of a dictionary entry of the text object is obtained;
According to the text object and the text, the current senses of a dictionary entry of the text object is determined;
By the corresponding image set of the current senses of a dictionary entry, it is determined as image set relevant to the text object.
Further, the problem of text is input text;In the text object set further include: with described problem
The corresponding candidate answers of text;
The method further include:
Obtain input picture corresponding with described problem text;
Image recognition is carried out to the input picture, obtains the corresponding image vector of the input picture;
The corresponding image vector of the input picture and the corresponding multi-modal vector of described problem text are merged
And classification, determine answer corresponding with described problem text.
Further, described corresponding multi-modal to the corresponding image vector of the input picture and described problem text
Vector is merged and is classified, and determines answer corresponding with described problem text, comprising:
The corresponding multi-modal vector input of the corresponding image vector of the input picture and described problem text is default
Disaggregated model, obtain the probability of each candidate answers of disaggregated model output;
The candidate answers that corresponding probability is met to preset condition are determined as answer corresponding with described problem text.
Further, the text is that text is described composed by each words of description of image description model output;
Before acquisition text to be processed, further includes:
Obtain the image to be described, of input;
Image recognition is carried out to the image to be described, obtains the corresponding image vector of the image to be described,;
The corresponding image vector of the image to be described, is inputted into described image descriptive model, described image is obtained and describes mould
First words of description of type output;
First words of description is determined as the description text;
According to the corresponding multi-modal vector of text object each in the text object set, determine that the text is corresponding
After multi-modal vector, further includes:
The corresponding multi-modal vector of the description text and the corresponding image vector of the image to be described, are inputted into institute
State image description model, obtain the second words of description of described image descriptive model output, will first words of description with
Second words of description is integrated to obtain the description text, until all words of description of described image descriptive model output are
Only.
The text Multimodal presentation method of the embodiment of the present invention identifies text by obtaining text to be processed,
Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set;For
Each text object in text object set obtains image set relevant to text object;According to image each in image set
Corresponding image vector determines image vector corresponding with text object;According to the corresponding text vector of text object and figure
As vector, the corresponding multi-modal vector of text object is determined;According to the corresponding multimode of text object each in text object set
State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text
Indicate, matched with multi-modal task, and due to the Multimodal presentation of text so that integrated classification model in multi-modal task or
Person's image description model in the training process, is trained, it will be able to ensure certain standard by less image and text
Exactness improves the execution accuracy and execution efficiency of multi-modal task to reduce trained cost.
In order to achieve the above object, second aspect of the present invention embodiment proposes a kind of text Multimodal presentation device, comprising:
It obtains module to identify the text for obtaining text to be processed, obtains the corresponding text of the text
The corresponding text vector of each text object in this object set and the text object set;
The acquisition module is also used to obtain and the text for each text object in the text object set
The relevant image set of this object;
Determining module, for concentrating the corresponding image vector of each image, the determining and text pair according to described image
As corresponding image vector;
The determining module is also used to determine institute according to the corresponding text vector of the text object and image vector
State the corresponding multi-modal vector of text object;
The determining module, be also used to according to text object each in the text object set it is corresponding it is multi-modal to
Amount, determines the corresponding multi-modal vector of the text.
Further, the acquisition module is specifically used for,
Obtain the associated picture of the text object;
The associated picture is polymerize, image set corresponding with each senses of a dictionary entry of the text object is obtained;
According to the text object and the text, the current senses of a dictionary entry of the text object is determined;
By the corresponding image set of the current senses of a dictionary entry, it is determined as image set relevant to the text object.
Further, the problem of text is input text;In the text object set further include: with described problem
The corresponding candidate answers of text;
The device further include: the first picture recognition module and integrated classification module;
The acquisition module is also used to obtain input picture corresponding with described problem text;
The first image identification module obtains the input picture for carrying out image recognition to the input picture
Corresponding image vector;
The integrated classification module, for corresponding to the corresponding image vector of the input picture and described problem text
Multi-modal vector merged and classified, determine corresponding with described problem text answer.
Further, the integrated classification module is specifically used for,
The corresponding multi-modal vector input of the corresponding image vector of the input picture and described problem text is default
Disaggregated model, obtain the probability of each candidate answers of disaggregated model output;
The candidate answers that corresponding probability is met to preset condition are determined as answer corresponding with described problem text.
Further, the text is that text is described composed by each words of description of image description model output;
The device further include: the second picture recognition module and input module;
The acquisition module is also used to obtain the image to be described, of input;
Second picture recognition module obtains described to be described, for carrying out image recognition to the image to be described,
The corresponding image vector of image;
The input module, for will the corresponding image vector input described image descriptive model of the image to be described,
Obtain first words of description of described image descriptive model output;
The determining module is also used to first words of description being determined as the description text;
The input module is also used to the corresponding multi-modal vector of the description text and the image pair to be described,
The image vector input described image descriptive model answered, obtains the second words of description of described image descriptive model output, by institute
It states first words of description and second words of description is integrated to obtain the description text, until described image descriptive model
Until exporting all words of description.
The text Multimodal presentation device of the embodiment of the present invention identifies text by obtaining text to be processed,
Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set;For
Each text object in text object set obtains image set relevant to text object;According to image each in image set
Corresponding image vector determines image vector corresponding with text object;According to the corresponding text vector of text object and figure
As vector, the corresponding multi-modal vector of text object is determined;According to the corresponding multimode of text object each in text object set
State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text
Indicate, matched with multi-modal task, and due to the Multimodal presentation of text so that integrated classification model in multi-modal task or
Person's image description model in the training process, is trained, it will be able to ensure certain standard by less image and text
Exactness improves the execution accuracy and execution efficiency of multi-modal task to reduce trained cost.
In order to achieve the above object, third aspect present invention embodiment proposes another text Multimodal presentation device, comprising:
Memory, processor and storage are on a memory and the computer program that can run on a processor, which is characterized in that the place
Reason device realizes text Multimodal presentation method as described above when executing described program.
To achieve the goals above, fourth aspect present invention embodiment proposes a kind of computer readable storage medium,
On be stored with computer program, which realizes text Multimodal presentation method as described above when being executed by processor.
To achieve the goals above, fifth aspect present invention embodiment proposes a kind of computer program product, when described
When instruction processing unit in computer program product executes, text Multimodal presentation method as described above is realized.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments
Obviously and it is readily appreciated that, in which:
Fig. 1 is a kind of flow diagram of text Multimodal presentation method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention;
Fig. 3 is the schematic diagram of vision question-answering task;
Fig. 4 is the execution schematic diagram of vision question-answering task;
Fig. 5 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention;
Fig. 6 is the schematic diagram of picture talk task;
Fig. 7 is a kind of structural schematic diagram of text Multimodal presentation device provided in an embodiment of the present invention;
Fig. 8 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention;
Fig. 9 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention;
Figure 10 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.
Below with reference to the accompanying drawings the text Multimodal presentation method and device of the embodiment of the present invention is described.
Fig. 1 is a kind of flow diagram of text Multimodal presentation method provided in an embodiment of the present invention.As shown in Figure 1,
Text Multimodal presentation method the following steps are included:
S101, text to be processed is obtained, text is identified, obtain the corresponding text object set of text, and
The corresponding text vector of each text object in text object set.
The executing subject of text Multimodal presentation method provided by the invention is text Multimodal presentation device, text multimode
State indicates that device can be the hardware devices such as terminal device, server, or the software to install on hardware device.Wherein, to
The text of processing can be single word, multiple words or the sentence being made of multiple words.Text object can be text
In all words, have word relevant to current multi-modal task in the word or text of physical meaning.The present embodiment
In, the process that text Multimodal presentation device identifies text is specifically as follows, and segments to text, obtains in text
At least one included word;Word relevant to current multi-modal task is selected from least one word, will with work as
The preceding relevant word of multi-modal task is combined, and obtains the corresponding text object set of text;For text object set
In each text object, by text object and the preset term vector model of text input, obtain the output of term vector model to
Amount, is determined as the corresponding text vector of text object for the vector.Wherein, term vector model for example can be convolutional neural networks
Model (Convolutional Neural Networks, CNN) is bag of words (Bag-of-words model, BOW)
Deng.Term vector model can be trained using a large amount of text.
In the present embodiment, when multi-modal task is VQA task, text to be processed can be the problem of user inputs text
This, wherein question text can directly input for user, or voice carries out after speech recognition the problem of input to user
It arrives.When multi-modal task is IC task, text to be processed can be the single word, multiple of image description model output
Word or the description text being made of multiple words.
S102, for each text object in text object set, obtain image set relevant to text object.
In the present embodiment, the process that text Multimodal presentation device executes step 102 is specifically as follows, for text object
Text object carrying is sent to search engine, so that search engine is searched by each text object in set in searching request
Rope multiple images relevant to text object, are combined multiple images, obtain image set relevant to text object.Its
In, search engine for example can be the search engines such as Flickr, imageNet.
Further, on the basis of the above embodiments, since partial words have multiple senses of a dictionary entry, and the not difference of synonymity
Not very greatly, when such as word " batter " is used as noun, the senses of a dictionary entry can be batter or paste.Therefore, text is multi-modal
The process for indicating that device executes step 102 specifically can also be to obtain the associated picture of text object;Associated picture is gathered
It closes, obtains image set corresponding with each senses of a dictionary entry of text object;According to text object and text, working as text object is determined
The preceding senses of a dictionary entry;By the corresponding image set of the current senses of a dictionary entry, it is determined as image set relevant to text object.
In the present embodiment, the process that text Multimodal presentation device polymerize associated picture is specifically as follows, to appoint
Two associated pictures of anticipating carry out similarity calculations, by two associated pictures that similarity is greater than preset threshold be aggregated to together, from
And the corresponding image set of each senses of a dictionary entry can be obtained.In addition, text Multimodal presentation device is according to text object and text, really
The process for determining the current senses of a dictionary entry of text object for example can be, text Multimodal presentation device by each senses of a dictionary entry of text object with
The preset senses of a dictionary entry model of text input, obtains the probability of each senses of a dictionary entry, and corresponding probability is greater than to the senses of a dictionary entry of preset value, is determined as
The current senses of a dictionary entry of text object.Wherein, senses of a dictionary entry model can for CNN model etc., senses of a dictionary entry model can using a large amount of texts and
The current senses of a dictionary entry of each text object is trained in text.
S103, according to the corresponding image vector of image each in image set, determine image vector corresponding with text object.
In the present embodiment, the dimension of the corresponding image vector of each image is generally identical in image set, determining with text pair
As the mode of corresponding image vector can be, summation is weighted to the corresponding image vector of each image, or summation is made even
Mean value etc..
S104, according to the corresponding text vector of text object and image vector, determine that text object is corresponding multi-modal
Vector.
In the present embodiment, text Multimodal presentation device determines that the mode of the corresponding multi-modal vector of text object for example may be used
Think, the corresponding text vector of text object and image vector spliced, obtain text object it is corresponding it is multi-modal to
Amount.Wherein, the dimension of the vector spliced be text vector dimension and image vector dimension and.
S105, according to the corresponding multi-modal vector of text object each in text object set, determine that text is corresponding more
Modal vector.
In the present embodiment, the corresponding multi-modal vector of text can be, by the corresponding multi-modal vector of each text object
The vector matrix of composition, or the vector spliced by the corresponding multi-modal vector of each text object.
The text Multimodal presentation method of the embodiment of the present invention identifies text by obtaining text to be processed,
Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set;For
Each text object in text object set obtains image set relevant to text object;According to image each in image set
Corresponding image vector determines image vector corresponding with text object;According to the corresponding text vector of text object and figure
As vector, the corresponding multi-modal vector of text object is determined;According to the corresponding multimode of text object each in text object set
State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text
It indicates, is matched with multi-modal task, and due to the Multimodal presentation of text, so that the integrated classification model in multi-modal task exists
It in training process, is trained by less image and text, it will be able to ensure certain accuracy, to reduce instruction
Practice cost, improves the execution accuracy and execution efficiency of multi-modal task.
Fig. 2 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention.Such as Fig. 2 institute
Show, on the basis of embodiment shown in Fig. 1, text is text the problem of input;In text object set further include: with problem text
This corresponding candidate answers;
Corresponding, the method is further comprising the steps of:
S106, acquisition input picture corresponding with question text.
In the present embodiment, multi-modal task is specifically as follows vision question-answering task VQA.In the task, input can be
Question text, and input picture corresponding with question text.As shown in figure 3, being the schematic diagram of vision question-answering task.In Fig. 3
In, input picture is that batter hits are played baseball figure, and question text is " What color shirt is the batter
Wearing? ".Wherein, word relevant to vision question-answering task is " shirt ", " batter " in question text, with the problem
The corresponding candidate answers of text for example can be " blue ", " red " etc..Therefore, text object knot corresponding with the question text
Closing for example to be " shirt, batter, blue, red ... ".
It is corresponding, in vision question-answering task, since the corresponding candidate answers of question text do not have multiple senses of a dictionary entry generally,
Only have the single senses of a dictionary entry, for each candidate answers, after getting the associated picture of candidate word, do not need to associated picture into
Row polymerization, associated picture can be integrated directly, obtain image set relevant to candidate answers.Therefore, text is multi-modal
Device is indicated for each text object in text object set, the process for obtaining image set relevant to text object is specific
It can be for each text object in text object set, to judge whether text object is word in question text;If
It is then to obtain the associated picture of text object, polymerize to associated picture, obtains corresponding with each senses of a dictionary entry of text object
Image set determines the current senses of a dictionary entry of text object according to text object and text, by the corresponding image set of the current senses of a dictionary entry, really
It is set to image set relevant to text object;If it is not, then obtaining the associated picture of text object, associated picture is integrated,
Obtain image set relevant to text object.
S107, image recognition is carried out to input picture, obtains the corresponding image vector of input picture.
In the present embodiment, the process that text Multimodal presentation device executes step 107 is specifically as follows, and input picture is defeated
Enter preset image recognition model, obtains the image vector of image recognition model output, which is determined as input figure
As corresponding image vector.Wherein, image recognition model is trained using a large amount of image.Image recognition model for example can be with
For convolutional neural networks model (Convolutional Neural Networks, CNN) or depth residual error network (Deep
Residual network, ResNet) etc..
S108, to the corresponding image vector of input picture and the corresponding multi-modal vector of question text carry out fusion and
Classification determines answer corresponding with question text.
In the present embodiment, the process that text Multimodal presentation device executes step 108 is specifically as follows, by input picture pair
The corresponding multi-modal vector of the image vector and question text answered inputs preset disaggregated model, obtains disaggregated model output
The probability of each candidate answers;The candidate answers that corresponding probability is met to preset condition are determined as corresponding with question text
Answer.Wherein, disaggregated model can be trained using a large amount of image, problem and corresponding answer.Wherein, preset condition
It such as can be predetermined probabilities threshold value.
As shown in figure 4, being the execution schematic diagram of vision question-answering task.In Fig. 4, image recognition is carried out to input picture,
Obtain the corresponding image vector of input picture;Question text is handled, the corresponding multi-modal vector of question text is obtained;It will
The corresponding image vector of input picture multi-modal vector corresponding with question text is merged and is classified, and determines question text
Corresponding answer is red.It wherein, may include two images in image set relevant to text object " shirt ", in Fig. 4
Two T-shirt cottas;It may include three images in image set relevant to text object " batter ", such as three in Fig. 4
Batter's image.
In the present embodiment, above-mentioned text Multimodal presentation method is applied in vision question-answering task, obtains question text
And input picture corresponding with question text, word relevant to vision question-answering task in question text and candidate are answered
Case determines text object set, then using above-mentioned text Multimodal presentation method obtain question text it is corresponding it is multi-modal to
Amount, is merged and is classified to the corresponding image vector of input picture and the corresponding multi-modal vector of question text, is determined
Answer corresponding with question text to improve the accuracy of the answer got, and reduces being trained to for disaggregated model
This, improves the execution accuracy and execution efficiency of vision question-answering task.
Fig. 5 is the flow diagram of another text Multimodal presentation method provided in an embodiment of the present invention.Such as Fig. 5 institute
Show, on the basis of embodiment shown in Fig. 1, text is description text composed by each words of description of image description model output
This;
Corresponding, before step 101, the method is further comprising the steps of:
S109, the image to be described, for obtaining input.
In the present embodiment, multi-modal task is specifically as follows picture talk task IC.In the task, input can for
Image is described.As shown in fig. 6, being the schematic diagram of picture talk task.In Fig. 6, input picture is astigmatism step figure, corresponding to retouch
Stating text is " Two elephants and a baby elephant walking together ".
S110, image recognition is carried out to image to be described, obtains the corresponding image vector of image to be described,.
In the present embodiment, the process that text Multimodal presentation device executes step 110 is specifically as follows, by image to be described,
Preset image recognition model is inputted, the image vector of image recognition model output is obtained, which is determined as wait retouch
State the corresponding image vector of image.Wherein, image recognition model is trained using a large amount of image.Image recognition model is for example
It can be convolutional neural networks model (Convolutional Neural Networks, CNN) or depth residual error network
(Deep residual network, ResNet) etc..
S111, by the corresponding image vector input picture descriptive model of image to be described, obtain image description model output
First words of description.
In the present embodiment, image description model for example can be end-to-end neural network model.The input of the model can be with
For image, output can be the corresponding description text of image.The model can be using a large amount of image and corresponding description text
Originally it is trained.
S112, first words of description is determined as to describe text.
Corresponding, in the case where describing text is first words of description, after step 105, the method may be used also
With the following steps are included:
S113, the corresponding multi-modal vector of text will be described and the corresponding image vector input picture of image to be described, is retouched
Model is stated, the second words of description of image description model output is obtained, first words of description and second words of description is whole
Conjunction obtains description text, repeats step 101-105, until image description model exports all words of description.
In the present embodiment, by above-mentioned text Multimodal presentation method be applied to picture talk task in, obtain input to
Image is described, image recognition is carried out to image to be described, obtains the corresponding image vector of image to be described,;By image pair to be described,
The image vector input picture descriptive model answered obtains first words of description of image description model output, first is retouched
Predicate language is determined as describing text;Then using above-mentioned text Multimodal presentation method obtain description text it is corresponding it is multi-modal to
Amount, will describe the corresponding multi-modal vector of text and the corresponding image vector input picture descriptive model of image to be described, obtains
The second words of description for taking image description model to export, first words of description and second words of description are integrated and are described
Text, until image description model exports all words of description, so that the accuracy of the description text got is improved,
And the training cost of image description model is reduced, improve the execution accuracy and execution efficiency of picture talk task.
Fig. 7 is a kind of structural schematic diagram of text Multimodal presentation device provided in an embodiment of the present invention.As shown in fig. 7,
It include: to obtain module 71 and determining module 72.
Wherein, it obtains module 71 to identify the text for obtaining text to be processed, obtains the text
The corresponding text vector of each text object in corresponding text object set and the text object set;
The acquisition module 71, is also used to for each text object in the text object set, obtain with it is described
The relevant image set of text object;
Determining module 72, for concentrating the corresponding image vector of each image, the determining and text according to described image
The corresponding image vector of object;
The determining module 72 is also used to be determined according to the corresponding text vector of the text object and image vector
The corresponding multi-modal vector of the text object;
The determining module 72, be also used to according to text object each in the text object set it is corresponding it is multi-modal to
Amount, determines the corresponding multi-modal vector of the text.
Text Multimodal presentation device provided by the invention can be the hardware devices such as terminal device, server, Huo Zhewei
The software installed on hardware device.Wherein, text to be processed can be for single word, multiple words or by multiple words
The sentence of composition.In the present embodiment, the process that text Multimodal presentation device identifies text is specifically as follows, to text
It is segmented, obtains at least one word included in text;It is selected from least one word and current multi-modal
It is engaged in relevant word, word relevant to current multi-modal task is combined, the corresponding text object collection of text is obtained
It closes;Text object and the preset term vector model of text input are obtained for each text object in text object set
The vector of term vector model output, is determined as the corresponding text vector of text object for the vector.Wherein, term vector model is for example
It can be convolutional neural networks model (Convolutional Neural Networks, CNN) or be bag of words (Bag-
Of-words model, BOW) etc..Term vector model can be trained using a large amount of text.
In the present embodiment, when multi-modal task is VQA task, text to be processed can be the problem of user inputs text
This, wherein question text can directly input for user, or voice carries out after speech recognition the problem of input to user
It arrives.When multi-modal task is IC task, text to be processed can be the single word, multiple of image description model output
Word or the description text being made of multiple words.
Further, on the basis of the above embodiments, since partial words have multiple senses of a dictionary entry, and the not difference of synonymity
Not very greatly, when such as word " batter " is used as noun, the senses of a dictionary entry can be batter or paste.Therefore, module 71 is obtained
It specifically can be used for, obtain the associated picture of text object;Associated picture is polymerize, each justice with text object is obtained
The corresponding image set of item;According to text object and text, the current senses of a dictionary entry of text object is determined;By the corresponding figure of the current senses of a dictionary entry
Image set is determined as image set relevant to text object.
In the present embodiment, obtains the process that module 71 polymerize associated picture and be specifically as follows, to any two phase
It closes image and carries out similarity calculation, two associated pictures that similarity is greater than preset threshold are aggregated to together, so as to obtain
To the corresponding image set of each senses of a dictionary entry.In addition, obtaining module 71 according to text object and text, the current of text object is determined
The process of the senses of a dictionary entry for example can be, text Multimodal presentation device is by each senses of a dictionary entry of text object and the preset justice of text input
Item model, obtains the probability of each senses of a dictionary entry, and corresponding probability is greater than to the senses of a dictionary entry of preset value, is determined as the current justice of text object
?.Wherein, senses of a dictionary entry model can be CNN model etc., and senses of a dictionary entry model can be using each text object in a large amount of texts and text
The current senses of a dictionary entry be trained.
The text Multimodal presentation device of the embodiment of the present invention identifies text by obtaining text to be processed,
Obtain the corresponding text vector of each text object in the corresponding text object set of text and text object set;For
Each text object in text object set obtains image set relevant to text object;According to image each in image set
Corresponding image vector determines image vector corresponding with text object;According to the corresponding text vector of text object and figure
As vector, the corresponding multi-modal vector of text object is determined;According to the corresponding multimode of text object each in text object set
State vector determines the corresponding multi-modal vector of text, so as to be carried out simultaneously using text vector and image vector to text
It indicates, is matched with multi-modal task, and due to the Multimodal presentation of text, so that the integrated classification model in multi-modal task exists
It in training process, is trained by less image and text, it will be able to ensure certain accuracy, to reduce instruction
Practice cost, improves the execution accuracy and execution efficiency of multi-modal task.
In conjunction with reference Fig. 8, on the basis of the embodiment shown in fig. 7, the text is text the problem of input;The text
In object set further include: candidate answers corresponding with described problem text;
Corresponding, the device can also include: the first picture recognition module 73 and integrated classification module 74;
The acquisition module 71 is also used to obtain input picture corresponding with described problem text;
The first image identification module 73 obtains the input figure for carrying out image recognition to the input picture
As corresponding image vector;
The integrated classification module 74, for the corresponding image vector of the input picture and described problem text pair
The multi-modal vector answered is merged and is classified, and determines answer corresponding with described problem text.
In the present embodiment, multi-modal task is specifically as follows vision question-answering task VQA.In the task, input can be
Question text, and input picture corresponding with question text.As shown in figure 3, being the schematic diagram of vision question-answering task.In Fig. 3
In, input picture is that batter hits are played baseball figure, and question text is " What color shirt is the batter
Wearing? ".Wherein, word relevant to vision question-answering task is " shirt ", " batter " in question text, with the problem
The corresponding candidate answers of text for example can be " blue ", " red " etc..Therefore, text object knot corresponding with the question text
Closing for example to be " shirt, batter, blue, red ... ".
It is corresponding, in vision question-answering task, since the corresponding candidate answers of question text do not have multiple senses of a dictionary entry generally,
Only have the single senses of a dictionary entry, for each candidate answers, after getting the associated picture of candidate word, do not need to associated picture into
Row polymerization, associated picture can be integrated directly, obtain image set relevant to candidate answers.Therefore, module 71 is obtained
For each text object in text object set, the process for obtaining image set relevant to text object is specifically as follows,
For each text object in text object set, judge whether text object is word in question text;If so, obtaining
The associated picture for taking text object, polymerize associated picture, obtains image set corresponding with each senses of a dictionary entry of text object,
According to text object and text, the current senses of a dictionary entry of text object is determined, the corresponding image set of the current senses of a dictionary entry is determined as and text
The relevant image set of this object;If it is not, then obtaining the associated picture of text object, associated picture is integrated, is obtained and text
The relevant image set of this object.
In the present embodiment, the first picture recognition module 73 carries out image recognition to input picture, and it is corresponding to obtain input picture
The process of image vector be specifically as follows, input picture is inputted into preset image recognition model, obtains image recognition model
The image vector is determined as the corresponding image vector of input picture by the image vector of output.Wherein, image recognition model uses
A large amount of image is trained.Image recognition model for example can be convolutional neural networks model (Convolutional
Neural Networks, CNN) or depth residual error network (Deep residual network, ResNet) etc..
In the present embodiment, integrated classification module 74 determines that the process of answer corresponding with question text is specifically as follows, will
The corresponding image vector of input picture and the corresponding multi-modal vector of question text input preset disaggregated model, obtain classification
The probability of each candidate answers of model output;The candidate answers that corresponding probability is met to preset condition, are determined as and problem
The corresponding answer of text.Wherein, disaggregated model can be trained using a large amount of image, problem and corresponding answer.Its
In, preset condition for example can be predetermined probabilities threshold value.
In the present embodiment, above-mentioned text Multimodal presentation method is applied in vision question-answering task, obtains question text
And input picture corresponding with question text, word relevant to vision question-answering task in question text and candidate are answered
Case determines text object set, then using above-mentioned text Multimodal presentation method obtain question text it is corresponding it is multi-modal to
Amount, is merged and is classified to the corresponding image vector of input picture and the corresponding multi-modal vector of question text, is determined
Answer corresponding with question text to improve the accuracy of the answer got, and reduces being trained to for disaggregated model
This, improves the execution accuracy and execution efficiency of vision question-answering task.
In conjunction with reference Fig. 9, on the basis of the embodiment shown in fig. 7, the text is each of image description model output
Text is described composed by words of description;
Corresponding, the device can also include: the second picture recognition module 75 and input module 76;
The acquisition module 71, is also used to obtain the image to be described, of input;
Second picture recognition module 75 obtains described wait retouch for carrying out image recognition to the image to be described,
State the corresponding image vector of image;
The input module 76, for the corresponding image vector input described image of the image to be described, to be described mould
Type obtains first words of description of described image descriptive model output;
The determining module 72 is also used to first words of description being determined as the description text;
The input module 76 is also used to the corresponding multi-modal vector of the description text and the image to be described,
Corresponding image vector inputs described image descriptive model, obtains the second words of description of described image descriptive model output, will
First words of description and second words of description are integrated to obtain the description text, until described image describes mould
Until type exports all words of description.
In the present embodiment, multi-modal task is specifically as follows picture talk task IC.In the task, input can for
Image is described.As shown in fig. 6, being the schematic diagram of picture talk task.In Fig. 6, input picture is astigmatism step figure, corresponding to retouch
Stating text is " Two elephants and a baby elephant walking together ".
In the present embodiment, the process that the second picture recognition module 75 obtains the corresponding image vector of image to be described, specifically may be used
Think, image to be described, is inputted into preset image recognition model, the image vector of image recognition model output is obtained, by the figure
As vector is determined as the corresponding image vector of image to be described,.Wherein, image recognition model is trained using a large amount of image.
Image recognition model for example can be, convolutional neural networks model (Convolutional Neural Networks, CNN) or
Person's depth residual error network (Deep residual network, ResNet) etc..
In the present embodiment, image description model for example can be end-to-end neural network model.The input of the model can be with
For image, output can be the corresponding description text of image.The model can be using a large amount of image and corresponding description text
Originally it is trained.
In the present embodiment, by above-mentioned text Multimodal presentation method be applied to picture talk task in, obtain input to
Image is described, image recognition is carried out to image to be described, obtains the corresponding image vector of image to be described,;By image pair to be described,
The image vector input picture descriptive model answered obtains first words of description of image description model output, first is retouched
Predicate language is determined as describing text;Then using above-mentioned text Multimodal presentation method obtain description text it is corresponding it is multi-modal to
Amount, will describe the corresponding multi-modal vector of text and the corresponding image vector input picture descriptive model of image to be described, obtains
The second words of description for taking image description model to export, first words of description and second words of description are integrated and are described
Text, until image description model exports all words of description, so that the accuracy of the description text got is improved,
And the training cost of image description model is reduced, improve the execution accuracy and execution efficiency of picture talk task.
Figure 10 is the structural schematic diagram of another text Multimodal presentation device provided in an embodiment of the present invention.The text is more
Mod table showing device includes:
Memory 1001, processor 1002 and it is stored in the calculating that can be run on memory 1001 and on processor 1002
Machine program.
Processor 1002 realizes the text Multimodal presentation method provided in above-described embodiment when executing described program.
Further, text Multimodal presentation device further include:
Communication interface 1003, for the communication between memory 1001 and processor 1002.
Memory 1001, for storing the computer program that can be run on processor 1002.
Memory 1001 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-
Volatile memory), a for example, at least magnetic disk storage.
Processor 1002 realizes text Multimodal presentation method described in above-described embodiment when for executing described program.
If memory 1001, processor 1002 and the independent realization of communication interface 1003, communication interface 1003, memory
1001 and processor 1002 can be connected with each other by bus and complete mutual communication.The bus can be industrial standard
Architecture (Industry Standard Architecture, referred to as ISA) bus, external equipment interconnection
(Peripheral Component, referred to as PCI) bus or extended industry-standard architecture (Extended Industry
Standard Architecture, referred to as EISA) bus etc..The bus can be divided into address bus, data/address bus, control
Bus processed etc..Only to be indicated with a thick line in Figure 10, it is not intended that an only bus or a type of convenient for indicating
Bus.
Optionally, in specific implementation, if memory 1001, processor 1002 and communication interface 1003, are integrated in one
It is realized on block chip, then memory 1001, processor 1002 and communication interface 1003 can be completed mutual by internal interface
Communication.
Processor 1002 may be a central processing unit (Central Processing Unit, referred to as CPU), or
Person is specific integrated circuit (Application Specific Integrated Circuit, referred to as ASIC) or quilt
It is configured to implement one or more integrated circuits of the embodiment of the present invention.
The present invention also provides a kind of non-transitorycomputer readable storage mediums, are stored thereon with computer program, the journey
Text Multimodal presentation method as described above is realized when sequence is executed by processor.
The present invention also provides a kind of computer program products, when the instruction processing unit in the computer program product executes
When, realize text Multimodal presentation method as described above.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples
It closes and combines.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three
It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction
The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass
Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment
It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings
Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable
Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used
Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from
Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile
Journey gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries
It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium
In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above
The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention
System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention
Type.
Claims (13)
1. a kind of text Multimodal presentation method characterized by comprising
Text to be processed is obtained, the text is identified, obtains the corresponding text object set of the text, Yi Jisuo
State the corresponding text vector of each text object in text object set;
For each text object in the text object set, image set relevant to the text object is obtained;
The corresponding image vector of each image is concentrated according to described image, determines image vector corresponding with the text object;
According to the corresponding text vector of the text object and image vector, determine the text object it is corresponding it is multi-modal to
Amount;
According to the corresponding multi-modal vector of text object each in the text object set, the corresponding multimode of the text is determined
State vector.
2. the method according to claim 1, wherein each text in the text object set
Object obtains image set relevant to the text object, comprising:
Obtain the associated picture of the text object;
The associated picture is polymerize, image set corresponding with each senses of a dictionary entry of the text object is obtained;
According to the text object and the text, the current senses of a dictionary entry of the text object is determined;
By the corresponding image set of the current senses of a dictionary entry, it is determined as image set relevant to the text object.
3. the method according to claim 1, wherein the problem of text is input text;The text pair
As in set further include: candidate answers corresponding with described problem text;
The method further include:
Obtain input picture corresponding with described problem text;
Image recognition is carried out to the input picture, obtains the corresponding image vector of the input picture;
To the corresponding image vector of the input picture and the corresponding multi-modal vector of described problem text carry out fusion and
Classification determines answer corresponding with described problem text.
4. according to the method described in claim 3, it is characterized in that, it is described to the corresponding image vector of the input picture and
The corresponding multi-modal vector of described problem text is merged and is classified, and determines answer corresponding with described problem text, packet
It includes:
The corresponding image vector of the input picture and the corresponding multi-modal vector of described problem text are inputted preset point
Class model obtains the probability of each candidate answers of the disaggregated model output;
The candidate answers that corresponding probability is met to preset condition are determined as answer corresponding with described problem text.
5. the method according to claim 1, wherein the text is each description of image description model output
Text is described composed by word;
Before acquisition text to be processed, further includes:
Obtain the image to be described, of input;
Image recognition is carried out to the image to be described, obtains the corresponding image vector of the image to be described,;
The corresponding image vector of the image to be described, is inputted into described image descriptive model, it is defeated to obtain described image descriptive model
First words of description out;
First words of description is determined as the description text;
According to the corresponding multi-modal vector of text object each in the text object set, the corresponding multimode of the text is determined
After state vector, further includes:
The corresponding multi-modal vector of the description text and the corresponding image vector of the image to be described, are inputted into the figure
As descriptive model, the second words of description of described image descriptive model output is obtained, by first words of description and described
Second words of description is integrated to obtain the description text, until described image descriptive model exports all words of description.
6. a kind of text Multimodal presentation device characterized by comprising
It obtains module to identify the text for obtaining text to be processed, obtains the corresponding text pair of the text
As the corresponding text vector of text object each in set and the text object set;
The acquisition module is also used to obtain and the text pair for each text object in the text object set
As relevant image set;
Determining module, for concentrating the corresponding image vector of each image, the determining and text object pair according to described image
The image vector answered;
The determining module is also used to determine the text according to the corresponding text vector of the text object and image vector
The corresponding multi-modal vector of this object;
The determining module is also used to according to the corresponding multi-modal vector of text object each in the text object set, really
Determine the corresponding multi-modal vector of the text.
7. device according to claim 6, which is characterized in that the acquisition module is specifically used for,
Obtain the associated picture of the text object;
The associated picture is polymerize, image set corresponding with each senses of a dictionary entry of the text object is obtained;
According to the text object and the text, the current senses of a dictionary entry of the text object is determined;
By the corresponding image set of the current senses of a dictionary entry, it is determined as image set relevant to the text object.
8. device according to claim 6, which is characterized in that the problem of text is input text;The text pair
As in set further include: candidate answers corresponding with described problem text;
The device further include: the first picture recognition module and integrated classification module;
The acquisition module is also used to obtain input picture corresponding with described problem text;
It is corresponding to obtain the input picture for carrying out image recognition to the input picture for the first image identification module
Image vector;
The integrated classification module, for corresponding more to the corresponding image vector of the input picture and described problem text
Modal vector is merged and is classified, and determines answer corresponding with described problem text.
9. device according to claim 8, which is characterized in that the integrated classification module is specifically used for,
The corresponding image vector of the input picture and the corresponding multi-modal vector of described problem text are inputted preset point
Class model obtains the probability of each candidate answers of the disaggregated model output;
The candidate answers that corresponding probability is met to preset condition are determined as answer corresponding with described problem text.
10. device according to claim 6, which is characterized in that the text is each the retouching of image description model output
Text is described composed by predicate language;
The device further include: the second picture recognition module and input module;
The acquisition module is also used to obtain the image to be described, of input;
Second picture recognition module obtains the image to be described, for carrying out image recognition to the image to be described,
Corresponding image vector;
The input module, for the corresponding image vector of the image to be described, to be inputted described image descriptive model, acquisition
First words of description of described image descriptive model output;
The determining module is also used to first words of description being determined as the description text;
The input module is also used to the corresponding multi-modal vector of the description text and the image to be described, is corresponding
Image vector inputs described image descriptive model, obtains the second words of description of described image descriptive model output, by described the
One words of description and second words of description are integrated to obtain the description text, until described image descriptive model exports
Until all words of description.
11. a kind of text Multimodal presentation device characterized by comprising
Memory, processor and storage are on a memory and the computer program that can run on a processor, which is characterized in that institute
It states when processor executes described program and realizes such as text Multimodal presentation method as claimed in any one of claims 1 to 5.
12. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program
Such as text Multimodal presentation method as claimed in any one of claims 1 to 5 is realized when being executed by processor.
13. a kind of computer program product realizes such as right when the instruction processing unit in the computer program product executes
It is required that any text Multimodal presentation method in 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811230363.9A CN109359196B (en) | 2018-10-22 | 2018-10-22 | Text multi-modal representation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811230363.9A CN109359196B (en) | 2018-10-22 | 2018-10-22 | Text multi-modal representation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109359196A true CN109359196A (en) | 2019-02-19 |
CN109359196B CN109359196B (en) | 2020-11-17 |
Family
ID=65346011
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811230363.9A Active CN109359196B (en) | 2018-10-22 | 2018-10-22 | Text multi-modal representation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109359196B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362823A (en) * | 2019-06-21 | 2019-10-22 | 北京百度网讯科技有限公司 | The training method and device of text generation model are described |
CN111581335A (en) * | 2020-05-14 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN112819052A (en) * | 2021-01-25 | 2021-05-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN113139121A (en) * | 2020-01-20 | 2021-07-20 | 阿里巴巴集团控股有限公司 | Query method, model training method, device, equipment and storage medium |
CN113177115A (en) * | 2021-06-30 | 2021-07-27 | 中移(上海)信息通信科技有限公司 | Conversation content processing method and device and related equipment |
WO2022033208A1 (en) * | 2020-08-12 | 2022-02-17 | 腾讯科技(深圳)有限公司 | Visual dialogue method and apparatus, model training method and apparatus, electronic device, and computer readable storage medium |
CN116778011A (en) * | 2023-05-22 | 2023-09-19 | 阿里巴巴(中国)有限公司 | Image generating method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9720934B1 (en) * | 2014-03-13 | 2017-08-01 | A9.Com, Inc. | Object recognition of feature-sparse or texture-limited subject matter |
CN107076567A (en) * | 2015-05-21 | 2017-08-18 | 百度(美国)有限责任公司 | Multilingual image question and answer |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
-
2018
- 2018-10-22 CN CN201811230363.9A patent/CN109359196B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9720934B1 (en) * | 2014-03-13 | 2017-08-01 | A9.Com, Inc. | Object recognition of feature-sparse or texture-limited subject matter |
CN107076567A (en) * | 2015-05-21 | 2017-08-18 | 百度(美国)有限责任公司 | Multilingual image question and answer |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
Non-Patent Citations (1)
Title |
---|
李德志: "广告类超文本多模态的视觉语法分析", 《外语学刊》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362823A (en) * | 2019-06-21 | 2019-10-22 | 北京百度网讯科技有限公司 | The training method and device of text generation model are described |
CN110362823B (en) * | 2019-06-21 | 2023-07-28 | 北京百度网讯科技有限公司 | Training method and device for descriptive text generation model |
CN113139121A (en) * | 2020-01-20 | 2021-07-20 | 阿里巴巴集团控股有限公司 | Query method, model training method, device, equipment and storage medium |
CN111581335A (en) * | 2020-05-14 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN111581335B (en) * | 2020-05-14 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Text representation method and device |
WO2022033208A1 (en) * | 2020-08-12 | 2022-02-17 | 腾讯科技(深圳)有限公司 | Visual dialogue method and apparatus, model training method and apparatus, electronic device, and computer readable storage medium |
CN112819052A (en) * | 2021-01-25 | 2021-05-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN113177115A (en) * | 2021-06-30 | 2021-07-27 | 中移(上海)信息通信科技有限公司 | Conversation content processing method and device and related equipment |
CN113177115B (en) * | 2021-06-30 | 2021-10-26 | 中移(上海)信息通信科技有限公司 | Conversation content processing method and device and related equipment |
CN116778011A (en) * | 2023-05-22 | 2023-09-19 | 阿里巴巴(中国)有限公司 | Image generating method |
Also Published As
Publication number | Publication date |
---|---|
CN109359196B (en) | 2020-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109359196A (en) | Text Multimodal presentation method and device | |
Ilievski et al. | A focused dynamic attention model for visual question answering | |
Reed et al. | Learning deep representations of fine-grained visual descriptions | |
CN106649825B (en) | Voice interaction system and creation method and device thereof | |
CN109858555A (en) | Data processing method, device, equipment and readable storage medium storing program for executing based on image | |
CN108595410A (en) | The automatic of hand-written composition corrects method and device | |
CN109871828A (en) | Video frequency identifying method and identification device, storage medium | |
CN107679033A (en) | Text punctuate location recognition method and device | |
CN107832432A (en) | A kind of search result ordering method, device, server and storage medium | |
CN110444199A (en) | A kind of voice keyword recognition method, device, terminal and server | |
CN110210021A (en) | Read understanding method and device | |
CN107680589A (en) | Voice messaging exchange method, device and its equipment | |
CN109255126A (en) | Article recommended method and device | |
CN110188350A (en) | Text coherence calculation method and device | |
EP3937076A1 (en) | Activity detection device, activity detection system, and activity detection method | |
CN113780486B (en) | Visual question answering method, device and medium | |
CN108182246A (en) | Sensitive word detection filter method, device and computer equipment | |
CN108681541A (en) | Image searching method, device and computer equipment | |
CN105975557A (en) | Topic searching method and device applied to electric equipment | |
CN110598763A (en) | Image identification method and device and terminal equipment | |
CN112560506A (en) | Text semantic parsing method and device, terminal equipment and storage medium | |
CN106169065A (en) | A kind of information processing method and electronic equipment | |
CN115344805A (en) | Material auditing method, computing equipment and storage medium | |
CN110046340A (en) | The training method and device of textual classification model | |
CN108985289A (en) | Messy code detection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |