WO2022033208A1 - 视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质 - Google Patents

视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2022033208A1
WO2022033208A1 PCT/CN2021/102815 CN2021102815W WO2022033208A1 WO 2022033208 A1 WO2022033208 A1 WO 2022033208A1 CN 2021102815 W CN2021102815 W CN 2021102815W WO 2022033208 A1 WO2022033208 A1 WO 2022033208A1
Authority
WO
WIPO (PCT)
Prior art keywords
answer
feature
vector
question
dialogue
Prior art date
Application number
PCT/CN2021/102815
Other languages
English (en)
French (fr)
Inventor
陈飞龙
孟凡东
李鹏
周杰
徐波
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022033208A1 publication Critical patent/WO2022033208A1/zh
Priority to US17/989,613 priority Critical patent/US20230082605A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of visual dialogue, and relates to a visual dialogue method, a model training method, an apparatus, an electronic device, and a computer-readable storage medium.
  • Visual dialogue refers to the process of having a meaningful dialogue with humans in the conversational language of natural language for visual content (such as images).
  • the output answer of the current input question is usually obtained based on the input image, the current input question, the previous round of historical question-and-answer dialogue, and the work state vector of the previous moment.
  • the accuracy of the output answer is lower.
  • the embodiments of the present application provide a visual dialogue method, a model training method, an apparatus, an electronic device, and a computer-readable storage medium.
  • the embodiment of the present application provides a visual dialogue method, the method includes:
  • Multi-modal decoding processing is performed on the state vector corresponding to the current round of questioning and the image feature to obtain an actual output answer corresponding to the current round of questioning.
  • An embodiment of the present application provides a training method for a visual dialogue model, the method comprising:
  • the visual dialogue model is trained to obtain a trained visual dialogue model.
  • the embodiment of the present application provides a visual dialogue device, and the device includes:
  • the first acquisition module is configured to acquire the image feature of the input image and the state vector corresponding to the previous n rounds of historical question-and-answer dialogues, where n is a positive integer;
  • the first obtaining module is configured to obtain the question characteristics of the current round of questions
  • a first feature encoding module configured to perform multimodal encoding processing on the image feature, the state vector corresponding to the previous n rounds of historical question-and-answer dialogues, and the question feature, to obtain a state vector corresponding to the current round of questioning;
  • the first feature decoding module is configured to perform multi-modal decoding processing on the state vector corresponding to the current round of questioning and the image feature to obtain an actual output answer corresponding to the current round of questioning.
  • the embodiment of the present application provides a training device for a visual dialogue model, and the device includes:
  • the second obtaining module is configured to obtain the image feature samples of the input image samples and the state vector samples corresponding to the previous s rounds of historical question-and-answer dialogue samples, where s is a positive integer;
  • the second obtaining module is configured to obtain the question feature samples of the current round of questioning samples and the first answer features of the real answers corresponding to the current round of questioning samples;
  • the second feature encoding module is configured to call the visual dialogue model to perform multi-modal encoding processing on the image feature samples, the state vector samples corresponding to the previous s rounds of historical question-and-answer dialogue samples, and the question feature samples, to obtain the current The state vector sample corresponding to the question sample in turn;
  • the second feature decoding module is configured to call the visual dialogue model to perform multi-modal decoding processing on the state vector sample corresponding to the current polling sample, the image feature sample and the first answer feature, and obtain the current The second answer feature of the actual output answer sample corresponding to the round question sample;
  • the training module is configured to train the visual dialogue model according to the first answer feature and the second answer feature to obtain a trained visual dialogue model.
  • An embodiment of the present application provides an electronic device, the electronic device includes a processor and a memory, and the memory stores at least one instruction, at least a piece of program, code set or instruction set, the at least one instruction, the at least one A piece of program, the code set or the instruction set is loaded and executed by the processor to implement the visual dialogue method and the training method of the visual dialogue model as described above.
  • An embodiment of the present application provides a computer-readable storage medium, where at least one instruction, at least one piece of program, code set or instruction set is stored in the computer-readable storage medium, the at least one instruction, the at least one piece of program, The code set or instruction set is loaded and executed by the processor to implement the visual dialogue method and the training method of the visual dialogue model as described above.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instructions from the computer-readable storage medium, and when the processor executes the computer instructions, the electronic device executes the above-mentioned visual dialogue method and visual dialogue model training method .
  • FIG. 1 is a frame diagram of a visual dialogue system provided by an exemplary embodiment of the present application
  • FIG. 2 is a flowchart of a visual dialogue method provided by an exemplary embodiment of the present application.
  • FIG. 3 is a structural framework diagram of a visual dialogue model provided by an exemplary embodiment of the present application.
  • FIG. 4 is a flowchart of a visual dialogue method provided by another exemplary embodiment of the present application.
  • FIG. 5 is a structural frame diagram of a visual dialogue model provided by another exemplary embodiment of the present application.
  • FIG. 6 is a structural frame diagram of a multi-modal incremental conversion encoder provided by an exemplary embodiment of the present application.
  • FIG. 7 is a structural frame diagram of a multimodal incremental transcoder provided by another exemplary embodiment of the present application.
  • FIG. 8 is a structural framework diagram of a multi-modal incremental conversion decoder provided by an exemplary embodiment of the present application.
  • FIG. 9 is a structural frame diagram of a multi-modal incremental conversion decoder provided by another exemplary embodiment of the present application.
  • FIG. 10 is a flowchart of a training method for a visual dialogue model provided by an exemplary embodiment of the present application.
  • FIG. 11 is a structural block diagram of a visual dialogue device provided by an exemplary embodiment of the present application.
  • FIG. 12 is a structural block diagram of an apparatus for training a visual dialogue model provided by an exemplary embodiment of the present application.
  • FIG. 13 is a schematic diagram of an apparatus structure of a server provided by an exemplary embodiment of the present application.
  • Computer Vision It is a science that studies how to make machines "see”. It refers to the use of cameras and computers instead of human eyes to perform machine vision processing such as target recognition, tracking and measurement, and make graphics. processing, so that the processing result becomes an image that is more suitable for human observation or transmitted to the instrument for detection.
  • Computer vision technology as a theoretical and technical scientific discipline related to computer vision research, is used to establish artificial intelligence systems that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology (3 -Dimension, 3D) technology, virtual reality, augmented reality, simultaneous positioning and map construction and other technologies, as well as common face recognition, fingerprint recognition and other biometric identification technologies.
  • OCR Optical Character Recognition
  • the input image is processed by the computer vision technology, and the answer is output according to the input question, wherein the input question is a question related to the input image.
  • VQA Visual Question Answering
  • NLP Natural Language Processing
  • An image and a question about the free-form, open-ended natural language of the image are input into the electronic device, and the output is: the generated natural language answer.
  • the electronic device obtains the content of the image, the meaning and intention of the question, and related common sense information, and outputs a reasonable answer that conforms to natural language rules according to the input image and question.
  • Visual Dialog It is an extended field of VQA, and its main task is to have a meaningful dialogue with humans in the conversational language of natural language for visual content. That is, given an image, a conversation history, and a question about the image, the electronic device places the question in the image, infers context from the conversation history, and answers the question accurately. Unlike VQA, visual dialogue processes multi-turn dialogue histories through an encoder that can combine multiple information sources.
  • AI Artificial Intelligence
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones It is believed that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value.
  • the visual dialogue will be implemented based on artificial intelligence technology.
  • the visual dialogue model trained by the visual dialogue method provided by the embodiment of the present application can be applied to shopping applications, group purchase applications, travel management applications (such as ticket ordering applications, hotel ordering applications), etc. in the application.
  • the above application is provided with intelligent customer service, and the user can get the answer to the problem he needs to solve by having a dialogue with the intelligent customer service.
  • Intelligent customer service is implemented through a visual dialogue model built in the application's backend server, which is pre-trained.
  • the visual dialogue model receives a question input by the user, the visual dialogue model outputs an answer about the question.
  • a smart customer service is a customer service of a shopping application.
  • the question asked by the user is about the item A in the input image.
  • the question is: Where is the store that sells the item A?
  • the intelligent customer service outputs the answer according to the user's question: the stores selling item A are store 1, store 3, and store 10.
  • the user can browse the corresponding store interface according to the output answer.
  • the visual dialogue model trained by the visual dialogue method provided in the embodiment of the present application can be applied to smart devices such as smart terminals or smart homes.
  • the virtual assistant set in the smart terminal the virtual assistant is realized by a trained visual dialogue model, and the visual dialogue model is pre-trained.
  • the visual dialogue model receives a question input by the user, the visual dialogue model outputs an answer about the question.
  • user A posts a dynamic on a social platform (the dynamic includes an image)
  • the image is a photo of user A on vacation at the seaside
  • the virtual assistant reminds user B (user B and user A are friends) that user A has posted a new photo
  • User B asks the virtual assistant a question: What's in the photo?
  • the virtual assistant outputs the answer: User A is playing at the beach.
  • user B can choose whether to enter user A's social platform interface to browse photos.
  • the above only takes two application scenarios as examples for description.
  • the visual dialogue method provided in the embodiment of the present application can also be applied to other scenarios that require visual dialogue (for example, a scene of explaining pictures for persons with impaired vision, etc.).
  • the embodiments do not limit specific application scenarios.
  • the visual dialogue method and the training method of the visual dialogue model provided by the embodiments of the present application can be applied to electronic devices with strong data processing capabilities.
  • the visual dialogue method and the training method of the visual dialogue model provided by the embodiments of the present application can be applied to a personal computer, a workstation or a server, that is, the visual dialogue and training can be realized through the personal computer, workstation or server Visual dialogue model.
  • the trained visual dialogue model it can be implemented as a part of the application and installed in the terminal, so that when the terminal receives a question related to the input image, it can output the answer corresponding to the question; or, the training
  • the latter visual dialogue model is set in the background server of the application program, so that the terminal with the application program installed can realize the function of visual dialogue with the user by means of the background server.
  • FIG. 1 shows a schematic diagram of a visual dialogue system provided by an exemplary embodiment of the present application.
  • the visual dialogue system 100 includes an electronic device 110 and a server 120, wherein data communication is performed between the electronic device 110 and the server 120 through a communication network, optionally, the communication network may be a wired network or a wireless network, and the communication network It can be at least one of a local area network, a metropolitan area network, and a wide area network.
  • An application program supporting a visual dialogue function is installed in the electronic device 110, and the application program may be a virtual reality application (Virtual Reality, VR), an augmented reality application (Augmented Reality, AR), a game application, a photo album application, A social application program, etc., is not limited in this embodiment of the present application.
  • the electronic device 110 may be a mobile terminal such as a smart phone, a smart watch, a tablet computer, a laptop, a smart robot, a vehicle-mounted device, etc., or a terminal such as a desktop computer, a projection computer, and a smart TV.
  • the embodiment of the present application does not limit the type of the electronic device.
  • the server 120 may be implemented as one server, or as a server cluster formed by a group of servers, and may be implemented as a physical server or as a cloud server. In a possible implementation, the server 120 is a background server of the application program in the electronic device 110 .
  • the electronic device 110 runs a chat application, and the user can obtain the information in the input image by chatting with the chat assistant of the chat application.
  • the input image 11 is an image pre-input into the server 120 by the electronic device 110 , or the input image 11 is an image pre-stored in the server 120 .
  • the user inputs a question related to the input image in the chat interface of the chat assistant, the electronic device 110 sends the question to the server 120, and the server 120 is provided with the trained visual dialogue model 10, and the trained visual dialogue model 10 is based on the input image.
  • the answer to the question is output, and the answer is sent to the electronic device 110 , and the answer to the question from the chat assistant is displayed on the electronic device 110 .
  • the user asks the question: Is the girl sitting in the car?
  • the trained visual dialogue model 10 determines that the question asked by the user is the gender of the person located in the car in the input image based on the previous rounds of historical question-and-answer dialogues (question: how many people are in the image? Answer: 4 people), according to the The gender of the person is male, then output the answer: No.
  • the state vector 12 (n is a positive integer) corresponding to the previous n rounds of historical question-and-answer dialogues is pre-stored in the server 120, and the trained visual dialogue model 10 obtains the image feature 111 of the input image 11 and the current round of questioning
  • the state vector 14 corresponding to the current round of questioning is output in combination with the state vector 12 corresponding to the previous n rounds of historical question-and-answer dialogues.
  • the trained visual dialogue model 10 obtains the x+1th string in the output answer 16 according to the image feature 111 of the input image 11, the state vector 14 corresponding to the current round of questions, and the feature 15 of the first x strings that have been output , where x is a positive integer.
  • the server 120 may pre-store previous n rounds of historical question-and-answer dialogues, and the visual dialogue model extracts corresponding state vectors from the previous n rounds of historical question-and-answer dialogues.
  • the visual dialogue model needs to be trained by combining the image feature samples of the input image samples, the state vector samples corresponding to the current round of questioning samples, and the answer features of the real answers corresponding to the current round of questioning samples.
  • the real answer of the current round of questioning samples includes 5 words (strings)
  • the visual dialogue model outputs the answer, it outputs the actual output answer sample of each round of questioning according to the rule of outputting one word at a time.
  • the visual dialogue model outputs the third word
  • the visual dialogue model combines the first word and the second word in the real answer and the state vector corresponding to the current round of questions to output the third word, and based on the real answer and the actual output answer
  • the difference of the samples trains the visual dialogue model.
  • the following description takes the example of the training method of the visual dialogue model and the execution of the visual dialogue method by the server.
  • FIG. 2 shows a flowchart of a visual dialogue method provided by an exemplary embodiment of the present application.
  • the embodiment of the present application is described by taking the visual dialogue method used for the server 120 in the visual dialogue system 100 as shown in FIG. 1 as an example, and the visual dialogue method includes the following steps:
  • Step 201 Obtain image features of the input image and state vectors corresponding to the previous n rounds of historical question-and-answer dialogues, where n is a positive integer.
  • the server extracts the features of the input image, and the image features of the input image are obtained; and the state vector corresponding to the previous n rounds of historical question dialogues is the output of the previous round, so the server can In the output of one round, the state vectors corresponding to the previous n rounds of historical question and answer dialogues are obtained.
  • a visual dialogue model is constructed in the server, and the visual dialogue model is obtained after training, that is, a trained visual dialogue model; an input image is obtained through the visual dialogue model, and the input image may be an image pre-stored by the server, It may also be an image uploaded by the user to the server through the terminal (including at least one of an image stored in the terminal and an image captured by the terminal), or an image in an existing image set.
  • the embodiment of this application does not limit the type of the image. .
  • the visual dialogue model extracts image features from the input image.
  • the visual dialogue model includes a feature extraction model, and image features are extracted from the input image through the feature extraction model.
  • a round of historical question-and-answer dialogue means that the user asks a question, and the visual dialogue model outputs an answer to the question.
  • n rounds of historical question-and-answer dialogues are historical question-and-answer dialogues about the same input image.
  • the server establishes a corresponding relationship between n rounds of historical question and answer dialogues about the same input image and the input image.
  • the visual dialogue model will obtain the first n rounds of historical question and answer related to the input image. dialogue.
  • the user asks a question about image 1
  • the visual dialogue model obtains n 1 rounds of historical question-and-answer dialogues corresponding to image 1
  • the visual dialogue model obtains a question corresponding to image 2.
  • n 2 rounds of historical question and answer dialogue, n 1 and n 2 are both positive integers.
  • the visual dialogue model includes an encoder 21, and the encoder 21 includes a plurality of Multimodal Incremental Transformer Encoders (MITE) 211, which are set for each round of historical question-and-answer dialogues. There is a corresponding MITE 211.
  • MITE Multimodal Incremental Transformer Encoders
  • the MITE 211 corresponding to each round of historical question and answer dialogue outputs the state vector corresponding to the current round of historical question and answer dialogue, it is also necessary to combine the image features of the input image 11, the historical question and answer features of this round of historical question and answer dialogue and the previous
  • the state vector output by MITE 211 corresponding to the round of historical question and answer dialogue is used as input, and the state vector corresponding to each round of historical question and answer dialogue is obtained.
  • the image features of the input image 11 and the question features asked in the first round are used as input, a state vector is output, and the output state vector is passed to the subsequent rounds until the current round is processed.
  • the image features of the input image 11 are used as input
  • the question features of the current round of questions, and the state vector output of the MITE 211 corresponding to the nth round of historical question-and-answer dialogues are used as inputs, and the state vector of the current round of questions is obtained.
  • the state vector corresponding to a round of historical question and answer dialogue includes the historical question and answer feature corresponding to the round of historical question and answer.
  • the server maps the text of the historical question and answer dialogue into a word vector through a word embedding operation (Word Embedding), thereby obtaining the historical question and answer feature.
  • Word Embedding word embedding operation
  • the state vector corresponding to a round of historical question-and-answer dialogues is obtained by formula (1), and formula (1) is:
  • c n represents the state vector corresponding to the nth round of historical question and answer dialogue output by MITE
  • v n represents the image feature of the input image
  • u n represents the historical question and answer feature of the nth round of historical question and answer dialogue (extracted from the text of the historical question and answer dialogue).
  • c n-1 represents the state vector corresponding to the n-1th round of historical question-and-answer dialogue.
  • Step 202 acquiring question characteristics of the current round of questions.
  • the server extracts the features of the text corresponding to the current round of questions, and also obtains the question features of the current round of questions; here, the server can extract the question features from the text of the current round of questions through the visual dialogue model.
  • the server first maps each character string in the text of the current round of questioning through a word embedding operation to obtain a word vector of each character string, thereby obtaining a word vector corresponding to the text of the current round of questioning. Then, the server encodes each character string in the text of the current round of questions in a certain order through positional encoding to obtain the position of each word vector corresponding to the text of the current round of questions; wherein, the positional encoding includes Absolute position encoding and relative position encoding.
  • the question features obtained by the server through the visual dialogue model include word vectors and the position of each word vector in the sentence.
  • step 201 and step 202 may be implemented simultaneously, or, step 201 may be implemented first, and then step 202 may be implemented, or, step 202 may be implemented first, and then step 201 may be implemented; that is, steps 201 and 202 are implemented in the order of execution. In no particular order.
  • Step 203 Perform multi-modal encoding processing on the image features, the state vectors corresponding to the previous n rounds of historical question-and-answer dialogues, and the question features, to obtain a state vector corresponding to the current round of question-answering.
  • the server performs multi-modal encoding processing by integrating image features, state vectors corresponding to the previous n rounds of historical question-and-answer dialogues, and question features, and the obtained result is the state vector corresponding to the current round of questioning.
  • the server may perform a multimodal encoding process through a visual dialogue model.
  • the visual dialogue model includes respective corresponding MITEs 211 set for each round of historical question-and-answer dialogues, and corresponding MITEs 211 also exist for the current round of questioning.
  • the server uses the image features and the historical question-and-answer features of the first round of historical question-and-answer dialogues as the input of MITE 211 corresponding to the first round of historical question-and-answer dialogues, and outputs the state vector corresponding to the first round of historical question-and-answer dialogues;
  • the state vector corresponding to the historical Q&A dialogue, the historical Q&A features and image features of the second round of historical Q&A dialogue are input into MITE 211 corresponding to the second round of historical Q&A dialogue, and the state vector corresponding to the second round of historical Q&A dialogue is output; and so on.
  • the current round of questioning is the n+1th round
  • the state vector corresponding to the nth round of historical question-and-answer dialogue (the output of MITE 211 corresponding to the nth round of historical question-and-answer dialogue), image features and the n+1th round of questioning feature input
  • the state vector corresponding to the n+1 round of questions is output.
  • Step 204 Perform multi-modal decoding processing on the state vector and image features corresponding to the current round of questions, to obtain an actual output answer corresponding to the current round of questions.
  • the server performs decoding processing on the obtained state vector and image feature corresponding to the current round of questions, and the obtained decoding result is the actual output answer corresponding to the current round of questions.
  • the decoding process is a multi-modal decoding process.
  • the server can perform the multimodal decoding process through the visual dialogue model.
  • the visual dialogue model further includes a decoder 22, and the decoder 22 includes a Multimodal Incremental Transformer Decoder (MITD) 221.
  • MITD Multimodal Incremental Transformer Decoder
  • MITD combines the output words (strings) "I”, “am” and the state vector corresponding to the current round of questions to output the word "fine”.
  • the visual dialogue method provided by the embodiment of the present application enables the visual dialogue model to better understand the implicit information in the image in connection with the context by obtaining the state vectors corresponding to the first n rounds of historical question-and-answer dialogues about the input image.
  • the visual dialogue model can better output the actual output answer corresponding to the current round of questions according to various types of information, and improve the accuracy of the answer output by the visual dialogue model. , and ensure the consistency of the output answer with the question and the input image, and improve the effect of visual dialogue.
  • FIG. 4 shows a flowchart of a visual dialogue method provided by another exemplary embodiment of the present application.
  • the embodiment of the present application is described by taking the visual dialogue method being used in the server 120 in the visual dialogue system 100 as shown in FIG. 1 as an example, and the visual dialogue method includes the following steps:
  • Step 401 Obtain image features of the input image and state vectors corresponding to the previous n rounds of historical question-and-answer dialogues, where n is a positive integer.
  • the input images are images in an existing image set.
  • the visual dialogue model includes a feature extraction model, which is a model constructed based on a convolutional neural network.
  • the image features in the input image are extracted by the fast region detection convolutional neural network (Fast Region-CNN, Fast R-CNN), as shown in the following formula (2):
  • v represents the image feature of the input image
  • I represents the input image
  • FastR-CNN() represents the corresponding processing of Fast R-CNN.
  • the encoder 21 of the visual dialogue model includes multiple MITE 211, each round of historical question and answer dialogue corresponds to one MITE 211, and the state vector corresponding to the previous round of historical question and answer dialogue will be used as input to the next round of historical question and answer.
  • Dialogue corresponds to MITE 211. And so on, until the state vector corresponding to the previous round of historical question-and-answer dialogues of the current round of questions is obtained.
  • the input to MITE 211 also includes image descriptions17.
  • the input image also corresponds to an image description (caption)
  • the image description is used to describe the entities in the input image and the relationship between the entities
  • the image description is also used as the input of MITE 211, which is beneficial to vision Dialogue models are better at extracting information implicit in the input image.
  • the input image 11 corresponds to an image description: a self-driving tour of four people.
  • Step 402 acquiring question characteristics of the current round of questions.
  • the feature extraction model is also used to extract question features from the current round of questions.
  • the problem feature u n+1 is extracted by the following formulas (3) and (4):
  • u n+1 [u n+1 , 1 , u n+1 , 2 , ..., u n+1, L ] ⁇ R L ⁇ M (3)
  • PE() is the processing corresponding to the absolute position encoding function
  • w n+1, l is the word vector after the word embedding operation of the l-th string in the current round of questioning.
  • u n+1, l represents the string feature of the l-th string in the current round of questions
  • L represents the maximum number of strings in the current round of questions
  • M represents the dimension represented by each string
  • R represents the domain.
  • the historical question-and-answer feature corresponding to n rounds of historical question-and-answer dialogue can also be obtained by the above formulas (3) and (4).
  • Step 403 Obtain the state vector corresponding to the i-th historical question-and-answer dialogue, where i is a positive integer and the initial value of i is 1.
  • the server first encodes the state vector corresponding to the first round of historical question and answer through the first MITE 211; Perform multi-modal coding processing to obtain the state vector corresponding to the second round of historical question and answer; if the second round is not the current round, continue to obtain the state vector corresponding to the third round of historical question and answer based on the above processing, and so on until the current round, obtain the first round.
  • the multi-modal incremental conversion encoder is in one-to-one correspondence with the historical question-and-answer dialogue
  • the state vector corresponding to the i-th historical question-and-answer dialogue is the first MITE 211 to encode image features and question features Obtained
  • the state vector corresponding to the i-th historical question-and-answer dialogue is obtained by encoding the image features and question features of the i-1th MITE 211, as well as the state vector corresponding to the i-1th historical question-and-answer dialogue of.
  • i is a variable, which can be any one of 1 to n.
  • the state vector corresponding to the n+1th round of historical question-and-answer dialogue is output by MITE 211 corresponding to the n+1th round of historical question-and-answer dialogue.
  • MITE 211 for each round of historical Q&A dialogues, and one MITE 211 for the current round of Q&A.
  • the embodiment of the present application is described by taking the existence of at least one round of historical question-and-answer dialogue as an example.
  • Step 404 iterate i, call the i+1 th multimodal incremental conversion encoder in the visual dialogue model to perform image features, the state vector corresponding to the i-th historical question-and-answer dialogue and the i+1-th historical question-and-answer dialogue corresponding
  • the question-and-answer feature is multi-modally encoded, and the state vector corresponding to the i+1th round of historical question-and-answer dialogue is obtained.
  • the server responds that the i+1th round is the current round of questions, and outputs the state vector corresponding to the current round of questions through the MITE 211 corresponding to the i+1th round of historical question-and-answer dialogues; the server responds that the i+1th round is non-current.
  • Round of questioning output the state vector corresponding to the i+1 round of historical question and answer dialogue through MITE 211 corresponding to the i+1 round of historical question and answer dialogue.
  • the state vector corresponding to the i+1th round of historical question and answer dialogue is used as the input of the i+2th round of historical question and answer.
  • the multimodal incremental conversion encoder includes K sub-transformation encoders, where K is a positive integer, and step 404 can be replaced with the following steps:
  • Step 4041 Obtain the j-th intermediate representation vector, where the j-th intermediate representation vector is the j-th multi-model multi-model of the image feature, the state vector corresponding to the i-th historical question-and-answer dialogue, and the question-and-answer feature corresponding to the i+1-th historical question-and-answer dialogue.
  • the j-th intermediate representation vector is the vector corresponding to the i+1-th historical question-and-answer dialogue, where j is a positive integer and the initial value of j is 1.
  • j is a variable and takes a value from 1 to K.
  • call the first sub-transcoder to perform multi-modal encoding processing on the image features, the state vector corresponding to the i-th historical question-and-answer dialogue, and the question-and-answer feature corresponding to the i+1-th historical question-and-answer dialogue, and obtain the first intermediate encoding vector (the jth intermediate representation vector);
  • the jth sub-transformer is called to perform the image feature, the state vector corresponding to the i-th historical question-and-answer dialogue, and the j-1th intermediate representation vector.
  • the multimodal encoding process is performed to obtain the j-th intermediate encoding vector (the j-th intermediate representation vector).
  • each MITE 211 includes K sub-transcoding encoders 212, where K is a positive integer, and each sub-transcoding encoder 212 is used to perform a multi-modal encoding process, so that one round of historical question-and-answer dialogue is performed more than K times Modal encoding processing.
  • K is a positive integer
  • each sub-transcoding encoder 212 is used to perform a multi-modal encoding process, so that one round of historical question-and-answer dialogue is performed more than K times Modal encoding processing.
  • K is a positive integer
  • each sub-transcoding encoder 212 is used to perform a multi-modal encoding process, so that one round of historical question-and-answer dialogue is performed more than K times Modal encoding processing.
  • the modal coding process is performed to obtain the state vector c i+1 corresponding to the i+1th round of historical question-and-answer dialogue.
  • each MITE 211 includes the same or different number of sub-transcoding encoders, that is, the number of times of multi-modal encoding processing performed in each round of historical question-and-answer dialogues is the same or different.
  • the image feature v, the state vector c i of the i-th historical question and answer dialogue corresponding to the state vector c i and the historical question and answer feature u i+1 (by the i+1 round
  • the historical question-and-answer dialogue is obtained through the embedding layer) and input to the first sub-transform encoder 212 in the i+1th MITE 211 to output an intermediate representation vector, which is the intermediate representation vector, image feature v and the i+1th round of historical question and answer.
  • the question-and-answer feature u i+1 corresponding to the dialogue is input into the second sub-transcoder 212 .
  • the jth sub-transcoder 212 outputs j intermediate representation vectors, where the j intermediate representation vectors are vectors corresponding to the i+1th round of historical question-and-answer dialogues.
  • the intermediate representation vector output by the Kth sub-transcoder 212 is the i+1th round of historical question-and-answer dialogues. state vector c i+1 .
  • the intermediate representation vector is output, and the intermediate representation vector, image feature v and historical question-and-answer feature u i+1 are input into the second sub-transcoding encoder 212, and so on, the jth sub-encoder
  • the transcoder 212 outputs the j-th intermediate representation vector, where the j-th intermediate representation vector is the vector corresponding to the current round question (the state vector corresponding to the non-current round question).
  • Step 4042 iterate j, call the j+1th sub-transformer in the i+1th multimodal incremental transform encoder to perform the corresponding operation on the jth intermediate representation vector, image feature and the ith round of historical question-and-answer dialogue.
  • the state vector is subjected to multi-modal encoding processing to obtain the j+1-th intermediate representation vector.
  • the j+1-th intermediate representation vector is another vector corresponding to the i+1-th historical question-and-answer dialogue, and j+1 ⁇ K.
  • the server inputs the image features, the historical question-and-answer feature of the i+1th round of historical question-and-answer dialogue, and the jth intermediate representation vector output by the jth sub-transcoder 212 to the j+1th sub-transcoder In 212, the j+1 th sub-transcoder outputs the j+1 th intermediate representation vector, and the j+1 th intermediate representation vector is also a vector corresponding to the i+1 th round of historical question-and-answer dialogue.
  • Step 4043 Determine the Kth intermediate representation vector obtained by iteration j as the state vector corresponding to the i+1th round of historical question-and-answer dialogue.
  • the server inputs the intermediate representation vector output by the previous sub-transcoder to the next sub-transcoder.
  • the K sub-transformers in MITE have all performed multi-modal encoding processing, and output the state vector corresponding to one round of question-and-answer dialogue.
  • the server invokes the first sub-transformer in the i+1th MITE to perform multi-model multi-modeling on the image features, the state vector corresponding to the ith round of historical question-and-answer dialogue, and the question-and-answer feature of the i+1th round of historical question-and-answer dialogue.
  • state encoding processing to obtain the j-th intermediate representation vector; iterate j, call the j+1-th sub-transformer to perform multi-mode multi-modeling on the image features, the state vector corresponding to the i+1-th historical question-and-answer dialogue and the j-th intermediate representation vector
  • the j+1 th intermediate representation vector is obtained; among them, j is a positive integer variable that increases from 1; the K th intermediate representation vector obtained by iteration j is determined as the i+1 th historical question-and-answer dialogue. state vector.
  • Step 405 Determine the state vector corresponding to the n+1th round of historical question-and-answer dialogue obtained by iteration i as the state vector corresponding to the current round of questioning.
  • each round of historical question and answer dialogue corresponds to a MITE 211
  • each MITE 211 outputs the state vector corresponding to each round of historical question and answer dialogue
  • the state vector output by the previous MITE 211 is used as the input of the next MITE 211 until the input In the MITE 211 corresponding to the n+1 round of questions
  • the server outputs the state vector corresponding to the current round of questions through the MITE 211 corresponding to the n+1 round of questions.
  • the server invokes the first MITE in the visual dialogue model to perform multi-modal encoding processing on the image feature and the question-and-answer feature of the first round of historical question-and-answer dialogue, to obtain a state vector corresponding to the first round of historical question-and-answer dialogue; Iteration i, call the i+1th MITE to perform multi-modal encoding processing on the image features, the state vector corresponding to the i+1th round of historical question-and-answer dialogue, and the question-and-answer feature corresponding to the i+1th round of historical question-and-answer dialogue, and obtain the i+th The state vector corresponding to one round of historical question-and-answer dialogue; where i is a positive integer variable that increases from 1; the state vector corresponding to the n+1th round of historical question-and-answer dialogue obtained by iteration i is determined as the state vector corresponding to the current round of questioning.
  • Step 406 Invoke the multi-modal incremental conversion decoder in the visual dialogue model to obtain the character string features of the output strings in the actual output answers corresponding to the current round of questions.
  • the visual dialogue model includes a multimodal incremental transformation decoder (MITD model) 221 for decoding the strings that make up the answer.
  • MIMD model multimodal incremental transformation decoder
  • the current round of questions is: "How are you?", and the actual output answer is: “I am OK”.
  • the character string that the multimodal incremental decoder 221 is outputting is "OK", then the words "I” and "am” are input into the multimodal incremental conversion decoder.
  • the character string feature may be extracted from the answer text corresponding to the outputted answer through a feature extraction model.
  • Step 407 Invoke the multi-modal incremental conversion decoder to perform multi-modal decoding processing on the state vector, image feature and character string feature corresponding to the current round of questioning to obtain a decoded feature vector.
  • Step 408 Determine the actual output answer corresponding to the current round of questions according to the decoded feature vector, where the actual output answer includes the outputted character string.
  • the server inputs the output string to the MITD 221, and outputs a string in the actual output answer corresponding to the current round of questions in combination with the state vector and image features corresponding to the current round of questions.
  • the multimodal incremental conversion decoder includes T sub-transform encoders, where T is a positive integer, and the above step 407 can be replaced with the following steps:
  • Step 4071 Obtain the mth intermediate representation vector, where the mth intermediate representation vector is obtained by performing m times of multi-modal decoding processing on the state vector, image feature and character string feature corresponding to the current round of questions, where m is a positive integer and m The starting value is 1.
  • m is a variable and takes a value from 1 to T.
  • m use the first sub-transformation decoder to perform multi-modal decoding processing on the state vector, image feature and character string feature corresponding to the current round of questions, and obtain the first intermediate decoding vector (the mth intermediate representation vector) ;
  • m is greater than 1, use the mth sub-conversion decoder to perform multi-modal decoding processing on the state vector, image feature and the m-1th intermediate decoding vector (the m-1th intermediate representation vector) corresponding to the current round of questions , and the mth intermediate decoding vector (the mth intermediate representation vector) is obtained.
  • the MITD 221 in FIG. 5 includes T sub-transition decoders 222, each of which is used to perform a multi-modal decoding process, so that one MITD 221 performs T multi-modal decoding processing on an input vector state decoding process.
  • the visual dialogue model includes one or more MITDs 221, and the embodiments of the present application are described by taking the visual dialogue model including one MITD 221 as an example.
  • the image feature v, the character string feature, and the state vector c n+1 corresponding to the current round question output by the MITE 211 corresponding to the current round question are input into the first sub-transformation decoder 222 in the MITD, and the intermediate representation vector is output.
  • the intermediate representation vector, image feature v and string feature are input to the second sub-transform decoder 222 .
  • the m-th sub-transform decoder 222 outputs the m-th intermediate representation vector, where the m-th intermediate representation vector is the vector corresponding to the current round question.
  • Step 4072 iterate m, call the m+1 th sub-transform decoder in the multi-modal incremental conversion decoder to perform multi-modal decoding processing on the m th intermediate representation vector, the image feature and the state vector corresponding to the current round question. , get the m+1-th intermediate representation vector, m+1 ⁇ T.
  • the server inputs the m-th intermediate representation vector output by the m-th sub-transform decoder into the m+1-th sub-transform decoder 222, and the m+1-th sub-transform decoder outputs the m+1-th intermediate representation vector.
  • the intermediate representation vector, the m+1 th intermediate representation vector is also the vector corresponding to the current round of questioning.
  • Step 4073 Determine the T-th intermediate representation vector obtained by iteration m as the decoded feature vector.
  • the server inputs the intermediate representation vector output by the previous sub-conversion decoder into the next sub-conversion decoder, until the T sub-conversion decoders in the MITD have all performed multi-modal decoding processing, and output the current The corresponding decoded feature vector is asked in turn, and the decoded feature vector is used to determine the actual output answer.
  • the server calls the first sub-conversion decoder in MITD to perform multi-modal decoding processing on the image feature, the state vector corresponding to the current round of questioning, and the string feature, and obtains the first intermediate decoding vector; iteration j, calling The m+1th sub-transformation decoder performs multi-modal encoding processing on the image feature, the state vector corresponding to the current round question, and the mth intermediate decoding vector (the mth intermediate representation vector), and obtains the m+1th intermediate decoding vector (the m+1 th intermediate representation vector); wherein, m is a positive integer variable increasing from 1; the T th intermediate decoding vector (the T th intermediate representation vector) obtained by iteration m is determined as the decoding feature vector.
  • the visual dialogue method provided by the embodiment of the present application enables the visual dialogue model to better understand the implicit information in the input image in connection with the context by obtaining the state vectors corresponding to the first n rounds of historical question-and-answer dialogues about the input image.
  • the visual dialogue model can better output the actual output answer corresponding to the current round of questions according to various types of information, and improve the accuracy of the answer output by the visual dialogue model. It ensures the consistency of the output answer with the question and the input image, and improves the effect of visual dialogue.
  • the server performs multi-modal encoding processing on the state vector corresponding to each round of historical question-and-answer dialogue through the multi-modal incremental conversion encoder in the visual dialogue model, and so on, so as to obtain the current round of questioning.
  • the state vector of makes the output answer obtained after the subsequent multi-modal decoding process more accurate.
  • the server sets K sub-transcoders in each multi-modal incremental transcoder, and the K sub-transcoders sequentially transmit the intermediate representation vector output by the previous sub-transcoder.
  • the state vector corresponding to the current round of questions is obtained, so that the output answer obtained by the subsequent decoding processing is more accurate.
  • the embodiments of the present application can provide accurate intermediate representation vectors for subsequent output answers through the layered structure.
  • the server decodes the state vector output by the multi-modal incremental conversion encoder through the multi-modal incremental conversion decoder in the visual dialogue model, so that the visual dialogue model can accurately output the current round. The actual output answer corresponding to the question is asked.
  • the server passes the T sub-conversion decoders set in the multi-modal incremental conversion decoder, and the T sub-conversion decoders sequentially transmit the intermediate representation vector output by the previous sub-conversion decoder to the next sub-conversion decoder.
  • the actual output answer corresponding to the current round of questions is obtained.
  • the layered structure can ensure the accuracy of the answer output by the visual dialogue model.
  • FIG. 7 shows a schematic structural diagram of a sub-transcoder provided by an exemplary embodiment of the present application.
  • a sub-transform encoder 212 includes a Self-Attention layer 213, a Cross-Modal Attention layer 214, a History Attention layer 215, and a Feedforward Neural Network layer Network, FNN) 216.
  • K means that a MITE 211 includes K sub-transform encoders 212 , namely, K self-attention layers 213 , K cross-modal attention layers 214 , K historical attention layers 215 , and K feed-forward neural network layers 216 .
  • Step 1 Invoke the j+1 th sub-transform encoder in the i+1 th multimodal incremental transform encoder to perform intermediate encoding processing on the j th intermediate representation vector to obtain a first sub-vector.
  • the server inputs the j-th intermediate representation vector output by the j-th sub-transcoder to the self-attention layer 213 of the j+1-th sub-transcoder, and outputs the first sub-vector.
  • a (j+1) MultiHead(C j , C j , C j ) (5)
  • a (j+1) represents the first sub-vector
  • C j represents the j-th intermediate representation vector output by the previous sub-transcoder (the j-th sub-transcoder)
  • MultiHead() represents the corresponding multi-head attention mechanism. deal with.
  • the jth intermediate representation vector output by the jth sub-transcoder is output by the feedforward neural network layer of the jth sub-transcoder.
  • Step 2 Perform intermediate encoding processing on the first sub-vector and the image feature to obtain a second sub-vector.
  • the server inputs the first sub-vector into the cross-modal attention layer 214, simultaneously inputs the image feature v of the image, and outputs the second sub-vector.
  • B (j+1) represents the second sub-vector.
  • Step 3 Perform intermediate encoding processing on the second sub-vector and the state vector corresponding to the i-th historical question-and-answer dialogue to obtain a third sub-vector.
  • the server inputs the second sub-vector into the historical attention layer 215, and simultaneously inputs the state vector corresponding to the i-th historical question-and-answer dialogue (that is, the state vector of the MITE output corresponding to the i-th historical question-and-answer dialogue) , output the third sub-vector.
  • F (j+1) represents the third sub-vector
  • c i represents the state vector corresponding to the i-th historical question-and-answer dialogue.
  • Step 4 Perform intermediate encoding processing on the third sub-vector to obtain the j+1-th intermediate representation vector.
  • the server inputs the third sub-vector to the feedforward neural network layer 216, and outputs the j+1-th intermediate representation vector corresponding to the j+1-th sub-transcoder.
  • C (j+1) represents the j+1-th intermediate representation
  • FFN() represents the processing corresponding to the feedforward neural network layer.
  • c i+1 represents the state vector corresponding to the i+1th round of historical question-and-answer dialogue.
  • the intermediate representation vector is output, and the intermediate representation vector will be used as the j+2-th sub-transcoder The input of each sub-transcoder, and so on, until the last sub-transcoder outputs the state vector corresponding to the i+1th round of historical question-and-answer dialogue.
  • each MITE corresponds to a round of historical question and answer dialogue
  • the MITE corresponding to the current round of questions takes the state vector, question features and image features corresponding to the previous round of historical question and answer dialogue as input, and inputs them to the MITE 211 corresponding to the current round of questions.
  • the self-attention layer 213 of the first sub-transform encoder 212 in the above steps are repeated until the state vector corresponding to the current round of questioning is output.
  • each intermediate representation vector is calculated respectively through the multi-layer structure set in the sub-transformation encoder, so that each sub-transformation encoder can accurately
  • the intermediate representation vector is output to ensure that the subsequent state vector corresponding to the current round of questions is accurate.
  • FIG. 8 shows a schematic structural diagram of a sub-transform decoder provided by an exemplary embodiment of the present application.
  • a sub-transition decoder 222 includes a Self-attention layer (Self-attention) 223, a Gated Cross Attention (GCA) layer 224, and a Feedforward Neural Network (FNN) layer 225.
  • T denotes that an MITD 221 includes T sub-transformation decoders 222 , namely, T self-attention layers 223 , T gated cross-modal attention layers 224 and T feedforward neural network layers 225 .
  • the input of a sub-transform decoder 222 includes the image feature v of the input image, the state vector corresponding to the n+1th round of historical temperature dialogue, the problem feature corresponding to the n+1th round, and the target input.
  • the input and output process of the m+1th sub-transformation decoder is taken as an example for description, and the input and output process of the sub-transformation decoder is as follows:
  • Step 11 Invoke the m+1 th sub-transform decoder in the multi-modal incremental transform decoder to perform intermediate decoding processing on the m th intermediate representation vector to obtain a third sub-vector.
  • the server inputs the m-th intermediate representation vector output by the m-th sub-transform decoder into the self-attention layer 223 of the m+1-th sub-transform decoder, and outputs a third sub-vector.
  • J (m+1) represents the third sub-vector
  • R m represents the m-th intermediate representation vector output by the previous sub-transform decoder (m-th sub-transform decoder)
  • MultiHead() represents the multi-head attention mechanism.
  • the output of the sub-transformation decoder is not used as input before the first sub-transition decoder, and the target input R 0 is input into the first sub-transition decoder (that is, the answer feature of the actual output answer;
  • the target input is the string features of the first x strings that have been output; in the training process of the visual dialogue model, the target input is the characters in the actual output answer corresponding to the first x strings that have been output. String characteristics of the string.
  • the m-th intermediate representation vector output by the m-th sub-transform decoder is output by the feed-forward neural network layer of the m-th sub-transform decoder.
  • Step 12 Perform intermediate decoding processing on the third sub-vector, the image feature and the state vector corresponding to the current round of questioning to obtain the fourth sub-vector.
  • the server inputs the third sub-vector, the image feature, and the state vector corresponding to the current round of questioning into the gated cross-modal attention layer 224, and simultaneously inputs the state vector and image feature corresponding to the current round of questioning, Output the fourth subvector.
  • the cross-modal attention layer 226-1 receives the state vector c n+ corresponding to the current round of questioning (the n+1th round). 1 , and output the vector E (m+1) according to the third sub-vector J (m+1) and the state vector c n+1 corresponding to the current round of questioning; as shown in formula (11):
  • the cross-modal attention layer 226-2 receives the image feature v and outputs a vector G (m+1) ; as shown in equation (12):
  • the force layer 226-1 outputs a vector G (m+1)
  • the cross-modal attention layer 226-2 outputs a vector E (m+1)
  • the output vectors (E (m+1) and G (m+1) ) are represented by unmarked rectangles in FIG. 9 , and the rectangles are only indicative of the size and number of feature vectors that do not represent the actual output.
  • the vector E (m+1) output by the cross-modal attention layer 226-1 is passed through the fully connected layer (Fully Connected Layers, FC) 227-1 to output the vector ⁇ (m+1) ; such as formula ( 13) shown:
  • E (m+1) represents the vector output by the cross-modal attention layer 226-1
  • represents the logistic regression function (Sigmoid)
  • W E and b E represent the parameters of the cross-modal attention layer 226-1.
  • the vector G (m+1) output by the cross-modal attention layer 226-2 is outputted by the fully connected layer 227-2 vector ⁇ (m+1) , as shown in formula (14):
  • ⁇ (m+1) ⁇ (W G [J (m+1) , G (m+1) ]+b G ) (14)
  • G (m+1) ) represents the vector output by the cross-modal attention layer 226-2
  • represents the logistic regression function
  • W G and b G represent the parameters of the cross-modal attention layer 226-2.
  • represents the Hadamard product.
  • the fully connected layer 227-1 and the fully connected layer 227-2 are the same, the calculation processes on both sides can be exchanged, that is, the fully connected layer 227-2 outputs a vector ⁇ (m+1) , and the fully connected Layer 227-1 outputs a vector ⁇ (m+1) .
  • Step 13 Perform intermediate decoding processing on the fourth sub-vector to obtain the m+1 th intermediate representation vector.
  • the server inputs the fourth sub-vector into the feedforward neural network layer 225, and outputs the m+1-th intermediate representation vector corresponding to the m+1-th multi-modal decoding process, as shown in formula (16) shown:
  • R (m+1) represents the m+1 th intermediate representation output by the m+1 th sub-transform decoder.
  • the intermediate representation vector will be output, and the intermediate representation vector will be used as the input of the m+2 th sub-transform decoder. And so on, until the last sub-transform decoder outputs the above decoded feature vector r n+1 .
  • the server obtains the string probability output in the actual output answer according to the decoded feature vector.
  • the feature vector output by MITD is input to the logistic regression layer to obtain the probability of the string currently being output, as shown in formula (18):
  • the server outputs the string in the actual output answer according to the string probability.
  • the server can use the output string probability to determine the currently output string (target output) through the visual dialogue model.
  • each intermediate representation vector is calculated respectively through the multi-layer structure set in the sub-transformation decoder, so that each sub-transition decoder can accurately
  • the intermediate representation vector is output to ensure that the decoded feature vector corresponding to the current round question obtained subsequently is accurate.
  • the accuracy of the actual output answer output according to the decoded feature vector is guaranteed.
  • the attention models in the multimodal incremental conversion encoder and the multimodal incremental conversion decoder in the embodiments of the present application can be replaced with other attention models, such as traditional attention models, Local and global attention models, multi-head attention models, etc.
  • the training method of the visual dialogue model is described below.
  • FIG. 10 shows a flowchart of a training method for a visual dialogue model provided by an exemplary embodiment of the present application.
  • the embodiment of the present application is described by taking the visual dialogue method being used in the server 120 in the visual dialogue system 100 as shown in FIG. 1 as an example, and the visual dialogue method includes the following steps:
  • Step 1001 Obtain the image feature samples of the input image samples and the state vector samples corresponding to the previous s rounds of historical question-and-answer dialogue samples, where s is a positive integer.
  • the training samples for training the visual dialogue model include input image samples, and the input image samples are images in an existing image set.
  • the visual dialogue model includes a feature extraction model, which is a model constructed based on a convolutional neural network. Therefore, the server extracts the features in the input image samples through the fast region detection convolutional neural network, and the extracted features are the image feature samples; or, the server extracts the image feature samples in the input image samples through the convolutional neural network; or, The server extracts image feature samples in the input image samples through a Visual Geometry Group Network (VGG); or, extracts image feature samples in the input image samples through a Residual Neural Network (ResNET).
  • VCG Visual Geometry Group Network
  • ResNET Residual Neural Network
  • the process of training the visual dialogue model includes the process of training the feature extraction model, so the feature extraction model is a trained feature extraction model.
  • step 1001 is similar to the implementation description of step 401 .
  • Step 1002 Obtain the question features of the current round of questions and the first answer features of the real answers corresponding to the current round of questions.
  • the server may use formulas (3) and (4) to obtain the question feature and the first answer feature.
  • the question feature and the first answer feature may be obtained by a visual dialogue model.
  • Step 1003 Invoke the visual dialogue model to perform multi-modal encoding processing on the image feature samples, the state vector samples corresponding to the previous s rounds of historical question and answer dialogue samples, and the question feature samples, and obtain the state vector samples corresponding to the current round of questioning samples.
  • the server is provided with s multi-modal incremental conversion encoders (MITE) for the previous s rounds of historical question-and-answer dialogue samples through the visual dialogue model, and a corresponding MITE is set for the current round of questioning samples, and the previous
  • the state vector samples corresponding to a round of historical question-and-answer dialogue samples output by MITE are used as the input of the next MITE.
  • the above process of outputting state vector samples is repeated until the state vector samples corresponding to the current round of questioning samples are output.
  • Step 1004 Invoke the visual dialogue model to perform multi-modal decoding processing on the state vector samples, image feature samples and first answer features corresponding to the current round of questioning samples to obtain the second answer features of the actual output answer samples corresponding to the current round of questioning samples.
  • the visual dialogue model further includes a Multimodal Incremental Transition Decoder (MITD), and the server inputs the state vector samples, image feature samples and first answer features corresponding to the current round of questioning samples into the MITD.
  • the MITD model consists of T sub-transform decoders, and the intermediate representation vector output by the previous sub-transform decoder is used as the input of the next sub-transform decoder. The above process of outputting the intermediate representation vector is repeated until the final decoded feature vector sample corresponding to the current round of questioning samples is output.
  • the decoded feature vector sample is the second answer feature of the actual output answer sample corresponding to the current round of questioning samples.
  • the server obtains the string feature labels (first answer features) of the first q strings in the real answer through the visual dialogue model, and the first q strings in the real answer are different from the actual
  • the q strings that have been output in the output answer are in one-to-one correspondence, and q is a positive integer; according to the state vector sample, image feature sample and string feature label corresponding to the current round of questioning samples, the q+1th string in the actual output answer is obtained The corresponding second answer feature.
  • Step 1005 Train the visual dialogue model according to the first answer feature and the second answer feature to obtain a trained visual dialogue model.
  • the server trains the visual dialogue model according to the difference between the first answer feature and the second answer feature.
  • the trained visual dialogue model is the visual dialogue model in step 403 .
  • a visual dialogue model is trained on a combination of the actual output word "OK” and the word “fine” in the real answer.
  • the visual dialogue method of the embodiment of the present application by obtaining the state vectors corresponding to the first n rounds of historical question-and-answer dialogues about the input image, enables the trained visual dialogue model to better understand the hidden meaning in the image in relation to the context. information, using the multimodal encoding processing method and the multimodal decoding processing method, so that the trained visual dialogue model can better output the actual output answer corresponding to the current round of questions according to various types of information, and improve the visual effect after training.
  • the accuracy of the answer output by the dialogue model, and the consistency of the output answer with the question and the input image is guaranteed to improve the effect of visual dialogue.
  • the visual dialogue model is obtained by training the state vector sample corresponding to the current round of questioning samples, the image feature sample and the first answer feature corresponding to the real answer, so that the accuracy of the answer output by the trained visual dialogue model is improved.
  • the visual dialogue model when the trained visual dialogue model is ready to output the q+1th string, the visual dialogue model is based on all the strings before the q+1th string in the real answer and the current round of questions.
  • the state vector sample corresponding to the sample is used to determine what string the output q+1th string is, so that the string accuracy rate output by the trained visual dialogue model is higher, thereby ensuring that the output answer is more accurate.
  • the training method and usage method of the visual dialogue model are similar.
  • the state vector samples and image feature samples corresponding to the current round of questioning samples are subjected to multi-modal decoding processing to obtain the current round of questioning.
  • the second answer feature of the actual output answer is combined with the first answer feature and the second answer feature of the real answer to train the visual dialogue model.
  • the trained visual dialogue model outputs the string to be output according to the output string and the state vector corresponding to the current round of questioning.
  • the multi-modal incremental conversion encoder in the visual dialogue model is called to obtain the first state vector sample corresponding to the a-th round of historical question-and-answer dialogue samples, where a is a positive integer, a is a variable, and a corresponds to Takes the value from 1 to s; obtains the state vector sample corresponding to the historical question-and-answer dialogue sample of the a-th round, a is a positive integer and the starting value of a is 1; iterates a, and calls the a+th in the visual dialogue model
  • a multi-modal incremental conversion encoder performs multi-modal encoding processing on the image feature samples, the state vector samples corresponding to the a-th round of historical question-and-answer dialogue samples, and the question-and-answer feature samples corresponding to the a+1 round of historical question-and-answer dialogue samples.
  • the state vector sample is determined as the state vector sample corresponding to the current round of questioning samples.
  • the multimodal incremental conversion encoder includes K sub-transformation encoders, where K is a positive integer; the j-th intermediate representation vector sample is obtained, and the j-th intermediate representation vector sample is a pair of image feature samples,
  • the state vector samples corresponding to the historical question-and-answer dialogue samples of the a-th round and the question-and-answer feature samples corresponding to the a+1-th historical question-and-answer dialogue samples are obtained by performing j multi-modal encoding processing, and the j-th intermediate representation vector sample is the a+
  • the vector corresponding to 1 round of historical question and answer dialogue samples, j is a positive integer and the initial value of j is 1; iteration j, call the a+1th multimodal incremental conversion encoder in the visual dialogue model.
  • the j+1st sub-transcoder in the +1 multi-modal incremental transcoder performs multi-modal multi-modal analysis on the jth intermediate representation vector sample, the image feature sample and the state vector sample corresponding to the i-th round of historical question-and-answer dialogue samples.
  • state encoding process and the j+1-th intermediate representation vector sample is obtained, and the j+1-th intermediate representation vector sample is another vector corresponding to the i+1-th historical question-and-answer dialogue sample, j+1 ⁇ K;
  • the Kth intermediate representation vector sample of is determined as the state vector sample corresponding to the a+1th round of historical question and answer dialogue samples.
  • the j+1 th sub-transform encoder in the a+1 th multimodal incremental conversion encoder is called to perform intermediate encoding processing on the j th intermediate representation vector sample to obtain the first sub-vector sample; perform intermediate encoding processing on the first sub-vector sample and the image feature sample to obtain the second sub-vector sample; perform intermediate encoding processing on the second sub-vector sample and the state vector sample corresponding to the a-th round of historical question and answer dialogue samples to obtain the first sub-vector sample Three sub-vector samples; perform intermediate coding processing on the third sub-vector sample to obtain the j+1 th intermediate representation vector sample.
  • the multi-modal incremental conversion decoder in the visual dialogue model is called to obtain the character string feature samples of the output strings in the actual output answer sample corresponding to the current round of questioning samples;
  • the quantitative conversion decoder performs multi-modal decoding processing on the state vector samples, image feature samples and character string feature samples corresponding to the current round of questioning samples to obtain decoded feature vector samples; the actual output corresponding to the current round of questioning is determined according to the decoded feature vector samples Sample answer.
  • the multi-modal incremental conversion decoder includes T sub-conversion decoders, where T is a positive integer; the mth intermediate representation vector sample is obtained, and the mth intermediate representation vector sample is a sample for the current round of questioning
  • the corresponding state vector samples, image feature samples and character string feature samples are obtained by m times of multi-modal decoding processing, m is a positive integer and the initial value of m is 1; iterate m, call multi-modal incremental conversion decoding
  • the m+1th sub-conversion decoder in the decoder performs multi-modal decoding processing on the mth intermediate representation vector sample, the image feature sample and the state vector sample corresponding to the current round question sample, and obtains the m+1th intermediate representation vector sample. , m+1 ⁇ T; the T-th intermediate representation vector sample obtained by iteration m is determined as the decoded feature vector sample.
  • the m+1 th sub-transform decoder in the multi-modal incremental conversion decoder is called to perform intermediate decoding processing on the m th intermediate representation vector sample to obtain a third sub-vector sample; Perform intermediate decoding processing on the sub-vector samples, image feature samples and state vector samples corresponding to the current polling sample to obtain the fourth sub-vector sample; perform intermediate decoding processing on the fourth sub-vector sample to obtain the m+1 th intermediate representation vector sample .
  • the estimated probability of the string output in the actual output answer sample is obtained according to the decoded feature vector sample; the string sample in the actual output answer sample is output according to the estimated string probability.
  • Table 1 shows the training effect of the visual dialogue model compared with the benchmark model, and the visual dialogue model provided in the above method embodiment is comprehensively evaluated with different types of evaluation indicators.
  • the visual dialogue model obtains a list of candidate answers, and the three evaluation metrics in Table 1 are used to evaluate the performance of the visual dialogue model in retrieving answers.
  • MRR represents the mean reciprocal rank (Mean Reciprocal Rank), which sorts the list of candidate answers. If the correct answer is ranked in the ath position, the value of MRR is 1/a. The higher the value of MRR, the higher the accuracy of the answer output by the visual dialogue model, that is, the better the effect of the visual dialogue model.
  • R@K represents the Existence of the Human Response in Top-K Ranked Responses.
  • the higher the value of R@K the higher the accuracy of the answer output by the visual dialogue model, that is, the visual The dialog model works better.
  • Mean represents the average level of human responses. The lower the value of Mean, the higher the accuracy of the answer output by the visual dialogue model, that is, the better the effect of the visual dialogue model.
  • FIG. 11 is a structural block diagram of a visual dialogue device provided by an exemplary embodiment of the present application.
  • the visual dialogue device 11-1 includes:
  • the first acquisition module 1110 is configured to acquire the image features of the input image and the state vectors corresponding to the previous n rounds of historical question-and-answer dialogues, where n is a positive integer;
  • the first obtaining module 1110 is configured to obtain the question characteristics of the current round of questions
  • the first feature encoding module 1120 is configured to perform multi-modal encoding processing on image features, state vectors corresponding to the previous n rounds of historical question-and-answer dialogues, and question features, to obtain a state vector corresponding to the current round of questioning;
  • the first feature decoding module 1130 is configured to perform multi-modal decoding processing on the state vector and image features corresponding to the current round of questions, to obtain the actual output answers corresponding to the current round of questions.
  • the feature encoding module 1120 is further configured to obtain the state vector corresponding to the i-th historical question-and-answer dialogue, where i is a positive integer and the initial value of i is 1;
  • the i+1 th multi-modal incremental conversion encoder performs multi-modal multi-modality on the image feature, the state vector corresponding to the i-th round of historical question-and-answer dialogue, and the question-and-answer feature corresponding to the i+1-th round of historical question-and-answer dialogue samples.
  • state encoding processing to obtain the state vector corresponding to the i+1 round of historical question-and-answer dialogue, and different multi-modal incremental conversion encoders correspond to different historical question-and-answer dialogues one-to-one;
  • the state vector corresponding to the +1 round of historical question and answer dialogue is determined as the state vector corresponding to the current round of questioning.
  • the multi-modal incremental conversion encoder includes K sub-transcoding encoders, where K is a positive integer; the first feature encoding module 1120 is further configured to obtain the jth intermediate representation vector, the The j-th intermediate representation vector is obtained by performing j multi-modal encoding processing on the image feature, the state vector corresponding to the i-th round of historical question-and-answer dialogue, and the question-and-answer feature corresponding to the i+1-th historical question-and-answer dialogue sample.
  • the j-th intermediate representation vector is the vector corresponding to the i+1-th historical question-and-answer dialogue, j is a positive integer and the initial value of j is 1; iteration j, calls the first in the visual dialogue model.
  • the j+1 th sub-transcoder in the i+1 th multi-modal incremental transcoder of the i+1 multimodal incremental transcoders is responsible for the jth intermediate representation vector , the image feature and the state vector corresponding to the i-th historical question-and-answer dialogue are subjected to multi-modal encoding processing to obtain the j+1-th intermediate representation vector, and the j+1-th intermediate representation vector is the i-th intermediate representation vector.
  • +1 another vector corresponding to the historical question-and-answer dialogue, j+1 ⁇ K; the K-th intermediate representation vector obtained by iteration j is determined as the state vector corresponding to the i+1-th historical question-and-answer dialogue.
  • the first feature encoding module 1120 is further configured to call the j+1 th sub-transcoder in the i+1 th multi-modal incremental transcoder Perform intermediate encoding processing on the j-th intermediate representation vector to obtain a first sub-vector; perform intermediate encoding processing on the first sub-vector and the image feature to obtain a second sub-vector; Perform intermediate encoding processing on the state vector corresponding to the i-th historical question-and-answer dialogue to obtain a third sub-vector; perform intermediate encoding processing on the third sub-vector to obtain the j+1-th intermediate representation vector.
  • the first feature decoding module 1130 is further configured to call the multimodal incremental conversion decoder in the visual dialogue model to obtain the output string of the actual output answer corresponding to the current round of questioning. character string feature; call the multimodal incremental conversion decoder to perform multimodal decoding processing on the state vector, image feature and string feature corresponding to the current round of questioning to obtain a decoded feature vector; determine according to the decoded feature vector The actual output answer corresponding to the current round of questions, wherein the actual output answer includes the outputted character string.
  • the first feature decoding module 1130 is further configured to determine a string probability according to the decoded feature vector; and determine a string in the actual output answer according to the string probability.
  • the multimodal incremental conversion decoder includes T sub-transformation decoders, where T is a positive integer; the first feature decoding module 1130 is further configured to obtain the mth intermediate representation vector, the The mth intermediate representation vector is obtained by performing m times of multi-modal decoding processing on the state vector corresponding to the current round question, the image feature and the character string feature, where m is a positive integer and the initial value of m is 1; Iteration m, calling the m+1th sub-transformation decoder in the multimodal incremental conversion decoder to perform a state corresponding to the mth intermediate representation vector, the image feature, and the current round questioning The vector is subjected to multi-modal decoding processing to obtain the m+1 th intermediate representation vector, where m+1 ⁇ T; the T th intermediate representation vector obtained by iteration m is determined as the decoding feature vector.
  • the first feature decoding module 1130 is further configured to call the m+1 th sub-transform decoder in the multi-modal incremental conversion decoder to perform a Perform intermediate decoding processing on the representation vector to obtain a third sub-vector; perform intermediate decoding processing on the third sub-vector, the image feature and the state vector corresponding to the current round of questions to obtain a fourth sub-vector; Perform intermediate decoding processing on the four sub-vectors to obtain the m+1 th intermediate representation vector.
  • the visual dialogue device by acquiring the state vectors corresponding to the first n rounds of historical question-and-answer dialogues about the input image, so that the visual dialogue model can better understand the information hidden in the image in connection with the context, Using the multi-modal encoding processing method and the multi-modal decoding processing method, the visual dialogue model can better output the actual output answer corresponding to the current round of questions according to various types of information, and improve the accuracy of the answer output by the visual dialogue model. , and ensure the consistency of the output answer with the question and the input image, and improve the effect of visual dialogue.
  • the state vector corresponding to each round of historical question-and-answer dialogue is multi-modally encoded by the multi-modal incremental conversion encoder in the visual dialogue model, and so on, so as to obtain the current round of questioning.
  • the state vector makes the output answer obtained after the subsequent multi-modal decoding process more accurate.
  • the K sub-transcoders sequentially transfer the intermediate representation vector output by the previous sub-transcoder to the K sub-transcoders.
  • the state vector corresponding to the current round of questions is obtained, so that the output answer obtained by the subsequent decoding processing is more accurate.
  • the layered structure is guaranteed to provide accurate intermediate representation vectors for subsequent output answers.
  • the state vector output by the multi-modal incremental conversion encoder is decoded by the multi-modal incremental conversion decoder in the visual dialogue model, so that the visual dialogue model can accurately output the current round of questions. The corresponding actual output answer.
  • the T sub-transform decoders sequentially transmit the intermediate representation vector output by the previous sub-transform decoder to the next one.
  • the actual output answer corresponding to the current round of questions is obtained.
  • the accuracy of the answer output by the visual dialogue model is guaranteed by the layered structure.
  • each intermediate representation vector is calculated separately through the multi-layer structure set in the sub-transformation encoder, so that each sub-transformation encoder can accurately output the intermediate representation vector according to the previous sub-transformation encoder, so as to ensure subsequent It is accurate to obtain the state vector corresponding to the current round of questions.
  • each intermediate representation vector is calculated respectively through the multi-layer structure set in the sub-transformation decoder, so that each sub-transformation decoder can accurately output the intermediate representation vector according to the previous sub-transformation decoder, This ensures that the decoded feature vector corresponding to the current round of questions obtained subsequently is accurate. This ensures that the actual output answer is output according to the decoded feature vector.
  • the visual dialogue device provided by the embodiment of the present application is only illustrated by the division of the above functional modules.
  • the structure is divided into different functional modules to complete all or part of the functions described above.
  • the visual dialogue device and the visual dialogue method provided in the above embodiments belong to the same concept, and the specific implementation process thereof can be found in the visual dialogue method provided by the embodiment of the present application, which will not be repeated here.
  • the training apparatus 12-1 includes:
  • the second acquisition module 1210 is configured to acquire the image feature samples of the input image samples and the state vector samples corresponding to the previous s rounds of historical question-and-answer dialogue samples, where s is a positive integer;
  • the second obtaining module 1210 is configured to obtain the question feature samples of the current round of questioning samples and the first answer features of the real answers corresponding to the current round of questioning samples;
  • the second feature encoding module 1220 is configured to call the visual dialogue model to perform multi-modal encoding processing on the image feature samples, the state vector samples corresponding to the previous s rounds of historical question-and-answer dialogue samples, and the question feature samples, to obtain the The state vector sample corresponding to the current round of questioning samples;
  • the second feature decoding module 1230 is configured to call the visual dialogue model to perform multi-modal decoding processing on the state vector samples corresponding to the current round of questioning samples, the image feature samples and the first answer features, to obtain the The second answer feature of the actual output answer sample corresponding to the current round of questioning samples;
  • the training module 1240 is configured to train the visual dialogue model according to the first answer feature and the second answer feature to obtain a trained visual dialogue model.
  • the second feature decoding module 1230 is further configured to obtain character string feature labels of the first q character strings in the real answer, and the first q character strings in the real answer are the same as the actual
  • the q strings that have been output in the output answer sample are in one-to-one correspondence, q is a positive integer, and the first answer feature includes the character string feature label; according to the state vector sample corresponding to the current round of questioning samples, the image feature sample and the character string feature label to obtain the second answer feature corresponding to the q+1 th character string in the actual output answer sample corresponding to the current round of questioning samples.
  • the second feature encoding module 1220 is further configured to call the multimodal incremental conversion encoder in the visual dialogue model to obtain the first state vector sample corresponding to the a-th round of historical question-and-answer dialogue samples, a is a positive integer, a is a variable, and the corresponding value of a is any one of 1 to s; obtain the state vector sample corresponding to the historical question and answer dialogue sample of the a-th round, a is a positive integer and the initial value of a is 1; Iteration a, call the a+1 multi-modal incremental conversion encoder in the visual dialogue model to perform the image feature samples, the state vector samples corresponding to the a-th round of historical question-and-answer dialogue samples, and the a+1-th historical question-and-answer dialogue samples The corresponding question-and-answer feature samples are processed by multi-modal encoding, and the state vector samples corresponding to the a+1 round of historical question-and-answer dialogue samples are obtained.
  • Different multi-modal incremental conversion encoders correspond to different historical question-and-answer dialogue samples one-to-one;
  • the state vector sample corresponding to the s+1th round of historical question-and-answer dialogue samples obtained by iteration a is determined as the state vector sample corresponding to the current round of questioning samples.
  • the multi-modal incremental conversion encoder includes K sub-transcoders, where K is a positive integer; the second feature encoding module 1220 is further configured to obtain the jth intermediate representation vector sample, the The j intermediate representation vector samples are obtained by performing j multimodal encoding processing on the image feature samples, the state vector samples corresponding to the a-th round of historical question-and-answer dialogue samples, and the question-and-answer feature samples corresponding to the a+1-th historical question-and-answer dialogue samples.
  • the jth intermediate representation vector sample is the vector corresponding to the a+1th round of historical question and answer dialogue samples, j is a positive integer and the initial value of j is 1; iteration j, calls the a+1th multimodal in the visual dialogue model
  • the a+1 th sub-transcoder in the multi-modal incremental transcoder is used for the jth intermediate representation vector sample, the image feature sample and the ith round of history
  • the state vector samples corresponding to the question and answer dialogue samples are subjected to multi-modal encoding processing to obtain the j+1-th intermediate representation vector sample, and the j+1-th intermediate representation vector sample is another vector corresponding to the i+1th round of historical question-and-answer dialogue samples.
  • j+1 ⁇ K the Kth intermediate representation vector sample obtained by iteration j is determined as the state vector sample corresponding to the a+1th round of historical question and answer dialogue samples.
  • the second feature encoding module 1220 is further configured to call the j+1 th sub-transcoder in the a+1 th multimodal incremental transcoder to express the jth intermediate representation
  • the vector sample is subjected to intermediate coding processing to obtain the first sub-vector sample; the first sub-vector sample and the image feature sample are subjected to intermediate coding processing to obtain the second sub-vector sample; the second sub-vector sample and the a-th round of historical question-and-answer dialogue samples
  • the corresponding state vector sample is subjected to intermediate coding processing to obtain a third sub-vector sample; the third sub-vector sample is subjected to intermediate coding processing to obtain the j+1 th intermediate representation vector sample.
  • the second feature decoding module 1230 is further configured to call the multi-modal incremental conversion decoder in the visual dialogue model to obtain the outputted characters in the actual output answer sample corresponding to the current round of questioning samples String feature samples of the string; call the multi-modal incremental conversion decoder to perform multi-modal decoding processing on the state vector samples, image feature samples and string feature samples corresponding to the current round of questioning samples to obtain decoded feature vector samples; The decoded feature vector samples determine the actual output answer samples corresponding to the current round of questions.
  • the multi-modal incremental conversion decoder includes T sub-transformation decoders, where T is a positive integer; the second feature decoding module 1230 is further configured to obtain the mth intermediate representation vector sample, the th The m intermediate representation vector samples are obtained by performing m times of multi-modal decoding processing on the state vector samples, image feature samples and character string feature samples corresponding to the current polling sample, where m is a positive integer and the initial value of m is 1; Iterate m, call the m+1 th sub-transform decoder in the multi-modal incremental conversion decoder to perform multi-modal decoding on the m-th intermediate representation vector sample, image feature sample and the state vector sample corresponding to the current round question sample processing, to obtain the m+1 th intermediate representation vector sample, where m+1 ⁇ T; the T th intermediate representation vector sample obtained by iteration m is determined as the decoded feature vector sample.
  • the second feature decoding module 1230 is further configured to call the m+1 th sub-transform decoder in the multimodal incremental transform decoder to perform intermediate decoding on the m th intermediate representation vector sample process to obtain the third sub-vector sample; perform intermediate decoding processing on the state vector sample corresponding to the third sub-vector sample, the image feature sample and the current round of questioning samples to obtain the fourth sub-vector sample; perform intermediate decoding on the fourth sub-vector sample process, and the m+1 th intermediate representation vector sample is obtained.
  • the second feature decoding module 1230 is further configured to decode the feature vector samples to obtain the estimated string probability output in the actual output answer sample; output the string estimated probability in the actual output answer sample according to the estimated string probability String sample.
  • FIG. 13 shows a schematic structural diagram of a server provided by an exemplary embodiment of the present application.
  • the server 1300 may be the server 120 in the visual dialogue system 100 as shown in FIG. 1 .
  • the server 1300 includes a central processing unit (CPU, Central Processing Unit) 1301 , a system memory 1304 including a random access memory (RAM, Random Access Memory) 1302 and a read only memory (ROM, Read Only Memory) 1303 , and a system bus 1305 connecting the system memory 1304 and the central processing unit 1301 .
  • CPU Central Processing Unit
  • RAM random access memory
  • ROM Read Only Memory
  • the server 1300 also includes a basic input/output system (I/O system, Input Output System) 1306, which facilitates the transfer of information between various devices within the computer, and a large database for storing the operating system 1313, application programs 1314, and other program modules 1315. Capacity storage device 1307.
  • I/O system Input Output System
  • the basic input/output system 1306 includes a display 1308 for displaying information and input devices 1309 such as a mouse, keyboard, etc., for user input of information. Both the display 1308 and the input device 1309 are connected to the central processing unit 1301 through the input and output controller 1310 connected to the system bus 1305.
  • the basic input/output system 1306 may also include an input output controller 1310 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus. Similarly, input output controller 1310 also provides output to a display screen, printer, or other type of output device.
  • Mass storage device 1307 is connected to central processing unit 1301 through a mass storage controller (not shown in FIG. 13 ) connected to system bus 1305 .
  • Mass storage device 1307 and its associated computer-readable media provide non-volatile storage for server 1300. That is, the mass storage device 1307 may include a computer-readable storage medium (not shown in FIG. 13 ) such as a hard disk or a Compact Disc Read Only Memory (CD-ROM, Compact Disc Read Only Memory) drive.
  • a computer-readable storage medium such as a hard disk or a Compact Disc Read Only Memory (CD-ROM, Compact Disc Read Only Memory) drive.
  • Computer-readable storage media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), Electrically Erasable Programmable Read Only Memory (EEPROM, Electrically Erasable Programmable Read Only Memory), flash memory or other solid-state storage Its technology, CD-ROM, Digital Versatile Disc (DVD, Digital Versatile Disc) or Solid State Drives (SSD, Solid State Drives), other optical storage, tape cartridges, magnetic tape, disk storage or other magnetic storage devices.
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • flash memory or other solid-state storage Its technology, CD-ROM, Digital Versatile Disc (DVD, Digital Versatile Disc) or Solid
  • the random access memory may include a resistive random access memory (ReRAM, Resistance Random Access Memory) and a dynamic random access memory (DRAM, Dynamic Random Access Memory).
  • ReRAM resistive random access memory
  • DRAM Dynamic Random Access Memory
  • the computer storage medium is not limited to the above-mentioned types.
  • the system memory 1304 and the mass storage device 1307 described above may be collectively referred to as memory.
  • the server 1300 may also be connected to a remote computer on the network through a network such as the Internet to run. That is, the server 1300 can be connected to the network 1312 through the network interface unit 1311 connected to the system bus 1305, or can also use the network interface unit 1311 to connect to other types of networks or remote computer systems (not shown in FIG. 13 ). ).
  • the above-mentioned memory also includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.
  • an electronic device in an embodiment of the present application, includes a processor and a memory, and the memory stores at least one instruction, at least one piece of program, code set or instruction set, at least one instruction, at least one piece of program, code
  • the set or instruction set is loaded and executed by the processor to implement the visual dialogue method and the training method of the visual dialogue model as described above.
  • a computer-readable storage medium stores at least one instruction, at least one piece of program, code set or instruction set, at least one instruction, at least one piece of program, code set Or the instruction set is loaded and executed by the processor to implement the visual dialogue method and the training method of the visual dialogue model as described above.
  • the computer-readable storage medium may include: Read Only Memory (ROM, Read Only Memory), Random Access Memory (RAM, Random Access Memory), Solid State Drive (SSD, Solid State Drives), or an optical disc.
  • the random access memory may include a resistive random access memory (ReRAM, Resistance Random Access Memory) and a dynamic random access memory (DRAM, Dynamic Random Access Memory).
  • Embodiments of the present application also provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the visual dialogue method and the training of the visual dialogue model as described in the above aspects method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质,涉及视觉对话领域。该方法包括:获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数(201);获取当前轮提问的问题特征(202);对图像特征、前n轮历史问答对话对应的状态向量和问题特征进行多模态编码处理,得到当前轮提问对应的状态向量(203);对当前轮提问对应的状态向量和图像特征进行多模态解码处理,得到当前轮提问对应的实际输出答案(204)。

Description

视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为202010805359.1、申请日为2020年08月12日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及视觉对话领域,涉及一种视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质。
背景技术
视觉对话是指针对视觉内容(如图像),以自然语言的会话语言与人类进行有意义的对话的过程。
一般来说,为实现视觉对话,通常基于输入图像、当前输入问题、上一轮历史问答对话和前一刻工作状态向量,来得到当前输入问题的输出答案。然而,上述获得输出答案的技术方案中,当输入的问题中携带有较多信息时,输出的答案准确率较低。
发明内容
本申请实施例提供了一种视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质,通过结合前n轮历史问答对话对输入图像中包含的信息进行获取,能够提高针对输入问题所输出的答案的准确率。本申请实施例的技术方案如下:
本申请实施例提供了一种视觉对话方法,所述方法包括:
获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数;
获取当前轮提问的问题特征;
对所述图像特征、所述前n轮历史问答对话对应的状态向量和所述问题特征进行多模态编码处理,得到所述当前轮提问对应的状态向量;
对所述当前轮提问对应的状态向量和所述图像特征进行多模态解码处理,得到所述当前轮提问对应的实际输出答案。
本申请实施例提供了一种视觉对话模型的训练方法,所述方法包括:
获取输入图像样本的图像特征样本和前s轮历史问答对话样本对应的状态向量样本,s为正整数;
获取当前轮提问样本的问题特征样本和所述当前轮提问样本对应的真实答案的第一答案特征;
调用视觉对话模型对所述图像特征样本、所述前s轮历史问答对话样本对应的状态向量样本和所述问题特征样本进行多模态编码处理,得到所述当前轮提问样本对应的状态向量样本;
调用所述视觉对话模型对所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述第一答案特征进行多模态解码处理,得到所述当前轮提问样本对应的实际输出答案样本的第二答案特征;
根据所述第一答案特征和所述第二答案特征,对所述视觉对话模型进行训练,得到训练后的视觉对话模型。
本申请实施例提供了一种视觉对话装置,所述装置包括:
第一获取模块,配置为获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数;
所述第一获取模块,配置为获取当前轮提问的问题特征;
第一特征编码模块,配置为对所述图像特征、所述前n轮历史问答对话对应的状态向量和所述问题特征进行多模态编码处理,得到所述当前轮提问对应的状态向量;
第一特征解码模块,配置为对所述当前轮提问对应的状态向量和所述图像特征进行多模态解码处理,得到所述当前轮提问对应的实际输出答案。
本申请实施例提供了一种视觉对话模型的训练装置,所述装置包括:
第二获取模块,配置为获取输入图像样本的图像特征样本和前s轮历史问答对话样本对应的状态向量样本,s为正整数;
所述第二获取模块,配置为获取当前轮提问样本的问题特征样本和所述当前轮提问样本对应真实答案的第一答案特征;
第二特征编码模块,配置为调用视觉对话模型对所述图像特征样本、所述前s轮历史问答对话样本对应的状态向量样本和所述问题特征样本进行多模态编码处理,得到所述当前轮提问样本对应的状态向量样本;
第二特征解码模块,配置为调用所述视觉对话模型对所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述第一答案特征进行多模态解码处理,得到所述当前轮提问样本对应的实际输出答案样本的第二答案特征;
训练模块,配置为根据所述第一答案特征和所述第二答案特征,对所述视觉对话模型进行训练,得到训练后的视觉对话模型。
本申请实施例提供了一种电子设备,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述所述的视觉对话方法和视觉对话模型的训练方法。
本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上述所述的视觉对话方法和视觉对话模型的训练方法。
本申请实施例提供了一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中。电子设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令时,所述电子设备执行如上所述的视觉对话方法和视觉对话模型的训练方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过对输入图像以及关于输入图像的前n轮历史问答对话进行处理,能够联系上下文更好地理解输入图像中隐含的信息,以及利用多模态编码处理方式和多模态解码处理方式,能够更好地根据多种类型的信息,准确地输出当前轮提问对应的实际输出答案,从而,能够提高输出的答案的准确率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一个示例性实施例提供的视觉对话系统的框架图;
图2是本申请一个示例性实施例提供的视觉对话方法的流程图;
图3是本申请一个示例性实施例提供的视觉对话模型的结构框架图;
图4是本申请另一个示例性实施例提供的视觉对话方法的流程图;
图5是本申请另一个示例性实施例提供的视觉对话模型的结构框架图;
图6是本申请一个示例性实施例提供的多模态增量式转换编码器的结构框架图;
图7是本申请另一个示例性实施例提供的多模态增量式转编码器的结构框架图;
图8是本申请一个示例性实施例提供的多模态增量式转换解码器的结构框架图;
图9是本申请另一个示例性实施例提供的多模态增量式转换解码器的结构框架图;
图10是本申请一个示例性实施例提供的视觉对话模型的训练方法的流程图;
图11是本申请一个示例性实施例提供的视觉对话装置的结构框图;
图12是本申请一个示例性实施例提供的视觉对话模型的训练装置的结构框图;
图13是本申请一个示例性实施例提供的服务器的装置结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
首先,对本申请实施例涉及的名词进行介绍。
1)计算机视觉技术(Computer Vision,CV):是一门研究如何使机器“看”的科学,是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉处理,并做图形处理,使处理结果成为更适合人眼观察或传送给仪器检测的图像。计算机视觉技术作为一个计算机视觉研究相关的理论和技术科学学科,用于建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(Optical Character Recognition,OCR)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、三维技术(3-Dimension,3D)技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。本申请实施例中,通过计算机视觉技术对输入图像进行处理,并根据输入的问题输出答案,其中,输入的问题是与输入图像有关的问题。
2)视觉问答(Visual Question Answering,VQA):是一种涉及计算机视觉和自然语言处理(Natural Language Processing,NLP)两大领域的学习任务。向电子设备中输入一张图像和一个关于这张图像的形式自由(free-form)、开放式(opened)的自然语言的问题,输出为:产生的自然语言的回答。视觉问答过程中,电子设备通过获取图像的内容、问题的含义和意图以及相关的常识的信息,实现根据输入的图像和问题输出一个符合自然语言规则且合理的答案。
3)视觉对话(Visual Dialog):是VQA的拓展领域,其主要任务为:对视觉内容,以自然语言的会话语言与人类进行有意义的对话。也就是说,给定图像、对话历史和关于图像的问题,电子设备将问题置于图像中,从对话历史中推断上下文,并准确地回答 问题。与VQA不同的是,视觉对话通过一个可以组合多个信息源的编码器对多轮对话历史的进行处理。
4)人工智能(Artificial Intelligence,AI):是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。也就是说,人工智能是计算机科学的一个综合技术,基于智能的实质,生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。在本申请实施例中,将基于人工智能技术实现视觉对话。
本申请实施例提供的视觉对话方法可以应用于如下场景:
一、智能客服
在该应用场景下,采用本申请实施例提供的视觉对话方法所训练的视觉对话模型可应用于购物应用程序、团购应用程序、出行管理应用程序(如票务订购应用程序、酒店订购应用程序)等应用程序中。上述应用程序设置有智能客服,用户可通过向智能客服进行对话从而得到自己需要解决的问题的答案。智能客服是通过应用程序的后台服务器中构建的视觉对话模型实现的,视觉对话模型是预先经过训练的。当视觉对话模型接收到用户输入的问题时,视觉对话模型输出关于该问题的答案。比如,智能客服是购物应用程序的客服,用户提出的问题是关于输入图像中物品A的问题,该问题是:销售物品A的店铺有哪?智能客服根据用户的提问输出答案:销售物品A的店铺为店铺1、店铺3以及店铺10。用户可根据输出的答案去浏览相应的店铺界面。
二、虚拟助理
在该应用场景下,采用本申请实施例提供的视觉对话方法所训练的视觉对话模型可应用于智能终端或智能家居等智能设备中。以智能终端中设置的虚拟助理为例,该虚拟助理是通过训练后的视觉对话模型实现的,该视觉对话模型是预先经过训练的。当视觉对话模型接收到用户输入的问题时,视觉对话模型输出关于该问题的答案。比如,用户A在社交平台上发布动态(该动态包括图像),该图像是用户A在海边度假的照片,虚拟助理提醒用户B(用户B与用户A是好友关系)用户A发布了新照片,用户B向虚拟助理提出问题:照片里面都有什么?虚拟助理输出答案:用户A在海边玩耍。则用户B可以自行选择是否进入用户A的社交平台界面浏览照片。
上述仅以两种应用场景为例进行说明,本申请实施例提供的视觉对话方法还可以应用于其他需要视觉对话的场景(比如,为视力有损伤的人事讲解图片的场景等等),本申请实施例并不对具体应用场景进行限定。
本申请实施例提供的视觉对话方法和视觉对话模型的训练方法可以应用于具有较强的数据处理能力的电子设备中。在一种可能的实施方式中,本申请实施例提供的视觉对话方法和视觉对话模型的训练方法可以应用于个人计算机、工作站或服务器中,即可以通过个人计算机、工作站或服务器实现视觉对话以及训练视觉对话模型。
而对于训练后的视觉对话模型,可以实施为应用程序的一部分,并被安装在终端中, 如此,终端在接收到与输入图像有关的问题时,能够输出该问题对应的答案;或者,该训练后的视觉对话模型设置在应用程序的后台服务器中,以便安装有应用程序的终端借助后台服务器实现与用户进行视觉对话的功能。
请参考图1,图1示出了本申请一个示例性实施例提供的视觉对话系统的示意图。该视觉对话系统100包括电子设备110和服务器120,其中,电子设备110与服务器120之间通过通信网络进行数据通信,可选地,通信网络可以是有线网络也可以是无线网络,且该通信网络可以是局域网、城域网以及广域网中的至少一种。
电子设备110中安装有支持视觉对话功能的应用程序,该应用程序可以是虚拟现实应用程序(Virtual Reality,VR)、增强现实应用程序(Augmented Reality,AR)、游戏应用程序、图片相册应用程序、社交应用程序等,本申请实施例对此不作限定。
在本申请实施例中,电子设备110可以是智能手机、智能手表、平板电脑、膝上便携式笔记本电脑、智能机器人、车载设备等移动终端,也可以是台式电脑、投影式电脑、智能电视等终端,本申请实施例对电子设备的类型不作限定。
服务器120可以实施为一台服务器,也可以实施为一组服务器构成的服务器集群,以及可以是物理服务器,也可以实现为云服务器。在一种可能的实施方式中,服务器120是电子设备110中应用程序的后台服务器。
如图1所示,在本申请实施例中,电子设备110中运行有聊天应用程序,用户可通过与聊天应用程序的聊天助手聊天获取输入图像中的信息。示意性地,输入图像11是通过电子设备110预先输入至服务器120中的图像,或者,输入图像11是服务器120中预先存储的图像。用户在聊天助手的聊天界面中输入与该输入图像有关的问题,电子设备110将问题发送至服务器120中,服务器120设置有训练后的视觉对话模型10,训练后的视觉对话模型10根据输入的问题输出答案,并将答案发送至电子设备110中,在电子设备110上显示有聊天助手关于该问题的答案。比如,用户提出问题:坐在车里的是女生吗?训练后的视觉对话模型10根据前几轮的历史问答对话(问题:图像中有几个人呢?答案:4个人)确定用户提出的问题是输入图像中位于车内的人的性别,根据车内的人的性别为男性,则输出答案:不是。
示意性地,服务器120中预先存储有前n轮历史问答对话对应的状态向量12(n为正整数),训练后的视觉对话模型10在获取到输入图像11的图像特征111和当前轮提问的问题特征13时,结合前n轮历史问答对话对应的状态向量12,输出当前轮提问对应的状态向量14。训练后的视觉对话模型10根据输入图像11的图像特征111、当前轮提问对应的状态向量14和已输出的前x个字符串的特征15,得到输出答案16中的第x+1个字符串,x为正整数。
在本申请的一些实施例中,服务器120中可预先存储有前n轮历史问答对话,视觉对话模型从前n轮历史问答对话中提取对应的状态向量。
视觉对话模型在训练时需要结合输入图像样本的图像特征样本、当前轮提问样本对应的状态向量样本和当前轮提问样本对应的真实答案的答案特征进行训练。比如,当前轮提问样本的真实答案包括5个词语(字符串),视觉对话模型输出答案时是按照每次输出一个词语的规则输出每轮提问的实际输出答案样本。当视觉对话模型输出第3个词语时,视觉对话模型结合真实答案中的第1个词语、第2个词语以及当前轮提问对应的状态向量输出第3个词语,并基于真实答案与实际输出答案样本的差异训练出视觉对话模型。
为了方便表述,下述以视觉对话模型的训练方法和视觉对话方法由服务器执行为例进行说明。
图2示出了本申请一个示例性实施例提供的视觉对话方法的流程图。本申请实施例 以视觉对话方法用于如图1所示的视觉对话系统100中的服务器120为例进行说明,该视觉对话方法包括如下步骤:
步骤201,获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数。
在本申请实施例中,服务器对输入图像的特征进行提取,也就获得了输入图像的图像特征;而前n轮历史问题对话对应的状态向量是上一轮的输出,从而,服务器能够从上一轮的输出中,获得前n轮历史问答对话对应的状态向量。
示意性地,服务器中构建有视觉对话模型,该视觉对话模型是经过训练获得的,即是训练后的视觉对话模型;通过视觉对话模型获取输入图像,该输入图像可以是服务器预先存储的图像,还可以是用户通过终端上传至服务器的图像(包括终端存储的图像和终端拍摄的图像中的至少一种),又可以是现有的图像集中的图像,本申请实施例对图像的类型不作限定。
视觉对话模型从输入图像中提取图像特征,在本申请的一些实施例中,视觉对话模型包括特征提取模型,通过特征提取模型从输入图像中提取图像特征。
一轮历史问答对话是指以用户提出一个问题开始,视觉对话模型输出关于该问题的答案结束,一问一答形成一轮历史问答对话。
示意性地,n轮历史问答对话是关于同一输入图像的历史问答对话。服务器将关于同一输入图像的n轮历史问答对话与该输入图像建立对应关系,当用户提出的问题是关于该输入图像的问题时,视觉对话模型将获取与该输入图像有关的前n轮历史问答对话。在一个示例中,用户提出的问题是关于图像1的,视觉对话模型获取与图像1对应的n 1轮历史问答对话,然后用户又提出关于图像2的问题,视觉对话模型获取与图像2对应的n 2轮历史问答对话,n 1和n 2均为正整数。
示意性地,如图3所示,视觉对话模型包括编码器21,编码器21包括多个多模态增量式转换编码器(Multimodal Increasemental Transformer Encoder,MITE)211,针对每轮历史问答对话设置有对应的MITE 211,在每轮历史问答对话对应的MITE 211输出本轮历史问答对话对应的状态向量时,还需要结合输入图像11的图像特征、本轮历史问答对话的历史问答特征以及上一轮历史问答对话对应的MITE 211输出的状态向量作为输入,得到每轮历史问答对话对应的状态向量。这里,针对第1轮对应的MITE 211,将输入图像11的图像特征、第1轮提问的问题特征作为输入,输出一个状态向量,并将输出的状态向量向后续轮传递,直至处理到当前轮;针对当前轮对应的MITE 211,将输入图像11的图像特征、当前轮提问的问题特征以及第n轮历史问答对话对应的MITE 211输出的状态向量作为输入,得到当前轮提问的状态向量。在本申请的一些实施例中,一轮历史问答对话对应的状态向量包括该轮历史问答对应的历史问答特征。
示意性地,服务器通过词嵌入操作(Word Embedding)将历史问答对话的文本映射为词向量,从而得到历史问答特征。
在本申请的一些实施例中,通过公式(1)获得一轮历史问答对话对应的状态向量,公式(1)为:
c n=MITE(v n,u n,c n-1)    (1)
其中,c n表示MITE输出的第n轮历史问对话对应的状态向量,v n表示输入图像的图像特征,u n表示第n轮历史问答对话的历史问答特征(从历史问答对话的文本中提取),c n-1表示第n-1轮历史问答对话对应的状态向量。
步骤202,获取当前轮提问的问题特征。
在本申请实施例中,服务器提取当前轮提问对应的文本的特征,也就获得了当前轮提问的问题特征;这里,服务器可以通过视觉对话模型从当前轮提问的文本中提取问题 特征。
本申请实施例以当前轮提问的问题特征包括问题中涉及的词向量和词向量的位置为例进行说明。
示意性地,服务器先通过词嵌入操作对当前轮提问的文本中的每个字符串进行映射,得到每个字符串的词向量,从而得到当前轮提问的文本对应的词向量。接着,服务器通过位置编码(Positional Encoding)使得当前轮提问的文本中的每个字符串按照一定的顺序进行编码,来获得当前轮提问的文本对应的每个词向量的位置;其中,位置编码包括绝对位置编码和相对位置编码。从而,服务器通过视觉对话模型获取到的问题特征包括词向量和每个词向量在句子中的位置。
可以理解的是,步骤201和步骤202可以同步实施,或者,步骤201先实施,步骤202后实施,或者,步骤202先实施,步骤201后实施;也就是说,步骤201和步骤202在执行顺序上不分先后。
步骤203,对图像特征、前n轮历史问答对话对应的状态向量和问题特征进行多模态编码处理,得到当前轮提问对应的状态向量。
在本申请实施例中,服务器综合图像特征、前n轮历史问答对话对应的状态向量和问题特征进行多模态编码处理,所获得的结果即与当前轮提问对应的状态向量。这里,服务器可以通过视觉对话模型执行多模态编码处理。
示意性地,视觉对话模型包括针对每轮历史问答对话设置的各自对应的MITE 211,针对当前轮提问,也存在对应的MITE 211。
示意性地,服务器将图像特征、第一轮历史问答对话的历史问答特征作为第一轮历史问答对话对应的MITE 211的输入,输出第一轮历史问答对话对应的状态向量;服务器将第一轮历史问答对话对应的状态向量、第二轮历史问答对话的历史问答特征和图像特征输入第二轮历史问答对话对应的MITE 211,输出第二轮历史问答对话对应的状态向量;以此类推。当前轮提问为第n+1轮,则将第n轮历史问答对话对应的状态向量(第n轮历史问答对话对应的MITE 211的输出)、图像特征和第n+1轮提问的问题特征输入至第n+1轮提问对应的MITE 211中,输出第n+1轮提问对应的状态向量。
步骤204,对当前轮提问对应的状态向量和图像特征进行多模态解码处理,得到当前轮提问对应的实际输出答案。
在本申请实施例中,服务器对获得的当前轮提问对应的状态向量和图像特征进行解码处理,所获得的解码结果即与当前轮提问对应的实际输出答案。其中,解码处理是一种多模态解码处理。这里,服务器可以通过视觉对话模型执行多模态解码处理。
示意性地,继续参见图3,如图3所示,视觉对话模型还包括解码器22,解码器22包括多模态增量式转换解码器(Multimodal Increasemental Transformer Decoder,MITD)221,通过将MITD 221输出的当前轮提问对应的状态向量、图像特征和已输出的字符串(目标输入)对应的嵌入层的输出输入至MITD 221中,MITD 221的输出经过逻辑回归层得到当前轮提问对应的实际输出答案中的某个字符串(目标输出)。
比如,当前轮提问的问题为:“How are you?”时,MITD结合已输出的单词(字符串)“I”、“am”以及当前轮提问对应的状态向量输出“fine”这个单词。
可以理解的是,本申请实施例提供的视觉对话方法,通过获取关于输入图像的前n轮历史问答对话对应的状态向量,使得视觉对话模型能够联系上下文更好地理解图像中隐含的信息,利用多模态编码处理方式和多模态解码处理方式,使得视觉对话模型能够更好地根据多种类型的信息,输出当前轮提问对应的实际输出答案,提高视觉对话模型输出的答案的准确率,且保证输出的答案与问题和输入图像的一致性,提升视觉对话的效果。
图4示出了本申请另一个示例性实施例提供的视觉对话方法的流程图。本申请实施例以该视觉对话方法用于如图1所示的视觉对话系统100中的服务器120为例进行说明,该视觉对话方法包括如下步骤:
步骤401,获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数。
示意性地,输入图像是现有的图像集中的图像。视觉对话模型包括特征提取模型,特征提取模型是基于卷积神经网络构建的模型。比如,通过快速区域检测卷积神经网络(Fast Region-CNN,Fast R-CNN)提取输入图像中的图像特征,如下公式(2)所示:
v=FastR-CNN(I)    (2)
其中,v表示输入图像的图像特征,I表示输入图像,FastR-CNN()表示Fast R-CNN对应的处理。
如图5所示,视觉对话模型的编码器21包括多个MITE 211,每轮历史问答对话对应一个MITE 211,上一轮历史问答对话对应的状态向量将作为输入,输入至下一轮历史问答对话对应的MITE 211中。以此类推,直至获取到当前轮提问的上一轮历史问答对话对应的状态向量。与图3不同的是,MITE 211的输入还包括图像描述17。
在本申请的一些实施例中,输入图像还对应有图像描述(caption),图像描述用于描述输入图像中的实体以实体之间的关系,将图像描述也作为MITE 211的输入,有利于视觉对话模型更好地提取输入图像隐含的信息。比如,输入图像11对应有图像描述:四个人的自驾游旅行。
步骤402,获取当前轮提问的问题特征。
示意性地,特征提取模型还用于从当前轮提问中提取问题特征。问题特征u n+1通过如下公式(3)和(4)提取:
u n+1=[u n+1,1,u n+1,2,…,u n+1,L]∈R L×M   (3)
u n+1,l=w n+1,l+PE(l)     (4)
其中,PE()为绝对位置编码函数对应的处理,w n+1,l为当前轮提问中第l个字符串进行词嵌入操作后的词向量。u n+1,l表示当前轮提问中第l个字符串的字符串特征,L表示当前轮提问中的字符串的最大数量,M表示的是每个字符串代表的维度,R代表域。
可以理解的是,上述绝对位置编码函数也可替换为相对位置编码函数。
在本申请的一些实施例中,n轮历史问答对话对应的历史问答特征也可通过上述公式(3)和(4)得到。
步骤403,获取第i轮历史问答对话对应的状态向量,i为正整数且i的起始值为1。
在本申请实施例中,服务器先通过第1个MITE 211编码出第1轮历史问答对应的状态向量;再通过第2个MITE 211对图像特征、第1轮历史问答对应的状态向量和问题特征进行多模态编码处理,得到第2轮历史问答对应的状态向量;如果第2轮不是当前轮,则继续基于上述处理获得第3轮历史问答对应的状态向量,如此迭代直至当前轮,获得第n+1轮历史问答对应的状态向量。这里,由于多模态增量式转换编码器与历史问答对话一一对应,当i为1时,第i轮历史问答对话对应的状态向量是第1个MITE 211对图像特征和问题特征进行编码获得的;当i大于1时,第i轮历史问答对话对应的状态向量是第i-1个MITE 211对图像特征和问题特征、以及第i-1轮历史问答对话对应的状态向量进行编码获得的。其中,i为变量,取值为1至n中的任一个。
以当前轮提问所属的轮次为第n+1轮,如图5所示,第n+1轮历史问答对话对应的状态向量是由第n+1轮历史问答对话对应的MITE 211输出的。每一轮历史问答对话对应有一个MITE 211,当前轮提问也对应一个MITE 211。本申请实施例以至少存在一轮 历史问答对话为例进行说明。
步骤404,迭代i,调用视觉对话模型中的第i+1个多模态增量式转换编码器对图像特征、第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话对应的问答特征进行多模态编码处理,得到第i+1轮历史问答对话对应的状态向量。
示意性地,服务器响应于第i+1轮为当前轮提问,通过第i+1轮历史问答对话对应的MITE 211输出当前轮提问对应的状态向量;服务器响应于第i+1轮为非当前轮提问,通过第i+1轮历史问答对话对应的MITE 211输出第i+1轮历史问答对话对应的状态向量。该第i+1轮历史问答对话对应的状态向量作为第i+2轮历史问答的输入。
在本申请实施例中,多模态增量式转换编码器包括K个子转换编码器,K为正整数,步骤404可替换为如下步骤:
步骤4041,获取第j个中间表示向量,第j个中间表示向量是对图像特征、第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话对应的问答特征进行j次多模态编码处理得到的,第j个中间表示向量是第i+1轮历史问答对话对应的向量,j为正整数且j的起始值为1。
需要说明的是,j为变量,取值为1至K中的任一个。当j为1时,调用第1个子转换编码器对图像特征、第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话对应的问答特征进行多模态编码处理,得到第1个中间编码向量(第j个中间表示向量);当j大于1时,调用第j个子转换编码器对图像特征、第i轮历史问答对话对应的状态向量和第j-1个中间表示向量进行多模态编码处理,得到第j个中间编码向量(第j个中间表示向量)。
如图6所示,每个MITE 211包括K个子转换编码器212,K为正整数,每个子转换编码器212用于执行一次多模态编码处理,从而,一轮历史问答对话执行K次多模态编码处理。其中,一轮历史问答对话中,比如,针对第i+1轮历史问答对话对应的问答特征u i+1(第i+1轮历史问答对话通过嵌入层获得的输出结果),经过K次多模态编码处理,获得第i+1轮历史问答对话对应的状态向量c i+1
在本申请实施例中,每个MITE 211包括的子转换编码器的数量相同或不同,即各轮历史问答对话执行的多模态编码处理的次数相同或不同。
响应于第i+1轮为非当前轮提问,如图6所示,将图像特征v、第i轮历史问答对话对应状态向量c i和历史问答特征u i+1(由第i+1轮历史问答对话经过嵌入层获得)输入至第i+1个MITE 211中的第1个子转换编码器212中,输出中间表示向量,将该中间表示向量、图像特征v和第i+1轮历史问答对话对应的问答特征u i+1输入至第2个子转换编码器212中。以此类推,第j个子转换编码器212输出j个中间表示向量,该j个中间表示向量是第i+1轮历史问答对话对应的向量。继续利用子转换编码器212执行处理,直至获得第K个子转换编码器212输出的中间表示向量;这里,第K个子转换编码器212输出的中间表示向量为第i+1轮历史问答对话对应的状态向量c i+1
响应于第i+1轮为当前轮提问,将图像特征v、第i轮历史问答对话对应的状态向量c i和当前轮提问的问题特征u i+1输入至第i+1个MITE 211中的第1个子转换编码器212中,输出中间表示向量,将该中间表示向量、图像特征v和历史问答特征u i+1输入至第2个子转换编码器212中,以此类推,第j个子转换编码器212输出第j个中间表示向量,该第j个中间表示向量是当前轮提问对应的向量(非当前轮提问对应的状态向量)。
步骤4042,迭代j,调用第i+1个多模态增量式转换编码器中的第j+1个子转换编码器对第j个中间表示向量、图像特征和第i轮历史问答对话对应的状态向量进行多模态编码处理,得到第j+1个中间表示向量,第j+1个中间表示向量是第i+1轮历史问答 对话对应的另一向量,j+1≤K。
在本申请实施例中,服务器将图像特征、第i+1轮历史问答对话的历史问答特征和第j个子转换编码器212输出的第j个中间表示向量输入至第j+1个子转换编码器212中,第j+1个子转换编码器输出第j+1个中间表示向量,该第j+1个中间表示向量也是第i+1轮历史问答对话对应的向量。
需要说明的是,若第j+1<K,则第j+1个子转换编码器输出的第j+1个中间表示向量作为第j+2个子转换编码器的输入;若j+1=K,则第j+1个子转换编码器输出的第j+1个中间表示向量为第i+1轮历史问答对话对应的状态向量。
步骤4043,将迭代j得到的第K个中间表示向量确定为第i+1轮历史问答对话对应的状态向量。
在本申请实施例中,服务器将前一个子转换编码器输出的中间表示向量输入至下一个子转换编码器中。直至一轮问答对话(包括一轮历史问答对话和当前轮提问)对应的MITE中的K个子转换编码器均进行了多模态编码处理,输出一轮问答对话对应的状态向量。
需要说明的是,服务器调用第i+1个MITE中的第1个子转换编码器对图像特征、第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话的问答特征进行多模态编码处理,得到第j个中间表示向量;迭代j,调用第j+1个子转换编码器对图像特征、第i+1轮历史问答对话对应的状态向量和第j个中间表示向量进行多模态编码处理,得到第j+1个中间表示向量;其中,j为从1开始递增的正整数变量;将迭代j得到的第K个中间表示向量确定为第i+1轮历史问答对话对应的状态向量。
步骤405,将迭代i得到的第n+1轮历史问答对话对应的状态向量确定为当前轮提问对应的状态向量。
需要说明的是,每轮历史问答对话对应一个MITE 211,每个MITE 211输出的是每轮历史问答对话对应的状态向量,前一个MITE 211输出的状态向量作为下一个MITE 211的输入,直到输入至第n+1轮提问对应的MITE 211中,服务器通过第n+1轮提问对应的MITE 211输出当前轮提问对应的状态向量。
在本申请实施例中,服务器调用视觉对话模型中的第1个MITE对图像特征和第1轮历史问答对话的问答特征进行多模态编码处理,得到第1轮历史问答对话对应的状态向量;迭代i,调用第i+1个MITE对图像特征、第i+1轮历史问答对话对应的状态向量和第i+1轮历史问答对话对应的问答特征进行多模态编码处理,得到第i+1轮历史问答对话对应的状态向量;其中,i为从1开始递增的正整数变量;将迭代i得到的第n+1轮历史问答对话对应的状态向量确定为当前轮提问对应的状态向量。
步骤406,调用视觉对话模型中的多模态增量式转换解码器,获取当前轮提问对应的实际输出答案中已输出的字符串的字符串特征。
如图5所示,视觉对话模型包括多模态增量式转换解码器(MITD模型)221,用于解码出组成答案的字符串。示意性地,当前轮提问为:“How are you?”,实际输出答案为:“I am OK”。多模态增量式解码器221正在输出的字符串是“OK”,则向多模态增量式转换解码器中输入单词“I”和“am”。
在本申请实施例中,字符串特征可通过特征提取模型从已输出的答案对应的答案文本中提取。
步骤407,调用多模态增量式转换解码器对当前轮提问对应的状态向量、图像特征和字符串特征进行多模态解码处理,得到解码特征向量。
步骤408,根据解码特征向量确定当前轮提问对应的实际输出答案,其中,实际输出答案包括已输出的字符串。
在本申请实施例中,服务器通过向MITD 221输入已输出的字符串,并结合当前轮提问对应的状态向量和图像特征输出当前轮提问对应的实际输出答案中的一个字符串。
在本申请实施例中,多模态增量式转换解码器包括T个子转换编码器,T为正整数,上述步骤407可替换为如下步骤:
步骤4071,获取第m个中间表示向量,第m个中间表示向量是对当前轮提问对应的状态向量、图像特征和字符串特征进行m次多模态解码处理得到的,m为正整数且m的起始值为1。
需要说明的是,m为变量,取值为1至T中的任一个。当m为1时,利用第1个子转换解码器对当前轮提问对应的状态向量、图像特征和字符串特征进行多模态解码处理,得到第1个中间解码向量(第m个中间表示向量);当m大于1时,利用第m个子转换解码器对当前轮提问对应的状态向量、图像特征和第m-1个中间解码向量(第m-1个中间表示向量)进行多模态解码处理,得到第m个中间解码向量(第m个中间表示向量)。
如图5所示,图5中的MITD 221包括T个子转换解码器222,每个子转换解码器222用于执行一次多模态解码处理,从而,一个MITD 221对输入的向量执行T次多模态解码处理。
在本申请的一些实施例中,视觉对话模型包括一个或多个MITD 221,本申请实施例以视觉对话模型包括一个MITD 221为例进行说明。
将图像特征v、字符串特征和当前轮提问对应的MITE 211输出的当前轮提问对应的状态向量c n+1输入至MITD中的第1个子转换解码器222中,输出中间表示向量,将该中间表示向量、图像特征v和字符串特征输入至第2个子转换解码器222中。以此类推,第m个子转换解码器222输出第m个中间表示向量,该第m个中间表示向量是当前轮提问对应的向量。
步骤4072,迭代m,调用多模态增量式转换解码器中的第m+1个子转换解码器对第m个中间表示向量、图像特征和当前轮提问对应的状态向量进行多模态解码处理,得到第m+1个中间表示向量,m+1≤T。
在本申请实施例中,服务器将第m个子转换解码器输出的第m个中间表示向量输入至第m+1个子转换解码器222中,第m+1个子转换解码器输出第m+1个中间表示向量,该第m+1个中间表示向量也是当前轮提问对应的向量。
需要说明的是,若m+1<T,则第m+1个子转换解码器输出的第m+1个中间表示向量作为第m+2个子转解码器的输入;若m+1=T,则第m+1个子转换解码器输出的第m+1个中间表示向量为当前轮提问对应的解码特征向量,根据解码特征向量可确定输出的字符串。
步骤4073,将迭代m得到的第T个中间表示向量确定为解码特征向量。
在本申请实施例中,服务器将前一个子转换解码器输出的中间表示向量输入至下一个子转换解码器中,直到MITD中的T个子转换解码器均进行了多模态解码处理,输出当前轮提问对应的解码特征向量,该解码特征向量用于确定实际输出答案。
需要说明的是,服务器调用MITD中的第1个子转换解码器对图像特征、当前轮提问对应的状态向量和字符串特征进行多模态解码处理,得到第1个中间解码向量;迭代j,调用第m+1个子转换解码器对图像特征、当前轮提问对应的状态向量和第m个中间解码向量(第m个中间表示向量)进行多模态编码处理,得到第m+1个中间解码向量(第m+1个中间表示向量);其中,m为从1开始递增的正整数变量;将迭代m得到的第T个中间解码向量(第T个中间表示向量)确定为解码特征向量。
可以理解的是,本申请实施例提供的视觉对话方法,通过获取关于输入图像的前n 轮历史问答对话对应的状态向量,使得视觉对话模型能够联系上下文更好地理解输入图像中隐含的信息,利用多模态编码处理方式和多模态解码处理方式,使得视觉对话模型能够更好地根据多种类型的信息,输出当前轮提问对应的实际输出答案,提高视觉对话模型输出的答案的准确率,且保证输出的答案与问题和输入图像的一致性,提升视觉对话的效果。
还可以理解的是,服务器通过视觉对话模型中的多模态增量式转换编码器对每一轮历史问答对话对应的状态向量进行多模态编码处理,以此类推,从而得到当前轮提问对应的状态向量,使得后续经过多模态解码处理后得到的输出答案更加准确。
还可以理解的是,服务器通过在每个多模态增量式转换编码器中设置K个子转换编码器,该K个子转换编码器之间依次将前一个子转换编码器输出的中间表示向量传递至下一个子转换编码器中,从而得到当前轮提问对应的状态向量,使得后续进行解码处理得到的输出答案更加准确。本申请实施例通过层状结构能够为后续输出答案提供准确的中间表示向量。
还可以理解的是,服务器通过视觉对话模型中的多模态增量式转换解码器对多模态增量式转换编码器输出的状态向量进行解码处理,从而使得视觉对话模型能够准确输出当前轮提问对应的实际输出答案。
还可以理解的是,服务器通过多模态增量式转换解码器中设置的T个子转换解码器,该T个子转换解码器之间依次将前一个子转换解码器输出的中间表示向量传递至下一个子转换解码器中,从而得到当前轮提问对应的实际输出答案。本申请实施例通过层状结构能够保证视觉对话模型输出的答案的准确率。
下面分别对子转换编码器和子转换解码器的内部结构进行说明。
图7示出了本申请一个示例性实施例提供的子转换编码器的结构示意图。一个子转换编码器212包括自注意力层(Self-Attention)213、跨模态注意力层(Cross-Modal Attention)214、历史注意力层(History Attention)215和前馈神经网络层(Feedforward Neural Network,FNN)216。K表示一个MITE 211包括K个子转换编码器212,即包括K个自注意力层213、K个跨模态注意力层214、K个历史注意力层215和K个前馈神经网络层216。
示意性地,以第j+1个子转换编码器的输入输出过程为例进行说明,该子转换编码器的输入输出过程如下:
步骤1、调用第i+1个多模态增量式转换编码器中的第j+1个子转换编码器对第j个中间表示向量进行中间编码处理,得到第一子向量。
在本申请实施例中,服务器将第j个子转换编码器输出的第j个中间表示向量输入至第j+1个子转换编码器的自注意力层213中,输出第一子向量。
示例性地,获取第j个子向量的过程可通过公式(5)实现,公式(5)如下:
A (j+1)=MultiHead(C j,C j,C j)      (5)
其中,A (j+1)表示第一子向量,C j表示前一个子转换编码器(第j个子转换编码器)输出的第j个中间表示向量,MultiHead()表示多头注意力机制对应的处理。
可以理解的是,第j个子转换编码器输出的第j个中间表示向量是第j个子转换编码器的前馈神经网络层输出的。
步骤2、对第一子向量和图像特征进行中间编码处理,得到第二子向量。
在本申请实施例中,服务器将第一子向量输入至跨模态注意力层214中,同时输入图像的图像特征v,输出第二子向量。
示例性地,获取第二子向量的过程可通过公式(6)实现,公式(6)如下:
B (j+1)=MultiHead(A (j+1),v,v)    (6)
其中,B (j+1)表示第二子向量。
步骤3、对第二子向量和第i轮历史问答对话对应的状态向量进行中间编码处理,得到第三子向量。
在本申请实施例中,服务器将第二子向量输入至历史注意力层215中,同时输入第i轮历史问答对话对应的状态向量(即第i轮历史问答对话对应的MITE输出的状态向量),输出第三子向量。
示例性地,获取第三子向量的过程可通过公式(7)实现,公式(7)如下:
F (j+1)=MultiHead(B (j+1),c i,c i)     (7)
其中,F (j+1)表示第三子向量,c i表示第i轮历史问答对话对应的状态向量。
步骤4、对第三子向量进行中间编码处理,得到第j+1个中间表示向量。
在本申请实施例中,服务器将第三子向量输入至前馈神经网络层216,输出与第j+1个子转换编码器对应的第j+1个中间表示向量。
示例性地,获取第二中间表示向量的过程可通过公式(8)实现,公式(8)如下:
C (j+1)=FFN(F (j+1))     (8)
其中,C (j+1)表示第j+1个中间表示量,FFN()表示前馈神经网络层对应的处理。
需要说明的是,若第j+1个子转换编码器是第i+1轮历史问答对话的MITE中的最后一个子转换编码器(即j+1=K),则输出第i+1轮历史问答对话对应的状态向量.
示例性地,获取第i+1轮历史问答对话对应的状态向量的过程可通过公式(9)实现,公式(9)如下:
c i+1=C (j+1)    (9)
其中,c i+1表示第i+1轮历史问答对话对应的状态向量。
需要说明的是,若第j+1个子转换编码器不是MITE模型中的最后一个子转换编码器(即j+1<K),则输出中间表示向量,该中间表示向量将作为第j+2个子转换编码器的输入,以此类推,直到最后一个子转换编码器输出第i+1轮历史问答对话对应的状态向量。
需要说明的是,每一个MITE对应一轮历史问答对话,当前轮提问对应的MITE将上一轮历史问答对话对应的状态向量、问题特征和图像特征作为输入,输入至当前轮提问对应的MITE 211中的第一个子转换编码器212中的自注意力层213中,重复上述步骤,直到输出当前轮提问对应的状态向量。
可以理解的是,本申请实施例的视觉对话方法,通过子转换编码器中设置的多层结构,分别计算各个中间表示向量,使得每一个子转换编码器均能根据前一个子转换编码器准确输出中间表示向量,从而保证后续得到当前轮提问对应的状态向量是准确的。
图8示出了本申请一个示例性实施例提供的子转换解码器的结构示意图。一个子转换解码器222包括自注意力层(Self-attention)223、门控跨模态注意力层(Gated Cross Attention,GCA)224和前馈神经网络层(Feedforward Neural Network,FNN)225。T表示一个MITD 221包括T个子转换解码器222,即包括T个自注意力层223、T个门控跨模态注意力层224和T个前馈神经网络层225。一个子转换解码器222的输入包括输入图像的图像特征v、第n+1轮历史温度对话对应的状态向量、第n+1轮对应的问题特征和目标输入。
示意性地,以第m+1个子转换解码器的输入输出过程为例进行说明,该子转换解码器的输入输出过程如下:
步骤11、调用所述多模态增量式转换解码器中的第m+1个子转换解码器对第m个中间表示向量进行中间解码处理,得到第三子向量。
在本申请实施例中,服务器将第m个子转换解码器输出的第m个中间表示向量输 入至第m+1个子转换解码器的自注意力层223中,输出第三子向量。
示例性地,获取第三子向量的过程可通过公式(10)实现,公式(10)如下:
J (m+1)=MultiHead(R m,R m,R m)     (10)
其中,J (m+1)表示第三子向量,R m表示前一个子转换解码器(第m个子转换解码器)输出的第m个中间表示向量,MultiHead()表示多头注意力机制。
需要说明的是,第1个子转换解码器之前无子转换解码器的输出作为输入,将目标输入R 0输入至第1个子转换解码器中(即实际输出答案的答案特征;在视觉对话模型的实际使用过程中,目标输入是已输出的前x个字符串的字符串特征;在视觉对话模型的训练过程中,目标输入是与已输出的前x个字符串对应的实际输出答案中的字符串的字符串特征。
可以理解的是,第m个子转换解码器输出的第m个中间表示向量是第m个子转换解码器的前馈神经网络层输出的。
步骤12、对第三子向量、图像特征和当前轮提问对应的状态向量进行中间解码处理,得到第四子向量。
在本申请实施例中,服务器将第三子向量、图像特征和当前轮提问对应的状态向量输入至门控跨模态注意力层224中,同时输入当前轮提问对应的状态向量和图像特征,输出第四子向量。
如图9所示,门控跨模态注意力层224中,跨模态注意力层226-1(Cross-modal Attention)接收当前轮提问(第n+1轮)对应的状态向量c n+1,并根据第三子向量J (m+1)和当前轮提问对应的状态向量c n+1输出向量E (m+1);如公式(11)所示:
E (m+1)=MultiHead(J (m+1),c n+1,c n+1)     (11)
继续参见图9,跨模态注意力层226-2接收图像特征v,输出向量G (m+1);如公式(12)所示:
G (m+1)=MultiHead(J (m+1),v,v)    (12)
需要说明的是,由于图9中的跨模态注意力层226-1和跨模态注意力层226-2是相同的,所以两侧的计算过程可以调换,即左侧的跨模态注意力层226-1输出向量G (m+1),跨模态注意力层226-2输出向量E (m+1)。以图9中无标注的矩形表示输出的向量(E (m+1)和G (m+1)),矩形仅为示意不代表实际输出的特征向量的大小和个数。
继续参见图9,跨模态注意力层226-1输出的向量E (m+1),通过全连接层(Fully Connected Layers,FC)227-1输出向量α (m+1);如公式(13)所示:
α (m+1)=σ(W E[J (m+1),E (m+1)]+b E)    (13)
其中,E (m+1)表示跨模态注意力层226-1输出的向量,σ表示逻辑回归函数(Sigmoid),W E、b E表示跨模态注意力层226-1的参数。
继续参见图9,跨模态注意力层226-2输出的向量G (m+1),通过全连接层227-2输出向量β (m+1),如公式(14)所示:
β (m+1)=σ(W G[J (m+1),G (m+1)]+b G)    (14)
其中,G (m+1))表示跨模态注意力层226-2输出的向量,σ表示逻辑回归函数,W G、b G表示跨模态注意力层226-2的参数。
最后,结合上述计算结果,利用哈达玛积(Hadamard Product)计算第四子向量P (m+1)并输出,如公式(15)所示:
P (m+1)=α (m+1)οE (m+1)(m+1)οG (m+1)     (15)
其中,ο表示哈达玛积。
需要说明的是,由于全连接层227-1和全连接层227-2是相同的,所以两侧的计算 过程可以调换,即全连接层227-2输出向量α (m+1),全连接层227-1输出向量β (m+1)
步骤13、对第四子向量进行中间解码处理,得到第m+1个中间表示向量。
在本申请实施例中,服务器将第四子向量输入至前馈神经网络层225中,输出第m+1次多模态解码处理对应的第m+1个中间表示向量,如公式(16)所示:
R (m+1)=FFN(P (m+1))    (16)
其中,R (m+1)表示第m+1个子转换解码器输出的第m+1个中间表示量。
需要说明的是,若第m+1个子转换解码器是MITD中的最后一个子转换编码器,对当前轮提问对应的状态向量、图像特征和字符串特征进行多模态解码处理,得到解码特征向量r n+1,如公式(17)所示:
r n+1=R (m+1)    (17)
需要说明的是,若第m+1个子转换解码器不是MITD模型中的最后一个子转换解码器,则输出中间表示向量,该中间表示向量将作为第m+2个子转换解码器的输入,以此类推,直到最后一个子转换解码器输出上述解码特征向量r n+1
在本申请实施例中,服务器根据解码特征向量得到实际输出答案中输出的字符串概率。
如图5所示,将MITD输出的特征向量输入至逻辑回归层,得到当前正在输出的字符串的概率,如公式(18)所示:
Figure PCTCN2021102815-appb-000001
其中,
Figure PCTCN2021102815-appb-000002
表示当前正在输出的字符串的概率(字符串概率)。
在本申请实施例,服务器根据字符串概率输出实际输出答案中的字符串。这里,服务器可以通过视觉对话模型来利用输出的字符串概率确定当前正在输出的字符串(目标输出)。
可以理解的是,本申请实施例的视觉对话方法,通过子转换解码器中设置的多层结构,分别计算各个中间表示向量,使得每一个子转换解码器均能根据前一个子转换解码器准确输出中间表示向量,从而保证后续得到的当前轮提问对应的解码特征向量是准确的。从而保证根据解码特征向量输出的实际输出答案的准确性。
可以理解的是,本申请实施例中多模态增量式转换编码器和多模态增量式转换解码器中的注意力模型可替换为其他的注意力模型,比如传统的注意力模型、局部和全局注意力模型、多头注意力模型等。
下面对视觉对话模型的训练方法进行说明。
图10示出了本申请一个示例性实施例提供的视觉对话模型的训练方法的流程图。本申请实施例以该视觉对话方法用于如图1所示的视觉对话系统100中的服务器120为例进行说明,该视觉对话方法包括如下步骤:
步骤1001,获取输入图像样本的图像特征样本和前s轮历史问答对话样本对应的状态向量样本,s为正整数。
需要说明的是,训练视觉对话模型的训练样本包括输入图像样本,输入图像样本是现有的图像集中的图像。视觉对话模型包括特征提取模型,特征提取模型是基于卷积神经网络构建的模型。从而,服务器通过快速区域检测卷积神经网络提取输入图像样本中的特征,所提取到的特征即图像特征样本;或者,服务器通过卷积神经网络提取输入图像样本中的图像特征样本;再或者,服务器通过视觉几何组网络(Visual Geometry Group Network,VGG)提取输入图像样本中的图像特征样本;又或者,通过残差神经网络(ResNET)提取输入图像样本中的图像特征样本。其中,训练视觉对话模型的过程中,包括训练特征提取模型的过程,从而,特征提取模型是训练好的特征提取模型。
需要说明的是,步骤1001对应的实现描述与步骤401的实现描述类似。
步骤1002,获取当前轮提问的问题特征和当前轮提问对应的真实答案的第一答案特征。
在本申请实施例中,服务器可采用公式(3)和(4)来获取问题特征和第一答案特征。这里,问题特征和第一答案特征可以是通过视觉对话模型获得的。
步骤1003,调用视觉对话模型对图像特征样本、前s轮历史问答对话样本对应的状态向量样本和问题特征样本进行多模态编码处理,得到当前轮提问样本对应的状态向量样本。
在本申请实施例中,服务器通过视觉对话模型针对前s轮历史问答对话样本设置有s个多模态增量式转换编码器(MITE),针对当前轮提问样本设置有对应的MITE,前一个MITE输出的一轮历史问答对话样本对应的状态向量样本作为下一个MITE的输入。重复上述输出状态向量样本的过程,直到输出当前轮提问样本对应的状态向量样本。
步骤1004,调用视觉对话模型对当前轮提问样本对应的状态向量样本、图像特征样本和第一答案特征进行多模态解码处理,得到当前轮提问样本对应的实际输出答案样本的第二答案特征。
在本申请实施例中,视觉对话模型还包括多模态增量式转换解码器(MITD),服务器将当前轮提问样本对应的状态向量样本、图像特征样本和第一答案特征输入至MITD中。MITD模型包括T个子转换解码器,前一个子转换解码器输出的中间表示向量作为下一个子转换解码器的输入。重复上述输出中间表示向量的过程,直到输出当前轮提问样本对应的最终解码特征向量样本。该解码特征向量样本为当前轮提问样本对应的实际输出答案样本的第二答案特征。
需要说明的是,在视觉对话模型的训练过程中,服务器通过视觉对话模型获取真实答案中前q个字符串的字符串特征标签(第一答案特征),真实答案中前q个字符串与实际输出答案中已输出的q个字符串一一对应,q为正整数;根据当前轮提问样本对应的状态向量样本、图像特征样本和字符串特征标签得到实际输出答案中第q+1个字符串对应的第二答案特征。
比如,当前轮提问样本为:“How are you?”,且该当前轮提问样的真实答案为:“I am fine。”,视觉对话模型实际输出答案样本为:“I am OK。”时,在视觉对话模型的训练过程中,当视觉对话模型准备输出实际输出答案样本中的第三个单词(字符串)时,通过向MITD中输入真实答案中的单词“I”、“am”以当前轮提问样本对应的状态向量样本,从而视觉对话模型输出的答案中的第三个单词:OK(或者是good)。
步骤1005,根据第一答案特征和第二答案特征,对视觉对话模型进行训练,得到训练后的视觉对话模型。
在本申请实施例中,服务器根据第一答案特征和第二答案特征之间的差异对视觉对话模型进行训练。这里,训练后的视觉对话模型即步骤403中的视觉对话模型。
示例性地,结合实际输出单词“OK”和真实答案中的单词“fine”对视觉对话模型进行训练。
可以理解的是,本申请实施例的视觉对话方法,通过获取关于输入图像的前n轮历史问答对话对应的状态向量,使得训练后的视觉对话模型能够联系上下文更好地理解图像中隐含的信息,利用多模态编码处理方式和多模态解码处理方式,使得训练后的视觉对话模型能够更好地根据多种类型的信息,输出当前轮提问对应的实际输出答案,提高训练后的视觉对话模型输出的答案的准确率,且保证输出的答案与问题和输入图像的一致性,提升视觉对话的效果。
还可以理解的是,通过当前轮提问样本对应的状态向量样本、图像特征样本和真实答案对应的第一答案特征来训练得到视觉对话模型,使得训练后的视觉对话模型输出的 答案的准确率提高。
在本申请实施例中,当训练后的视觉对话模型准备输出第q+1个字符串时,视觉对话模型是根据真实答案中第q+1个字符串之前的所有的字符串和当前轮提问样本对应的状态向量样本来确定输出的第q+1个字符串是什么字符串,从而使得训练后的视觉对话模型输出的字符串准确率更高,从而保证输出的答案的正确率更高。
可以理解的是,视觉对话模型的训练方法和使用方法相似,在训练视觉对话模型时,通过对当前轮提问样本对应的状态向量样本和图像特征样本进行多模态解码处理,得到当前轮提问对应的实际输出答案的第二答案特征,结合真实答案的第一答案特征和第二答案特征对视觉对话模型进行训练。在实际使用视觉对话模型时,向视觉对话模型输入问题后,训练后的视觉对话模型根据已输出的字符串和当前轮提问对应的状态向量输出准备输出的字符串。
在本申请实施例中,调用视觉对话模型中的多模态增量式转换编码器获取第a轮历史问答对话样本对应的第一状态向量样本,a为正整数,a为变量,a对应的取值为1至s中的任一个;获取第a轮历史问答对话样本对应的状态向量样本,a为正整数且a的起始值为1;迭代a,调用视觉对话模型中的第a+1个多模态增量式转换编码器对图像特征样本、第a轮历史问答对话样本对应的状态向量样本和第a+1轮历史问答对话样本对应的问答特征样本进行多模态编码处理,得到第a+1轮历史问答对话样本对应的状态向量样本,多模态增量式转换编码器与历史问答对话样本一一对应;将迭代a得到的第s+1轮历史问答对话样本对应的状态向量样本确定为当前轮提问样本对应的状态向量样本。
在本申请实施例中,多模态增量式转换编码器包括K个子转换编码器,K为正整数;获取第j个中间表示向量样本,第j个中间表示向量样本是对图像特征样本、第a轮历史问答对话样本对应的状态向量样本和第a+1轮历史问答对话样本对应的问答特征样本样本进行j次多模态编码处理得到的,第j个中间表示向量样本是第a+1轮历史问答对话样本对应的向量,j为正整数且j的起始值为1;迭代j,调用视觉对话模型中的第a+1个多模态增量式转换编码器中的第a+1个多模态增量式转换编码器中的第j+1个子转换编码器对第j个中间表示向量样本、图像特征样本和第i轮历史问答对话样本对应的状态向量样本进行多模态编码处理,得到第j+1个中间表示向量样本,第j+1个中间表示向量样本是第i+1轮历史问答对话样本对应的另一向量,j+1≤K;将迭代j得到的第K个中间表示向量样本确定为第a+1轮历史问答对话样本对应的状态向量样本。
在本申请实施例中,调用第a+1个多模态增量式转换编码器中的第j+1个子转换编码器对第j个中间表示向量样本进行中间编码处理,得到第一子向量样本;对第一子向量样本和图像特征样本进行中间编码处理,得到第二子向量样本;对第二子向量样本和第a轮历史问答对话样本对应的状态向量样本进行中间编码处理,得到第三子向量样本;对第三子向量样本进行中间编码处理,得到第j+1个中间表示向量样本。
在本申请实施例中,调用视觉对话模型中的多模态增量式转换解码器获取当前轮提问样本对应的实际输出答案样本中已输出的字符串的字符串特征样本;调用多模态增量式转换解码器对当前轮提问样本对应的状态向量样本、图像特征样本和字符串特征样本进行多模态解码处理,得到解码特征向量样本;根据解码特征向量样本确定当前轮提问对应的实际输出答案样本。
在本申请实施例中,多模态增量式转换解码器包括T个子转换解码器,T为正整数;获取第m个中间表示向量样本,第m个中间表示向量样本是对当前轮提问样本对应的状态向量样本、图像特征样本和字符串特征样本进行m次多模态解码处理得到的,m为正整数且m的起始值为1;迭代m,调用多模态增量式转换解码器中的第m+1个子转换解码器对第m个中间表示向量样本、图像特征样本和当前轮提问样本对应的状态向量 样本进行多模态解码处理,得到第m+1个中间表示向量样本,m+1≤T;将迭代m得到的第T个中间表示向量样本确定为解码特征向量样本。
在本申请实施例中,调用多模态增量式转换解码器中的第m+1个子转换解码器对第m个中间表示向量样本进行中间解码处理,得到第三子向量样本;对第三子向量样本、图像特征样本和当前轮提问样本对应的状态向量样本进行中间解码处理,得到第四子向量样本;对第四子向量样本进行中间解码处理,得到第m+1个中间表示向量样本。
在本申请实施例中,根据解码特征向量样本得到实际输出答案样本中输出的字符串预估概率;根据字符串预估概率输出实际输出答案样本中的字符串样本。
表1示出了视觉对话模型与基准模型对比下的训练效果,以不同类型的评价指标综合评价上述方法实施例中提供的视觉对话模型。
表1
Figure PCTCN2021102815-appb-000003
对于每个问题,视觉对话模型均会获取候选答案的列表,表1中的三种评估指标用于评价视觉对话模型检索答案的性能。
其中,MRR表示平均排序倒数(Mean Reciprocal Rank),将候选答案的列表进行排序,若正确答案排在第a位,则MRR的值为1/a。MRR的值越高代表视觉对话模型输出的答案准确率越高,即视觉对话模型的效果越好。
R@K表示排名前K的答案中存在的人类反应等级(Existence of the Human Response in Top-K Ranked Responses),R@K的值越高代表视觉对话模型输出的答案准确率越高,即视觉对话模型的效果越好。
Mean表示人类反应的平均等级,Mean的值越低代表视觉对话模型输出的答案的准确率越高,即是觉对话模型的效果越好。
由表1可知,本申请实施例提供的视觉对话模型在各项评价指标上均优于基准视觉对话模型(通常提高或降低1个点即是显著提高)。
图11是本申请一个示例性实施例提供的视觉对话装置的结构框图,该视觉对话装置11-1包括:
第一获取模块1110,配置为获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数;
所述第一获取模块1110,配置为获取当前轮提问的问题特征;
第一特征编码模块1120,配置为对图像特征、前n轮历史问答对话对应的状态向量和问题特征进行多模态编码处理,得到当前轮提问对应的状态向量;
第一特征解码模块1130,配置为对当前轮提问对应的状态向量和图像特征进行多模态解码处理,得到当前轮提问对应的实际输出答案。
在本申请实施例中,所述特征编码模块1120,还配置为获取第i轮历史问答对话对应的状态向量,i为正整数且i的起始值为1;迭代i,调用视觉对话模型中的第i+1个多模态增量式转换编码器对所述图像特征、所述第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话样本对应的问答特征进行多模态编码处理,得到第i+1轮历史问答对话对应的状态向量,不同的所述多模态增量式转换编码器与不同的所述历史问答对 话一一对应;将迭代i得到的第n+1轮历史问答对话对应的状态向量确定为所述当前轮提问对应的状态向量。
在本申请实施例中,多模态增量式转换编码器包括K个子转换编码器,K为正整数;所述第一特征编码模块1120,还配置为获取第j个中间表示向量,所述第j个中间表示向量是对所述图像特征、所述第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话样本对应的问答特征进行j次多模态编码处理得到的,所述第j个中间表示向量是所述第i+1轮历史问答对话对应的向量,j为正整数且j的起始值为1;迭代j,调用所述视觉对话模型中的所述第i+1个多模态增量式转换编码器中的所述第i+1个多模态增量式转换编码器中的第j+1个子转换编码器对所述第j个中间表示向量、所述图像特征和所述第i轮历史问答对话对应的状态向量进行多模态编码处理,得到第j+1个中间表示向量,所述第j+1个中间表示向量是所述第i+1轮历史问答对话对应的另一向量,j+1≤K;将迭代j得到的第K个中间表示向量确定为所述第i+1轮历史问答对话对应的状态向量。
在本申请实施例中,所述第一特征编码模块1120,还配置为调用所述第i+1个多模态增量式转换编码器中的所述第j+1个子转换编码器对所述第j个中间表示向量进行中间编码处理,得到第一子向量;对所述第一子向量和所述图像特征进行中间编码处理,得到第二子向量;对所述第二子向量和所述第i轮历史问答对话对应的状态向量进行中间编码处理,得到第三子向量;对所述第三子向量进行中间编码处理,得到所述第j+1个中间表示向量。
在本申请实施例中,所述第一特征解码模块1130,还配置为调用视觉对话模型中的多模态增量式转换解码器获取当前轮提问对应的实际输出答案中已输出的字符串的字符串特征;调用所述多模态增量式转换解码器对当前轮提问对应的状态向量、图像特征和字符串特征进行多模态解码处理,得到解码特征向量;根据所述解码特征向量确定所述当前轮提问对应的所述实际输出答案,其中,所述实际输出答案包括所述已输出的字符串。
在本申请实施例中,所述第一特征解码模块1130,还配置为根据所述解码特征向量确定字符串概率;根据所述字符串概率确定所述实际输出答案中的字符串。
在本申请实施例中,多模态增量式转换解码器包括T个子转换解码器,T为正整数;所述第一特征解码模块1130,还配置为获取第m个中间表示向量,所述第m个中间表示向量是对所述当前轮提问对应的状态向量、所述图像特征和所述字符串特征进行m次多模态解码处理得到的,m为正整数且m的起始值为1;迭代m,调用所述多模态增量式转换解码器中的第m+1个子转换解码器对所述第m个中间表示向量、所述图像特征和所述当前轮提问对应的状态向量进行多模态解码处理,得到第m+1个中间表示向量,m+1≤T;将迭代m得到的第T个中间表示向量确定为所述解码特征向量。
在本申请实施例中,所述第一特征解码模块1130,还配置为调用所述多模态增量式转换解码器中的所述第m+1个子转换解码器对所述第m个中间表示向量进行中间解码处理,得到第三子向量;对所述第三子向量、所述图像特征和所述当前轮提问对应的状态向量进行中间解码处理,得到第四子向量;对所述第四子向量进行中间解码处理,得到所述第m+1个中间表示向量。
可以理解的是,本申请实施例提供的视觉对话装置,通过获取关于输入图像的前n轮历史问答对话对应的状态向量,使得视觉对话模型能够联系上下文更好地理解图像中隐含的信息,利用多模态编码处理方式和多模态解码处理方式,使得视觉对话模型能够更好地根据多种类型的信息,输出当前轮提问对应的实际输出答案,提高视觉对话模型输出的答案的准确率,且保证输出的答案与问题和输入图像的一致性,提升视觉对话的 效果。
还可以理解的是,通过视觉对话模型中的多模态增量式转换编码器对每一轮历史问答对话对应的状态向量进行多模态编码处理,以此类推,从而得到当前轮提问对应的状态向量,使得后续经过多模态解码处理后得到的输出答案更加准确。
还可以理解的是,通过在每个多模态增量式转换编码器中设置K个子转换编码器,该K个子转换编码器之间依次将前一个子转换编码器输出的中间表示向量传递至下一个子转换编码器中,从而得到当前轮提问对应的状态向量,使得后续进行解码处理得到的输出答案更加准确。通过层状结构保证为后续输出答案提供准确的中间表示向量。
还可以理解的是,通过视觉对话模型中的多模态增量式转换解码器对多模态增量式转换编码器输出的状态向量进行解码处理,从而使得视觉对话模型能够准确输出当前轮提问对应的实际输出答案。
还可以理解的是,通过多模态增量式转换解码器中设置的T个子转换解码器,该T个子转换解码器之间依次将前一个子转换解码器输出的中间表示向量传递至下一个子转换解码器中,从而得到当前轮提问对应的实际输出答案。通过层状结构保证视觉对话模型输出的答案的准确率。
还可以理解的是,通过子转换编码器中设置的多层结构,分别计算各个中间表示向量,使得每一个子转换编码器均能根据前一个子转换编码器准确输出中间表示向量,从而保证后续得到当前轮提问对应的状态向量是准确的。
本申请实施例的视觉对话方法,通过子转换解码器中设置的多层结构,分别计算各个中间表示向量,使得每一个子转换解码器均能根据前一个子转换解码器准确输出中间表示向量,从而保证后续得到当前轮提问对应的解码特征向量是准确的。从而保证根据解码特征向量输出实际输出答案。
需要说明的是,本申请实施例提供的视觉对话装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的视觉对话装置与视觉对话方法实施例属于同一构思,其具体实现过程详见本申请实施例提供的视觉对话方法,这里不再赘述。
图12是本申请一个示例性实施例提供的视觉对话模型的训练装置的结构框图,该训练装置12-1包括:
第二获取模块1210,配置为获取输入图像样本的图像特征样本和前s轮历史问答对话样本对应的状态向量样本,s为正整数;
所述第二获取模块1210,配置为获取当前轮提问样本的问题特征样本和所述当前轮提问样本对应真实答案的第一答案特征;
第二特征编码模块1220,配置为调用视觉对话模型对所述图像特征样本、所述前s轮历史问答对话样本对应的状态向量样本和所述问题特征样本进行多模态编码处理,得到所述当前轮提问样本对应的状态向量样本;
第二特征解码模块1230,配置为调用所述视觉对话模型对所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述第一答案特征进行多模态解码处理,得到所述当前轮提问样本对应的实际输出答案样本的第二答案特征;
训练模块1240,配置为根据所述第一答案特征和所述第二答案特征,对所述视觉对话模型进行训练,得到训练后的视觉对话模型。
在本申请实施例中,所述第二特征解码模块1230,还配置为获取所述真实答案中前q个字符串的字符串特征标签,所述真实答案中前q个字符串与所述实际输出答案样本中已输出的q个字符串一一对应,q为正整数,第一答案特征包括所述字符串特征标签; 根据所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述字符串特征标签,得到所述当前轮提问样本对应的所述实际输出答案样本中第q+1个字符串对应的所述第二答案特征。
在本申请实施例中,所述第二特征编码模块1220,还配置为调用视觉对话模型中的多模态增量式转换编码器获取第a轮历史问答对话样本对应的第一状态向量样本,a为正整数,a为变量,a对应的取值为1至s中的任一个;获取第a轮历史问答对话样本对应的状态向量样本,a为正整数且a的起始值为1;迭代a,调用视觉对话模型中的第a+1个多模态增量式转换编码器对图像特征样本、第a轮历史问答对话样本对应的状态向量样本和第a+1轮历史问答对话样本对应的问答特征样本进行多模态编码处理,得到第a+1轮历史问答对话样本对应的状态向量样本,不同的多模态增量式转换编码器与不同的历史问答对话样本一一对应;将迭代a得到的第s+1轮历史问答对话样本对应的状态向量样本确定为当前轮提问样本对应的状态向量样本。
在本申请实施例中,多模态增量式转换编码器包括K个子转换编码器,K为正整数;所述第二特征编码模块1220,还配置为获取第j个中间表示向量样本,第j个中间表示向量样本是对图像特征样本、第a轮历史问答对话样本对应的状态向量样本和第a+1轮历史问答对话样本对应的问答特征样本进行j次多模态编码处理得到的,第j个中间表示向量样本是第a+1轮历史问答对话样本对应的向量,j为正整数且j的起始值为1;迭代j,调用视觉对话模型中的第a+1个多模态增量式转换编码器中的第a+1个多模态增量式转换编码器中的第j+1个子转换编码器对第j个中间表示向量样本、图像特征样本和第i轮历史问答对话样本对应的状态向量样本进行多模态编码处理,得到第j+1个中间表示向量样本,第j+1个中间表示向量样本是第i+1轮历史问答对话样本对应的另一向量,j+1≤K;将迭代j得到的第K个中间表示向量样本确定为第a+1轮历史问答对话样本对应的状态向量样本。
在本申请实施例中,所述第二特征编码模块1220,还配置为调用第a+1个多模态增量式转换编码器中的第j+1个子转换编码器对第j个中间表示向量样本进行中间编码处理,得到第一子向量样本;对第一子向量样本和图像特征样本进行中间编码处理,得到第二子向量样本;对第二子向量样本和第a轮历史问答对话样本对应的状态向量样本进行中间编码处理,得到第三子向量样本;对第三子向量样本进行中间编码处理,得到第j+1个中间表示向量样本。
在本申请实施例中,所述第二特征解码模块1230,还配置为调用视觉对话模型中的多模态增量式转换解码器获取当前轮提问样本对应的实际输出答案样本中已输出的字符串的字符串特征样本;调用多模态增量式转换解码器对当前轮提问样本对应的状态向量样本、图像特征样本和字符串特征样本进行多模态解码处理,得到解码特征向量样本;根据解码特征向量样本确定当前轮提问对应的实际输出答案样本。
在本申请实施例中,多模态增量式转换解码器包括T个子转换解码器,T为正整数;所述第二特征解码模块1230,还配置为获取第m个中间表示向量样本,第m个中间表示向量样本是对当前轮提问样本对应的状态向量样本、图像特征样本和字符串特征样本进行m次多模态解码处理得到的,m为正整数且m的起始值为1;迭代m,调用多模态增量式转换解码器中的第m+1个子转换解码器对第m个中间表示向量样本、图像特征样本和当前轮提问样本对应的状态向量样本进行多模态解码处理,得到第m+1个中间表示向量样本,m+1≤T;将迭代m得到的第T个中间表示向量样本确定为解码特征向量样本。
在本申请实施例中,所述第二特征解码模块1230,还配置为调用多模态增量式转换解码器中的第m+1个子转换解码器对第m个中间表示向量样本进行中间解码处理,得 到第三子向量样本;对第三子向量样本、图像特征样本和当前轮提问样本对应的状态向量样本进行中间解码处理,得到第四子向量样本;对第四子向量样本进行中间解码处理,得到第m+1个中间表示向量样本。
在本申请实施例中,所述第二特征解码模块1230,还配置为解码特征向量样本得到实际输出答案样本中输出的字符串预估概率;根据字符串预估概率输出实际输出答案样本中的字符串样本。
图13示出了本申请一个示例性实施例提供的服务器的结构示意图。该服务器1300可以如图1所示的视觉对话系统100中的服务器120。如图13所示,服务器1300包括中央处理单元(CPU,Central Processing Unit)1301、包括随机存取存储器(RAM,Random Access Memory)1302和只读存储器(ROM,Read Only Memory)1303的系统存储器1304,以及连接系统存储器1304和中央处理单元1301的系统总线1305。服务器1300还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统,Input Output System)1306,和用于存储操作系统1313、应用程序1314和其他程序模块1315的大容量存储设备1307。
基本输入/输出系统1306包括有用于显示信息的显示器1308和用于用户输入信息的诸如鼠标、键盘之类的输入设备1309。其中显示器1308和输入设备1309都通过连接到系统总线1305的输入输出控制器1310连接到中央处理单元1301。基本输入/输出系统1306还可以包括输入输出控制器1310以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1310还提供输出到显示屏、打印机或其他类型的输出设备。
大容量存储设备1307通过连接到系统总线1305的大容量存储控制器(图13中未示出)连接到中央处理单元1301。大容量存储设备1307及其相关联的计算机可读介质为服务器1300提供非易失性存储。也就是说,大容量存储设备1307可以包括诸如硬盘或者紧凑型光盘只读存储器(CD-ROM,Compact Disc Read Only Memory)驱动器之类的计算机可读存储介质(图13中未示出)。
计算机可读存储介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存或其他固态存储其技术,CD-ROM、数字通用光盘(DVD,Digital Versatile Disc)或固态硬盘(SSD,Solid State Drives)、其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance Random Access Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。当然,本领域技术人员可知计算机存储介质不局限于上述几种。上述的系统存储器1304和大容量存储设备1307可以统称为存储器。
在本申请实施例中,服务器1300还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即服务器1300可以通过连接在系统总线1305上的网络接口单元1311连接到网络1312,或者说,也可以使用网络接口单元1311来连接到其他类型的网络或远程计算机系统(图13中未示出)。
上述存储器还包括一个或者一个以上的程序,一个或者一个以上程序存储于存储器中,被配置由CPU执行。
在本申请实施例中,提供了一种电子设备,该电子设备包括处理器和存储器,存储器中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段 程序、代码集或指令集由处理器加载并执行以实现如上所述的视觉对话方法和视觉对话模型的训练方法。
在本申请实施例中,提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现如上所述的视觉对话方法和视觉对话模型的训练方法。
可选地,该计算机可读存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、固态硬盘(SSD,Solid State Drives)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance Random Access Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。
本申请实施例还提供了一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中。电子设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述电子设备执行如上方面所述的视觉对话方法和视觉对话模型的训练方法。
本领域普通技术人员可以理解实现上述本申请实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的计算机可读存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (14)

  1. 一种视觉对话方法,所述方法由电子设备执行,所述方法包括:
    获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数;
    获取当前轮提问的问题特征;
    对所述图像特征、所述前n轮历史问答对话对应的状态向量和所述问题特征进行多模态编码处理,得到所述当前轮提问对应的状态向量;
    对所述当前轮提问对应的状态向量和所述图像特征进行多模态解码处理,得到所述当前轮提问对应的实际输出答案。
  2. 根据权利要求1所述的方法,其中,所述对所述图像特征、所述前n轮历史问答对话对应的状态向量和所述问题特征进行多模态编码处理,得到所述当前轮提问对应的状态向量,包括:
    获取第i轮历史问答对话对应的状态向量,i为正整数且i的起始值为1;
    迭代i,调用视觉对话模型中的第i+1个多模态增量式转换编码器对所述图像特征、所述第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话对应的问答特征进行多模态编码处理,得到第i+1轮历史问答对话对应的状态向量,不同的所述多模态增量式转换编码器与不同的所述历史问答对话一一对应;
    将迭代i得到的第n+1轮历史问答对话对应的状态向量确定为所述当前轮提问对应的状态向量。
  3. 根据权利要求2所述的方法,其中,所述第i+1个多模态增量式转换编码器包括K个子转换编码器,K为正整数;
    所述调用视觉对话模型中的第i+1个多模态增量式转换编码器对所述图像特征、所述第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话对应的问答特征进行多模态编码处理,得到第i+1轮历史问答对话对应的状态向量,包括:
    获取第j个中间表示向量,所述第j个中间表示向量是对所述图像特征、所述第i轮历史问答对话对应的状态向量和第i+1轮历史问答对话对应的问答特征进行j次多模态编码处理得到的,所述第j个中间表示向量是所述第i+1轮历史问答对话对应的向量,j为正整数且j的起始值为1;
    迭代j,调用所述第i+1个多模态增量式转换编码器中的第j+1个子转换编码器对所述第j个中间表示向量、所述图像特征和所述第i轮历史问答对话对应的状态向量进行多模态编码处理,得到第j+1个中间表示向量,所述第j+1个中间表示向量是所述第i+1轮历史问答对话对应的另一向量,j+1≤K;
    将迭代j得到的第K个中间表示向量确定为所述第i+1轮历史问答对话对应的状态向量。
  4. 根据权利要求3所述的方法,其中,所述调用所述第i+1个多模态增量式转换编码器中的第j+1个子转换编码器对所述第j个中间表示向量、所述图像特征和所述第i轮历史问答对话对应的状态向量进行多模态编码处理,得到第j+1个中间表示向量,包括:
    调用所述第i+1个多模态增量式转换编码器中的所述第j+1个子转换编码器对所述第j个中间表示向量进行中间编码处理,得到第一子向量;
    对所述第一子向量和所述图像特征进行中间编码处理,得到第二子向量;
    对所述第二子向量和所述第i轮历史问答对话对应的状态向量进行中间编码处理,得到第三子向量;
    对所述第三子向量进行中间编码处理,得到所述第j+1个中间表示向量。
  5. 根据权利要求1至4任一所述的方法,其中,所述对所述当前轮提问对应的状态向量和所述图像特征进行多模态解码处理,得到所述当前轮提问对应的实际输出答案,包括:
    调用视觉对话模型中的多模态增量式转换解码器获取所述当前轮提问对应的已输出的字符串的字符串特征;
    调用所述多模态增量式转换解码器对所述当前轮提问对应的状态向量、所述图像特征和所述字符串特征进行多模态解码处理,得到解码特征向量;
    根据所述解码特征向量确定所述当前轮提问对应的所述实际输出答案,其中,所述实际输出答案包括所述已输出的字符串。
  6. 根据权利要求5所述的方法,其中,所述根据所述解码特征向量确定所述当前轮提问对应的所述实际输出答案,包括:
    根据所述解码特征向量确定字符串概率;
    根据所述字符串概率确定所述实际输出答案中的字符串。
  7. 根据权利要求5所述的方法,其中,所述多模态增量式转换解码器包括T个子转换解码器,T为正整数;
    所述调用所述多模态增量式转换解码器对所述当前轮提问对应的状态向量、所述图像特征和所述字符串特征进行多模态解码处理,得到解码特征向量,包括:
    获取第m个中间表示向量,所述第m个中间表示向量是对所述当前轮提问对应的状态向量、所述图像特征和所述字符串特征进行m次多模态解码处理得到的,m为正整数且m的起始值为1;
    迭代m,调用所述多模态增量式转换解码器中的第m+1个子转换解码器对所述第m个中间表示向量、所述图像特征和所述当前轮提问对应的状态向量进行多模态解码处理,得到第m+1个中间表示向量,m+1≤T;
    将迭代m得到的第T个中间表示向量确定为所述解码特征向量。
  8. 根据权利要求7所述的方法,其中,所述调用所述多模态增量式转换解码器中的第m+1个子转换解码器对所述第m个中间表示向量、所述图像特征和所述当前轮提问对应的状态向量进行多模态解码处理,得到第m+1个中间表示向量,包括:
    调用所述多模态增量式转换解码器中的所述第m+1个子转换解码器对所述第m个中间表示向量进行中间解码处理,得到第三子向量;
    对所述第三子向量、所述图像特征和所述当前轮提问对应的状态向量进行中间解码处理,得到第四子向量;
    对所述第四子向量进行中间解码处理,得到所述第m+1个中间表示向量。
  9. 一种视觉对话模型的训练方法,所述方法由电子设备执行,所述方法包括:
    获取输入图像样本的图像特征样本和前s轮历史问答对话样本对应的状态向量样本,s为正整数;
    获取当前轮提问样本的问题特征样本和所述当前轮提问样本对应的真实答案的第一答案特征;
    调用视觉对话模型对所述图像特征样本、所述前s轮历史问答对话样本对应的状态向量样本和所述问题特征样本进行多模态编码处理,得到所述当前轮提问样本对应的状态向量样本;
    调用所述视觉对话模型对所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述第一答案特征进行多模态解码处理,得到所述当前轮提问样本对应的实际输出答案样本的第二答案特征;
    根据所述第一答案特征和所述第二答案特征,对所述视觉对话模型进行训练,得到训练后的视觉对话模型。
  10. 根据权利要求9所述的方法,其中,所述对所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述第一答案特征进行多模态解码处理,得到所述当前轮提问样本对应的实际输出答案样本的第二答案特征,包括:
    获取所述真实答案中前q个字符串的字符串特征标签,所述真实答案中前q个字符串与所述实际输出答案样本中已输出的q个字符串一一对应,q为正整数,第一答案特征包括所述字符串特征标签;
    根据所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述字符串特征标签,得到所述当前轮提问样本对应的所述实际输出答案样本中第q+1个字符串对应的所述第二答案特征。
  11. 一种视觉对话装置,所述装置包括:
    第一获取模块,配置为获取输入图像的图像特征和前n轮历史问答对话对应的状态向量,n为正整数;
    所述第一获取模块,配置为获取当前轮提问的问题特征;
    第一特征编码模块,配置为对所述图像特征、所述前n轮历史问答对话对应的状态向量和所述问题特征进行多模态编码处理,得到所述当前轮提问对应的状态向量;
    第一特征解码模块,配置为对所述当前轮提问对应的状态向量和所述图像特征进行多模态解码处理,得到所述当前轮提问对应的实际输出答案。
  12. 一种视觉对话模型的训练装置,所述装置包括:
    第二获取模块,配置为获取输入图像样本的图像特征样本和前s轮历史问答对话样本对应的状态向量样本,s为正整数;
    所述第二获取模块,配置为获取当前轮提问样本的问题特征样本和所述当前轮提问样本对应真实答案的第一答案特征;
    第二特征编码模块,配置为调用视觉对话模型对所述图像特征样本、所述前s轮历史问答对话样本对应的状态向量样本和所述问题特征样本进行多模态编码处理,得到所述当前轮提问样本对应的状态向量样本;
    第二特征解码模块,配置为调用所述视觉对话模型对所述当前轮提问样本对应的状态向量样本、所述图像特征样本和所述第一答案特征进行多模态解码处理,得到所述当前轮提问样本对应的实际输出答案样本的第二答案特征;
    训练模块,配置为根据所述第一答案特征和所述第二答案特征,对所述视觉对话模型进行训练,得到训练后的视觉对话模型。
  13. 一种电子设备,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至8任一所述的视觉对话方法,或者,以实现权利要求9或10所述的视觉对话模型的训练方法。
  14. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至8任一所述的视觉对话方法,或者,以实现权利要求9或10所述的视觉对话模型的训练方法。
PCT/CN2021/102815 2020-08-12 2021-06-28 视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质 WO2022033208A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/989,613 US20230082605A1 (en) 2020-08-12 2022-11-17 Visual dialog method and apparatus, method and apparatus for training visual dialog model, electronic device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010805359.1A CN111897940B (zh) 2020-08-12 2020-08-12 视觉对话方法、视觉对话模型的训练方法、装置及设备
CN202010805359.1 2020-08-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/989,613 Continuation US20230082605A1 (en) 2020-08-12 2022-11-17 Visual dialog method and apparatus, method and apparatus for training visual dialog model, electronic device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2022033208A1 true WO2022033208A1 (zh) 2022-02-17

Family

ID=73229694

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/102815 WO2022033208A1 (zh) 2020-08-12 2021-06-28 视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质

Country Status (3)

Country Link
US (1) US20230082605A1 (zh)
CN (1) CN111897940B (zh)
WO (1) WO2022033208A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617036A (zh) * 2022-09-13 2023-01-17 中国电子科技集团公司电子科学研究院 一种多模态信息融合的机器人运动规划方法及设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897940B (zh) * 2020-08-12 2024-05-17 腾讯科技(深圳)有限公司 视觉对话方法、视觉对话模型的训练方法、装置及设备
CN112579759B (zh) * 2020-12-28 2022-10-25 北京邮电大学 模型训练方法及任务型视觉对话问题的生成方法和装置
CN113177112B (zh) * 2021-04-25 2022-07-01 天津大学 基于kr积融合多模态信息的神经网络视觉对话装置及方法
CN113435399B (zh) * 2021-07-14 2022-04-15 电子科技大学 一种基于多层次排序学习的多轮视觉对话方法
CN116071835B (zh) * 2023-04-07 2023-06-20 平安银行股份有限公司 人脸识别攻击事后筛查的方法、装置和电子设备
CN117235670A (zh) * 2023-11-10 2023-12-15 南京信息工程大学 基于细粒度交叉注意力的医学影像问题视觉解答方法
CN117271818B (zh) * 2023-11-22 2024-03-01 鹏城实验室 视觉问答方法、系统、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006062620A2 (en) * 2004-12-03 2006-06-15 Motorola, Inc. Method and system for generating input grammars for multi-modal dialog systems
CN105574133A (zh) * 2015-12-15 2016-05-11 苏州贝多环保技术有限公司 一种多模态的智能问答系统及方法
CN109359196A (zh) * 2018-10-22 2019-02-19 北京百度网讯科技有限公司 文本多模态表示方法及装置
CN111309883A (zh) * 2020-02-13 2020-06-19 腾讯科技(深圳)有限公司 基于人工智能的人机对话方法、模型训练方法及装置
CN111460121A (zh) * 2020-03-31 2020-07-28 苏州思必驰信息科技有限公司 视觉语义对话方法及系统
CN111897940A (zh) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 视觉对话方法、视觉对话模型的训练方法、装置及设备

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609891B (zh) * 2019-09-18 2021-06-08 合肥工业大学 一种基于上下文感知图神经网络的视觉对话生成方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006062620A2 (en) * 2004-12-03 2006-06-15 Motorola, Inc. Method and system for generating input grammars for multi-modal dialog systems
CN105574133A (zh) * 2015-12-15 2016-05-11 苏州贝多环保技术有限公司 一种多模态的智能问答系统及方法
CN109359196A (zh) * 2018-10-22 2019-02-19 北京百度网讯科技有限公司 文本多模态表示方法及装置
CN111309883A (zh) * 2020-02-13 2020-06-19 腾讯科技(深圳)有限公司 基于人工智能的人机对话方法、模型训练方法及装置
CN111460121A (zh) * 2020-03-31 2020-07-28 苏州思必驰信息科技有限公司 视觉语义对话方法及系统
CN111897940A (zh) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 视觉对话方法、视觉对话模型的训练方法、装置及设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617036A (zh) * 2022-09-13 2023-01-17 中国电子科技集团公司电子科学研究院 一种多模态信息融合的机器人运动规划方法及设备
CN115617036B (zh) * 2022-09-13 2024-05-28 中国电子科技集团公司电子科学研究院 一种多模态信息融合的机器人运动规划方法及设备

Also Published As

Publication number Publication date
CN111897940A (zh) 2020-11-06
US20230082605A1 (en) 2023-03-16
CN111897940B (zh) 2024-05-17

Similar Documents

Publication Publication Date Title
WO2022033208A1 (zh) 视觉对话方法、模型训练方法、装置、电子设备及计算机可读存储介质
CN109874029B (zh) 视频描述生成方法、装置、设备及存储介质
CN111897939B (zh) 视觉对话方法、视觉对话模型的训练方法、装置及设备
US20220028031A1 (en) Image processing method and apparatus, device, and storage medium
CN109508375A (zh) 一种基于多模态融合的社交情感分类方法
CN112860888B (zh) 一种基于注意力机制的双模态情感分析方法
CN113297370B (zh) 基于多交互注意力的端到端多模态问答方法及系统
AU2019101138A4 (en) Voice interaction system for race games
CN114549850B (zh) 一种解决模态缺失问题的多模态图像美学质量评价方法
CN112699215B (zh) 基于胶囊网络与交互注意力机制的评级预测方法及系统
CN113792177A (zh) 基于知识引导深度注意力网络的场景文字视觉问答方法
CN113870395A (zh) 动画视频生成方法、装置、设备及存储介质
CN114663915A (zh) 基于Transformer模型的图像人-物交互定位方法及系统
CN114282055A (zh) 视频特征提取方法、装置、设备及计算机存储介质
CN115630145A (zh) 一种基于多粒度情感的对话推荐方法及系统
CN113705191A (zh) 样本语句的生成方法、装置、设备及存储介质
CN113239159A (zh) 基于关系推理网络的视频和文本的跨模态检索方法
CN117093687A (zh) 问题应答方法和装置、电子设备、存储介质
CN117033609B (zh) 文本视觉问答方法、装置、计算机设备和存储介质
CN114186978A (zh) 简历与岗位匹配度预测方法及相关设备
CN115659987A (zh) 基于双通道的多模态命名实体识别方法、装置以及设备
CN115759262A (zh) 基于知识感知注意力网络的视觉常识推理方法及系统
CN115659242A (zh) 一种基于模态增强卷积图的多模态情感分类方法
CN113836354A (zh) 一种跨模态视觉与文本信息匹配方法和装置
CN116661940B (zh) 组件识别方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21855261

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.06.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21855261

Country of ref document: EP

Kind code of ref document: A1