CN117828142A

CN117828142A - Question and answer method and device based on multi-mode information and application thereof

Info

Publication number: CN117828142A
Application number: CN202311453276.0A
Authority: CN
Inventors: 李圣权; 黎维; 张香伟; 王理程; 毛云青
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-04-05

Abstract

The application provides a question-answering method and device based on multi-mode information and application thereof, wherein the method comprises the steps of collecting multi-mode heterogeneous information sources; encoding table contents of the table data into vector representations with fixed lengths through a table information generation model to generate linear texts; obtaining semantic representation information through an image information generation model; obtaining semantic representations of text information data and problem text data through a text information coding model; the unified language representation space is built, and clues related to the text data of the questions are searched in the language representation space to serve as a generation basis of subsequent answers; sorting the cables according to the sequence-problem to obtain a plurality of candidate cables; semantic representations of the plurality of candidate cues and the question text data are input into an adaptive answer extractor to output an answer. The method and the device can realize cross-modal reasoning and improve training efficiency and generalization capability.

Description

Question and answer method and device based on multi-mode information and application thereof

Technical Field

The application relates to the technical field of multi-mode questions and answers, in particular to a question and answer method and device based on multi-mode information and application thereof.

Background

With the advent of the information explosion age, people are paying more attention to a question-answering system capable of efficiently acquiring information, and the question-answering system can provide short and accurate answers for users, so that the problem of information overload is relieved. In real life, when people try to answer complex questions, it is often dependent on a variety of sources of information, such as visual, textual, and tabular data, etc. However, in the conventional question-answering system, only a single knowledge source is usually used to answer the questions posed by the user, such as a text paragraph or a knowledge graph triplet, and it is assumed that enough evidence is provided in the single knowledge source to answer the questions of the user, but the method ignores important visual clues from pictures, precise contents from tables, and other knowledge in different modes.

It is therefore desirable to develop a question-answering system that can take full advantage of a large number of different modal knowledge from the internet, including non-text formatted images, forms, etc. At present, researchers propose different multi-modal information question-answering methods, which are mainly classified into a classifier-based method and an end-to-end-based method.

However, classifier-based methods require designing input features or model structures in a multimodal space while introducing classifiers to decide the most appropriate one of the modality information for generating answers, which is inflexible for cross-modality reasoning or data-efficient training, such that the model has limited reasoning ability based on the multimodal information. The end-to-end based method relies on integrating three traditional question-answer models, including a document question-answer model, a form question-answer model and a picture question-answer model, and lacks information association degree and reasoning capability among modes, so that the selection of the mode information also affects the final generation accuracy.

Therefore, a multi-modal information-based question-answering method and device capable of fully utilizing related information among multiple modalities and high in training efficiency and application thereof are needed to solve the problems in the prior art.

Disclosure of Invention

The embodiment of the application provides a question-answering method and device based on multi-mode information and application thereof, and aims at solving the problem that the prior art lacks information association degree and reasoning capability between modes and influences accuracy.

The core technology of the invention mainly converts images and forms into natural language at the model input stage, and simplifies the multi-mode information question-answering task into simpler text question-answering questions. Cross-modal reasoning is achieved by combining textual representations of interrelated cues of multiple modalities.

In a first aspect, the present application provides a multi-modal information-based question-answering method, the method including the steps of:

s00, collecting multi-mode heterogeneous information sources;

the heterogeneous information source comprises text data, form data and image data;

s10, encoding table contents of table data into vector representations with fixed lengths through a table information generation model, and generating linear texts based on the output context vectors;

converting the image data into natural language through an image information generation model to obtain semantic representation information;

mapping discrete text symbols in text data and problem text data of a user to a continuous vector space through a text information coding model, and initializing to obtain semantic representations of the text information data and the problem text data;

s20, a unified language representation space is built by the vector representation, semantic representation information and semantic representation of text information data, so that vector representations of different modes are mapped to a shared language representation space;

s30, searching clues related to the text data of the questions in a language representation space based on semantic representation of the text data of the questions as a generation basis of subsequent answers;

s40, sorting the cables according to the sequence-problem to obtain a plurality of candidate cables;

s50, inputting semantic representations of the candidate clues and the question text data into the adaptive answer extractor to output answers.

Further, in step S10, the table information generation model adopts a sequence-to-sequence network architecture. The adoption of the Seq2Seq model (sequence-to-sequence network architecture) can convert form data into semantic representation information, and has the beneficial effects of context understanding, natural language generation, universality, long-term dependency capturing and the like.

Further, in step S10, the text information encoding model adopts a pre-training language model BERT and adopts a sliding window mechanism to filter irrelevant information. The text information coding model which adopts the pre-training language model BERT and adopts a sliding window mechanism to filter irrelevant information can improve the efficiency and accuracy of the question-answering system, and has the beneficial effects of context awareness, wide applicability and the like.

Further, in step S10, the image information generating model combines a global policy and a local policy, the global policy generates a macroscopic text description of the image through the image description model, the local policy extracts each object and its attribute in the image through the object-attribute matching model and generates a corresponding text description, so as to supplement semantic detail information missing in the macroscopic text description, and meanwhile, the final semantic representation information of the image is obtained by splicing the global policy and the local policy obtained information. The image information generation model combining the global strategy and the local strategy can provide more complete and multi-level image semantic representation information, and meanwhile, the robustness and the flexibility of the model are improved, and the model has high efficiency. This helps to improve the performance and user experience of the image information based question and answer system.

Further, in step S30, the dense information retriever is used as a retriever, so that it projects the question text data and points to the clues in the language representation space. By using the dense information retriever as the retriever, efficient and accurate matching and context-aware understanding can be performed on the question text data, and meanwhile, interpretable retrieval results and flexible expansion capability are provided, so that the performance and user experience of the question-answering system are improved.

Further, in step S40, the sequence includes semantic representations from one or more modalities in tables, pictures and text information sources, all semantic representations are linked to form an inference chain sequence, sequence-question pairs are input into a relevance decision network, and a cross-attention mechanism is introduced at the same time to calculate ranking scores of each candidate cue, and a plurality of candidate cues with highest ranking are selected. The adopted multi-mode information fusion, inference chain generation, correlation decision network, cross attention mechanism and other technologies can effectively process information sources from different modes, improve the accuracy and robustness of a question-answering system and provide more comprehensive and accurate answers.

Further, in step S50, the adaptive answer extractor adopts an encoder-decoder network structure. The method can realize the contextual understanding and conditional constraint of the input sequence, generate high-quality answers, have robustness and flexibility, and improve the performance and reliability of a question-answering system.

In a second aspect, the present application provides a multi-modal information-based question answering apparatus, including:

the input module is used for inputting the text data of the questions;

the acquisition module acquires multi-mode heterogeneous information sources; the heterogeneous information source comprises text data, form data and image data;

the table information generation module is used for encoding table contents of the table data into vector representations with fixed lengths and generating linear texts based on the output context vectors;

the image information generation module is used for converting the image data into natural language to obtain semantic representation information;

the text information coding module maps discrete text symbols in text data and problem text data of a user to a continuous vector space, and initializes the text symbols to obtain semantic representations of the text information data and the problem text data;

the unified representation module is used for integrating a unified language representation space with the semantic representation of the vector representation, the semantic representation information and the text information data so as to map the vector representations of different modes to a shared language representation space;

the relevance decision module is used for searching clues related to the text data of the questions in a language representation space based on semantic representation of the text data of the questions as a generation basis of subsequent answers; sorting the cables according to the sequence-problem to obtain a plurality of candidate cables;

the self-adaptive answer extractor inputs semantic representations of the candidate clues and the question text data into the self-adaptive answer extractor to obtain an answer;

and the output module outputs the answer.

In a third aspect, the present application provides an electronic device comprising a memory in which a computer program is stored and a processor arranged to run the computer program to perform the above-described multimodal information based question-answering method.

In a fourth aspect, the present application provides a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process comprising a question-answering method based on multimodal information according to the above.

The main contributions and innovation points of the invention are as follows: 1. compared with the prior art, the method and the device have the advantages that images and tables are converted into natural language in the model input stage, and the multi-mode information question-answering task is simplified into a simpler text question-answering problem. The method is helpful to overcome diversified modal combinations, and cross-modal reasoning is realized by combining text representations formed by mutually related clues of multiple modalities; by integrating different types of information into a unified language presentation space, a user's questions can be better understood and answered. It is suitable for various fields, such as intelligent customer service, question-answering system, etc.

2. Compared with the prior art, the text conversion strategy combining global and local information is introduced into the image, so that the information loss in the process of converting the image into the text is minimum, and the maximum image characterization information is reserved as much as possible. And searching each clue related to the problem in a unified language space, and simultaneously introducing a negative sample and a cross attention mechanism when sequencing related clues, so that the distinguishing capability of the positive and negative samples is improved, the most effective answer clues are obtained, and the training efficiency and generalization capability are improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow of a multimodal information based question-answering method according to an embodiment of the present application;

FIG. 2 is a simplified flow chart of a method according to an embodiment of the present application;

FIG. 3 is a block diagram of an image information generation model;

FIG. 4 is a block diagram of a correlation decision network;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

Example 1

The application aims to provide a multi-mode information-based question-answering method, and particularly relates to fig. 1 and 2, the method comprises the following steps of:

s00, collecting multi-mode heterogeneous information sources;

the heterogeneous information source comprises text data, form data and image data; the user question data input by the user is text data; the table data herein refers to a structured table composed of rows and columns.

in this embodiment, the table information generation model converts the table content of the table data into a natural sentence conforming to the habit of human language in a linear arrangement manner according to a natural language template defined in advance, the original structure and information loss of the table are not destroyed in the process, the complete reservation of the table information in the generated sentence is ensured, the table information generation model adopts a sequence-to-sequence network architecture for encoding the table content into a vector representation with a fixed length, and then linear text is generated based on the output context vector.

Preferably, the table information generating model includes two coding modules for coding the table title and the cell information respectively, the coding modules include two embedded layers, a convolution layer and a pooling layer, after coding the information, the title features and the cell content features are concat (connected) to obtain a fusion feature representation, and finally, the natural language description closely related to the content is generated from the fusion feature representation through a generating model based on RNN (recurrent neural network).

S11, converting image data into natural language through an image information generation model to obtain semantic representation information;

in this embodiment, the image information generation model converts image data into natural language, and in order to avoid information loss, a global policy and a local policy are combined. The global strategy adopts an image description model to generate macroscopic text description of the image, the local strategy adopts an object-attribute matching model to extract each object and attribute thereof in the image, generates corresponding text description to supplement semantic detail information missing in the macroscopic text description, and splices the global text information description sequence and the local text information description sequence to obtain final semantic representation information of the image.

Preferably, the image information generation model employs two conversion strategies, as shown in fig. 3: global policies and local policies.

The global strategy comprises that a ResNet network for extracting global image features superimposes image features from 16 interest point areas provided by a fast CNN network, the obtained image features are input into a macroscopic text description of an output image in a multi-layer transducer architecture fused with multi-mode intermediate representation, and end-to-end training is carried out by adopting a random initialization mode and cross entropy loss, wherein each layer of transducer comprises a self-attention network and a feedforward network, training parameters comprise an initial learning rate of 0.0025, a linear warm-up step length of 5K, a random discarding rate of 0 and a batch size of 256;

the local strategy comprises three layers of information conversion, firstly, an image subtitle model is utilized to convert image level information into a text title, and the generated text title information is C _n Then using object-attribute model to convert the object-level information into object and attribute label, and making the generated object text information and attribute text information be O _n Finally, detecting all possibly existing text contents in the image by utilizing an OCR model, wherein the corresponding generated text contents are T _n Comprehensively obtaining local text description of the imageWherein phi is _n ＝W _n F _n ([C _n ,O _n ,T _n ])，W _n To learn parameters, F _n Is a nonlinear layer.

S12, mapping discrete text symbols in text data and problem text data of a user to a continuous vector space through a text information coding model, and initializing to obtain semantic representations of the text information data and the problem text data;

in this embodiment, the text information coding model is used to map discrete text symbols to a continuous vector space, obtain semantic representations of text information data and problem text data after initialization, and filter irrelevant information by using a sliding window mechanism and a pre-training language model BERT.

Preferably, the text information encoding model includes: training a text information coding model by adopting a common learning method to extract text semantic features, wherein all input text paragraphs areDefining the history window size as w, then the text passage at the history window size is +.>Then construct a +.>And q _k Is initialized in the incoming BERT network to get +.>Then inputting the initial representation into a text information coding model to obtain a final text information representationWherein w is _q Is a text adjacency matrix.

in this embodiment, a unified language representation space is constructed according to the obtained information representations (vector representations, semantic representation information and semantic representations of text information data) containing all the multiple modes, and the unified language representation space maps the vector representations of different modes to a shared representation space so as to facilitate the subsequent reasoning of the cross-mode information and the understanding of the problem-related information.

in this embodiment, according to the semantic representation of the question text information, the clues related to the question text are searched in the unified language representation space and used as the basis for generating subsequent answers, and the dense information retriever DPR is used as a retriever to project the question and point to the text clues from the multi-modal data source in the unified language representation space. Dense information retrievers (DPRs) are a deep learning model for text retrieval. It was developed by Google AI and was first published in 2022. DPR is a bi-directional retrieval model that encodes queries and documents into dense vectors, and then uses these vectors to calculate the similarity between the queries and the documents. The advantage is that it can efficiently handle long text and complex queries. It can also generate high quality query summaries, which are useful for understanding queries and generating relevant results.

in this embodiment, the plurality of candidate clues are classified and ordered based on sequence-problem pairs, the required sequence is semantic representation of information possibly including one or more modes from tables, pictures and text information sources, the semantic representations are linked to form an inference chain sequence, the sequence-problem pairs are input into a correlation decision network together, a cross attention mechanism is introduced for the sequence-problem pairs, correlation between the problem and the candidate clue sequence is modeled more accurately, reasoning among the modes can be realized naturally, ranking scores of each candidate clue are calculated, and top N candidate clues with highest ranking are selected.

Wherein the inference chain is an inference chain that combines entities in question and candidate answer information, for example: the question is what the animal race is in an international race on the star race, and the corresponding inference chain may be star race→international race→horse.

Preferably, the relevance decision network comprises, as shown in fig. 4: combining multiple sequence-problem pairs x, x ⁺ ，x ^- Input into convolutional neural network of shared weight to obtain vector representation R _w (x),R _w (x ⁺ )，R _w (x ^_ ) Then inputting the features into a cross attention layer for interactive fusion and mapping the features into a new space to form a plurality of sequence-problem pairs for representing R in the new space _w (x ₁ )，By calculating the L1 distance between the vectors, i.e. +.>Andand obtaining a similarity calculation result.

Wherein the plurality of sequence-problem pairs for input comprises one positive sample x and two negative samples x ⁺ ，x ^- By training, the distance between the samples with strong correlation is as small as possible, and the distance between the samples with weak correlation is as large as possible, so that the accurate sequencing of the related texts is realized. Negative samples and a cross attention mechanism are introduced, the distinguishing capability of the positive and negative samples is improved, the most effective answer clues are obtained, and the training efficiency and generalization capability are improved.

In this embodiment, top N candidate cues obtained in step S40 are input together with a question text representation to an adaptive answer extractor, where the question text representation is the semantic feature obtained in step S20 using BERT, and the adaptive answer extractor uses an encoder-decoder network structure.

Example two

Based on the same conception, the application also provides a question-answering device based on the multi-mode information, which comprises the following steps:

the input module is used for inputting the text data of the questions;

and the output module outputs the answer.

Example III

This embodiment also provides an electronic device, referring to fig. 5, comprising a memory 404 and a processor 402, the memory 404 having stored therein a computer program, the processor 402 being arranged to run the computer program to perform the steps of any of the method embodiments described above.

In particular, the processor 402 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

The memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. Memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or FLASH memory (FLASH) or a combination of two or more of these. The RAM may be Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM) where appropriate, and the DRAM may be fast page mode dynamic random access memory 404 (FPMDRAM), extended Data Output Dynamic Random Access Memory (EDODRAM), synchronous Dynamic Random Access Memory (SDRAM), or the like.

Memory 404 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions for execution by processor 402.

Processor 402 implements any of the multimodal information-based question-answering methods of the above embodiments by reading and executing computer program instructions stored in memory 404.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402 and the input/output device 408 is connected to the processor 402.

The transmission device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 406 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

The input-output device 408 is used to input or output information. In this embodiment, the input information may be a question or the like, and the output information may be an answer or the like.

Example IV

The present embodiment also provides a readable storage medium having stored therein a computer program including program code for controlling a process to execute the process including the multi-modal information-based question-answering method according to the first embodiment.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and/or macros can be stored in any apparatus-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. In addition, in this regard, it should be noted that any blocks of the logic flows as illustrated may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs, etc. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The foregoing examples merely represent several embodiments of the present application, the description of which is more specific and detailed and which should not be construed as limiting the scope of the present application in any way. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the present application, which falls within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. The question-answering method based on the multi-mode information is characterized by comprising the following steps of:

s00, collecting multi-mode heterogeneous information sources;

s10, encoding table contents of the table data into vector representations with fixed lengths through a table information generation model, and generating linear texts based on the output context vectors;

mapping discrete text symbols in the text data and the problem text data of the user to a continuous vector space through a text information coding model, and initializing to obtain semantic representations of the text information data and the problem text data;

s30, searching clues related to the question text data in the language representation space based on the semantic representation of the question text data as a generation basis of subsequent answers;

s40, sorting the clues according to the sequence-problem to obtain a plurality of candidate clues;

s50, inputting the semantic representations of the candidate clues and the question text data into an adaptive answer extractor to output an answer.

2. The multi-modal information-based question-answering method according to claim 1, wherein in step S10, the form information generation model adopts a sequence-to-sequence network architecture.

3. The multi-modal information-based question-answering method according to claim 1, wherein in step S10, the text information encoding model employs a pre-trained language model BERT and employs a sliding window mechanism to filter irrelevant information.

4. The multi-modal information-based question-answering method according to claim 1, wherein in step S10, the image information generation model combines a global policy and a local policy, the global policy generates a macroscopic text description of an image through an image description model, the local policy extracts each object and its attribute in the image through an object-attribute matching model and generates a corresponding text description to supplement semantic detail information missing in the macroscopic text description, and at the same time, the final semantic representation information of the image is obtained by splicing the information obtained by the global policy and the local policy.

5. The multi-modal information-based question-answering method according to claim 1, wherein in step S30, dense information retrievers are used as retrievers to project and point to cues in the language representation space for the question text data.

6. The multi-modal information-based question-answering method according to claim 1, wherein in step S40, the sequence includes semantic representations from one or more modalities of tables, pictures and text information sources, the sequence-question pairs are input into a relevance decision network by linking all semantic representations to form a sequence of inference chains, and cross-attention mechanisms are introduced simultaneously to calculate ranking scores for each candidate cue, and a plurality of candidate cues with highest ranking are selected.

7. The multi-modal information-based question-answering method according to any one of claims 1-6, wherein in step S50, the adaptive answer extractor employs an encoder-decoder network structure.

8. A multi-modal information based question answering apparatus, comprising:

the input module is used for inputting the text data of the questions;

and the output module outputs the answer.

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the multimodal information based question-answering method according to any one of claims 1 to 7.

10. A readable storage medium, characterized in that the readable storage medium has stored therein a computer program comprising program code for controlling a process to execute a process comprising the multimodal information based question-answering method according to any one of claims 1 to 7.