CN114238587A

CN114238587A - Reading understanding method and device, storage medium and computer equipment

Info

Publication number: CN114238587A
Application number: CN202111655536.3A
Authority: CN
Inventors: 陈致鹏; 崔一鸣; 陈志刚
Original assignee: Zhongke Xunfei Internet Beijing Information Technology Co ltd; iFlytek Co Ltd
Current assignee: Zhongke Xunfei Internet Beijing Information Technology Co ltd; iFlytek Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-03-25

Abstract

The application discloses a reading understanding method, a reading understanding device, a storage medium and computer equipment. The method comprises the following steps: acquiring text data and image data to be processed, wherein the text data comprises questions and options corresponding to the questions, and the image data comprises scene pictures; extracting a text vector representation of the text data, wherein the text vector representation comprises text information of the question and text information of the option; extracting a picture vector representation of the image data; according to the text vector representation and the picture vector representation, calculating a multi-modal vector representation containing text information and image information; according to the multi-mode vector representation, the probability value of each option serving as a correct answer is calculated, so that the correct answer matched with the question and the scene picture is determined from the options according to the probability value, multi-mode reading comprehension of picture and text question input can be achieved, and the reading comprehension accuracy is improved.

Description

Reading understanding method and device, storage medium and computer equipment

Technical Field

The application relates to the technical field of computers, in particular to a reading understanding method, a reading understanding device, a storage medium and computer equipment.

Background

Currently, reading understanding methods and systems for multiple modes are few. Basically, the method is a single mode model, processes data, extracts relevant characteristics and is applied to solving the problem that a plurality of modes are needed to be solved simultaneously. For multi-modal reading understanding, it is common practice to extract useful information from a picture by image recognition (e.g. OCR, face recognition, character motion recognition, etc.) as features, model a question by processing a text model (e.g. GRU, LSTM, CNN, BERT, etc.) by using the features, and then obtain an answer to the question by calculating the similarity of feature vectors by combining the image features and the text features. The existing method needs to cooperate a plurality of systems to completely solve the multi-modal reading understanding problem, needs to synchronously process two signals by using an image recognition system and a natural language processing system at the same time, and then constructs a third system to uniformly complete the calculation of the final answer according to the results of the two systems. When the existing multi-modal reading understanding system processes multi-modal signal input and finally completes reading understanding problems, the system is complex, signals which are increased mutually are easy to lose, the error rate is high, and the reading understanding accuracy is low.

Disclosure of Invention

The embodiment of the application provides a reading understanding method, a reading understanding device, a storage medium and computer equipment, which can realize multi-mode reading understanding of picture and text problem input and improve the accuracy of reading understanding.

In one aspect, a reading comprehension method is provided, which includes:

acquiring text data and image data to be processed, wherein the text data comprises questions and options corresponding to the questions, and the image data comprises scene pictures;

extracting a text vector representation of the text data, the text vector representation including text information of the question and text information of the option;

extracting a picture vector representation of the image data;

calculating a multi-modal vector representation comprising text information and image information according to the text vector representation and the picture vector representation;

and calculating a probability value of each option serving as a correct answer according to the multi-modal vector representation, and determining the correct answer matched with the question and the scene picture from the options according to the probability value.

Optionally, the extracting a text vector representation of the text data includes:

and converting each word in the text data into a sequence number corresponding to each word in the word list through the word list, and searching the text vector representation of the text data according to the sequence number.

Optionally, the extracting a picture vector representation of the image data includes:

and performing target detection and feature extraction on the scene picture according to a target detection model to obtain the picture vector representation, wherein the picture vector representation comprises an image information vector representation of each visual target in the scene picture and an image information vector representation of the whole picture.

Optionally, the calculating a multi-modal vector representation including text information and image information according to the text vector representation and the picture vector representation includes:

processing the text vector representation and the picture vector based on a self-attention model to obtain text information of the question, and global interaction information between the text information of the option and the image information;

carrying out normalization processing on the global interaction information to obtain first normalization information;

and determining a multi-modal vector representation containing text information and image information according to the global interaction information and the first normalization information.

Optionally, the processing the text vector representation and the picture vector based on a self-attention model to obtain the text information of the question, the text information of the option, and the global interaction information between the image information includes:

inputting an embedded vector representation determined from the text vector representation and the picture vector representation from an attention model, calculating a matching matrix from a product between the embedded vector representation and a transposed matrix of the embedded vector representation;

and determining global interaction information between the text information of the question, the text information of the option and the image information according to the product of the matching matrix and the embedded vector representation.

Optionally, the determining a multi-modal vector representation including text information and image information according to the global interaction information and the first normalization information includes:

adding the global interaction information and the first normalization information to obtain first summation information;

inputting the first summation information into a full-connection layer for processing, and then carrying out normalization processing on an output result of the full-connection layer to obtain second normalization information;

and adding the first summation information and the second normalization information to obtain the multi-modal vector representation containing the text information and the image information.

Optionally, the method further includes:

acquiring position vector representation and type vector representation, wherein the position vector representation is used for marking the position of each word in the text data, and the type vector representation is used for distinguishing a text type from an image type;

the computing a multimodal vector representation including text information and image information from the text vector representation and the picture vector representation comprises:

and calculating a multi-modal vector representation containing text information and image information according to the text vector representation, the picture vector representation, the position vector representation and the type vector representation.

Optionally, the calculating, according to the multi-modal vector representation, a probability value of each option as a correct answer to determine, according to the probability value, a correct answer matched with the question and the scene picture from the options includes:

splitting the multi-modal vector representation to obtain a problem option representation and an image option representation;

processing the problem option representation and the image option representation based on a cross attention model to obtain a focus point vector representation;

and calculating a probability value of each option serving as a correct answer according to the attention point vector representation, and determining the correct answer matched with the question and the scene picture from the options according to the probability value.

In another aspect, there is provided a reading and understanding apparatus, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring text data and image data to be processed, the text data comprises questions and options corresponding to the questions, and the image data comprises scene pictures;

a first extraction unit configured to extract a text vector representation of the text data, the text vector representation including text information of the question and text information of the option;

a second extraction unit for extracting a picture vector representation of the image data;

the computing unit is used for computing multi-modal vector representation containing text information and image information according to the text vector representation and the picture vector representation;

and the determining unit is used for calculating a probability value of each option as a correct answer according to the multi-modal vector representation so as to determine the correct answer matched with the question and the scene picture from the options according to the probability value.

In another aspect, a computer-readable storage medium is provided, which stores a computer program adapted to be loaded by a processor to perform the steps of the reading and understanding method according to any of the above embodiments.

In another aspect, a computer device is provided, the computer device includes a processor and a memory, the memory stores a computer program, and the processor is used for executing the steps in the reading understanding method according to any one of the above embodiments by calling the computer program stored in the memory.

In another aspect, a computer program product is provided, which comprises computer instructions that, when executed by a processor, implement the steps in the reading and understanding method according to any of the above embodiments.

The method comprises the steps of obtaining text data and image data to be processed, wherein the text data comprises questions and options corresponding to the questions, and the image data comprises scene pictures; extracting a text vector representation of the text data, wherein the text vector representation comprises text information of the question and text information of the option; extracting a picture vector representation of the image data; according to the text vector representation and the picture vector representation, calculating a multi-modal vector representation containing text information and image information; and calculating the probability value of each option as a correct answer according to the multi-modal vector representation, and determining the correct answer matched with the question and the scene picture from the options according to the probability value. According to the method and the device, multi-modal reading understanding of picture and text question input is achieved through a Transformer model, image data and text data containing questions and options are input through the model at the same time, multi-modal vector representation containing text information and image information is calculated through an attention (attention) mechanism in the Transformer model, useful picture information and text information are filtered out, then correct answer options are selected according to the multi-modal vector representation, and reading understanding accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a structural framework diagram of a reading understanding system provided in an embodiment of the present application.

Fig. 2 is a first flowchart of a reading understanding method according to an embodiment of the present application.

Fig. 3 is a first application scenario diagram of a reading understanding method according to an embodiment of the present application.

Fig. 4 is a second flowchart of a reading understanding method provided in the embodiment of the present application.

Fig. 5 is a schematic view of a second application scenario of the reading understanding method according to the embodiment of the present application.

Fig. 6 is a third flow chart of a reading understanding method provided in the embodiment of the present application.

Fig. 7 is a schematic view of a third application scenario of the reading understanding method according to the embodiment of the present application.

Fig. 8 is a schematic structural diagram of a reading and understanding apparatus provided in an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a reading understanding method and device, computer equipment and a storage medium. Specifically, the reading understanding method of the embodiment of the present application may be executed by a computer device, where the computer device may be a terminal or a server, and the like. The terminal can be a smart phone, a tablet Computer, a notebook Computer, a smart television, a smart speaker, a wearable smart device, a Personal Computer (PC), and the like, and the terminal can further include a client, which can be a video client, a browser client, an instant messaging client, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The embodiment of the application can be applied to various scenes such as artificial intelligence, voice recognition and intelligent traffic.

First, some terms or expressions appearing in the course of describing the embodiments of the present application are explained as follows:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Deep Learning (DL) is a branch of machine Learning, an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple nonlinear transformations. Deep learning is to learn the intrinsic rules and the expression levels of training sample data, and the information obtained in the learning process is very helpful to the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.

Neural Networks (NN) are a deep learning model that mimics the structure and function of biological Neural networks in the field of machine learning and cognitive science.

The Transformer model is an NLP (natural language processing) classical model. The Transformer model encodes inputs and computes outputs based entirely on Attention, and does not rely on a sequence-aligned cyclic or convolutional neural network, and uses the Self-Attention mechanism without adopting the sequential structure of RNNs, so that the model can be trained in parallel and can possess global information.

The prior multi-modal reading understanding system mainly has the following defects when processing multi-modal signal input and finally completing the reading understanding problem:

1. for multi-modal tasks, the approach of multiple systems to process can lead to accumulation of errors. For example, if an image system has an error, a useless feature signal is extracted, which further affects the subsequent systems to make correct judgment. Similarly, when the system for processing the text has errors or the precision is reduced, the reading comprehension questions can be answered by the picture information subsequently, and the wrong judgment can be more easily made.

2. When multi-modal signals are jointly processed using multiple single-modal systems, some of the mutually enhanced signals are lost. When extracting features, the image system does not know what features the text system needs, and the text system cannot obtain signal input of images when extracting features, so that information relevant to the features during final fusion cannot be normally acquired.

3. When a plurality of systems are adopted to respectively process information of different modes, the systems become more complex, and therefore the error probability of the systems is obviously improved. Because the image and text are physically processed separately, something ambiguous that the text and image themselves can resolve through a co-occurrence relationship is lost, which results in a significant reduction in the ability to answer reading comprehension questions for multi-modal signals. And reading the understanding problem itself has a great difficulty, which is further increased if the information of both modalities cannot be modeled simultaneously.

With the development of artificial intelligence technology, a single-mode model cannot meet the requirement of daily human-computer interaction. In order to comply with the requirement of technical development, the embodiment of the present application provides a multimodal reading and understanding system, and when solving the reading and understanding problem, the reading and understanding object of the embodiment of the present application is changed from input text to a picture of a complex scene, and through understanding the picture, a correct answer is selected. With the development of artificial neural networks, the processing capability of a single model of multi-modal information becomes more and more a key technology of human-computer interaction, so that the embodiment of the application provides a multi-modal-based reading understanding technology for helping a machine to better understand environmental signals and answer related questions through the signals. The technology has great application potential in future robots.

The embodiment of the application can be used for solving a multi-mode reading understanding problem input by a picture and text problem or a question and answer problem of an image. The method can be widely applied to man-machine real scene conversation and understanding and searching of images. Has great application value.

The multi-mode reading understanding problem solving system based on the deep learning model of the neural network can well overcome the problems. First, the embodiment of the present application vectorizes the text data, and each word, each question, each option, and each sentence are mapped to a specific space. The representation of the options is then subjected to a spatial relationship-based point of interest calculation (attention calculation) with the image vector representation of the question and of the different modalities, i.e. the question and the options are captured by this calculation to the most relevant part of the image information. Finally, the fully connected deep neural network maps the information related between the question image and the options to a solution space, so that a correct option is selected as an answer. The system of the embodiment of the application has the main advantages that: 1. has good migration performance. The training data can be migrated to other modalities of reading comprehension problems as long as the training data is replaced. 2. Without the need to manually edit large amounts of expert knowledge. The model automatically acquires knowledge of solving the problem from the data. 3. The used image problems and options are represented by space vectors, and the method has strong generalization capability. 4. One system can handle image and text modal input. And the later application deployment is greatly simplified.

Referring to fig. 1, fig. 1 is a structural frame diagram of a reading understanding system according to an embodiment of the present application. The reading understanding system comprises a multi-modal Transformer model. First, the input of text data, including questions and options, is a text modality. Then there is an input of image data of the image modality, the image data comprising a picture of the scene. Then a multimodal Transformer model. The options matched with the question and the scene picture can be calculated through a Transformer model to be used as the correct answer of the question. The method comprises the steps that a Transformer model filters useful picture information and text information through simultaneously input signals of questions, options and scene pictures and through an attention mechanism in the Transformer model, multi-mode vector representation containing the text information and image information is calculated to filter the useful picture information and the text information, and then correct answer options are selected through full connection according to the filtered information. The feature representation of the text and the image in two modes is learned when the image and the text are input simultaneously, the model can accurately obtain useful multi-mode vector representation information, and finally, whether the scene image is matched with the answer of the current question or not is accurately judged. The whole reading and understanding system directly completes the reading and understanding problem through one model, so the structure is simple, and the performance is high.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

The embodiments of the present application provide a reading understanding method, which may be executed by a terminal or a server, or may be executed by both the terminal and the server; the embodiment of the application is described by taking an example that a reading understanding method is executed by a server.

Referring to fig. 2 to 7, fig. 2, fig. 4, and fig. 6 are schematic flow diagrams of a reading understanding method according to an embodiment of the present application, and fig. 3, fig. 5, and fig. 7 are schematic application scenarios of the reading understanding method according to the embodiment of the present application. The method comprises the following steps:

step 110, acquiring text data and image data to be processed, wherein the text data comprises questions and options corresponding to the questions, and the image data comprises scene pictures.

For example, the text data includes a question and an option corresponding to the question, wherein the option may be one or more. For example, the question to be handled is "who is wearing glasses in the figure? "; for example, the option is one, for example, the option is "man", for example, the option is multiple, for example, the option is "man, woman, child", and the like.

For example, the image data includes a scene picture, which is a picture provided corresponding to a scene described by the question, such as a user can answer the question by observing the scene picture during the answering process.

Step 120, extracting a text vector representation of the text data, wherein the text vector representation comprises text information of the question and text information of the option.

First, the text data may be vectorized to map each word, each question, each option, each sentence to a particular space to obtain a text vector representation of the text data.

For example, as described in conjunction with fig. 3, the reading understanding system may include a multi-modal data input processing module, a multi-modal feature extraction module, and an option scoring module, wherein the feature extraction module may employ a Transformer model. The text data in the TXT format (namely the original text corresponding to the problems and the options in the TXT format) input into the data input processing module is converted into a serial number (ID) in a corresponding word list of each word through the word list, and then the embedded (embedding) vector representation of each word corresponding to the word list is searched for the original text through the ID. For example, the problem: "who wears glasses in the figure? ", the option is man, and is converted into ID [1,4,3,6,7,0,12,87,98,10] through the vocabulary, because the ID of the" figure "in the vocabulary is 1, it is converted into 1, and then the vector representation (w1, w2, w3, w4, w5, w6, w7, w8, w9, w10) corresponding to each word is found through the ID, and the word vector sequence is obtained. This string of word vector sequences can then be subsequently used as input parameters for the transform model, which can be defined as a text vector representation represented as a matrix of seq _ len times hid _ size, where seq _ len represents the text length and hid _ size represents the size of the word vector.

Step 130, a picture vector representation of the image data is extracted.

For example, image data input to the data input processing module is first extracted by a target detection model (Fast RCNN) for physical numerical information, and then a Fast RCNN model capable of correctly extracting physical information is obtained by model learning. Fast RCNN (Fast Regions with CNN features) is a Fast, region-based convolutional network method for target detection.

For example, as described with reference to fig. 3, image data including a scene picture is input, and object detection and feature extraction are performed on the scene picture in the image data through Fast-RCNN to obtain an image information vector representation of each visual object in the picture and an image information vector representation of the entire picture. Wherein the visual target is corresponding to an object needing attention in the text data, such as "who in the figure" mentioned in the text data, then people in the scene picture, such as men and women in the figure, need to be paid attention. Wherein the image information vector representation of the entire picture would be applied separately to each text word and the image information vector representation of the individual visual object would correspond to the text vector specifically representing the image information, i.e. the img text vector shown in fig. 3. The final image data as an input parameter of the transform model is also a matrix of seq _ len times hid _ size, which can be defined as a picture vector representation.

Step 140, calculating a multi-modal vector representation comprising text information and image information according to the text vector representation and the picture vector representation.

For example, the embed vector representation including the image, question and option is input into the transform model, and a multi-modal vector representation between each option and the picture and question is calculated. A multimodal vector representation is a vector that contains both textual and image information.

Optionally, the method further includes: acquiring position vector representation and type vector representation, wherein the position vector representation is used for marking the position of each word in the text data, and the type vector representation is used for distinguishing a text type from an image type;

Wherein the position vector represents a position for labeling each word in the text data, and the size of the position vector representation is a matrix of seq _ len times hid _ size. The type vector represents a matrix with the size of seq _ len multiplied by hid _ size, for example, the sub-text type is represented as 0, and the image type is represented as 1.

For example, as shown in fig. 3, the input parameters finally input into the transform model may include an embedded vector representation Embedding, denoted as E, composed of a text vector representation plus a picture vector representation plus a position vector representation and a type vector representation. For example, the embed vector representation including the image, question and option is input into the transform model, and a multi-modal vector representation between each option and the picture and question is calculated. The multi-modal vector representation is a vector containing both text information and image information, namely, the vector representation of each of the options and the questions and the representation of the image are calculated inside a transducer model, the optimal feature vector is extracted in a mode of co-occurrence of groups, and finally the multi-modal vector representation is connected with a full connection to calculate the correct answer of the current question. And finally, normalizing the probability values to be used as the probability of each option as a correct answer.

For example, as described with reference to fig. 3, a multi-modal vector representation including text information and image information may be calculated by processing a text vector representation and a picture vector representation according to a multi-modal feature extraction module, which may employ a transform model. The feature extraction module has the main function of calculating the multi-modal vector representation fusing the text information and the image information. As shown in fig. 3, after being processed by the transform model, a multi-modal vector representation fused with text information and image information is finally obtained, and the multi-modal vector representation can be used for calculating a probability value of whether the question answer matches with the current scene picture. As shown in fig. 3, a probability value of "man" as the correct answer is calculated. The feature extraction module is mainly used for calculating the incidence relation among images, questions and options and outputting a matched feature matrix, and the feature matrix can be used as a multi-modal vector representation containing text information and image information.

Optionally, as shown in fig. 4, step 140 may be implemented through steps 141 to 143, specifically:

and step 141, processing the text vector representation and the picture vector based on a self-attention model, and obtaining the text information of the question, the text information of the option, and the global interaction information between the image information.

Optionally, the processing the text vector representation and the picture vector based on a self-attention model to obtain the text information of the question, the text information of the option, and the global interaction information between the image information includes: inputting an embedded vector representation determined from the text vector representation and the picture vector representation from an attention model, calculating a matching matrix from a product between the embedded vector representation and a transposed matrix of the embedded vector representation; and determining global interaction information between the text information of the question, the text information of the option and the image information according to the product of the matching matrix and the embedded vector representation.

Wherein the length dimension of the global mutual information is the same as the length dimension of the embedded vector representation.

For example, referring to fig. 5, the inputs to the feature extraction module are: vector representation of images, questions, and options, which is a matrix of seq _ len by hid _ size. For example, the vector representation of the image, question, and option may be an Embedding vector representation composed of a text vector representation and a picture vector representation, denoted as E. The vector representation of the image, question and option may be a text vector representation, a picture vector representation, a position vector representation and a type vector representation, and the constructed Embedding vector representation is denoted as E.

The output of the feature extraction module is: a multimodal vector representation fusing image information and all text information, with a size of seq _ len times hid _ size matrix.

For example, referring to fig. 5, a matching matrix is calculated by a self attention (self _ attention) model inside a feature extraction module, and an embedding vector representation E of seq _ len multiplied by hid _ size is input, which is a vector representation of an image, a question and an option; the output is global interactive information H_s，H_sIs seq _ len multiplied by hid _ size. In the specific calculation process, self _ attention represents self and self calculation attention (attention), and E is multiplied by E matrix^TObtaining a matching matrix M, wherein the size of the matching matrix M is seq _ len multiplied by seq _ len, and then the M matrix is multiplied by E to obtain H_s，H_sIs seq _ len multiplied by hid _ size. Wherein E is^TIs the transpose of the E matrix.

And 142, performing normalization processing on the global interaction information to obtain first normalization information.

For example, referring to FIG. 5, global interaction information H for self _ interaction model output_sNormalization, i.e. norm, is performed to obtain first normalization information H_nFirst normalization information H_nIs of a size ofseq _ len times hid _ size, normalization does not affect the matrix size. The length dimension of the global interactive information is the same as the length dimension represented by the embedded vector, and the word vector size of the global interactive information is the same as the word vector size represented by the embedded vector.

Step 143, determining a multi-modal vector representation including text information and image information according to the global interaction information and the first normalization information.

Optionally, the determining a multi-modal vector representation including text information and image information according to the global interaction information and the first normalization information includes: adding the global interaction information and the first normalization information to obtain first summation information; inputting the first summation information into a full-connection layer for processing, and then carrying out normalization processing on an output result of the full-connection layer to obtain second normalization information; and adding the first summation information and the second normalization information to obtain the multi-modal vector representation containing the text information and the image information.

For example, referring to FIG. 5, the input to the fully connected layer is H_s+H_nI.e. global mutual information H_sAnd the first normalization information H_nAnd adding to obtain first summation information, and inputting the first summation information into a full-connection layer for processing, wherein the size of an output result of the full-connection layer is seq _ len multiplied by hid _ size. Then, normalization processing (norm) is carried out on the output result of the full connection layer to obtain second normalization information, and the obtained second normalization information is added with the first summation information again to obtain a multi-mode vector representation H_nn. Due to H_nnThe self _ attribute model can be stacked here in multiple layers, and can be generally set to 12 layers or 24 layers, as with the size of the input E matrix.

For example, the multimodal vector representation H_nnFinally, by directly outputting output, the size of which is a matrix of seq _ len times hid _ size, the multimodal vector representation is a multimodal vector representation in which the image information and all text information are fused. Wherein the length dimension of the multi-modal vector representation is the same as the length dimension of the embedded vector representation。

The multi-modal vector representation which integrates options, images and questions is obtained through calculation through the feature extraction module, the multi-modal vector representation comprises a highly abstract semantic matching relation between texts and images, and rich information is provided for a follow-up module to select correct answers according to matching information. And meanwhile, the text matching which is only performed at a simple character level in the past is converted into the matching between vector spaces. The feature extraction module raises text matching to the semantic space level.

And 150, calculating a probability value of each option serving as a correct answer according to the multi-modal vector representation, and determining the correct answer matched with the question and the scene picture from the options according to the probability value.

For example, the output result (multi-modal vector representation combining the option, the image and the question) output by the feature extraction module may be processed by a single option scoring module to calculate the probability value of the current option as the correct answer.

Optionally, as shown in fig. 6, step 150 may be implemented through steps 151 to 153, specifically:

and step 151, splitting the multi-modal vector representation to obtain a question option representation and an image option representation.

For example, referring to FIG. 7, an input multi-modal vector representation fused with an image and text is split by a matrix to obtain a question option representation H_qcAnd image option representation H_pc. The size of Output is a matrix of seq _ len by hid _ size, the question option is a matrix of seq1_ len by hid _ size, and the image option is a matrix of sseq2_ len by hid _ size, where seq1+ seq 2.

Step 152, processing the question option representation and the image option representation based on a cross attention model to obtain a point of interest vector representation.

For example, referring to FIG. 7, based on a cross attention model, a question option is used to represent H_qcFrom image option representation H_pcLast calculate a focusWith your (attribute) vector. H_qcAnd H_pcAre respectively connected to form H'_qcAnd H'_pcThen H'_qcMatrix multiplied by H'_pcThe matrix size of the attention representation is seq1 multiplied by seq 2. Attenton denotes re-matrix multiplication by H_pcObtaining an attribute vector representation H_att，H_attThe matrix size of (1) is seq1 by hid _ size. The attention is represented as content or points of interest, such as the question and scene pictures illustrated in fig. 3, and the obtained point of interest vector representation may contain content of "men in the picture wear glasses" and "women in the picture do not wear glasses".

Step 153, according to the attention point vector representation, calculating a probability value of each option as a correct answer, so as to determine a correct answer matched with the question and the scene picture from the options according to the probability value.

For example, referring to fig. 7, a probability value of the option as the answer is finally calculated. The Attend vector represents the summation in the seq1 dimension, resulting in vector V, which is a 1 by hid _ size vector. Then, one more full connection, the vector V is changed to a score (score) which may represent a probability value of the option as the correct answer.

When the score is calculated through the option scoring module, the options, the questions and the pictures are distinguished again, the probability value of the current option as the correct answer is finally obtained again through the attention mode, and if one question has four options, the correct answer can be determined by comparing different scores of the four options. For example, the option with the highest probability value (i.e., the highest score) is determined as the correct answer to match the question and the scene picture.

When the problem is solved, the whole multi-mode reading understanding system sequentially processes the problem through the multi-mode data input processing module, the multi-mode feature extraction module and the option scoring module to obtain a correct answer of the problem. Wherein, before the reading understanding system is used, enough multi-modal reading understanding data can be provided to carry out model learning on the reading understanding system. After learning, the whole reading understanding system can automatically solve the reading understanding problem of a specific scene (such as high-and-high-speed English).

The embodiment of the application can be based on a multi-modal reading understanding technology of a transformer, has multi-modal information extraction capability, and can calculate the correct answer of the question according to multi-modal vector representation.

Compared with the traditional system based on rules and expert knowledge, the reading understanding system greatly simplifies the whole problem solving process and removes the complicated rules and the process of knowledge characteristic extraction. Only the data of system model training is needed to be provided, and the system can automatically learn knowledge and rules related to problem solving. The reading understanding system can be rapidly migrated to other similar tasks and fields. The reading and understanding system adopts a design scheme of modularization, different modules have specific functions, and the reading and understanding system can be quickly transferred to related systems. For example, the option scoring module can be directly applied to other tasks needing to calculate the vector similarity. And the whole system can be directly applied to the multi-modal reading understanding task by replacing the training data.

All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.

In order to better implement the reading and understanding method of the embodiment of the application, the embodiment of the application also provides a reading and understanding device. Referring to fig. 8, fig. 8 is a schematic structural diagram of a reading and understanding apparatus according to an embodiment of the present application. The reading and understanding device 200 may comprise:

an obtaining unit 201, configured to obtain text data and image data to be processed, where the text data includes a question and an option corresponding to the question, and the image data includes a scene picture;

a first extracting unit 202, configured to extract a text vector representation of the text data, where the text vector representation includes text information of the question and text information of the option;

a second extraction unit 203 for extracting a picture vector representation of the image data;

a calculating unit 204, configured to calculate a multi-modal vector representation including text information and image information according to the text vector representation and the picture vector representation;

the determining unit 205 is configured to calculate a probability value of each option as a correct answer according to the multi-modal vector representation, so as to determine, from the options, a correct answer matched with the question and the scene picture according to the probability value.

Optionally, the first extracting unit 202 may be configured to convert each word in the text data into a sequence number corresponding to each word in a word list through the word list, and search for text vector representation of the text data according to the sequence number.

Optionally, the second extracting unit 203 may be configured to perform target detection and feature extraction on the scene picture according to a target detection model to obtain the picture vector representation, where the picture vector representation includes an image information vector representation of each visual target in the scene picture and an image information vector representation of the entire picture.

Optionally, the computing unit 204 may be specifically configured to: processing the text vector representation and the picture vector based on a self-attention model to obtain text information of the question, and global interaction information between the text information of the option and the image information; carrying out normalization processing on the global interaction information to obtain first normalization information; and determining a multi-modal vector representation containing text information and image information according to the global interaction information and the first normalization information.

Optionally, when the text vector representation and the picture vector are to be processed based on a self-attention model to obtain the text information of the question, the text information of the option, and the global interaction information between the image information, the calculating unit 204 may be configured to: inputting an embedded vector representation determined from the text vector representation and the picture vector representation from an attention model, calculating a matching matrix from a product between the embedded vector representation and a transposed matrix of the embedded vector representation; and determining global interaction information between the text information of the question, the text information of the option and the image information according to the product of the matching matrix and the embedded vector representation.

Optionally, when determining the multi-modal vector representation including text information and image information according to the global interaction information and the first normalization information, the computing unit 204 may be configured to: adding the global interaction information and the first normalization information to obtain first summation information; inputting the first summation information into a full-connection layer for processing, and then carrying out normalization processing on an output result of the full-connection layer to obtain second normalization information; and adding the first summation information and the second normalization information to obtain the multi-modal vector representation containing the text information and the image information.

Optionally, the obtaining unit 201 may be further configured to obtain a position vector representation and a type vector representation, where the position vector representation is used to mark a position of each word in the text data, and the type vector representation is used to distinguish a text type from an image type;

the calculation unit 204 may be configured to calculate a multi-modal vector representation comprising text information and image information based on the text vector representation, the picture vector representation, the location vector representation, and the type vector representation.

Optionally, the determining unit 205 may be specifically configured to: splitting the multi-modal vector representation to obtain a problem option representation and an image option representation; processing the problem option representation and the image option representation based on a cross attention model to obtain a focus point vector representation; and calculating a probability value of each option serving as a correct answer according to the attention point vector representation, and determining the correct answer matched with the question and the scene picture from the options according to the probability value.

It should be noted that, for reading and understanding the functions of each module in the apparatus 200 in this embodiment, reference may be made to the specific implementation manner of any embodiment in the foregoing method embodiments, and details are not described here again.

The above reading and understanding means may be implemented in whole or in part by software, hardware, and combinations thereof. The units may be embedded in hardware or independent from a processor in the computer device, or may be stored in a memory in the computer device in software, so that the processor can call and execute operations corresponding to the units.

The reading and understanding apparatus 200 may be integrated in a terminal or a server having a memory and a processor and having an arithmetic capability, or the reading and understanding apparatus 200 may be the terminal or the server. The terminal can be a smart phone, a tablet Computer, a notebook Computer, a smart television, a smart speaker, a wearable smart device, a Personal Computer (PC), and the like, and the terminal can further include a client, which can be a video client, a browser client, an instant messaging client, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Fig. 9 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and as shown in fig. 9, the computer device 300 may include: a communication interface 301, a memory 302, a processor 303 and a communication bus 304. The communication interface 301, the memory 302 and the processor 303 realize mutual communication through a communication bus 304. The communication interface 301 is used for the apparatus 300 to communicate data with external devices. The memory 302 may be used for storing software programs and modules, and the processor 303 may operate the software programs and modules stored in the memory 302, for example, the software programs of the corresponding operations in the foregoing method embodiments.

Alternatively, the processor 303 may call the software programs and modules stored in the memory 302 to perform the following operations: acquiring text data and image data to be processed, wherein the text data comprises questions and options corresponding to the questions, and the image data comprises scene pictures; extracting a text vector representation of the text data, the text vector representation including text information of the question and text information of the option; extracting a picture vector representation of the image data; calculating a multi-modal vector representation comprising text information and image information according to the text vector representation and the picture vector representation; and calculating a probability value of each option serving as a correct answer according to the multi-modal vector representation, and determining the correct answer matched with the question and the scene picture from the options according to the probability value.

Optionally, the computer device 300 is the terminal or the server. The terminal can be a smart phone, a tablet computer, a notebook computer, a smart television, a smart sound box, a wearable smart device, a personal computer and the like. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform and the like.

Optionally, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps in the foregoing method embodiments when executing the computer program.

The present application also provides a computer-readable storage medium for storing a computer program. The computer-readable storage medium can be applied to a computer device, and the computer program enables the computer device to execute the corresponding process in the reading and understanding method in the embodiment of the present application, which is not described herein again for brevity.

The present application also provides a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes the corresponding process in the reading and understanding method in the embodiment of the present application, which is not described herein again for brevity.

The present application also provides a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes the corresponding process in the reading and understanding method in the embodiment of the present application, which is not described herein again for brevity.

It should be understood that the processor of the embodiments of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

It will be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous link SDRAM (SLDRAM), and Direct Rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that the above memories are exemplary but not limiting illustrations, for example, the memories in the embodiments of the present application may also be Static Random Access Memory (SRAM), dynamic random access memory (dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (enhanced SDRAM, ESDRAM), Synchronous Link DRAM (SLDRAM), Direct Rambus RAM (DR RAM), and the like. That is, the memory in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer or a server) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of reading comprehension, the method comprising:

extracting a picture vector representation of the image data;

2. The reading comprehension method of claim 1 wherein said extracting a text vector representation of said text data comprises:

3. The reading understanding method of claim 1, wherein the extracting the picture vector representation of the image data comprises:

4. The reading comprehension method of claim 1 wherein said computing a multi-modal vector representation comprising text information and image information from said text vector representation and said picture vector representation comprises:

5. The reading understanding method of claim 4, wherein the processing the text vector representation and the picture vector based on a self-attention model to obtain global interaction information between the text information of the question, the text information of the option, and the image information comprises:

6. The reading understanding method of claim 4, wherein the determining a multi-modal vector representation containing text information and image information based on the global interaction information and the first normalization information comprises:

7. The method of claim 1, further comprising:

8. The reading comprehension method of one of claims 1 to 7 wherein the calculating a probability value for each of the choices as a correct answer based on the multi-modal vector representation to determine a correct answer from the choices that matches the question and the scene picture based on the probability values comprises:

9. A reading and understanding apparatus, comprising:

10. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor to perform the steps of the reading and understanding method according to any one of claims 1-8.

11. A computer device, characterized in that the computer device comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the steps in the reading understanding method according to any one of claims 1-8 by calling the computer program stored in the memory.