CN112148836A

CN112148836A - Multi-modal information processing method, device, equipment and storage medium

Info

Publication number: CN112148836A
Application number: CN202010928220.6A
Authority: CN
Inventors: 柴琛林; 李航
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-12-29

Abstract

The embodiment of the application provides a multi-modal information processing method, a multi-modal information processing device, multi-modal information processing equipment and a storage medium. The method comprises the following steps: acquiring at least one first modality information; determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information; and outputting the multi-modal information. Therefore, the requirements of users can be met, and the applicability of the multi-mode information processing method can also be improved.

Description

Multi-modal information processing method, device, equipment and storage medium

Technical Field

The embodiments of the present application relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for multimodal information processing.

Background

At present, many information query technologies comprehensively use technologies such as natural language processing, information retrieval, artificial intelligence and the like. For example: the intelligent question-answering technology is a novel information service technology which comprehensively utilizes technologies such as natural language processing, information retrieval, artificial intelligence and the like. Unlike conventional search engines, the auto-quiz system asks questions in natural speech sentences, and the system analyzes and understands the user's questions and returns the answers the user wants.

Disclosure of Invention

The embodiment of the application provides a multi-modal information processing method, a multi-modal information processing device, multi-modal information processing equipment and a multi-modal information processing storage medium.

In a first aspect, an embodiment of the present application provides a multimodal information processing method, including: acquiring at least one first modality information; determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information; and outputting the multi-modal information.

In a second aspect, an embodiment of the present application provides a multimodal information processing method, including: obtaining first training data, the first training data comprising: at least one second modality information; training a pre-training language model through first training data; the pre-training language model is used for determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information.

In a third aspect, an embodiment of the present application provides a multimodal information processing apparatus, including: the device comprises a first acquisition module, a determination module and an output module, wherein the first acquisition module is used for acquiring at least one type of first modality information; the determining module is used for determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information; the output module is used for outputting multi-modal information.

In a fourth aspect, an embodiment of the present application provides a multimodal information processing apparatus, including: the device comprises a first acquisition module and a first training module, wherein the first acquisition module is used for acquiring first training data, and the first training data comprises: at least one second modality information; the first training module is used for training a pre-training language model through first training data; the pre-training language model is used for determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information.

In a fifth aspect, an electronic device is provided, comprising:

a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any embodiment of the present application.

In a sixth aspect, there is provided a computer readable storage medium for storing a computer program for causing a computer to perform the method of any embodiment of the present application.

In the embodiment of the present application, multi-modal information corresponding to at least one first-modal information may be determined according to at least one first-modal information, that is, in the embodiment of the present application, single-modal or multi-modal information input is implemented, and multi-modal information output is implemented, and compared with a mode of single-modal information input and output, such single-modal or multi-modal information input and multi-modal information output may meet a user's requirement for information diversity presentation, and may also improve applicability of the multi-modal information processing method.

Further, the embodiment of the application realizes end-to-end input and output through the pre-trained language model, namely, the pre-trained language model is a model based on a neural network, and multi-modal information can be output only by inputting at least one modal information into the model. This end-to-end input and output approach may improve information processing efficiency.

Furthermore, the information processing method based on the neural network model in the embodiments of the present application can improve the information processing efficiency compared with the information processing method based on the conventional machine learning model or the artificial rule.

Drawings

Fig. 1A is a schematic diagram of an intelligent question and answer scenario provided in an embodiment of the present application;

fig. 1B is a schematic diagram of an intelligent question-answering scenario provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a multimodal information processing process provided in an embodiment of the present application;

FIG. 3 is a diagram illustrating a problem of an image modality according to an embodiment of the present application;

FIG. 4 is a flowchart of a multimodal information processing method according to an embodiment of the present application;

FIG. 5 is a flow diagram of a method for determining multimodal information provided by an embodiment of the present application;

FIG. 6 is a diagram illustrating a pre-trained language model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a pre-trained language model provided in accordance with another embodiment of the present application;

FIG. 8 is a flow diagram of a method for determining multimodal information provided by another embodiment of the present application;

FIG. 9 is a schematic diagram of a pre-trained language model according to yet another embodiment of the present application;

FIG. 10 is a schematic diagram of a pre-trained language model provided in accordance with yet another embodiment of the present application;

FIG. 11 is a flow chart of a method for multimodal information processing according to another embodiment of the present application;

FIG. 12 is a flow chart of a method for multimodal information processing according to yet another embodiment of the present application;

FIG. 13 is a flowchart of a method for processing at least one third reference message according to an embodiment of the present disclosure;

FIG. 14 is a flowchart of a method for processing at least one third reference message according to another embodiment of the present application;

fig. 15 is a schematic diagram of a multi-modal information processing apparatus 1500 according to an embodiment of the present application;

fig. 16 is a schematic diagram of a multi-modal information processing apparatus 1600 according to an embodiment of the present application;

fig. 17 is a schematic block diagram of an electronic device 1700 provided in an embodiment of the present application.

Detailed Description

The input of the current intelligent question-answering system is a single text question, and the output is a single text answer, so that obviously, the current information query technology, such as a single-mode information processing mode based on the current intelligent question-answering system, cannot meet the requirements of users, and has the problem of low applicability.

In order to solve the technical problem, the inventive concept of the present application is: and performing vector representation and information fusion on the input single-mode information or multi-mode information to output multi-mode information.

The technical solution of the embodiment of the present application is applicable to the following scenarios, but is not limited thereto:

scene one: intelligent question-answering scenarios, such as: a user may enter the intelligent question-answering interface shown in fig. 1A, and the interaction of the user on the intelligent question-answering interface is an intelligent question-answering scenario. Fig. 1A illustrates currently the problem of image modality, and the user can also input the problem of modality such as voice, video, text, etc. on the interface. Alternatively, the user may click on a plug-in, icon, or virtual button on the terminal to enter the smart question-and-answer interface. The presentation modes of the intelligent question-answering interfaces are various, fig. 1A shows one intelligent question-answering interface, and fig. 1B shows another intelligent question-answering interface entered by a user clicking a customer service icon on an Application (APP).

Scene two: other predictive scenarios, such as: predicting the next sentence of the current sentence, or predicting inter-sentence consistency.

It should be understood that, in the embodiment of the present application, the terminal device may be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA) device, a handheld device having a Wireless communication function, a computing device or other processing device connected to a Wireless modem, an in-vehicle device, a wearable device, or the like, and the embodiment of the present application is not limited thereto.

The technical solution of the embodiment of the present application will be described in detail below:

in the embodiments of the present application: vector representation and information fusion are carried out on the input single-mode information or multi-mode information through a pre-training language model, so that multi-mode information is output. Therefore, the following description focuses on how to perform vector representation and information fusion on input single-mode information or multi-mode information through a pre-training language model to output a training process of the multi-mode information and the pre-training language model. The training of the pre-training language model can be divided into unsupervised training and supervised training.

For example, taking an application to an intelligent question-answering scenario as an example, fig. 2 is a schematic diagram of a multi-modal information processing process provided in an embodiment of the present application, as shown in fig. 2, data needs to be labeled in supervised training, that is, data of an input model includes a two-tuple < at least one modal question, a multi-modal answer >, or data of an input model includes a triplet < at least one modal question, at least one reference information, a multi-modal answer >, and the reference information is reference information of a modal question and is also described as reference data. Further, the pre-trained language model may be trained through the labeled data.

Furthermore, at least one modal question or binary < at least one modal question, at least one reference information > may be input into the trained pre-trained language model, and the input data may be processed, such as vector representation and information fusion, by the pre-trained language model, and finally multi-modal answers may be output.

It should be noted that in the embodiment of the present application, one modality may be a text modality, a voice modality, an image modality, or a video modality. That is, the at least one modal problem may include: a question of a text modality, a question of a voice modality, a question of an image modality, a question of a video modality. The multi-modal answers include: at least two of an answer to a text modality, an answer to a voice modality, an answer to an image modality, and an answer to a video modality. The at least one reference information, also referred to as reference information of the at least one modality, may include: and at least one of reference information of a text modality, reference information of a voice modality, reference information of an image modality, and reference information of a video modality.

For at least one modal problem, an exemplary illustration is made: suppose that for the question "which infringement is written towards which person, the question of the text modality, i.e. the text" which infringement is written towards which person. The question of the speech modality, i.e. "which alcohol will be entered is which person written towards which era" speech. The problem of the image modality is shown in fig. 3. The question of the video modality may be a piece of video about which country and which person written the wine will be entered.

For a plurality of modal answers, an exemplary illustration is made: assuming that for the question "which will go into alcohol is written towards which person" the text modal answer may be the text "the will go into alcohol is written in Li white by poetry of Tang Dynasty". The answer of the voice modality is the voice that the liquor will be entered and the poetry of Tang Dynasty is written in Libai. The image modality answers, such as showing "incoming wine is Li white written down with poetry of Tang Dynasty" in the image. The answer to the video modality may be a piece of video about "incoming wine is Li white written down poetry of Tang Dynasty".

For reference information of at least one modality, an exemplary explanation is made: it is assumed that the textual reference may be the content of the wine to be entered for the question "which country and which person write the wine to be entered. The textual reference may also be a brief introduction to prunus. The speech reference may be a recitation of going into alcohol. The image reference information may be an image including content of wine to be entered. The video reference information may be a verse resolution video about the wine to be entered.

It should be understood that, in the embodiment of the present application, the modality of the reference information corresponding to one modality problem may be the same as or different from the modality of the modality problem, and the embodiment of the present application does not limit this. For example: the text modality question may correspond to reference information of a text modality, and may include reference information of modalities such as voice, image, or video.

It should be understood that the above-mentioned one modality may be one modality of the at least one first modality information in the embodiment of the present application, may also be one modality of the multi-modality information corresponding to the at least one first modality information, may also be one modality of the first reference information corresponding to the at least one first modality information, may also be one modality of the at least one second modality information in the embodiment of the present application, may also be one modality of the second reference information corresponding to the at least one second modality information, may also be one modality of the at least one third modality information in the embodiment of the present application, may also be one modality of the at least one third modality information corresponding to the at least one third modality information, and may also be one modality of the third reference information corresponding to the at least one third modality information.

It should be understood that the reference information of the at least one modality information is reference information related to the at least one modality information, and the reference information of the at least one modality information may be used to assist in obtaining modality information finally output by the pre-training model. For example: the at least one first reference information of the at least one first modality information is used for assisting in acquiring multi-modality information corresponding to the at least one first modality information. And at least one second reference information of the at least one second modality information is used for assisting in acquiring multi-modality information corresponding to the at least one second modality information. At least one type of reference information may also be used to assist in obtaining modality information during supervised training, such as: the at least one third reference information corresponding to the at least one third modality information may be used to assist in acquiring the at least one modality information corresponding to the at least one third modality information during the supervised training.

Alternatively, the reference information may be reference information about modality information acquired by a search engine.

Fig. 4 is a flowchart of a multimodal information processing method provided in an embodiment of the present application, where an execution subject of the method may be part or all of a terminal device, where part of the terminal device may be a processor of the terminal device, and the execution subject of the method may also be the terminal device and a server, that is, a part of steps in fig. 4 are executed by the terminal device, and another part of steps are executed by the server, which is not limited in this application. As shown in fig. 4, the method includes:

step S410: at least one first modality information is acquired.

Step S420: and determining multi-modal information corresponding to the at least one first modality information according to the at least one first modality information.

Step S430: and outputting the multi-modal information.

The respective modalities of the at least one first modality information are as described above, and details thereof are not repeated in the embodiments of the present application.

The following is a detailed description of step S420:

the first alternative is as follows: and determining multi-modal information corresponding to the at least one first modality information according to the at least one first modality information.

The second option is: and determining multi-modal information corresponding to the at least one first modality information according to the at least one first modality information and the at least one first reference information.

Description is made for the first alternative:

optionally, fig. 5 is a flowchart of a method for determining multimodal information according to an embodiment of the present application, and as shown in fig. 5, the method includes:

step S510: for each of the at least one type of first modality information, mapping the first modality information into a plurality of first characterization vectors.

Step S520: for each of the at least one type of first reference information, mapping the first reference information into a plurality of second characterization vectors.

Step S530: and fusing a plurality of first characterization vectors corresponding to the at least one type of first modality information and a plurality of second characterization vectors corresponding to the at least one type of first reference information to obtain fused vectors.

Step S540: and determining multi-modal information corresponding to at least one first modal information according to the fused vector.

Optionally, any of the first characterization vectors is used to characterize spatiotemporal information, content or type of any of the elements in the first modality information.

Illustratively, assume that the first modality information is a text modality problem: "which kind of face and person to write" each word in the text is an element of the first modality information, and for each element, a plurality of first characterization vectors are corresponding, for example, the element "will" may correspond to spatio-temporal information, content and type. The spatiotemporal information includes: temporal and/or spatial information, such as: the time information of "will" may be its input time, by which the chronological order between each sentence and between words within each sentence can be determined. The spatial information of "will" may be its spatial coordinates in the text. The content of "will" can also be represented by coordinates, such as: (1,1) indicates that the word "will". The type of "to" is the type of the first modality information it is in, such as: the type of the first modality information is a problem type.

It should be noted that, for other modality information such as an image, a voice, a video, and the like, the corresponding first token vector is similar to the corresponding first token vector of the text modality information, and details thereof are not repeated in the embodiment of the present application.

Optionally, any one of the second characterization vectors is used to characterize spatio-temporal information, content or type of any one of the elements in the first reference information.

Illustratively, assuming that the reference information is a text modal reference information, the content of which is a summary of prunus, each word in the text is an element of the reference information, and for each element, a plurality of second characterization vectors are corresponding, for example, the element "pru" may correspond to spatiotemporal information, content and type. The spatiotemporal information includes: temporal and/or spatial information, such as: the time information of the plum can be the acquisition time of the plum, and the time sequence between every two sentences and between every two words in every sentence can be determined through the time information. The spatial information of "lie" may be its spatial coordinates in the text. The content of "lie" can also be represented by coordinates, such as: (1,2) indicates the word "li". The type of "plum" is the type of reference information it is in, such as: the type of the reference information is a reference information type.

It should be noted that, for reference information of other modalities such as images, voice, video, and the like, a second token vector corresponding to the reference information is similar to a second token vector corresponding to text modality information, and details of this embodiment are not repeated herein.

Optionally, after acquiring a plurality of first token vectors corresponding to at least one type of first modality information and a plurality of second token vectors corresponding to at least one type of first reference information, the terminal device may sum the first token vectors and the second token vectors, or calculate an inner product, and so on, to obtain a fused vector. As shown in fig. 6, the pre-trained language model includes: an input layer, a processing layer, and an output layer, wherein the processing layer comprises: showing the sub-layer and the merging sub-layer. The input layer is used for acquiring at least one first modality information, and optionally, at least one first reference information. The presentation sub-layer may be configured to map each of the first modality information into a plurality of first characterization vectors and each of the first reference information into a plurality of second characterization vectors. The fusion sublayer is configured to obtain a fused vector by summing the first token vector and the second token vector, or by performing inner product calculation. And the output layer is used for obtaining multi-modal information according to the pre-training language model and the fused vector.

Optionally, for the first token vector or the second token vector corresponding to each element, summing with the first token vector or the second token vector of the element subsequent to and/or previous to the element, or performing inner product, etc., to obtain an intermediate vector corresponding to the element, and for the intermediate vector of the element, summing with the intermediate vector of the element subsequent to and/or previous to the element, or performing inner product, etc., until obtaining a fused vector of each element. As shown in fig. 7, the pre-trained language model includes: an input layer, a processing layer, and an output layer, wherein the processing layer comprises: showing the sub-layer and the merging sub-layer. The input layer is used for acquiring at least one first modality information, and optionally, at least one first reference information. The presentation sub-layer may be configured to map each of the first modality information to a plurality of first characterization vectors and each of the first reference information to a plurality of second characterization vectors. The fusion sublayer is used for summing the first token vector or the second token vector corresponding to each element with the first token vector or the second token vector of the element after and/or before the element, or calculating an inner product, etc., to obtain an intermediate vector corresponding to the element, and summing the intermediate vector of the element with the intermediate vector of the element after and/or before the element, or calculating an inner product, etc., until obtaining a vector after each element is fused. And the output layer is used for obtaining multi-modal information according to the pre-training language model and the fused vector.

Optionally, the feature information of the first token vector or the second token vector corresponding to each element is summed with the feature information of the first token vector or the second token vector of the element after and/or before the element, or an inner product is calculated, so as to obtain an intermediate vector corresponding to the element, and the intermediate vector of the element is summed with the intermediate vector of the element after and/or before the element, or an inner product is calculated, and so on, until a fused vector of each element is obtained.

Description is made with respect to alternative mode two:

optionally, fig. 8 is a flowchart of a method for determining multimodal information according to another embodiment of the present application, as shown in fig. 8, the method includes:

step S810: for each of the at least one type of first modality information, mapping the first modality information into a plurality of first characterization vectors.

Step S820: and fusing a plurality of first characterization vectors corresponding to at least one type of first modality information to obtain fused vectors.

Step S830: and determining multi-modal information corresponding to at least one first modal information according to the fused vector.

It should be noted that, for the explanation of the first token vector, reference may be made to the explanation of the first token vector, and details of this embodiment are not repeated herein.

Optionally, after acquiring a plurality of first characterization vectors corresponding to at least one type of first modality information, the terminal device may sum the first characterization vectors, or calculate an inner product, so as to obtain a fused vector. As shown in fig. 9, the pre-trained language model includes: an input layer, a processing layer, and an output layer, wherein the processing layer comprises: showing the sub-layer and the merging sub-layer. The input layer is used for acquiring at least one first modality information. The presentation sub-layer may be used to map each first modality information into a plurality of first characterization vectors. The fusion sublayer is configured to obtain a fused vector by summing the first token vectors, or by performing inner product calculation, etc. And the output layer is used for obtaining multi-modal information according to the pre-training language model and the fused vector.

Optionally, for the first token vector corresponding to each element, summing with the first token vectors of the elements subsequent to and/or previous to the element, or performing inner product, and the like, to obtain an intermediate vector corresponding to the element, and for the intermediate vector of the element, summing with the intermediate vectors of the elements subsequent to and/or previous to the element, or performing inner product, and the like, until obtaining a fused vector of each element. As shown in fig. 10, the pre-trained language model includes: an input layer, a processing layer, and an output layer, wherein the processing layer comprises: showing the sub-layer and the merging sub-layer. The input layer is used for acquiring at least one first modality information. The presentation sub-layer may be used to map each first modality information into a plurality of first characterization vectors. The fusion sublayer is used for summing the first token vector corresponding to each element with the first token vectors of the elements after and/or before the element, or calculating an inner product, and the like, so as to obtain an intermediate vector corresponding to the element, and summing the intermediate vector of the element with the intermediate vectors of the elements after and/or before the element, or calculating an inner product, and the like, until obtaining a fused vector of each element. And the output layer is used for obtaining multi-modal information according to the pre-training language model and the fused vector.

Optionally, the feature information of the first token vector corresponding to each element is summed with the feature information of the first token vector of the element subsequent and/or previous to the element, or an inner product is obtained, so as to obtain an intermediate vector corresponding to the element, and the intermediate vector of the element is summed with the intermediate vector of the element subsequent and/or previous to the element, or an inner product is obtained, and so on, until a fused vector of each element is obtained.

In summary, in the embodiment of the present application, multi-modal information corresponding to at least one first-modal information may be determined according to at least one first-modal information, that is, in the embodiment of the present application, single-modal or multi-modal information input is implemented, and multi-modal information output is implemented, compared to a mode of single-modal information input and output, such single-modal or multi-modal information input and multi-modal information output may satisfy a user's requirement for information diversity presentation, and may also improve applicability of the multi-modal information processing method. Further, the embodiment of the application realizes end-to-end input and output through the pre-trained language model, namely, the pre-trained language model is a model based on a neural network, and multi-modal information can be output only by inputting at least one modal information into the model. This end-to-end input and output approach may improve information processing efficiency. In addition, the information processing method based on the neural network model can improve the information processing efficiency compared with the information processing method based on the traditional machine learning model or the artificial rule.

The training process for the pre-trained language model will be explained as follows:

fig. 11 is a flowchart of a multimodal information processing method according to another embodiment of the present application, where an execution subject of the method may be part or all of a terminal device, where part of the terminal device may be a processor of the terminal device, and the execution subject of the method may also be the terminal device and a server, that is, a part of steps in fig. 11 are executed by the terminal device, and another part of steps are executed by the server, which is not limited in this application. As shown in fig. 11, the method includes:

step S1110: obtaining first training data, the first training data comprising: at least one second modality information.

Step S1120: the pre-trained language model is trained by the first training data.

Optionally, the first training data further comprises: and the at least one second modality information corresponds to the at least one second reference information.

It should be appreciated that the present embodiment is an unsupervised training process for a pre-trained language model.

Optionally, when the pre-training language model is trained through the first training data, that is, the first training data needs to be input into the pre-training language model, the pre-training language model performs vector representation and information fusion on the first training data to obtain a fused vector, multi-modal information corresponding to the first training data is obtained according to the fused vector, and parameters in the pre-training language model are adjusted through the multi-modal information.

It should be understood that, the vector characterization and information fusion performed on the first training data by the pre-training language model to obtain the fused vector may refer to a processing procedure of the at least one first-modality information by the pre-training language model, or refer to a processing procedure of the at least one first-modality information and the at least one first reference information by the pre-training language model, which is not described in detail in this embodiment of the present application.

In summary, in the embodiment of the present application, the terminal device may train the pre-training language model in an unsupervised training manner, so as to improve the precision of the pre-training language model, thereby obtaining more accurate multi-modal information.

Fig. 12 is a flowchart of a multimodal information processing method according to still another embodiment of the present application, where an execution subject of the method may be part or all of a terminal device, where part of the terminal device may be a processor of the terminal device, and the execution subject of the method may also be the terminal device and a server, that is, a part of steps in fig. 12 are executed by the terminal device, and another part of steps are executed by the server, which is not limited in this application. As shown in fig. 12, the method includes:

step 1210: obtaining second training data, the second training data comprising: the at least one third modality information and the at least one modality information corresponding to the at least one third modality information.

Step S1220: the pre-trained language model is trained by the second training data.

Taking the application to the intelligent question-answering scenario as an example, the third modality information may be a third modality question, and the corresponding at least one modality information may be at least one modality answer.

Optionally, the second training data further comprises: and the at least one third mode information corresponds to the at least one third reference information.

It should be appreciated that the present embodiment is a supervised training process for pre-trained language models.

Optionally, when the pre-training language model is trained through second training data, that is, the second training data needs to be input into the pre-training language model, vector representation and information fusion are performed on the second training data through the pre-training language model to obtain a fused vector, multi-modal information corresponding to the second training data is obtained according to the fused vector, and parameters in the pre-training language model are adjusted through the multi-modal information.

It should be understood that, the vector characterization and the information fusion are performed on the second training data through the pre-training language model to obtain a fused vector, which may refer to a processing procedure of the at least one first-mode information through the pre-training language model, or refer to a processing procedure of the at least one first-mode information and the at least one reference information through the pre-training language model, and this is not repeated in this embodiment of the present application.

It should be noted that the supervised training process corresponding to fig. 12 and the unsupervised training process corresponding to fig. 11 may be executed in combination or independently, and the embodiment of the present application does not limit this.

In summary, in the embodiment of the present application, the terminal device may train the pre-training language model in a supervised training manner, so as to improve the precision of the pre-training language model, thereby obtaining more accurate multi-modal information.

Optionally, the terminal device may further process the at least one third reference information according to the at least one third modality information, so as to obtain at least one modality information corresponding to the at least one third modality information.

Optionally, the processing comprises: at least one of extraction, rewriting, and combining.

It should be noted that, the order and the number of extraction, rewriting, and combining are not limited in the embodiments of the present application, for example: at least one kind of third reference information is extracted, and then rewriting and combination are performed. Or the third reference information may be rewritten first and then the combination and extraction may be performed. The following is illustrated by specific examples:

example one: fig. 13 is a flowchart of a method for processing at least one third reference information according to an embodiment of the present application, and as shown in fig. 13, the method includes the following steps:

step 1310: and extracting the related content of the at least one third modality information from the at least one third reference information.

Step S1320: and obtaining at least one type of modal information corresponding to the at least one type of third modal information according to the related content.

Illustratively, assuming that the third modality information is "which kind of alcohol to be entered is written towards which person", the relevant content extracted in the at least one third reference information may be the whole content or a fragment of the alcohol to be entered. But also the brief introduction of plum white, etc.

Optionally, the relevant content may be determined as at least one modality information corresponding to the at least one third modality information. Or, according to the at least one third modality information, rewriting the related content to obtain at least one modality information corresponding to the at least one third modality information. Or, according to the at least one third modality information, rewriting the related content, and combining the rewritten content to obtain at least one modality information corresponding to the at least one third modality information.

Illustratively, it is assumed that the third modality information is "which kind of alcohol will be written towards which person", and the whole content or fragment of the alcohol will be taken as at least one kind of modality information corresponding to the at least one kind of third modality information.

Illustratively, assuming that the third modality information is "which kind of alcohol will be written towards which person", the terminal device rewrites the lipped profile, such as: li Bai (701-762), too white characters, outstanding poems in Tang and Tang, and a great romantic poem after the Chinese literature history is subjected to the inflexion, which is called the poem. After partial overwriting, the following contents are obtained: libai characters are too white, blue lotus residents and 'fairy people' are the most prominent romantic poetry of the Tang, are praised by later people as 'poetry', and the content serves as at least one type of modal information corresponding to at least one type of third modal information.

Illustratively, assuming that the third modality information is "which alcohol will be written towards which person", the entire contents or segments of the alcohol will be entered and the rewritten contents: the white characters are too white, No. blue lotus is saint, No. "fairy person" again, is the most prominent romantic poetry of the tang, is honored as "poetry" by the back man, makes up, as at least one modal information that at least one kind of third modal information corresponds.

It should be noted that the combination in the embodiment of the present application may be a combination of information, for example, a combination of information a and information B, and the obtained result is < information a, information B >.

Example two: fig. 14 is a flowchart of a method for processing at least one third reference information according to another embodiment of the present application, and as shown in fig. 14, the method includes the following steps:

step S1401: and rewriting the at least one third reference information to obtain rewritten content.

Step S1402: and obtaining at least one type of modal information corresponding to the at least one type of third modal information according to the rewritten content.

Alternatively, the rewritten content may be determined as at least one modality information corresponding to the at least one third modality information. Or combining the rewritten contents according to the at least one third modality information to obtain at least one modality information corresponding to the at least one third modality information.

For example, assuming that the third modality information is "which kind of alcohol will be written towards which person", the rewritten content of the at least one third reference information after being rewritten may be: libai characters are too white, and blue lotus flower is a person living in the sky and 'fairy people' are the most prominent romantic poems in Tang, and are praised as 'poems'. The rewritten content may be at least one type of modality information corresponding to the at least one type of third modality information.

For example, suppose a rewritten content is that the libai character is too white, a person blue lotus is a person living in blue and green and a person "droplet", which is the most prominent romantic poetry of the Tang and is known as "poetry" by a person afterwards. Another rewritten content is: the lib and dupu are called "libu", and the two rewritten contents can be combined to obtain at least one type of modal information corresponding to the at least one type of third modal information.

In summary, in the supervised training process of the embodiment of the present application, at least one type of third reference information may be extracted, rewritten, combined, and the like to obtain at least one type of modality information corresponding to at least one type of third modality information, that is, the at least one type of modality information is obtained by extracting, rewriting, combining, and the like without being obtained through a manual or machine learning model, so that the information obtaining efficiency and the information precision may be improved.

Fig. 15 is a schematic diagram of a multi-modal information processing apparatus 1500 according to an embodiment of the present application, as shown in fig. 15, the apparatus includes: a first acquisition module 1510, a determination module 1520, and an output module 1530. The first obtaining module 1510 is configured to obtain at least one first modality information. The determining module 1520 is configured to determine multi-modal information corresponding to the at least one first modality information according to the at least one first modality information. The output module 1530 is used to output multimodal information.

Optionally, the apparatus further comprises: a second obtaining module 1540, configured to obtain at least one first reference information corresponding to the at least one first modality information. Accordingly, the determining module 1520 is specifically configured to: and determining multi-modal information corresponding to the at least one first modality information according to the at least one first modality information and the at least one first reference information.

Optionally, the determining module 1520 is specifically configured to: for each type of the at least one type of first modality information, mapping the first modality information into a plurality of first characterization vectors, wherein any one of the first characterization vectors is used for characterizing spatio-temporal information, content or type of any one of elements in the first modality information. For each kind of first reference information in the at least one kind of first reference information, mapping the first reference information into a plurality of second characterization vectors, any one of the second characterization vectors being used for characterizing spatio-temporal information, content or type of any one of elements in the first reference information. And fusing a plurality of first characterization vectors corresponding to the at least one type of first modality information and a plurality of second characterization vectors corresponding to the at least one type of first reference information to obtain fused vectors. And determining multi-modal information corresponding to at least one first modal information according to the fused vector.

Optionally, the determining module 1520 is specifically configured to: for each type of the at least one type of first modality information, mapping the first modality information into a plurality of first characterization vectors, wherein any one of the first characterization vectors is used for characterizing spatio-temporal information, content or type of any one of elements in the first modality information. And fusing a plurality of first characterization vectors corresponding to at least one type of first modality information to obtain fused vectors. And determining multi-modal information corresponding to at least one first modal information according to the fused vector.

Optionally, one modality is a text modality, a speech modality, an image modality, or a video modality.

Optionally, the first-modality information is a first-modality question, and the multi-modality information is a multi-modality answer.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 1500 shown in fig. 15 may perform the method embodiments corresponding to fig. 4, fig. 5, and fig. 8, and the foregoing and other operations and/or functions of the respective modules in the apparatus 1500 are respectively for implementing the corresponding flows in the respective methods in fig. 4, fig. 5, and fig. 8, and are not repeated herein for brevity.

The apparatus 1500 of an embodiment of the present application is described above in connection with the figures from the perspective of a functional block. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 16 is a schematic diagram of a multi-modal information processing apparatus 1600 according to an embodiment of the present application, as shown in fig. 16, the apparatus includes: a first obtaining module 1610 and a first training module 1620, wherein the first obtaining module 1610 is configured to obtain first training data, and the first training data includes: at least one second modality information. The first training module 1620 is configured to train the pre-trained language model through the first training data. The pre-training language model is used for determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information.

Optionally, the apparatus further comprises: a second obtaining module 1630 and a second training module 1640, where the second obtaining module 1630 is configured to obtain second training data, and the second training data includes: the at least one third modality information and the at least one modality information corresponding to the at least one third modality information. The second training module 1640 is for training the pre-trained language model with second training data.

Optionally, the apparatus further comprises: a processing module 1650, configured to process the at least one third reference information according to the at least one third modality information, so as to obtain at least one modality information corresponding to the at least one third modality information.

Optionally, the processing module 1650 is specifically configured to: and extracting the related content of the at least one third modality information from the at least one third reference information. And obtaining at least one type of modal information corresponding to the at least one type of third modal information according to the related content.

Optionally, the processing module 1650 is specifically configured to: and determining the related content as at least one type of modality information corresponding to the at least one type of third modality information. Or, according to the at least one third modality information, rewriting the related content to obtain at least one modality information corresponding to the at least one third modality information. Or, according to the at least one third modality information, rewriting the related content, and combining the rewritten content to obtain at least one modality information corresponding to the at least one third modality information.

Optionally, the processing module 1650 is specifically configured to: and rewriting the at least one third reference information to obtain rewritten content. And obtaining at least one type of modal information corresponding to the at least one type of third modal information according to the rewritten content.

Optionally, the processing module 1650 is specifically configured to: and determining the rewritten content as at least one type of modality information corresponding to the at least one type of third modality information. Or combining the rewritten contents according to the at least one third modality information to obtain at least one modality information corresponding to the at least one third modality information.

Optionally, the processing module 1650 is specifically configured to: one modality of the at least one second modality information is a text modality, a voice modality, an image modality, or a video modality.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 1600 shown in fig. 16 may perform the method embodiments corresponding to fig. 11 to fig. 14, and the foregoing and other operations and/or functions of each module in the apparatus 1600 are respectively for implementing corresponding flows in each method in fig. 11 to fig. 14, and are not described herein again for brevity.

The apparatus 1600 of an embodiment of the present application is described above in connection with the figures from the perspective of a functional module. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

As shown in fig. 17, the electronic device 1700 may include:

a memory 1710 and a processor 1720, the memory 1710 being configured to store a computer program and to transfer the program code to the processor 1720. In other words, the processor 1720 may invoke and execute a computer program from the memory 1710 to implement the method in the embodiments of the present application.

For example, the processor 1720 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the present application, the processor 1720 may include, but is not limited to:

general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 1710 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program can be partitioned into one or more modules that are stored in the memory 1710 and executed by the processor 1720 to perform the methods provided by embodiments of the present application. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device 1700.

As shown in fig. 17, the electronic device 1700 may further include:

a transceiver 1730, the transceiver 1730 being connectable to the processor 1720 or the memory 1710.

The processor 1720 may control the transceiver 1730 to communicate with other devices, and in particular, may transmit information or data to other devices or receive information or data transmitted by other devices. The transceiver 1730 may include a transmitter and a receiver. The transceiver 1730 may further include antennas, which may be one or more in number.

It should be understood that the various components within the electronic device 1700 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and all the changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A multimodal information processing method, comprising:

acquiring at least one first modality information;

determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information;

and outputting the multi-modal information.

2. The method according to claim 1, wherein before determining multi-modal information corresponding to the at least one first-modal information according to the at least one first-modal information, the method further comprises:

acquiring at least one first reference information corresponding to the at least one first modality information;

correspondingly, the determining multi-modal information corresponding to the at least one first modality information according to the at least one first modality information includes:

and determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information and the at least one first reference information.

3. The method according to claim 2, wherein determining multi-modal information corresponding to the at least one first modality information according to the at least one first modality information and the at least one first reference information comprises:

for each of the at least one kind of first modality information, mapping the first modality information into a plurality of first characterization vectors, any of the first characterization vectors being used for characterizing spatio-temporal information, content or type of any of elements in the first modality information;

for each kind of first reference information in the at least one kind of first reference information, mapping the first reference information into a plurality of second characterization vectors, any one of the second characterization vectors being used for characterizing spatio-temporal information, content or type of any one element in the first reference information;

fusing a plurality of first characterization vectors corresponding to the at least one first modality information and a plurality of second characterization vectors corresponding to the at least one first reference information to obtain fused vectors;

and determining multi-modal information corresponding to the at least one first modal information according to the fused vector.

4. The method according to claim 1, wherein the determining multi-modal information corresponding to the at least one first modality information according to the at least one first modality information comprises:

fusing a plurality of first characterization vectors corresponding to the at least one type of first modality information to obtain fused vectors;

5. The method of any of claims 1-4, wherein a modality is a text modality, a voice modality, an image modality, or a video modality.

6. The method according to any one of claims 1-4, wherein the first-modality information is a first-modality question and the multi-modality information is a multi-modality answer.

7. A multimodal information processing method, comprising:

obtaining first training data, the first training data comprising: at least one second modality information;

training a pre-training language model through the first training data;

the pre-training language model is used for determining multi-modal information corresponding to at least one first modal information according to the at least one first modal information.

8. The method of claim 7, wherein the first training data further comprises: and the at least one second modality information corresponds to at least one second reference information.

9. The method of claim 7, further comprising:

obtaining second training data, the second training data comprising: at least one third modality information and at least one modality information corresponding to the at least one third modality information;

training the pre-trained language model with the second training data.

10. The method of claim 9, wherein the second training data further comprises: and the at least one third modality information corresponds to at least one third reference information.

11. The method of claim 10, further comprising:

and processing the at least one third reference information according to the at least one third modality information to obtain at least one modality information corresponding to the at least one third modality information.

12. The method according to claim 11, wherein the processing the at least one third reference information to obtain at least one modality information corresponding to the at least one third modality information comprises:

extracting the related content of the at least one third modality information from the at least one third reference information;

and obtaining at least one type of modal information corresponding to the at least one type of third modal information according to the related content.

13. The method according to claim 12, wherein obtaining at least one modality information corresponding to the at least one third modality information according to the related content comprises:

determining the related content as at least one type of modality information corresponding to the at least one type of third modality information; alternatively, the first and second electrodes may be,

according to the at least one third modality information, rewriting the related content to obtain at least one modality information corresponding to the at least one third modality information; alternatively, the first and second electrodes may be,

and rewriting the related contents according to the at least one third modality information, and combining the rewritten contents to obtain at least one modality information corresponding to the at least one third modality information.

14. The method according to claim 11, wherein the processing the at least one third reference information to obtain at least one modality information corresponding to the at least one third modality information comprises:

rewriting the at least one third reference information to obtain rewritten content;

and obtaining at least one type of modal information corresponding to the at least one type of third modal information according to the rewritten content.

15. The method according to claim 14, wherein obtaining at least one modality information corresponding to the at least one third modality information according to the rewritten content comprises:

determining the rewritten content as at least one type of modality information corresponding to the at least one type of third modality information; alternatively, the first and second electrodes may be,

and combining the rewritten contents according to the at least one third modality information to obtain at least one modality information corresponding to the at least one third modality information.

16. The method according to any one of claims 7-15, wherein one modality of the at least one second modality information is a text modality, a speech modality, an image modality, or a video modality.

17. A multimodal information processing apparatus, comprising:

the first acquisition module is used for acquiring at least one type of first modality information;

the determining module is used for determining multi-modal information corresponding to the at least one first modal information according to the at least one first modal information;

and the output module is used for outputting the multi-mode information.

18. A multimodal information processing apparatus, comprising:

a first obtaining module, configured to obtain first training data, where the first training data includes: at least one second modality information;

a first training module for training a pre-training language model by the first training data;

19. An electronic device, comprising:

a processor and a memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any one of claims 1 to 16.

20. A computer-readable storage medium for storing a computer program which causes a computer to perform the method of any one of claims 1 to 16.