CN113177115B

CN113177115B - Conversation content processing method and device and related equipment

Info

Publication number: CN113177115B
Application number: CN202110731986.XA
Authority: CN
Inventors: 王一秋; 曾志贤
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-26
Anticipated expiration: 2041-06-30
Also published as: CN113177115A

Abstract

The application provides a processing method, a device and related equipment of conversation content, and relates to the field of neural network models, wherein the method comprises the following steps: retrieving the first dialogue content to obtain a first answer obtained by combining M target text vectors and N target picture vectors, wherein the M target text vectors correspond to texts in the first dialogue content, and the N target picture vectors correspond to pictures in the first dialogue content; the first dialog content comprises questioning-type content; coding the first dialogue content to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors; inputting the first dialogue content, the first answer and the second answer into a target network model, and outputting a target answer matched with the first dialogue content, wherein the target answer is one of the first answer and the second answer. Through the method and the device, the problem that answer accuracy obtained based on a manual template mode is low in the prior art is solved.

Description

Conversation content processing method and device and related equipment

Technical Field

The present application relates to the field of neural network models, and in particular, to a method and an apparatus for processing dialog contents, and a related device.

Background

The existing man-machine dialogue technical scheme is that a matched question template is found in a template base according to a sentence input by a user, and then an answer is generated according to a corresponding answer template. However, the question template is generated based on a manual mode, and the accuracy of the obtained answers is low due to the limitation of the manual mode.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing conversation content and related equipment, and aims to solve the problem that the answer accuracy rate obtained by the conventional manual template-based mode is low.

In order to solve the above technical problem, the present application is implemented as follows:

in a first aspect, the present application provides a method for processing dialog content, including: retrieving first dialogue content to obtain a first answer obtained by combining M target text vectors and N target picture vectors, wherein the M target text vectors correspond to texts in the first dialogue content, and the N target picture vectors correspond to pictures in the first dialogue content; the first dialog content comprises quiz-type content; coding the first dialogue content to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors; wherein the L text encoding vectors are encoded from text in the first dialog content, and the Y picture encoding vectors are encoded from pictures in the first dialog content; inputting the first dialogue content, the first answer and the second answer into a target network model, and outputting a target answer matched with the first dialogue content, wherein the target answer is one of the first answer and the second answer, and M, N, L, Y is a positive integer.

In a second aspect, the present application provides a device for processing dialog contents, including: a first processing module, configured to retrieve a first dialog content to obtain a first answer that is obtained by combining M target text vectors and N target picture vectors, where the M target text vectors correspond to texts in the first dialog content, and the N target picture vectors correspond to pictures in the first dialog content; the first dialog content comprises quiz-type content; the second processing module is used for coding the first dialogue content to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors; wherein the L text encoding vectors are encoded from text in the first dialog content, and the Y picture encoding vectors are encoded from pictures in the first dialog content; and a third processing module, configured to input the first dialog content, the first answer, and the second answer into a target network model, and output a target answer matched with the first dialog content, where the target answer is an answer of the first answer and the second answer, and M, N, L, Y are positive integers.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which program, when executed by the processor, carries out the method steps of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method steps of the first aspect.

In the application, M target text vectors and N target picture vectors corresponding to the text and the picture in the first dialogue content can be obtained through retrieval, a first answer may then be obtained that is a combination of the M target text vectors and the N target picture vectors, and coding the first dialogue content by a coding mode to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors, namely, the text and the picture in the first dialogue content are fused in two modes, and various combinations of the text and the picture in the first dialogue content are considered, so that the accuracy of generating the answer is improved, the target answers obtained from the first answers and the second answers are more accurate, and therefore the problem that the answer accuracy rate is lower in the prior art based on a manual template mode is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a processing method of dialog content provided by an embodiment of the present application;

fig. 2 is a schematic diagram of a text-text encoding model provided in an embodiment of the present application for encoding dialog content;

FIG. 3 is a schematic diagram illustrating a process of creating a knowledge base of pictures and texts according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a process of creating a text vector library according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a process of creating a vector library of pictures according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a multi-turn dialog answer acquisition process according to an embodiment of the present application;

fig. 7 is a schematic diagram of a teletext encoding scheme provided in an embodiment of the application;

fig. 8 is a schematic structural diagram of a processing apparatus for dialog content provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiment of the present application, a method for processing dialog content is provided, referring to fig. 1, where fig. 1 is a flowchart of a method for processing dialog content provided in the embodiment of the present application, and is used for a terminal device, as shown in fig. 1, the method includes the following steps:

step 102, retrieving the first dialogue content to obtain a first answer obtained by combining M target text vectors and N target picture vectors, wherein the M target text vectors correspond to texts in the first dialogue content, and the N target picture vectors correspond to pictures in the first dialogue content; the first dialog content comprises questioning-type content;

it should be noted that the questioning type contents in the embodiments of the present application include at least one of the following: questioning content and answering content.

Step 104, coding the first dialogue content to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors; the L text coding vectors are obtained by coding texts in the first dialogue content, and the Y picture coding vectors are obtained by coding pictures in the first dialogue content;

and 106, inputting the first dialogue content, the first answer and the second answer into the target network model and outputting a target answer matched with the first dialogue content, wherein the target answer is one of the first answer and the second answer, and M, N, L, Y is a positive integer.

Through the above steps 102 to 106, M target text vectors and N target picture vectors corresponding to the text and the picture in the first dialog content can be obtained through retrieval, a first answer may then be obtained that is a combination of the M target text vectors and the N target picture vectors, and coding the first dialogue content by a coding mode to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors, namely, the text and the picture in the first dialogue content are fused in two modes, and various combinations of the text and the picture in the first dialogue content are considered, so that the accuracy of generating the answer is improved, the target answers obtained from the first answers and the second answers are more accurate, and therefore the problem that the answer accuracy rate is lower in the prior art based on a manual template mode is solved.

It should be noted that the value of M, N, L, Y in this application is determined according to the first dialog content, for example, if the first dialog content includes 8 phrases and 2 pictures, the values of M and L are less than or equal to 8, and the values of N and Y are less than or equal to 2. For example, M takes a value of 8 and N takes a value of 2; since the encoding of the text does not need to be a word group to perform single word group encoding, the value of L may be 6, and the value of Y may be 2. The above is merely an example, and the specific value of M, N, L, Y needs to be determined according to actual conditions.

It should be noted that the execution subject in the embodiment of the present application may be a terminal or other device, and the terminal or other device has a target neural network model built therein.

In an optional implementation manner of the embodiment of the present application, regarding the manner of retrieving the first dialog content to obtain the first answer obtained by combining the M target text vectors and the N target picture vectors, which is referred to in step 102 of the present application, further includes:

step 102-11, identifying M text vectors and N picture vectors from the first dialogue content;

step 102-12, determining first service types corresponding to M text vectors and determining second service types corresponding to N picture vectors;

102-13, determining a first index according to a first service type and a first mapping relation, and determining a second index according to a second service type and a second mapping relation, wherein the first mapping relation is used for indicating the relation between the service type of a text vector and the index, and the second mapping relation is used for indicating the relation between the service type of a picture vector and the index;

102-14, determining M target text vectors from a text vector library according to a first index, and determining N target picture vectors from a picture vector library according to a second index, wherein the similarity between the text vector and each target text vector is lower than a first threshold, and the similarity between the picture vector and each target picture vector is lower than a second threshold;

step 102-15, a first answer is obtained based on a combination of the M target text vectors and the N target picture vectors.

For the text vectors involved in the above steps 102-11, in a specific application scenario, a p-means sentence vector representation may be adopted, where p-means is defined as:

(formula 1)

Where W is a word vector for a sentence, assuming there are n words. It is clear that when p equals 1, it is the averaging operation. In addition, when p is equal to positive infinity, it is the operation taking the maximum (max), and when p is equal to negative infinity, it is the operation taking the minimum (min). The text vector thus defining a question can be expressed as:

(formula 2)

Here, the

Is a connector, representing a vector splice, with the p value taken to be 1, plus infinity and minus infinity, respectively. The text vector index may be created based on an HNSW vector index library.

In addition, the picture vector involved in the embodiment of the present application may extract features of the problem picture through a pre-training residual network (ResNet 18) model, so as to generate a picture vector, and the picture vector index may be created based on an HNSW vector index library.

In addition, after the text vector and the picture vector of the first dialogue content are determined, the service types of the text vector and the picture vector need to be determined, and since the same question can have different answers in different service scenes, the service type is determined first and then the answer is determined, so that the determined answer is more accurate. In the application, a mapping relation exists between the service type and the index, namely after the service type is determined, the corresponding picture vector library or text vector library is determined in a targeted manner through the index.

It should be noted that the similarity in the embodiment of the present application may be correspondingly set according to specific situations, and if more candidate answers are desired to be obtained, the value of the similarity may be set to be lower, and if many candidate answers are not desired to be obtained, the value of the similarity may be set to be higher.

In another optional implementation manner of this embodiment of this application, the manner of encoding the first dialog content to obtain the second answer obtained by combining the L text encoding vectors and the Y picture encoding vectors, which is referred to in this application step 104, may further include:

step 104-11, encoding the text in the first dialogue content to obtain L text encoding vectors;

step 104-12, coding the pictures in the first dialogue content to obtain Y picture coding vectors;

and step 104-13, fusing the L text coding vectors and the Y picture coding vectors to obtain a second answer.

For the above steps 104-11 to 104-13, in a specific application scenario, as shown in fig. 2, the first dialog content is encoded based on an image-text encoding model, specifically, a text is encoded by a pre-trained word2vec word vector model; the picture in each sentence is coded by using a picture coder ResNet, an additional full connection layer is added, and then each word feature code in the text is combined with the picture feature code, the correlation between the picture and the text is fully considered, and the calculation formula is as follows:

(formula 3)

Wherein the content of the first and second substances,

a word-encoding vector is represented that is,

representing a picture coding vector. And finally, overlapping and fusing each word coding vector in the text coding with the picture coding vector, and then transmitting the multi-mode (text + picture) codes obtained after fusion to a coding layer. The coding layer adopts a Convolutional Neural Network (CNN) for coding, the obtained semantic code is transmitted to a decoding layer, and the decoding layer adopts a bidirectional long-time memory neural network (BilSTM) for decoding. The method is based on an improved image-text encoder model, and is different from the prior method of directly splicing text encoding and picture encoding, the method fully considers the correlation between pictures and characters, and the picture vector encoding is fused into each text vector encoding, so that the generated answer is more accurate.

In an optional implementation manner of the embodiment of the present application, before retrieving the first dialog content related to step 106 to obtain the first answer obtained by combining the M target text vectors and the N target picture vectors, the method of the embodiment of the present application further includes:

step 11, obtaining historical question and answer contents, wherein the historical question and answer contents comprise question type contents;

step 12, dividing the historical question and answer content into text content and pictures;

step 13, determining the service types of the text content and the picture;

and step 14, storing the text content in the corresponding text knowledge base according to the service type, and storing the picture in the corresponding picture knowledge base.

Through the above steps 11 to 14, the question content in the question type content and the text and the picture in the response content need to be divided from the historical text content, i.e. the historical data, and the corresponding service type needs to be determined, so as to train the initial neural network subsequently.

In a specific application scenario, as shown in fig. 3, a history manual question-answering record is used as an original dialog corpus, a multimodal message (including text and picture messages) of a user is used as a question, and a reply message is used as a response to a current question. Dividing a multi-modal dialog corpus of a user into a text corpus and an image-text corpus, judging which service scene the text problem and the previous problem belong to by a text service model if the text corpus is the text corpus, and storing the problem and the response thereof in a text knowledge base of the corresponding service scene; if the picture and text linguistic data exist, judging which business scene the picture problem belongs to through the picture business model, and storing the picture information and the response thereof in a picture and text knowledge base of the corresponding business scene.

The text service classification model is a text classification model realized based on a convolutional neural network (TextCNN), and a problem of which service scene a current user text problem is can be output through the model; the picture service model is a picture classification model realized based on a residual error network (ResNet), and the problem of which service scene the current user picture problem belongs to can be output through the model.

In an optional implementation manner of the embodiment of the present application, the method of the embodiment of the present application may further include:

step 21, generating a corresponding text vector for the text content in the text knowledge base, and generating a corresponding picture vector for the picture in the picture knowledge base;

step 22, creating an index of the text vector and an index of the picture vector;

and step 23, storing the text vector after the index is created into a text vector library, and storing the picture vector after the index is created into a picture vector library.

In a specific application scenario, a text vector may be generated from text content and a picture vector may be generated from a picture based on the above equations 1 and 2, and further, an index of the text vector and the picture vector may need to be created, where the index corresponds to a service type. The specific process is as shown in fig. 4 and fig. 5, the questions in the text knowledge base and the image-text knowledge base are respectively expressed as text vectors and picture vectors, then indexes are created, and a text vector index and a picture vector base are generated.

Further, in this embodiment of the present application, before inputting the first dialog content, the first answer, and the second answer into the target network model and outputting the target answer matching the first dialog content, the method of this embodiment of the present application may further include:

step 31, encoding answers corresponding to the answer contents in the text vector library and the picture vector library to obtain identifiers, wherein the identifiers are used for indicating answer types of the answers, and the answer types comprise answers obtained based on the encoding vectors and answers obtained based on the vectors;

and 32, training the initial network model through a target training set to obtain a target network model, wherein the target training set comprises texts in a text vector library, pictures in a picture vector library and identifiers.

Through the above steps 31 and 32, different types of answers are coded in the embodiment of the present application, that is, the types of answers to be output can be distinguished through coding.

That is to say, the method of the embodiment of the present application may further include:

and step 108, inputting the first dialogue content, the first answer and the second answer into the target network model and outputting an identifier corresponding to the target answer.

The following describes the present application by way of example with reference to a specific implementation of an embodiment of the present application, and based on a schematic diagram of obtaining answers through multiple rounds of dialogues shown in fig. 6, a multiple round of dialog flow in the specific implementation includes:

step a, recalling answers through a retrieval method;

the method comprises the steps of dividing the space into text indexes and picture indexes, classifying the text indexes and the picture indexes through a text service model and a picture service model respectively, obtaining text vectors by adopting a p-means method, and obtaining picture vectors by adopting a pre-training ResNet18 model. And respectively obtaining M text similar vectors and N picture similar vectors from the text vector and picture vector index libraries through the similarity between the vectors, and obtaining M text candidate answers and N picture candidate answers before indexing. And finally, arranging and combining the M text answers and the N picture answers to obtain M × N answers after arrangement and combination. As the answers of the same question and picture in different service scenes are possibly different, the service classification model of the text and the picture is introduced, and the accuracy of the index answer is improved.

Step b, generating an answer;

wherein T candidate answers are recalled based on the teletext encoder model in fig. 2. The method is different from the method of directly splicing text coding and picture coding in the prior art, the picture vector coding is fused into each word vector coding, the characteristics of the text and the picture are effectively fused, the correlation between the text and the picture is fully considered, and the quality of the generated model recall answer is improved.

Step c, obtaining a final optimal answer;

as shown in fig. 7, candidate answers are ranked based on the image-text matching model, and a user question and M × N candidate answers recalled based on the retrieval method, and T candidate answers recalled based on the improved image-text encoder model are input into the trained image-text matching model to obtain an optimal answer as an answer to the user question.

The image-text matching model vectorizes the questions and the answers through an Embedding layer respectively, then splices the vectors, and performs matching output through a CNN layer. The maximum structure of the network model is characterized in that an answer to be matched is preprocessed in an answer Embedding layer of Answers, an Answers Encoding optimization strategy is introduced, the type (retrieval type or generation type) of an answer sequence is encoded, for example, the retrieval type answer is encoded into 1, the generation type answer is encoded into 0, the answer is spliced behind the Answers Embedding layer and is used as an identifier of the answer type to distinguish the type of the answer to be matched, an optimized Embedding vector is input into a CNN, and question-answer matching output is carried out. As the answer recalled based on the retrieval method is better than the answer recalled based on the generation method in accuracy, answer Encoding is introduced to enable a model to distinguish whether the answer to be matched is a retrieval type or a generation type, so that the type information of the answer can be captured through the trained model, and the result obtained through the model matching is better than the answer recalled based on the retrieval method.

As shown in fig. 8, an embodiment of the present application further provides an apparatus for processing dialog content, where the apparatus includes:

a first processing module 82, configured to retrieve the first dialog content to obtain a first answer obtained by combining M target text vectors and N target picture vectors, where the M target text vectors correspond to texts in the first dialog content, and the N target picture vectors correspond to pictures in the first dialog content; the first dialog content comprises questioning-type content;

a second processing module 84, configured to encode the first dialog content to obtain a second answer that is obtained by combining L text encoding vectors and Y picture encoding vectors; the L text coding vectors are obtained by coding texts in the first dialogue content, and the Y picture coding vectors are obtained by coding pictures in the first dialogue content;

and a third processing module 86, configured to input the first dialog content, the first answer, and the second answer into the target network model, and output a target answer matched with the first dialog content, where the target answer is an answer of the first answer and the second answer, and M, N, L, Y are positive integers.

By the device, M target text vectors and N target picture vectors corresponding to the texts and pictures in the first dialogue content can be obtained in a retrieval mode, a first answer may then be obtained that is a combination of the M target text vectors and the N target picture vectors, and coding the first dialogue content by a coding mode to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors, namely, the text and the picture in the first dialogue content are fused in two modes, and various combinations of the text and the picture in the first dialogue content are considered, so that the accuracy of generating the answer is improved, the target answers obtained from the first answers and the second answers are more accurate, and therefore the problem that the answer accuracy rate is lower in the prior art based on a manual template mode is solved.

Optionally, the first processing module 82 in this embodiment of the present application further includes: an identifying unit, configured to identify M text vectors and N picture vectors from the first dialog content; the first determining unit is used for determining first service types corresponding to the M text vectors and determining second service types corresponding to the N picture vectors; the second determining unit is used for determining a first index according to the first service type and the first mapping relation and determining a second index according to the second service type and the second mapping relation, wherein the first mapping relation is used for indicating the relation between the service type of the text vector and the index, and the second mapping relation is used for indicating the relation between the service type of the picture vector and the index; a third determining unit, configured to determine M target text vectors from the text vector library according to the first index, and determine N target picture vectors from the picture vector library according to the second index, where a similarity between a text vector and each target text vector is lower than a first threshold, and a similarity between a picture vector and each target picture vector is lower than a second threshold; the first processing unit is used for obtaining a first answer based on the combination of the M target text vectors and the N target picture vectors.

Optionally, the second processing module 84 in this embodiment of the present application further includes: the first coding unit is used for coding the text in the first dialogue content to obtain L text coding vectors; the second coding unit is used for coding the pictures in the first conversation content to obtain Y picture coding vectors; and the second processing unit is used for fusing the L text coding vectors and the Y picture coding vectors to obtain a second answer.

Optionally, the apparatus in this embodiment of the present application may further include: the obtaining module is used for obtaining historical question and answer contents before searching the first dialogue contents to obtain first answers obtained by combining the M candidate text answers and the N candidate picture answers, wherein the historical question and answer contents comprise question contents and answer contents; the dividing module is used for dividing the historical question and answer content into text content and pictures; the determining module is used for determining the service types of the text content and the picture; and the first storage module is used for storing the text content in the corresponding text knowledge base according to the service type and storing the picture in the corresponding picture knowledge base.

Optionally, the apparatus in this embodiment of the present application may further include: the recognition module is used for generating corresponding text vectors from the text contents in the text knowledge base and generating corresponding picture vectors from the pictures in the picture knowledge base; the creating module is used for creating an index of a text vector and an index of a picture vector; and the second storage module is used for storing the text vector after the index is created into a text vector library and storing the picture vector after the index is created into a picture vector library.

Optionally, the apparatus in this embodiment of the present application may further include: the encoding module is used for encoding answers corresponding to the answer contents in the text vector library and the picture vector library to obtain identifiers before inputting the first dialogue contents, the first answers and the second answers into the target network model and outputting the target answers matched with the first dialogue contents, wherein the identifiers are used for indicating answer types of the answers, and the answer types comprise answers obtained based on encoding vectors and answers obtained based on vectors; and the training module is used for training the initial network model through a target training set to obtain a target network model, wherein the target training set comprises texts in a text vector library, pictures in a picture vector library and identifiers.

Optionally, the apparatus in this embodiment of the present application may further include: and the fourth processing module is used for inputting the first dialogue content, the first answer and the second answer into the target network model and outputting the identifier corresponding to the target answer.

An embodiment of the present application further provides an electronic device, as shown in fig. 9, the electronic device includes: the processing method includes a processor 901, a memory 902, and a program 9021 stored in the memory 902 and executable on the processor, where the program, when executed by the processor, implements each process of the processing method embodiment of the above-described dialog content, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the processes of the above-mentioned indication method embodiment, and can achieve the same technical effects, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for processing dialog content, comprising:

retrieving first dialogue content to obtain a first answer obtained by combining M target text vectors and N target picture vectors, wherein the M target text vectors correspond to texts in the first dialogue content, and the N target picture vectors correspond to pictures in the first dialogue content; the first dialog content comprises quiz-type content;

coding the first dialogue content based on a picture-text coding model to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors; wherein the L text encoding vectors are encoded from text in the first dialog content, and the Y picture encoding vectors are encoded from pictures in the first dialog content;

inputting the first dialogue content, the first answer and the second answer into a target network model, and outputting a target answer matched with the first dialogue content, wherein the target answer is one of the first answer and the second answer, and M, N, L, Y is a positive integer.

2. The method of claim 1, wherein the retrieving the first dialog content results in a first answer that is a combination of M target text vectors and N target picture vectors, comprising:

identifying M text vectors and N picture vectors from the first dialog content;

determining first service types corresponding to the M text vectors, and determining second service types corresponding to the N picture vectors;

determining a first index according to the first service type and a first mapping relation, and determining a second index according to the second service type and a second mapping relation, wherein the first mapping relation is used for indicating the relation between the service type and the index of a text vector, and the second mapping relation is used for indicating the relation between the service type and the index of a picture vector;

determining M target text vectors from a text vector library according to the first index, and determining N target picture vectors from a picture vector library according to the second index, wherein the similarity between the text vector and each target text vector is lower than a first threshold value, and the similarity between the picture vector and each target picture vector is lower than a second threshold value;

obtaining the first answer based on a combination of the M target text vectors and the N target picture vectors.

3. The method of claim 1, wherein said encoding the first dialog content results in a second answer that is a combination of L text encoding vectors and Y picture encoding vectors, comprising:

encoding the text in the first dialogue content to obtain the L text encoding vectors;

coding pictures in the first conversation content to obtain the Y picture coding vectors;

and fusing the L text coding vectors and the Y picture coding vectors to obtain the second answer.

4. The method of claim 1, wherein prior to retrieving the first dialog content resulting in the first answer resulting from the combination of the M target text vectors and the N target picture vectors, the method further comprises:

obtaining historical question and answer contents, wherein the historical question and answer contents comprise question type contents;

dividing the historical question and answer content into text content and pictures;

determining the service types of the text content and the picture;

and storing the text content in a corresponding text knowledge base according to the service type, and storing the picture in a corresponding picture knowledge base.

5. The method of claim 4, further comprising:

generating text contents in the text knowledge base into corresponding text vectors, and generating corresponding picture vectors from pictures in the picture knowledge base;

creating an index of the text vector and creating an index of the picture vector;

and storing the text vector after the index is created into a text vector library, and storing the picture vector after the index is created into a picture vector library.

6. The method of claim 5, wherein before said inputting said first dialog content, said first answer, and said second answer into a target network model and outputting a target answer matching said first dialog content, said method comprises:

coding answers corresponding to answer contents in the text vector library and the picture vector library to obtain identifiers, wherein the identifiers are used for indicating answer types of the answers, and the answer types comprise answers obtained based on coding vectors and answers obtained based on vectors;

and training an initial network model through a target training set to obtain the target network model, wherein the target training set comprises texts in the text vector library, pictures in the picture vector library and the identifiers.

7. The method of claim 6, further comprising:

inputting the first dialogue content, the first answer and the second answer into a target network model, and outputting an identifier corresponding to the target answer.

8. A device for processing dialog content, comprising:

a first processing module, configured to retrieve a first dialog content to obtain a first answer that is obtained by combining M target text vectors and N target picture vectors, where the M target text vectors correspond to texts in the first dialog content, and the N target picture vectors correspond to pictures in the first dialog content; the first dialog content comprises quiz-type content;

the second processing module is used for coding the first dialogue content based on the image-text coding model to obtain a second answer obtained by combining L text coding vectors and Y picture coding vectors; wherein the L text encoding vectors are encoded from text in the first dialog content, and the Y picture encoding vectors are encoded from pictures in the first dialog content;

and a third processing module, configured to input the first dialog content, the first answer, and the second answer into a target network model, and output a target answer matched with the first dialog content, where the target answer is an answer of the first answer and the second answer, and M, N, L, Y are positive integers.

9. An electronic device, comprising: processor, memory and program stored on the memory and executable on the processor, which when executed by the processor implements the method steps of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.