CN110209784B

CN110209784B - Message interaction method, computer device and storage medium

Info

Publication number: CN110209784B
Application number: CN201910346251.8A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2024-03-12
Anticipated expiration: 2039-04-26
Also published as: CN110209784A

Abstract

The invention discloses a message interaction method, computer equipment and a storage medium, and belongs to the technical field of networks. The technical scheme provided by the embodiment of the invention can acquire the image message based on the text message sent by the terminal, replies the response image, increases the interestingness of the man-machine conversation process, greatly improves the intelligence of the man-machine conversation process, and enriches the realization modes of the man-machine conversation.

Description

Message interaction method, computer device and storage medium

Technical Field

The present invention relates to the field of network technologies, and in particular, to a message interaction method, a computer device, and a storage medium.

Background

Along with popularization of intelligent question-answering products such as customer service assistants and conversation robots, users can send query messages to the intelligent question-answering products, and the intelligent question-answering products automatically reply response messages to the users, so that man-machine conversation between the users and the machines is achieved.

On the other hand, with the popularity of instant messaging clients, users have a need to use not only text to express ideas but also images to express ideas when chatting based on the instant messaging clients, and the collection of these images can be generally and visually referred to as "expression packages".

In the current intelligent question-answering products, when a user sends an inquiry message to the intelligent question-answering products, the intelligent question-answering products can reply a text response message to the user only after analyzing the inquiry message no matter whether the inquiry message sent by the user is text or image, so that the man-machine conversation process is lack of interestingness and poor in intelligence.

Disclosure of Invention

The embodiment of the invention provides a message interaction method, computer equipment and a storage medium, which can solve the problem of poor intelligence in a man-machine conversation process. The technical scheme is as follows:

in one aspect, a message interaction method is provided, the method including:

receiving a first text message of a terminal;

acquiring a second text message based on the text characteristics of the first text message, wherein the second text message is a response text message of the first text message;

acquiring a first response image according to the second text message, wherein the semantic similarity between the first response image and the second text message accords with a first target condition;

and sending the first response image to the terminal.

In one possible implementation manner, the acquiring the first response image according to the second text message includes:

And inputting the second text message into a similarity determination model, and acquiring the first response image from at least one image in an image database corresponding to the similarity determination model through the similarity determination model.

In one possible implementation manner, the obtaining, by the similarity determination model, the first response image from at least one image in an image database corresponding to the similarity determination model includes:

acquiring semantic similarity between the second text message and at least one image in the image database through the similarity determination model;

and sequencing the at least one image according to the sequence of the semantic similarity from large to small, and determining the image meeting the first target condition as the first response image.

In one possible implementation manner, the obtaining, by the similarity determination model, the semantic similarity between the second text message and at least one image in the image database includes:

extracting text features of the second text message through a natural language processing sub-model in the similarity determination model;

extracting image features of the at least one image in the image database through an image processing sub-model in the similarity determination model;

And acquiring the semantic similarity between the second text message and the at least one image according to the text characteristics of the second text message and the image characteristics of the at least one image.

In one possible implementation manner, the acquiring the second text message based on the text feature of the first text message includes:

inputting the first text message into a text response model, acquiring text characteristics of the first text message through the text response model, and outputting the second text message based on the text characteristics of the first text message.

In one possible implementation, the first response image is: the image with the highest semantic similarity among the images with the semantic similarity larger than the target threshold value with the second text message; or alternatively, the first and second heat exchangers may be,

the first response image is: and in the images with the semantic similarity larger than the target threshold value with the second text message, the semantic similarity is positioned in the images with the first target quantity.

In one possible implementation, the method further includes:

determining an intention label corresponding to the first text message based on application scene information of the first text message;

And when the intention label is a target label, executing the operation of acquiring a second text message based on the text characteristics of the first text message.

In one aspect, a message interaction method is provided, the method including:

receiving a first text message of a terminal;

determining a type of a response message based on text characteristics of the first text message;

when the type of the response message is an image, acquiring a second response image according to the first text message, wherein the matching degree between the second response image and the first text message accords with a second target condition;

and sending the second response image to the terminal.

In one possible implementation manner, the obtaining a second response image according to the first text message includes:

and inputting the first text message into a second classification model, and acquiring the second response image from at least one image in an image database corresponding to the second classification model through the second classification model.

In one possible implementation manner, the obtaining, by the second classification model, the second response image from at least one image in an image database corresponding to the second classification model includes:

Obtaining the matching degree between the first text message and at least one image in the image database through the second classification model;

and sequencing the at least one image according to the sequence from the high matching degree to the low matching degree, and determining the image meeting the second target condition as the second response image.

In one possible implementation manner, the obtaining, by the second classification model, the matching degree between the first text message and at least one image in the image database includes:

extracting text features of the first text message through a natural language processing sub-model in the second classification model;

extracting image features of the at least one image in the image database through an image processing sub-model in the second classification model;

and acquiring the matching degree between the first text message and the at least one image according to the text characteristics of the first text message and the image characteristics of the at least one image.

In one possible implementation manner, the determining the type of the response message based on the text feature of the first text message includes:

and inputting the first text message into a first classification model, classifying the response message of the first text message through the first classification model, and outputting the type of the response message.

In one possible implementation manner, the second response image is: an image having the highest matching degree with the first text message; or alternatively, the first and second heat exchangers may be,

the second response image is: the matching degree between the first text message and the first text message is positioned at the first second target number of images.

In one possible implementation, the method further includes:

when an image message of a terminal is received, a third response image is acquired according to the image message, the matching degree between the third response image and the image message meets a third target condition, and the third response image is sent to the terminal.

In one possible implementation manner, the acquiring a third response image according to the image message includes:

and inputting the image message into the second classification model, and acquiring the third response image from at least one image in an image database corresponding to the second classification model through the second classification model.

In one possible implementation manner, the obtaining, by the second classification model, the third response image from at least one image in the image database corresponding to the second classification model includes:

Obtaining the matching degree between the image message and at least one image in the image database through the second classification model;

and sequencing the at least one image according to the sequence from the high matching degree to the low matching degree, and determining the image meeting the third target condition as the third response image.

In one possible implementation manner, the obtaining, by the second classification model, the matching degree between the image message and at least one image in the image database includes:

extracting image features of the image message through a natural language processing sub-model in the second classification model;

and acquiring the matching degree between the image message and the at least one image according to the image characteristics of the image message and the image characteristics of the at least one image.

In one possible implementation manner, the third response image is: an image with highest matching degree with the image message; or alternatively, the first and second heat exchangers may be,

the third response image is: the degree of matching with the image message is at the first third target number of images.

In one possible implementation, the method further includes:

and when the intention label is a target label, executing the operation of determining the type of the response message based on the text characteristics of the first text message.

In one aspect, a message interaction method is provided, the method including:

sending a first text message to a server in a session interface;

receiving a first response image from the server, wherein the semantic similarity between the first response image and a second text message accords with a first target condition, and the second text message is a response text message of the first text message;

and displaying the first response image on the session interface.

In one aspect, a message interaction method is provided, the method including:

sending a first text message to a server in a session interface;

receiving a second response image from the server, wherein the matching degree between the second response image and the first text message meets a second target condition;

and displaying the second response image on the session interface.

In one aspect, a message interaction method is provided, the method including:

sending an image message to a server in a session interface;

receiving a third response image from the server, wherein the matching degree between the third response image and the image message accords with a third target condition;

and displaying the third response image on the session interface.

In one aspect, there is provided a message interaction apparatus, the apparatus comprising:

the receiving module is used for receiving the first text message of the terminal;

a text message obtaining module, configured to obtain a second text message based on a text feature of the first text message, where the second text message is a response text message of the first text message;

the image acquisition module is used for acquiring a first response image according to the second text message, and the semantic similarity between the first response image and the second text message accords with a first target condition;

and the sending module is used for sending the first response image to the terminal.

In one possible implementation, the image acquisition module is configured to:

In one possible implementation, the image acquisition module includes:

a similarity obtaining sub-module, configured to obtain, through the similarity determining model, a semantic similarity between the second text message and at least one image in the image database;

and the determining submodule is used for sequencing the at least one image according to the sequence from the high semantic similarity to the low semantic similarity, and determining the image meeting the first target condition as the first response image.

In one possible implementation, the similarity obtaining submodule is configured to:

In one possible implementation manner, the text message acquiring module is configured to:

In one possible implementation, the apparatus further includes:

an intention determining module, configured to determine an intention tag corresponding to the first text message based on application scene information of the first text message;

and when the intention label is a target label, triggering the text message acquisition module to execute the operation of acquiring a second text message based on the text characteristics of the first text message.

a type determining module, configured to determine a type of a response message based on a text feature of the first text message;

the image acquisition module is used for acquiring a second response image according to the first text message when the type of the response message is an image, and the matching degree between the second response image and the first text message accords with a second target condition;

And the sending module is used for sending the second response image to the terminal.

In one possible implementation, the image acquisition module is configured to: and inputting the first text message into a second classification model, and acquiring the second response image from at least one image in an image database corresponding to the second classification model through the second classification model.

In one possible implementation, the image acquisition module includes:

a first matching degree determining submodule, configured to obtain, through the second classification model, a matching degree between the first text message and at least one image in the image database;

and the first image determining sub-module is used for sequencing the at least one image according to the sequence from the high matching degree to the low matching degree, and determining the image meeting the second target condition as the second response image.

In one possible implementation manner, the matching degree determining submodule is configured to:

In one possible implementation manner, the type determining module is configured to input the first text message into a first classification model, classify a response message of the first text message by using the first classification model, and output a type of the response message.

In one possible implementation manner, the image acquisition module is further configured to, when receiving an image message of a terminal, acquire a third response image according to the image message, where a matching degree between the third response image and the image message meets a third target condition, and trigger the sending module to send the third response image to the terminal.

In one possible implementation, the image acquisition module is further configured to:

In one possible implementation, the image acquisition module further includes:

the second matching degree acquisition sub-module is used for acquiring the matching degree between the image message and at least one image in the image database through the second classification model;

and the second image determining sub-module is used for sequencing the at least one image according to the sequence from the high matching degree to the low matching degree, and determining the image meeting the third target condition as the third response image.

In one possible implementation manner, the second matching degree obtaining submodule is configured to:

In one possible implementation, the apparatus further includes:

and triggering the type determining module to execute the operation of determining the type of the response message based on the text characteristics of the first text message when the intention label is a target label.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having stored therein at least one instruction that is loaded and executed by the one or more processors to implement operations performed by a message interaction method as described above.

the sending module is used for sending the first text message to the server in the session interface;

The receiving module is used for receiving a first response image from the server, the semantic similarity between the first response image and a second text message accords with a first target condition, and the second text message is a response text message of the first text message;

and the display module is used for displaying the first response image on the session interface.

the receiving module is used for receiving a second response image from the server, and the matching degree between the second response image and the first text message accords with a second target condition;

and the display module is used for displaying the second response image on the session interface.

the sending module is used for sending the image message to the server in the session interface;

the receiving module is used for receiving a third response image from the server, and the matching degree between the third response image and the image message accords with a third target condition;

and the display module is used for displaying the third response image on the session interface.

In one aspect, a terminal is provided that includes one or more processors and one or more memories having at least one instruction stored therein that is loaded and executed by the one or more processors to implement the operations performed by the message interaction method described above.

In one aspect, a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the operations performed by the message interaction method described above is provided.

The technical scheme provided by the embodiment of the invention can acquire the image message based on the text message sent by the terminal, replies the response image, increases the interestingness of the man-machine conversation process, greatly improves the intelligence of the man-machine conversation process, and enriches the realization modes of the man-machine conversation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of a message interaction method according to an embodiment of the present invention;

FIG. 2 is an interaction flow chart of a message interaction method provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a message interaction method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a similarity determining model according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart diagram of a message interaction method provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a dialog interface provided by an embodiment of the present invention;

FIG. 7 is an interaction flow chart of a message interaction method provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of a second classification model according to an embodiment of the present invention;

FIG. 9 is a schematic flow chart diagram of a message interaction method provided by an embodiment of the present invention;

FIG. 10 is an interaction flow chart of a message interaction method provided by an embodiment of the present invention;

FIG. 11 is a schematic flow chart diagram of a message interaction method provided by an embodiment of the present invention;

FIG. 12 is a schematic illustration of a dialog interface provided by an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention;

Fig. 14 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention;

fig. 17 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention;

fig. 18 shows a block diagram of a terminal 1800 provided by an exemplary embodiment of the present invention;

fig. 19 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment of a message interaction method according to an embodiment of the present invention. Referring to fig. 1, at least one terminal 101 and a server 102 may be included in the implementation environment, as described in detail below:

the at least one terminal 101 may be any terminal capable of sending a text message or an image message, after a user logs in to any terminal, the user may send the text message or the image message to the server 102, the server 102 may generate a response message based on the text message or the image message sent by the user, the response message may be divided into a response text message or a response image message, and the server 102 sends the response message to the terminal, so that man-machine interaction can be realized in the process of repeatedly executing the above steps. In the embodiment, the server can acquire the image message based on the text message sent by the terminal and reply the response image, so that the interestingness of the man-machine conversation process is improved, the intelligence of the man-machine conversation process is greatly improved, and the realization mode of the man-machine conversation is enriched.

In the above process, the user may send a text message or an image message based on the dialogue processing logic built in the terminal, for example, an intelligent assistant is built in the terminal, so that the terminal directly sends the text message or the image message based on the intelligent assistant, and of course, the user may also send the text message or the image message based on an application client on the terminal, where the application client may be any client supporting man-machine dialogue, for example, an instant messaging client configured with the intelligent assistant, a shopping client, a taxi taking client, and the like, and of course, the application client may also be a chat robot, a dialogue robot, and the like.

The server 102 may be any computer device capable of providing a machine response service, and when the server 102 receives a text message of any one of the at least one terminal 101, a response text message or a response image message may be sent to the terminal, respectively, according to different situations.

When the server 102 receives the image message of any terminal, it will generally send a response image message to the terminal, because the image message sent by the terminal is usually an expression image, and the amount of information carried by the expression image is low, it is difficult to accurately extract the text feature of the expression image, so that the accuracy of generating the response text message based on the text feature of the expression image is greatly reduced, and if the response image message is sent to the terminal, it is not necessary to extract the text feature of the expression image, thus improving the accuracy of the response image message, and in addition, increasing the interest in the man-machine conversation process.

In the following, a computer device is taken as an example of a server to describe, and fig. 2 is an interaction flow chart of a message interaction method according to an embodiment of the present invention. Referring to fig. 2, the embodiment is applied to an interaction process between a terminal and a server, and includes:

201. the terminal sends a first text message to the server.

The terminal may be any electronic device capable of sending a message, where the first text message is a chat text message, an inquiry text message, and the like, and the embodiment of the present invention does not specifically limit the content of the first text message.

In the above process, the terminal may send the first text message to the server based on the application client, and of course, the terminal may also send the first text message by its own message sending thread.

In some embodiments, the terminal may provide an entry of a dialog interface on the application client in the form of a function option, display the dialog interface when the terminal detects a click operation of the function option by a user, input a first text message in an input box of the dialog interface by the user, and perform an operation of transmitting the first text message to the server when the terminal detects a click operation of a send button in the input box by the user.

202. When the server receives a first text message of the terminal, an intention label corresponding to the first text message is determined based on application scene information of the first text message.

The intention label is used for representing the purpose indicated by the text message or the image message, for example, the intention label can be a knowledge question and answer, a boring, a weather inquiry, a ticket booking, a commodity inquiry and the like.

The application scenario information may be interface information browsed by the terminal before the dialogue interface is displayed, for example, when the terminal jumps from the introduction interface of the product details to the dialogue interface, the server may determine that the intention label of the text message or the image message is "product inquiry".

In the above receiving process, the server may receive any message sent by the terminal, and when detecting that the message only includes a text string, determine the message as a first text message, where the first text message may carry application scenario information.

In some embodiments, fig. 3 is a schematic diagram of a message interaction method provided by the embodiment of the present invention, referring to fig. 3, after a server receives a first text message, the server obtains application scenario information of the first text message, and performs intent classification on the first text message based on the application scenario information to obtain an intent label corresponding to the first text message, so that before the message interaction method in the embodiment of the present invention is performed, intent classification can be performed in advance, different models can be adopted according to different intent labels, for example, when the intent label is "chat", the following step 203 is performed, and when the intent label is "knowledge question and answer", message interaction is performed based on a special knowledge question and answer model.

203. When the intention tag is a target tag, the server inputs the first text message into a text response model.

When judging whether the intention label is the target label, the server can regularly match the intention label with the target label, and when the matching is successful, the intention label is determined to be the target label. The target tag may be any intended tag, and the number of target tags may be one or more, for example, the target tag is "boring".

In the above process, the text answer model is used to provide answer text, and in some embodiments, the text answer model may be a retrievable answer model, which may be implemented based on a keyword retrieval technology, and at least one corpus may be configured on the server, where at least one question-answer pair (QApair) may be stored in each corpus, so that the server may retrieve an answer text message corresponding to a first text message from the at least one corpus based on the first text message, thereby quickly obtaining the answer text message of the first text message.

Alternatively, the at least one question-answer pair may be stored in the form of key-value pairs, and the server enters the first text message into the retrievable answer model in step 203 described above, and the process of how this is specifically done is described in detail in step 204 below.

In some embodiments, the text response model may also be a generated response model, where the generated response model may be implemented based on a Neural Network (NNs), so that the server may generate a customized response text message based on the first text message, which may solve the defect of lack of corpus data of the retrieved response model, improve flexibility in a man-machine conversation process, and increase portability of the text response model.

Alternatively, the generated response model may be a codec-decode (encode-decode) network of SEQ2SEQ (sequence to sequence, sequence pair sequence), in which an encoding part and a decoding part may be generally included, alternatively, the encoding part may be an RNN (recurrent neural networks, recurrent neural network) for performing semantic recognition, the decoding part may be an RNN for performing response generation, of course, the encoding part or the decoding part may also be an LSTM (long short-term memory network), a BLSTM (bidirectional long short-term memory network), or the like, and the embodiment of the present invention does not specifically limit what type of network the encoding part or the decoding part adopts, and how to generate a response text message will be described in step 204 below.

204. The server obtains the text characteristics of the first text message through the text response model, and outputs the second text message based on the text characteristics of the first text message.

Wherein the first text message is a reply text message to the first text message.

In the above process, when the text response model is a search type response model, the server may extract a keyword expression of the first text message through the text response model, determine the keyword expression as a text feature of the first text message, for example, the keyword expression may be a boolean expression (boolean expression), a vector expression, or the like, and further search the at least one corpus for a response text corresponding to the keyword expression with the keyword expression as an index, thereby quickly acquiring the second text message, shortening the duration of acquiring the second text message, and guaranteeing the grammar correctness of the second text message.

In some embodiments, when the question-answer pairs are stored in the at least one corpus in the form of key-value pairs, for each question-answer pair, since one question-answer pair may include query text and answer text, the server may store the query text as a key name, store the answer text as a key value, and further, when retrieving in each corpus, the server may determine at least one query text similar to the keyword expression, obtain at least one answer text corresponding to the at least one query text, when the number of answer texts is one, directly determine the answer text as a second text message, shorten the time for obtaining the second text message, and when the number of answer texts is a plurality, determine the answer text with the highest confidence in the at least one answer text as the second text message, thereby guaranteeing the answer accuracy of the second text message to the first text message.

In the above process, when the keyword expression is a boolean expression, the similarity to the keyword expression means that any query text can satisfy the boolean expression, that is, the query text makes the boolean expression take a value of 1, and in some embodiments, when the keyword expression is a vector expression, the similarity to the keyword expression means that the cosine distance (also referred to as cosine similarity) between the vector of any query text and the vector expression is less than a distance threshold.

In some embodiments, the confidence level is used to indicate the answer accuracy between any answer text and the first text message, for example, in the case that the first text message contains ambiguities, each answer text corresponding to the semantics may be retrieved in the corpus, where the answer text closest to the semantics of the first text message may be selected by the confidence level, for example, the server may obtain the confidence level between the first text message and each answer text in the at least one answer text based on the BM25 algorithm, and determine the answer text with the highest confidence level as the second text message.

In some embodiments, when the text answer model is a generated answer model, the server may input the first text message into an encoding portion of the text answer model, extract text features of the first text message through the encoding portion, input the text features of the first text message into a decoding portion of the text answer model, and predict an answer text message of the first text message through the decoding portion, thereby obtaining an output of the decoding portion as the second text message, so that the answer text message can be customized according to the first text message of the terminal, the defect of corpus data starvation of the retrievable answer model is solved, flexibility in a man-machine conversation process is improved, and portability of the text answer model is increased.

In some embodiments, when the encoding portion and the decoding portion are both RNN networks, an embedding vector of at least one word in the first text message is obtained in the encoding portion through an embedding (embedding) layer of the RNN, the embedding vector of the at least one word is input into a plurality of hidden layers having a context connection relationship in the RNN, the embedding vector of the at least one word is subjected to weighted transformation through the plurality of hidden layers, and text characteristics of the first text message are output, so that contribution of the context relationship between the words in one text message to the text characteristics can be considered.

After obtaining the text feature of the first text message, inputting the text feature of the first text message into a decoding part, and performing weighted transformation on the text feature of the first text message through each hidden layer with a context connection relation in the decoding part, so as to generate a word sequence composed of a plurality of words, and acquiring the word sequence as the second text message, thereby being convenient for obtaining the second text message which is more matched with the semantics of the first text message.

In the above steps 203-204, the server can obtain the second text message based on the text feature of the first text message, specifically, the text feature of the first text message can be obtained through the text response model, and then the second text message is obtained, so that the second text message which is more matched with the text feature of the first text message can be obtained.

205. The server inputs the second text message into a similarity determination model, and extracts text features of the second text message through a natural language processing sub-model in the similarity determination model.

It should be noted that the similarity determination model may employ a regression model, where the similarity determination model differs from the classification model in that, for the same input x, y output by the similarity determination model is a continuous number that can represent a best fit of the input x, and y output by the classification model is a discrete number that represents the predicted probability that x matches a different class label in the classification model, where x may represent a textual feature or an image feature (the number may be one or more), and y may represent the output of the model (either a vector or a scalar). In an embodiment of the present invention, how to perform the message interaction using the similarity determination model will be described in detail, and in a next embodiment, how to perform the message interaction using the classification model will be described in detail.

The similarity determination model may include an NLP (natural language processing ) sub-model and an image processing sub-model, which are described in detail below in step 206.

The NLP sub-model is used to extract text features of a text message, for example, the NLP sub-model may be BERT (bidirectional encoder representation from transformers, a translation model represented by bi-directional coding), ELMo (embeddings from language models, a language model processed by embedding), GPT (general pre-trained transformer, a generated pre-trained translation model), and the like.

In the following, the present invention will be described by taking the NLP submodel as an example, because the BERT adopts a MASK LM (Masked language model, MASK-blocked language model) method in the pretraining process, and the MASK LM method also uses a MASK (MASK) to randomly replace the word to be predicted on the basis of bi-directionally encoding the text message, where the MASK may be a preset special mark (token), for example, the MASK may be [ MASK ], etc., so that compared with the conventional BLSTM model, the problem that the MASK replacement is used by the word to be predicted "sees oneself" in the BLSTM is solved in the pretraining process by the BERT, and the expression capability and generalization capability of the model to text message semantics are increased, and on the other hand, the efficiency of the pretraining process is also improved.

Further, since the BERT is essentially a translation model (transducer) after optimizing the pre-training process, the BERT is similar in structure to the translation model, and includes an encoding part and a decoding part in the BERT, and the decoding part and the encoding part may both be in the form of RNNs, optionally, a first self-attention layer is introduced in the RNN of the encoding part, for acquiring focal characteristics of the encoding part itself based on an attention mechanism, and not only a second self-attention layer is introduced in the RNN of the decoding part, for acquiring focal characteristics of the decoding part itself, but also a multi-head attention layer for acquiring focal characteristics within the output characteristics of the encoding part, the output characteristics of the decoding part, and the position characteristics of the second text message based on an attention mechanism.

Based on the above, the above step 205 is: the server inputs the second text message into the BERT model in the similarity determination model, at least one word in the second text message is encoded through an encoding part of the BERT model, in the encoding process, each word is cooperatively encoded with another word according to the contact information among the words based on the first self-attention layer, the feature vector of the at least one word is obtained, the feature vector of the at least one word is input into a decoding part of the BERT model because the feature vector of the at least one word is high-level, the feature vector of the at least one word is decoded through the decoding part, in the decoding process, the feature vector of each word is cooperatively decoded with the feature vector of the other word based on the contact information among the feature vectors based on the second self-attention layer, and then the target focus feature among the output feature of the encoding part, the output feature of the decoding part and the position feature of the second text message is obtained based on the polygonal attention layer, and the target focus feature is subjected to exponential normalization (softmax) processing, so that the text feature of the second text message is obtained, and the text feature based on the BERT can be extracted to the accurate text model.

In some embodiments, the process of co-coding any two words, i.e., the process of weighting-transforming the same neuron that the two words commonly input into the RNN, and the process of co-decoding the feature vectors of any two words, i.e., the process of weighting-transforming the same neuron that the feature vectors of the two words commonly input into the RNN.

206. The server extracts image features of the at least one image in the image database through the image processing sub-model in the similarity determination model.

The image database corresponds to the similarity determination model, at least one image can be included in the image database, the image database can be a database stored locally or a database downloaded from a cloud, the at least one image can include an expression image, a non-expression image and the like, the expression image is used for representing an image expressing ideas in a man-machine conversation process, text can be carried in the expression image, and further, the expression image can be divided into portrait expression, animal expression or cartoon expression and the like.

Alternatively, the number of the image databases may be one or more, and when the number of the image databases is plural, different image databases may store at least one image with the same type, for example, one image database is used for storing portrait expressions, and another image database is used for storing animal expressions, etc., the storage content of each image database is not specifically limited in the embodiment of the present invention.

The image processing sub-model is used for extracting image features of an image, and for example, the image processing sub-model may be a VGG (visual geometry group ) network, a TCN (temporal convolutional networks) network, a CNN (convolutional neural networks, convolutional neural network) or the like.

In some embodiments, when the image processing submodel is CNN, an input layer for decoding an input image, at least one convolution layer for convolving the decoded image, and an output layer for non-linear and normalization of the convolved image may be included in the CNN. In some embodiments, at least one pooling layer may be further introduced between the various convolution layers, the pooling layer being configured to compress the feature map output by the previous convolution layer, thereby reducing the size of the feature map.

In some embodiments, a residual connection may be employed between the at least one convolutional layer, which is: for each convolution layer, any feature map output by the convolution layer between the convolution layers and the corresponding feature map output by the current convolution layer can be overlapped to obtain a residual block, and the residual block is used as one feature map input into the next convolution layer, so that the degradation problem of the generating network can be solved, for example, the residual connection can be performed once every two convolution layers, and the like.

In the following, an image processing sub-model is taken as a VGG network for illustration, the VGG network comprises a plurality of convolution layers and a plurality of pooling layers, each convolution layer uses a small convolution kernel with the size of 3*3, each pooling layer uses a maximum pooling kernel with the size of 2 x 2, residual connection is adopted among all the convolution layers, so that the size of an image is reduced by half and the depth is doubled after each pooling along with deepening of the VGG network, the structure of the network is simplified, and the high-level image characteristics are conveniently extracted. For example, the VGG network may be VGG-16, etc., and the level of the VGG network is not specifically limited in the embodiments of the present invention.

Based on the above example, step 206 is: the server inputs any image in the image database into a VGG network in the similarity determination model, convolves the image through each convolution layer of the VGG network to obtain the image characteristics of the image, and repeatedly executes the process until the image characteristics of at least one image in the image database are obtained.

207. And the server acquires the semantic similarity between the second text message and the at least one image according to the text characteristics of the second text message and the image characteristics of the at least one image.

Wherein one semantic similarity is used to represent the degree of semantic similarity between the second text message and one image.

Because the similarity determination model is adopted in the embodiment of the invention, the NLP sub-model and the image processing sub-model output are all continuous feature vectors, when the semantic similarity is obtained, the cosine distance between the text feature of the second text message and the image feature of any image can be obtained, the cosine distance is used as the semantic similarity between the second text message and the image, for each image in the at least one image, the steps are repeatedly executed until at least one semantic similarity corresponding to the at least one image is obtained, the cosine value of the included angle between the text feature and the image feature can be measured through the cosine distance, and the correlation between the text feature and the image feature is expressed, so that the semantic similarity between the second text message and any image is expressed better, and the higher the cosine distance is, the higher the semantic similarity is expressed.

In some embodiments, the server may further obtain, as a semantic similarity, an inverse of a euclidean distance between the text feature of the second text message and the image feature of any one of the images, where the smaller the euclidean distance, the smaller the absolute difference between the text feature and the image feature, and the greater the inverse of the euclidean distance, the greater the semantic similarity, and thus the greater the semantic similarity, since the euclidean distance may measure an absolute distance between the text feature and the image feature in the euclidean space, and the greater the semantic similarity, so as to better express the semantic similarity between the second text message and any one of the images.

Fig. 4 is a schematic structural diagram of a similarity determination model provided in an embodiment of the present invention, referring to fig. 4, in the foregoing steps 205 to 207, the server inputs a second text message into the similarity determination model, and obtains, through the similarity determination model, a semantic similarity between the second text message and at least one image in the image database, which may also be regarded as a process of: the server inputs the second text message S2 and one image P1 in the image database each time into the similarity determination model, the similarity determination model outputs the semantic similarity between the second text message S2 and the image P1, the next time still inputs the second text message S2 into the similarity determination model, but the other input is replaced by the other image P2 in the image database, the similarity determination model outputs the semantic similarity between the second text message S2 and the other image P2, and so on, which is not repeated here, the above-mentioned process is repeatedly performed until the semantic similarity between the second text message S2 and the at least one image { P1, P2 … Pn } is obtained, and the following step 208 is performed, where n is any numerical value greater than or equal to 1.

208. The server ranks the at least one image according to the order of the semantic similarity from large to small, and determines the image meeting the first target condition as a first response image.

In some embodiments, the first response image may be: and the image with the highest semantic similarity among the images with the semantic similarity larger than the target threshold value with the second text message. Optionally, the first response image may further be: and in the images with the semantic similarity with the second text message being larger than the target threshold, the semantic similarity is positioned in the first target number of images before. The target threshold may be any value greater than or equal to 0, and the first target number may be any value greater than or equal to 1.

In step 208, after the server orders at least one image in the image database according to the order of the semantic similarity from large to small, when there is an image with the semantic similarity greater than the target threshold, because the second text message is the answer text of the first text message, the server may directly determine the image with the highest semantic similarity with the second text message as the first answer image, so that the first answer image may replace the second text message, and send the second answer image as the answer image of the first text message to the terminal, so that the accuracy of machine answer can be improved.

In some embodiments, after the ranking, when there are images with semantic similarity greater than the target threshold, the server may determine one or more images in the first target number of images with semantic similarity being the first answer image, that is, the server considers that any image with semantic similarity greater than the target threshold may be used as the first answer image for the first text message, so that the situation that the same chat text message is corresponding to the same answer image is avoided, and the server always replies the same answer image, thereby increasing diversity of man-machine conversation processes.

In the above steps 205-208, the server inputs the second text message into the similarity determination model, and obtains the first answer image from at least one image in the image database corresponding to the similarity determination model through the similarity determination model, that is, the server may obtain the first answer image according to the second text message, and since the semantic similarity between the first answer image and the second text message meets the first target condition, the server may replace the second text message with the first answer image as the answer message of the first text message, and perform step 209 described below.

209. The server transmits the first response image to the terminal.

In the above process, the server may reply a first response image to the terminal for the first text message sent by any terminal, so as to increase the interestingness of the man-machine conversation process.

Fig. 5 is a schematic flow chart of a message interaction method provided by the embodiment of the present invention, referring to fig. 5, a user may trigger a server to obtain a response text message through a text reply system (e.g., a text response model provided by the embodiment of the present invention) in a search mode or a generation mode by inputting the text message, and then determine, through a similarity determining model, an image with similarity meeting a certain condition from an image database to reply as an image response message, so as to implement a text-image message interaction mode. It should be noted that, in some embodiments, after the server ranks the at least one image in order of from the big to the small semantic similarity in the step 208, if there is no image with the semantic similarity greater than the target threshold, the server may not perform the steps 208 to 209, but directly send the second text message to the terminal, so that the situation of "no question" or "no question" due to the low correlation between the first response image and the first text message is avoided.

210. And when the terminal receives the first response image, displaying the first response image in the dialogue interface.

The dialog interface may be a User Interface (UI) interface, which is used to present each dialog message.

When the terminal is based on the application client to realize man-machine conversation, the entrance of the conversation interface can be provided on the application client in the mode of function options, the conversation interface is displayed when the click operation of the function options by a user is detected, and the contextual conversation message between the user and the machine is displayed in the conversation interface.

Alternatively, the dialog interface may be exited when the user has sent the first text message, and a new message prompt float may be displayed when the terminal receives the first response image, entered into the dialog interface when a user's click on the new message prompt float is detected, and the first response image displayed in the dialog interface, as is typically done in such a scenario when the user dialogues with a customer assistance of a shopping client.

Alternatively, when the terminal is implementing a man-machine conversation based on internal processing logic, a conversation interface may be displayed directly on the terminal, in which the first response image is displayed, embodying a man-machine conversation process in real time, e.g. typically implemented in such a scenario when the user is talking to the terminal's intelligent assistant.

Fig. 6 is a schematic diagram of a dialogue interface provided by the embodiment of the invention, and referring to fig. 6, after a user inputs a first text message 610 in the dialogue interface 600 and clicks to send, the server obtains a first response image 620 through the process shown in fig. 5, replies the first response image 620 to the terminal, and the terminal displays the first response image 620 based on the dialogue interface.

In the above steps 201-210, only any text message sent by the terminal is taken as an example for explanation, the server replies a response image to the text message, and when the terminal sends the text message again, the step 202 may be skipped, and the message interaction method in the embodiment of the present invention is repeatedly executed, which is not described herein.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

According to the method provided by the embodiment of the invention, the second text message is acquired based on the text characteristics of the first text message by receiving the first text message of the terminal, and the second text message is the response text message of the first text message, so that the first response image can be acquired according to the second text message, the semantic similarity between the first response image and the second text message accords with the first target condition, the first response image can be screened out from at least one image in the image database, the first response image is used as the response message instead of the second text message and is transmitted to the terminal, and therefore, the first response image can be replied to the terminal aiming at the first text message transmitted by the terminal, the interestingness of the man-machine conversation process is increased, the intelligence in the man-machine conversation process is greatly improved, and the realization mode of the man-machine conversation is enriched.

Further, the second text message is acquired through the search type response model, so that the second text message can be acquired quickly, the time for acquiring the second text message is shortened, and the grammar correctness of the second text message can be ensured.

Further, the second text message is obtained through the generated response model, the response text message can be generated in a customized mode according to the first text message of the terminal, the defect of lack of corpus data of the search type response model is overcome, flexibility in a man-machine conversation process is improved, and portability of the text response model is improved.

Further, the text features of the second text message are extracted through the BERT model in the similarity determination model, so that the expression capacity and generalization capacity of the model to the text message semantics can be increased, and more accurate text features can be extracted. Further, the image characteristics of any image in the image database are extracted through the VGG network in the similarity determination model, so that the network structure can be simplified, and the high-level image characteristics can be conveniently extracted.

Further, the cosine distance is obtained as the semantic similarity, and the included angle between the text feature and the image feature can be measured through the cosine distance, so that the correlation between the text feature and the image feature is represented; the reciprocal of the Euclidean distance is obtained as the semantic similarity, and the absolute distance between the text feature and the image feature in the Euclidean space can be measured through the Euclidean distance, so that the absolute difference between the text feature and the image feature is represented, and the reciprocal of the Euclidean distance can better express the semantic similarity between the second text message and any image.

Further, when the image with the semantic similarity being larger than the target threshold exists, the image with the highest semantic similarity is directly determined to be the first response image, so that the correlation between the first response image and the first text message is the highest, and the accuracy of machine response can be improved.

Further, when the images with the semantic similarity larger than the target threshold exist, one or more images with the semantic similarity in the first target number of images are determined to be the first response image, so that the situation that the same chat text message corresponds to can be avoided, the server always replies the same response image, and the diversity of man-machine conversation processes is increased.

Further, if no image with the semantic similarity larger than the target threshold value exists, the server can directly send the second text message to the terminal, so that the situation that the first answer image and the first text message are too low in correlation and are "WeChat" or "answer question" is avoided.

In the above embodiment, the server obtains the second text message for answering the first text message according to the first text message sent by the terminal, and selects the first answer image with the semantic similarity meeting the first target condition from the image database by inputting the second text message into the similarity determination model, thereby realizing that the server can answer the first answer image for the first text message sent by the terminal, increasing the interestingness of the man-machine conversation process, greatly improving the intelligence in the man-machine conversation process, and enriching the realization modes of the man-machine conversation.

In the above embodiments, the answer message is acquired based on the text message, and then the image actually used for answer is determined based on the answer message, so that the image is used as answer to perform man-machine interaction, in some embodiments, the server may also directly acquire the image as answer by the text message, and in this process, a classification model may be applied to implement message interaction, which is described below based on the embodiment shown in fig. 7. Fig. 7 is an interaction flow chart of a message interaction method according to an embodiment of the present invention. Referring to fig. 7, a computer device is taken as an example of a server, and this embodiment includes:

701. the terminal sends a first text message to the server.

Step 701 is similar to step 201, and will not be described again.

702. When the server receives a first text message of the terminal, an intention label corresponding to the first text message is determined based on application scene information of the first text message.

Step 702 is similar to step 202, and will not be described again.

703. When the intention label is a target label, the server inputs the first text message into a first classification model, classifies the response message of the first text message through the first classification model, and outputs the type of the response message.

The type of the response message may be text or image, and the embodiment of the present invention does not limit specific types of the response message.

In some embodiments, the first classification model is used to determine whether the type of the reply message is text or image, and the first classification model may be any classification model, for example, the classification model may be a CNN, VGG, or a logic similarity determination model, and the embodiment of the present invention does not specifically define the structure of the first classification model.

In the above process, after the server inputs the first text message into the first classification model, the first classification model is used to obtain the first prediction probability that the response message of the first text message is an image, and the second prediction probability that the response message of the first text message is a text, when the first prediction probability is greater than or equal to the second prediction probability, the response message is output as an image, and when the first prediction probability is less than the second prediction probability, the response message is output as a text, so that the type of the response message can be determined based on the text feature of the first text message in step 703, that is, the server can predict whether the type of the response message expected by the first text message is a text or an image through the first classification model.

704. When the type of the response message is an image, the server inputs the first text message into a second classification model, and extracts text features of the first text message through a natural language processing sub-model in the second classification model.

In the above embodiment, the similarity determining model is used to perform the message interaction, but in the embodiment of the present invention, the type of the response message is determined by the first classifying model, and when the type of the response message is an image, the second classifying model is used to determine the second response image, where the second classifying model may include an NLP sub-model and an image processing sub-model.

Optionally, the NLP submodel is used to extract text features of the text message, for example, the NLP submodel may be BERT, ELMo, GPT, etc., and the embodiment of the present invention does not specifically limit the structure of the NLP submodel.

The process of extracting the text feature of the first text message based on the NLP submodel in step 704 is similar to the process of extracting the text feature of the second text message based on the NLP submodel in step 205, except that the model parameters of the NLP may be different, and the input of the NLP submodel (the second text message) in step 205 is replaced with the first text message, which is not described here.

705. The server extracts image features of at least one image in the image database through the image processing sub-model in the second classification model.

Optionally, the image processing sub-model is used to extract image features of an image, for example, the image processing sub-model may be VGG, TCN, CNN, or the like, and the embodiment of the present invention does not specifically limit the structure of the image processing sub-model.

The process of extracting the image features of the at least one image based on the image processing sub-model in step 705 is similar to the process of extracting the image features of the at least one image based on the image processing sub-model in step 206, but the parameters of the image processing sub-model may be different, and will not be described here.

706. The server obtains the matching degree between the first text message and at least one image according to the text characteristics of the first text message and the image characteristics of the at least one image.

Wherein one degree of matching is used to represent the degree of contextual relevance between the first text message and one image.

In the above process, for any image in the image database, the server may obtain a context correlation coefficient between the text feature of the first text message and the image feature of the image according to the text feature of the first text message and the image feature of the image, and determine the context correlation coefficient as a matching degree between the first text message and the image, so that one matching degree can reflect the context correlation degree between one image and the first text message, that is, when the matching degree is larger, the probability that the image corresponding to the matching degree is the context of the first text message is larger, and therefore the probability that the image is a response image of the first text message is larger.

Fig. 8 is a schematic structural diagram of a second classification model provided in an embodiment of the present invention, referring to fig. 8, in the foregoing steps 704 to 706, when the type of the reply message is an image, the server obtains, through the second classification model, the matching degree between the first text message and at least one image in the image database, which may also be regarded as a procedure as follows: the server inputs the first text message S1 and one image P1 in the image database into the second classification model each time, the second classification model outputs the matching degree between the first text message S1 and the image P1, the next time the first text message S1 is still input into the second classification model, but the other input is replaced by the other image P2 in the image database, the second classification model outputs the matching degree between the first text message S1 and the other image P2, and so on, which will not be repeated here, the above-mentioned process is repeatedly executed until the matching degree between the first text message S1 and the at least one image { P1, P2 … Pn } is obtained, and the following step 707 is executed, where n is any numerical value greater than or equal to 1. It should be noted that, the input of the second classification model may also be an image message sent by the terminal, so that a degree of matching between the image detail and at least one image in the image database may be obtained, which will be described in detail in the next embodiment.

707. The server ranks the at least one image according to the order of the matching degree from the big to the small, and determines the image meeting the second target condition as a second response image.

In some embodiments, the second response image may be: an image having the highest matching degree with the first text message. Optionally, the second response image may further be: the degree of matching with the first text message is at the first second target number of images. Wherein the second target number may be any number greater than or equal to 1.

In step 707, after the server ranks at least one image in the image database according to the order of the matching degree from large to small, since the matching degree can represent the context association degree between the first text message and the image, the server may obtain the image with the highest matching degree, where the context association degree between the image and the first text message is the closest, and determine the image as the second response image, so that the accuracy of machine response may be improved.

In some embodiments, the server may further determine, after the ranking, one or more images of the first second target number of images with matching degrees as the second response image, so that the server can avoid corresponding to the same chat text message, and always replies to the same response image, thereby increasing diversity of man-machine conversation processes.

In the process, the server orders the at least one image according to the sequence from the big to the small of the matching degree, and determines the image meeting the second target condition as the second response image, so that the first text message sent by the terminal is realized, the server can reply to the second response image, the interestingness of the man-machine conversation process is improved, the intelligence of the man-machine conversation process is greatly improved, and the realization mode of the man-machine conversation is enriched.

In the above steps 704-707, when the type of the response message output by the first classification model is an image, the server inputs the first text message into a second classification model, through which a second response image is acquired from at least one image in an image database corresponding to the second classification model, and such a second classification model may be visually referred to as a QA system (query-answer system).

In the above process, the server acquires a second response image according to the first text message, where the matching degree between the second response image and the first text message meets a second target condition, that is: the server can predict the probability that each image in the image database matches the first text message inside the second classification model to obtain a second response image, and in step 708, the second response image is sent to the terminal, so that the interestingness of the man-machine conversation process is improved, the intelligence of the man-machine conversation process is greatly improved, and the implementation mode of the man-machine conversation is enriched.

708. The server transmits the second response image to the terminal.

Step 708 is similar to step 209 and will not be described here.

Fig. 9 is a schematic flow chart of a message interaction method provided by the embodiment of the present invention, referring to fig. 9, a user may trigger a server to determine a response type of this time through a response type classification (e.g., a first classification model provided by the embodiment of the present invention) by inputting a text message, and if the response type is an image, may determine, based on a question-answer system (e.g., a second classification model provided by the embodiment of the present invention), an image with a similarity meeting a certain condition from an image database to reply as an image response message, so as to implement a text-image message interaction manner. It should be noted that, in some embodiments, when the type of the response message output by the first classification model is text, a process similar to steps 203-204 in the foregoing embodiments may be executed, the first text message is input into the text response model, the text feature of the first text message is obtained through the text response model, based on the text feature of the first text message, a second text message is output, and the second text message is the response text of the first text message, and the second text message is sent to the terminal, so that processing logic of the server in the message interaction process is perfected, and a logic closed loop for replying to the terminal in each case is formed, and detailed embodiments of the process are described in steps 203-204.

709. And when the terminal receives the second response image, displaying the second response image in the dialogue interface.

Step 709 is similar to step 210, and is not described here again, and the schematic consent of the dialogue interface can refer to fig. 6.

In the above steps 701-709, only any text message sent by the terminal is taken as an example for explanation, the server replies a response image to the text message, and when the terminal sends the text message again, the step 702 may be skipped, and the message interaction method in the embodiment of the present invention is repeatedly executed, which is not described herein.

According to the method provided by the embodiment of the invention, the type of the response message is determined based on the text characteristics of the first text message by receiving the first text message of the terminal, so that whether the type of the response message is text or image can be determined before the response message is generated, when the type of the response message is image, the second response image is acquired according to the first text message, and the second response image is sent to the terminal because the matching degree between the second response image and the first text message meets the second target condition, so that the second response image can be directly acquired according to the first text message and the second response image is replied to the terminal, the interestingness of the man-machine conversation process is increased, the intelligence in the man-machine conversation process is greatly improved, and the realization mode of the man-machine conversation is enriched.

Further, the text features of the first text message are extracted from the second classification model through the BERT model, so that the expression capacity and generalization capacity of the model to the text message semantics can be increased, and more accurate text features can be extracted. Further, the image features of any image in the image database are extracted through the VGG network in the second classification model, so that the structure of the network can be simplified, and the high-level image features can be conveniently extracted.

Further, the context association coefficient is obtained as the matching degree, and the context association degree between the text feature and the image feature can be measured through the matching degree, so that the probability that each image is a response image of the first text message is represented.

Further, the image with the highest matching degree is directly determined to be the second response image, so that the accuracy of machine response can be improved; optionally, one or more images with the matching degree in the first second target number of images are determined to be the second response image, so that the situation that the same boring text message corresponds to the same response image can be avoided, the server always replies the same response image, and the diversity of man-machine conversation processes is increased.

Further, if the type of the response message output by the first classification model is text, the server can also input the first text message into the text response model, acquire the second text message through the text response model, and send the second text message to the terminal, so that the processing logic of the server in the message interaction process is perfected, and a logic closed loop for replying the terminal under various conditions is formed.

The above two embodiments provide two different ideas for acquiring the response image when the terminal sends the first text message, one is to acquire the second text message according to the first text message, then screen the response image with the semantic similarity meeting the first target condition from the image database, and the other is to determine the type of the response message, if the type of the response message is an image, then input the first text message into the second classification model, screen the response image with the matching degree meeting the second target condition from the image database, and increase the interestingness of the man-machine conversation process.

The above embodiment is an embodiment constructed based on interaction between text and images, and in one implementation, interaction between images may also be performed, and in particular, the interaction between images is described with the embodiment shown in fig. 10 below. Fig. 10 is an interaction flow chart of a message interaction method according to an embodiment of the present invention. Referring to fig. 10, a computer device is taken as an example of a server, and this embodiment includes:

1001. The terminal transmits an image message to the server.

The image message may be a chat type image message, or may be an inquiry type image message, or may be a character expression, an animal expression, or a cartoon expression when the chat type image message may further include an expression image and a non-expression image.

Step 1001 described above is similar to step 701 described above, except that the first text message is replaced with an image message.

1002. When the server receives an image message of the terminal, an intention tag corresponding to the image message is determined based on application scene information of the image message.

Step 1002 is similar to step 802, and will not be described again.

1003. When the intention label is a target label, the server inputs the image message into a second classification model, and extracts the image characteristics of the image message through a natural language processing sub-model in the second classification model.

The second classification model may be the same classification model as the second classification model in the above embodiment, and the specific structure is not described herein, in the training process of the second classification model, the feature extraction capability of the NLP sub-model in the second classification model for any message (whether text message or image message) sent by the terminal, and the feature extraction capability of the image processing sub-model in the second classification model for any image in the image database may be trained, so that in the prediction process of the second classification model, the image message sent by the terminal may be directly input into the second classification model, and the image features of the image message may be extracted by the NLP sub-model, thereby executing step 1004 described below.

Step 1003 is similar to step 704, and will not be described here.

1004. The server extracts image features of at least one image in the image database through the image processing sub-model in the second classification model.

Step 1004 is similar to step 705 and is not described here.

1005. And the server acquires the matching degree between the image message and the at least one image according to the image characteristics of the image message and the image characteristics of the at least one image.

In the above steps 1003-1005, the server inputs the image message into a second classification model, and obtains a matching degree between the image message and at least one image in the image database through the second classification model, wherein one matching degree is used to represent a context association degree between the image message and any image in the image database.

Step 1005 is similar to step 706, and will not be described here.

1006. The server ranks the at least one image in order of the matching degree from the high to the low, and determines the image meeting the third target condition as a third response image.

In some embodiments, the third response image may be: an image having the highest matching degree with the image message. Optionally, the third response image may further be: the degree of matching with the image message is at the first third target number of images. Wherein the third target number may be any value greater than or equal to 1.

In some embodiments, the server may further determine, after the ranking, one or more images of the images with the matching degree in the first third target number of images as the third response image, so that the situation that the same chat-like image message corresponds to the same response image message can be avoided, and the server always replies to the same response image, thereby increasing diversity of man-machine conversation processes.

In the steps 1003-1006, the server inputs the image message into the second classification model, and obtains the third response image from at least one image in the image database corresponding to the second classification model through the second classification model, so that the third response image with the matching degree meeting the third preset condition can be selected, and the accuracy of machine response is improved.

1007. The server transmits the third response image to the terminal.

Step 1007 is similar to step 708 and will not be described here.

Fig. 11 is a schematic flow chart of a message interaction method provided in an embodiment of the present invention, referring to fig. 11, a server inputs an image message into a question-answering system (for example, a second classification model in the embodiment of the present invention) according to the image message, and obtains a third answer image, and because the matching degree between the third answer image and the image message meets a third target condition, the third answer image is sent to the terminal, so that the server can reply a third answer image for any image message sent by the terminal, and the interestingness of a man-machine conversation process is increased.

1008. And when the terminal receives the third response image, displaying the third response image in the dialogue interface.

Step 1008 is similar to step 709, and is not described here.

In the above steps 1001-1008, only any image message sent by the terminal is taken as an example for explanation, the server replies a response image to the image message, and when the terminal sends the image message again, the step 1002 may be skipped, and the message interaction method in the embodiment of the present invention is repeatedly executed, which is not described herein.

Fig. 12 is a schematic diagram of a dialogue interface provided by the embodiment of the invention, referring to fig. 12, after a user inputs an image message 1210 in the dialogue interface 1200 and clicks to send, the server obtains a third response image 1220 through the flow shown in fig. 10, replies the third response image 1220 to the terminal, and the terminal displays the third response image 1220 based on the dialogue interface.

According to the method provided by the embodiment of the invention, when the image information of the terminal is received, the type of the response information is not required to be considered, the third response image is directly acquired according to the image information, and the third response image is sent to the terminal because the matching degree between the third response image and the image information accords with the third target condition, so that the server replies a third response image for any one image information, and the interestingness of the man-machine interaction process is increased.

Further, the image characteristics of the image message are extracted through the BERT model in the second classification model, so that the expression capacity and generalization capacity of the model to the semantics of the image message can be increased, and more accurate image characteristics can be extracted. Further, the image features of any image in the image database are extracted through the VGG network in the second classification model, so that the structure of the network can be simplified, and the high-level image features can be conveniently extracted.

Further, the context association coefficient is obtained as the matching degree, and the context association degree between the image features of different images can be measured through the matching degree, so that the probability that each image in the image database is a response image of the image message is represented.

Further, the image with the highest matching degree is directly determined to be the third response image, so that the accuracy of machine response can be improved; optionally, one or more images in the first third target number of images with the matching degree are determined to be the third response image, so that the situation that the same boring image message corresponds to the same boring image message can be avoided, the server always replies the same response image, and the diversity of man-machine conversation processes is increased.

Based on the message interaction method provided in the above embodiment, when the user inputs the expression image in the dialogue interface, the terminal sends the image message carrying the expression image to the server, and the server can be regarded as directly using the image message to search an image database (for example, expression package library), screen out the most suitable expression (i.e., matching degree meets the third target condition) as the third response image, and send the third response image to the terminal, so as to implement a reply process of man-machine dialogue.

The server can reply to the response text message and reply to the response image message, but for the image message sent by the terminal, the server directly inputs the second classification model to acquire the third response image, because if the response text message is to be generated for one image message, the semantic features transmitted by the image message are firstly understood and abstracted into a text sequence, and then the response text message is generated based on the text sequence, but under the boring scene, the semantics of the expression image are generally simple and rough, namely the expression image is generally low in information quantity, so that the expression image cannot be abstracted into a complete text sequence, and the generation of the response text message is unfavorable.

On the other hand, once the terminal sends the image message, the server returns a response image message, if the user replies an image message again through the terminal, the man-machine conversation at the moment can possibly enter a 'fight' mode of a naught, so that the personality (or 'man-machine setting') of the conversation robot can be enriched, the interestingness of the man-machine conversation process can be greatly improved, and the adhesion of the user is improved.

In some embodiments, the terminal may further set a "fight" function button on the dialogue interface, so that when the terminal detects a click operation on the function button, the message interaction method in the above embodiment is executed, thereby enriching the functions of the dialogue interface.

In some application scenarios, if the existing corpus of the server has more text corpus and fewer image corpus, the man-machine interaction method can be based on the message interaction method of the first embodiment to realize man-machine interaction, so that the utilization rate of the text corpus can be improved, the text corpus is fully utilized to train out a text response model with higher accuracy, and the defect of lack of the image corpus in the corpus can be supplemented by the accurate response text.

On the contrary, if the text corpus is less and the image corpus is more in the existing corpus of the server, the man-machine conversation can be realized based on the message interaction methods of the second embodiment and the third embodiment, so that the utilization rate of the image corpus can be improved, the image corpus is fully utilized to train a second classification model with higher accuracy, and the defect of text corpus shortage in the corpus can be supplemented by an accurate response image.

In some embodiments, the similarity determining model and the second classification model in the above embodiments are compared in the lateral direction, and although the similarity determining model and the second classification model each include an NLP sub-model and an image processing sub-model, training data used in training are different, and are described in detail below:

For the similarity determination model, the training data input each time in the training process is a < text, image > corpus pair, the text in the corpus pair can be marked as Sout, and the image in the corpus pair can be marked as Pout, so that Sout and Pout are actually answers of different types aiming at the same question when the training data is acquired. During server training, sout is input into an NLP sub-model, pout is input into an image processing sub-model, semantic similarity between Sout and Pout is output, iterative training is repeated to obtain a similarity determination model, and therefore the similarity determination model can be put into the message interaction method in the embodiment to predict.

For the second classification model, the training data input each time in the training process may be a < text Sin, image Pout > corpus pair, or may be a < image Pin, image Pout > corpus pair, where Sin/Pin is the context in a session (the last text, or the last image) and Pout is the context returned to Sin/Pin (the image) when the training data is acquired, and the server performs iterative training on corpus pairs, inputs Sin/Pin into the NLP sub-model, inputs Pout into the image processing sub-model, and outputs the matching degree of Sin/Pin and Pout, and the above processes are repeated until the second classification model is obtained, so that the message interaction method of the second classification model input into the above embodiment can be predicted.

Fig. 13 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention, referring to fig. 13, the device includes:

a receiving module 1301, configured to receive a first text message of a terminal;

a text message obtaining module 1302, configured to obtain a second text message based on the text feature of the first text message, where the second text message is a response text message of the first text message;

the image obtaining module 1303 is configured to obtain a first answer image according to the second text message, where a semantic similarity between the first answer image and the second text message meets a first target condition;

a transmitting module 1304, configured to transmit the first response image to the terminal.

The device provided by the embodiment of the invention can acquire the image message based on the text message sent by the terminal, replies the response image, increases the interestingness of the man-machine conversation process, greatly improves the intelligence of the man-machine conversation process, and enriches the realization modes of the man-machine conversation.

In one possible implementation, the image acquisition module 1303 is configured to:

In one possible implementation, the image acquisition module 1303 includes:

and the determining submodule is used for sequencing the at least one image according to the order of the semantic similarity from the high to the low, and determining the image meeting the first target condition as the first response image.

In one possible implementation, the similarity acquisition submodule is configured to:

In one possible implementation, the text message acquisition module 1302 is configured to:

the first response image is: the semantic similarity between the first text message and the second text message is larger than the target threshold, and the semantic similarity is between the first target number of images.

In one possible implementation, the apparatus further includes:

an intention determining module for determining an intention label corresponding to the first text message based on application scene information of the first text message;

when the intention tag is a target tag, the text message acquiring module 1302 is triggered to execute the operation of acquiring the second text message based on the text feature of the first text message.

Fig. 14 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention, referring to fig. 14, the device includes:

a receiving module 1401 for receiving a first text message of a terminal;

a type determining module 1402 for determining a type of the reply message based on text characteristics of the first text message;

an image obtaining module 1403, configured to obtain, when the type of the response message is an image, a second response image according to the first text message, where a matching degree between the second response image and the first text message meets a second target condition;

A sending module 1404, configured to send the second response image to the terminal.

In one possible implementation, the image acquisition module 1403 is configured to: and inputting the first text message into a second classification model, and acquiring the second response image from at least one image in an image database corresponding to the second classification model through the second classification model.

In one possible implementation, the image acquisition module 1403 includes:

a first matching degree determining submodule, configured to obtain a matching degree between the first text message and at least one image in the image database through the second classification model;

the first image determining sub-module is used for sequencing the at least one image according to the sequence from the high matching degree to the low matching degree, and determining the image meeting the second target condition as the second response image.

In one possible implementation, the matching degree determination submodule is configured to:

In one possible implementation, the type determining module 1402 is configured to input the first text message into a first classification model, classify a response message of the first text message by the first classification model, and output a type of the response message.

In one possible implementation, the second response image is: an image having the highest matching degree with the first text message; or alternatively, the first and second heat exchangers may be,

the second response image is: the degree of matching with the first text message is at the first second target number of images.

In one possible implementation manner, the image obtaining module 1403 is further configured to, when receiving an image message of a terminal, obtain, according to the image message, a third response image, where a matching degree between the third response image and the image message meets a third target condition, trigger the sending module 1404 to send the third response image to the terminal.

In one possible implementation, the image acquisition module 1403 is further configured to:

In one possible implementation, the image acquisition module 1403 further includes:

In one possible implementation, the second matching degree obtaining submodule is configured to:

In one possible implementation, the third response image is: an image having the highest matching degree with the image message; or alternatively, the first and second heat exchangers may be,

In one possible implementation, the apparatus further includes:

when the intention tag is a target tag, the type determination module 1402 is triggered to perform the operation of determining the type of the reply message based on the text characteristics of the first text message.

Fig. 15 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention, referring to fig. 15, the device includes:

a sending module 1501 for sending a first text message to a server in a session interface;

a receiving module 1502, configured to receive a first answer image from the server, where a semantic similarity between the first answer image and a second text message meets a first target condition, and the second text message is an answer text message of the first text message;

a display module 1503, configured to display the first response image on the session interface.

According to the device provided by the embodiment of the invention, after the terminal sends the text message to the server, the image message can be received from the server and displayed, so that the interestingness of the man-machine conversation process is increased, the intelligence of the man-machine conversation process is greatly improved, and the realization mode of the man-machine conversation is enriched.

Fig. 16 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention, referring to fig. 16, the device includes:

a sending module 1601, configured to send a first text message to a server in a session interface;

a receiving module 1602, configured to receive a second response image from the server, where a degree of matching between the second response image and the first text message meets a second target condition;

a display module 1603 for displaying the second response image on the session interface.

Fig. 17 is a schematic structural diagram of a message interaction device according to an embodiment of the present invention, referring to fig. 17, the device includes:

a transmitting module 1701, configured to transmit an image message to a server in a session interface;

a receiving module 1702 configured to receive a third response image from the server, where a matching degree between the third response image and the image message meets a third target condition;

a display module 1703, configured to display the third response image on the session interface.

The device provided by the embodiment of the invention can receive the image information from the server after the terminal sends the image information to the server, and display the image information, thereby increasing the interestingness of the man-machine conversation process, greatly improving the intelligence of the man-machine conversation process and enriching the realization modes of the man-machine conversation.

It should be noted that: in the message interaction device provided in the above embodiment, only the division of the above functional modules is used for illustration during message interaction, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the message interaction device and the message interaction method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the message interaction device and the message interaction method are detailed in the method embodiments, which are not repeated herein.

Fig. 18 shows a block diagram of a terminal 1800 according to an exemplary embodiment of the present invention. The terminal 1800 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. The terminal 1800 may also be referred to as a user device, portable terminal, laptop terminal, desktop terminal, or the like.

In general, the terminal 1800 includes: a processor 1801 and a memory 1802.

Processor 1801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1801 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1801 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1801 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 1801 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 1802 may include one or more computer-readable storage media, which may be non-transitory. The memory 1802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1802 is used to store at least one instruction for execution by processor 1801 to implement a terminal-side message interaction method provided by a method embodiment in the present application.

In some embodiments, the terminal 1800 may also optionally include: a peripheral interface 1803 and at least one peripheral. The processor 1801, memory 1802, and peripheral interface 1803 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 1803 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1804, a touch display screen 1805, a camera 1806, audio circuitry 1807, and a power supply 1809.

The peripheral interface 1803 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1801 and memory 1802. In some embodiments, processor 1801, memory 1802, and peripheral interface 1803 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1801, memory 1802, and peripheral interface 1803 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1804 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1804 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1804 converts electrical signals to electromagnetic signals for transmission, or converts received electromagnetic signals to electrical signals. Optionally, the radio frequency circuit 1804 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 1804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 1804 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display 1805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1805 is a touch display, the display 1805 also has the ability to collect touch signals at or above the surface of the display 1805. The touch signal may be input as a control signal to the processor 1801 for processing. At this point, the display 1805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1805 may be one, providing a front panel of the terminal 1800; in other embodiments, the display 1805 may be at least two, disposed on different surfaces of the terminal 1800 or in a folded configuration; in still other embodiments, the display 1805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1800. Even more, the display screen 1805 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 1805 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1806 is used to capture images or video. Optionally, the camera assembly 1806 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1806 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 1807 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1801 for processing, or inputting the electric signals to the radio frequency circuit 1804 for realizing voice communication. For stereo acquisition or noise reduction purposes, the microphone may be multiple, and disposed at different locations of the terminal 1800. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is then used to convert electrical signals from the processor 1801 or the radio frequency circuit 1804 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuitry 1807 may also include a headphone jack.

A power supply 1809 is used to power the various components in the terminal 1800. The power supply 1809 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1809 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 1800 also includes one or more sensors 1810. The one or more sensors 1810 include, but are not limited to: acceleration sensor 1811, gyro sensor 1812, pressure sensor 1813, optical sensor 1815, and proximity sensor 1816.

The acceleration sensor 1811 may detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal 1800. For example, the acceleration sensor 1811 may be used to detect components of gravitational acceleration on three coordinate axes. The processor 1801 may control the touch display screen 1805 to display a user interface in either a landscape view or a portrait view based on gravitational acceleration signals acquired by the acceleration sensor 1811. The acceleration sensor 1811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1812 may detect a body direction and a rotation angle of the terminal 1800, and the gyro sensor 1812 may collect a 3D motion of the user to the terminal 1800 in cooperation with the acceleration sensor 1811. The processor 1801 may implement the following functions based on the data collected by the gyro sensor 1812: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 1813 may be disposed on a side frame of terminal 1800 and/or below touch display 1805. When the pressure sensor 1813 is disposed at a side frame of the terminal 1800, a grip signal of the terminal 1800 by a user may be detected, and the processor 1801 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 1813. When the pressure sensor 1813 is disposed at the lower layer of the touch screen 1805, the processor 1801 controls the operability control on the UI interface according to the pressure operation of the user on the touch screen 1805. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1815 is used to collect the ambient light intensity. In one embodiment, the processor 1801 may control the display brightness of the touch display screen 1805 based on the intensity of ambient light collected by the optical sensor 1815. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1805 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 1805 is turned down. In another embodiment, the processor 1801 may also dynamically adjust the shooting parameters of the camera assembly 1806 based on the intensity of ambient light collected by the optical sensor 1815.

A proximity sensor 1816, also known as a distance sensor, is typically provided on the front panel of the terminal 1800. Proximity sensor 1816 is used to collect the distance between the user and the front face of terminal 1800. In one embodiment, when the proximity sensor 1816 detects that the distance between the user and the front face of the terminal 1800 gradually decreases, the processor 1801 controls the touch display 1805 to switch from the bright screen state to the off-screen state; when the proximity sensor 1816 detects that the distance between the user and the front of the terminal 1800 gradually increases, the touch display 1805 is controlled by the processor 1801 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 18 is not limiting and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 19 is a schematic structural diagram of a computer device according to an embodiment of the present invention, where the computer device 1900 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 1901 and one or more memories 1902, where the memories 1902 store at least one instruction, and the at least one instruction is loaded and executed by the processor 1901 to implement the message interaction method provided in the foregoing method embodiments. Of course, the computer device may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, such as a memory, comprising at least one instruction executable by a processor in a terminal to perform the message interaction method of the above embodiment is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A message interaction method, applied to a human-machine conversation scene, the method comprising:

generating a response message based on a text message or an image message transmitted by a terminal, transmitting the response message to the terminal, the response message being divided into a response text message or a response image message, wherein,

When a first text message input on a dialogue interface displayed by the terminal is received; determining an intention label corresponding to the first text message based on application scene information of the first text message, wherein the application scene information is interface information displayed by the terminal before the dialogue interface is displayed; when the intention label is a target label, acquiring text characteristics of the first text message through a search type response model, and searching at least one response text corresponding to the text characteristics in at least one corpus based on the text characteristics of the first text message to acquire a second text message, wherein a confidence coefficient between the first text message and each response text in the at least one response text is acquired, the response text with the highest confidence coefficient is determined as the second text message, the confidence coefficient is used for representing the response accuracy between the response text and the first text message, and the second text message is the response text message of the first text message;

inputting the second text message into a similarity determination model, and extracting text characteristics of the second text message through a natural language processing sub-model in the similarity determination model; extracting image features of at least one image in an image database corresponding to the similarity determination model through an image processing sub-model in the similarity determination model; acquiring semantic similarity between the second text message and the at least one image according to the text characteristics of the second text message and the image characteristics of the at least one image; sequencing the at least one image according to the sequence of the semantic similarity from large to small, and determining the image meeting the first target condition as a first response image, wherein the semantic similarity between the first response image and the second text message meets the first target condition; transmitting the first answer image to the terminal, wherein the first answer image is an answer to the first text message; if the first response image does not exist, the second text message is sent to the terminal;

When an image message of a terminal is received, inputting the image message into a second classification model, and acquiring a third response image from at least one image in an image database corresponding to the second classification model through the second classification model, wherein the matching degree between the third response image and the image message meets a third target condition, and the second classification model is used for acquiring the matching degree between the image message and at least one image in the image database corresponding to the second classification model; and sending the third response image to the terminal.

2. The method of claim 1, wherein the first response image is: the image with the highest semantic similarity among the images with the semantic similarity larger than the target threshold value with the second text message; or alternatively, the first and second heat exchangers may be,

3. A message interaction method, applied to a human-machine conversation scene, the method comprising:

When a first text message input on a dialogue interface displayed by the terminal is received; determining an intention label corresponding to the first text message based on application scene information of the first text message, wherein the application scene information is interface information displayed by the terminal before the dialogue interface is displayed; when the intention label is a target label, inputting the first text message into a first classification model, classifying the response message of the first text message through the first classification model, and outputting the type of the response message, wherein the first classification model is a classification model and is used for determining whether the type of the response message is text or image;

when the type of the response message is an image, inputting the first text message into a second classification model, and extracting text characteristics of the first text message through a natural language processing sub-model in the second classification model; extracting image features of at least one image in an image database through an image processing sub-model in the second classification model; acquiring the matching degree between the first text message and the at least one image according to the text characteristics of the first text message and the image characteristics of the at least one image; sequencing the at least one image according to the sequence of the matching degree from large to small, and determining the image meeting the second target condition as a second response image, wherein the matching degree between the second response image and the first text message meets the second target condition; transmitting the second response image to the terminal, the second response image being an answer to the first text message;

When the type of the response message is text, obtaining text characteristics of the first text message through a search type response model, searching at least one response text corresponding to the text characteristics in at least one corpus based on the text characteristics of the first text message to obtain a second text message, wherein a confidence coefficient between the first text message and each response text in the at least one response text is obtained, the response text with the highest confidence coefficient is determined to be the second text message, the confidence coefficient is used for representing the response accuracy between the response text and the first text message, the second text message is the response text message of the first text message, and the second text message is sent to the terminal;

when an image message of a terminal is received, the image message is input into the second classification model, a third response image is obtained from at least one image in an image database corresponding to the second classification model through the second classification model, the matching degree between the third response image and the image message meets a third target condition, and the third response image is sent to the terminal.

4. A computer device comprising one or more processors and one or more memories having stored therein at least one instruction loaded and executed by the one or more processors to implement the operations performed by the message interaction method of any of claims 1 to 3.

5. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the operations performed by the message interaction method of any of claims 1 to 3.