CN111930918B

CN111930918B - Cross-modal bilateral personalized man-machine social interaction dialog generation method and system

Info

Publication number: CN111930918B
Application number: CN202011046353.7A
Authority: CN
Inventors: 李树涛; 李宾; 孙斌
Original assignee: Hunan University
Current assignee: Hunan Xinxin Xiangrong Intelligent Technology Co.,Ltd.
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2020-12-18
Anticipated expiration: 2040-09-29
Also published as: CN111930918A

Abstract

The invention discloses a cross-modal bilateral personalized man-machine social interaction generation method and a cross-modal bilateral personalized man-machine social interaction generation system. According to the invention, the personalized information is fused in a cross-modal manner, the personalized information of the characters of both interaction parties is considered, the personalized characteristics of both interaction parties are fully utilized on the premise of ensuring reasonable reply content, smooth grammar and continuous logic, and the reply with individuality and individuality according to different people can be generated.

Description

Cross-modal bilateral personalized man-machine social interaction dialog generation method and system

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a cross-modal bilateral personalized human-computer social interaction generation method and system.

Background

With the progress of science and technology, human-computer interaction gradually develops towards intellectualization and individuation, and the interaction between people and robots is increasingly close to the interaction between people in the real world. Traditional man-machine social interaction generation belongs to the field of natural language processing, and mainly researches that a robot can make a natural reply according to text input of a user. Different from interpersonal social contact, vision is a main sensory source for receiving external information by people, and people can make natural personalized expression according to the external information. Therefore, in order to make the robot more "humanoid", in man-machine socialization, the robot needs to have a certain perception capability, and also needs to understand the language of the user and make a natural and personalized reply. How to fuse computer vision and natural language processing, and research cross-modal fusion coding of different modal information such as texts, images and the like, so that the robot generates 'human-like' personalized expression according to the acquired user images and natural language, and becomes one of challenges of man-machine natural interaction.

Personalized man-machine interaction mainly researches how a robot generates a reply consistent with a preset personality or characteristic, namely a reply with single-side personalization consistency. In recent years, the research of human-computer interaction based on deep learning has been greatly developed and is increasingly widely used in daily life, such as widely known microsoft ice, apple Siri, etc., which all have preset personality and characteristic attributes. In the field of personalized interaction, the latest progress based on pre-training methods has advanced the latest results of a series of natural language processing tasks, and recent research attempts to solve the problem of personalized dialog generation in a data-driven manner, i.e., learning features related to characters directly from large-scale dialog data sets, and generating corresponding responses conforming to the personality of the speaker by encoding sparse personalized interaction information of the characters and then performing fine tuning on a pre-training model.

Nowadays, the social chat robot can best embody personalized interaction and can generate personalized replies rich in character characteristics. At present, a chat robot is mainly based on character interaction, and in the process of man-machine interaction, the chat robot cannot see a user and cannot acquire personalized information of the user, but personalized attribute information of the robot can be predefined, so that personalized expression related to the chat robot can be generated by embedding and representing the personalized information of the robot.

However, in the social interaction between real people, the two interacting parties can acquire the personalized information of each other, and when replying, the replying person not only concentrates on the personalized expression of the replying person, but also should reply by considering the personalized characteristics of the other party and the problems of the other party, so that the man-machine interaction neglecting the personalized information of the user can generate feeling and disgust emotion, and the user experience is reduced. Therefore, the cross-mode bilateral personalized man-machine interaction technology needs to be developed in the natural interaction of the robot.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a cross-modal bilateral personalized man-machine social interaction session generation method and system.

In order to solve the technical problems, the invention adopts the technical scheme that:

a cross-modal bilateral personalized man-machine social conversation generation method comprises the following steps:

1) encoding a dialog contextE _CRobot personalized information codingE _TUser-customized information codingE _SEncoding of output results at the previous timeE _prevPerforming weighted fusion to obtain weighted fusion codeO _enc；

2) Encoding the weighted fusionO _encEncoding of output results at the previous timeE _prevInputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generative model is trained to pre-establish the weighted fusion code of the input and the code of the output result at the last momentE _prevAnd the mapping relation between the output optimal N candidate reply lists;

3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists;

4) and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result.

Optionally, step 1) is preceded by generating a dialog context codeE _CThe steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE _C。

Optionally, step 1) is preceded by generating a robot personalized information codeE _TThe steps of (1): carrying out word embedding coding on the preset robot personalized description to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain the code of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE _T。

Optionally, step 1) is preceded by generating a user personalized information codeE _SThe steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE _S。

Optionally, the functional expression for position coding is as follows:

in the above formula, the first and second carbon atoms are,E _pos,2i()indicating that the character is at 2iThe position-embedded code that maps positionally into a sinusoidal function,E _pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d _modelthe dimensions of the code are represented and,d _model=512，pos⊆[1,n]，ncontent length encoded for word embedding.

Optionally, performing weighted fusion in step 1) to obtain a weighted fusion codeO _encComprises the following steps:

s1) encoding the dialog contextE _CEncoding of output results at the previous timeE _prevCoding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO _CCoding the personalized information of the robotE _TEncoding of output results at the previous timeE _prevCoding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO _TEncoding user personalized informationE _SEncoding of output results at the previous timeE _prevCoding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO _SEncoding the output result of the previous timeE _prevCoding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO _prev；

S2) obtaining the weighted fusion code according to the following formulaO _enc；

In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γindicating the probability that no personalized information is present in the reply,O _Sto pay attention to the channel by multiple headsThe user personalized information code after the force code,O _Tfor coding the personalized information of the robot after multi-head self-attention coding,O _Cfor multi-headed self-attention coded dialog context coding,O _prevthe method is used for encoding the output result of the last moment after multi-head self-attention encoding.

Optionally, the bilateral personalized generation model in step 2) is a GPT network model, and step 2) further includes a step of training the GPT network model, and a joint loss function adopted when the GPT network model is trained is as follows:

in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L _D(φ) A loss function of the decoding model is represented,L _LM(φ) A loss function of the language model is represented,L _p(θ) Representing a loss function of the personalized predictive model,λ ₁andλ ₂is a weight coefficient; wherein the decoding model loss functionL _D(φ) The expression of (a) is as follows:

in the above formula, the first and second carbon atoms are,P _φ(y _i|x ₀ ,…,x _i-1 ，O _enc) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y _iindicating generation of the second in the string for the decoderiThe number of the characters is one,x ₀~x _i-1generating prefixes in character strings for decodersiThe number of the characters is one,O _encit is shown that the weighted fusion coding,iindicating the length of the generated character string;

loss function of language modelL _LM(φ) The function of (a) is expressed as follows:

in the above formula, the first and second carbon atoms are,P _φ(y _i|x _i-k ,…,x _i-1) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 character predictionx _i-k～x _i-1First, thekCharactery _iThe probability of (a) of (b) being,y _iindicating the first in a fixed windowkThe number of the characters is one,x _i-k~x _i-1representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position;

personalized predictive loss functionL _p(θ) The function of (a) is expressed as follows:

in the above formula, the first and second carbon atoms are,y _ja tag representing the personalized information is provided,P _θ(y=j|O _C) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O _Crepresenting a multi-headed self-attention coded dialog context code.

Optionally, the step of step 3) comprises: marking a robot reply text, conversation history information and interactive two-party personalized information in each section of conversation in collected human-computer social conversation data in advance, and using the marked information as a training positive sample of a deep neural network model; meanwhile, in order to balance the training data, a training negative sample is constructed by adopting a negative sampling method, the cross entropy of the two classes is used as a loss function of the deep neural network model, and the deep neural network model is trained; when the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;

in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) The information abundance value of the condition mutual information is shown,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Trepresenting own personalized information.

In addition, the present invention also provides a cross-modal bilateral personalized human-computer social conversation generating system, including a computer device, the computer device at least includes a microprocessor and a memory connected with each other, the microprocessor is programmed or configured to execute the steps of the cross-modal bilateral personalized human-computer social conversation generating method, or the memory stores a computer program programmed or configured to execute the cross-modal bilateral personalized human-computer social conversation generating method.

Furthermore, the present invention also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to perform the cross-modal bilateral personalized human-machine social conversation generation method.

Compared with the prior art, the invention has the following advantages: the invention includes encoding a dialog contextE _CRobot personalized information codingE _TUser-customized information codingE _SLast one timeEncoding of output results of scalesE _prevPerforming weighted fusion to obtain weighted fusion codeO _enc(ii) a Encoding the weighted fusionO _encEncoding of output results at the previous timeE _prevInputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists; and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result. According to the invention, the personalized information is fused in a cross-modal manner, the personalized information of the characters of both interaction parties is considered, the personalized characteristics of both interaction parties are fully utilized on the premise of ensuring reasonable reply content, smooth grammar and continuous logic, and the reply with individuality and individuality according to different people can be generated. The invention constructs the cross-modal personalized interaction model by taking the generation of the robot natural language reply which is rich in personality and varies from person to person as a target in the process of human-computer interaction, adopts the pre-training technology, and can obviously improve the chat quality and the social experience of human-computer interaction compared with the traditional method.

Drawings

FIG. 1 is a core flow diagram of a method according to an embodiment of the present invention.

FIG. 2 is a schematic view of a complete flow of the method according to the embodiment of the present invention.

Fig. 3 is a schematic diagram of a framework structure of a bilateral personalized generative model according to an embodiment of the present invention.

FIG. 4 is a diagram of a frame structure of an encoder according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a framework structure of a dynamic weighting and fusing module for encoding information according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a frame structure of a decoder according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of a variable across modal information coding fusion in an embodiment of the present invention;

FIG. 8 is a diagram illustrating comparison between output results of different methods according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, the method for generating a cross-modal bilateral personalized human-computer social conversation according to the present embodiment includes:

2) Encoding the weighted fusionO _encEncoding of output results at the previous timeE _prevInputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generation model is trained in advance to establish a weighted fusion code of input and a code of an output result at the previous momentE _prevAnd the mapping relation between the output optimal N candidate reply lists;

Referring to fig. 2, fig. 3 and fig. 4, the present embodiment further includes generating a dialog context code before step 1)E _CThe steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE _C。

Referring to fig. 2, fig. 3 and fig. 4, the method further includes generating a robot personalized information code before step 1) of the present embodimentE _TThe steps of (1): carrying out words on preset robot personalized descriptionEmbedding the codes to obtain embedded vectors, then carrying out position coding on the embedded vectors to obtain position coding vectors, and adding the embedded vectors and the position coding vectors to obtain codes of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE _T。

Referring to fig. 2, fig. 3 and fig. 4, the embodiment further includes generating a user personalized information code before step 1)E _SThe steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE _S。

In this embodiment, the step of extracting the personalized description information of the image modality includes: adopting a face enhancement tool kit (such as a Facelib open source tool kit) to enhance and preprocess face data, and extracting a face position and a feature point position in a picture by using a face position and feature point position extraction model (such as a MobileNet V1 pre-training model), wherein the extracted face position is the upper left corner coordinate and the lower right corner coordinate of an extraction frame; the characteristic points comprise coordinates of left eye pupils, coordinates of right eye pupils, coordinates of nose tips, coordinates of left mouth corners and coordinates of right mouth corners, and the extracted face images are corrected and cut by adopting designed parameters to obtain standard face images with the same size. And processing the obtained standard face picture by using a standardized method, and inputting the standard face picture into an age and gender face recognition classification model to extract an age and gender classification result after face recognition. For example, inputting a standard face image into a ShuffleNet pre-training model for fine adjustment operation, training and optimizing a loss function shown in the following formula, and obtaining an age and gender classification result after face recognition after forward propagation of the standard face image;

in the above formula, the results of age and sex classification are used respectively

And

it is shown that,P(a,g|I,B) Representing the probability of the predicted age and gender of the face image selected from the face bounding box,Iis an image of a face, and is,Bis a face bounding box. In addition, when extracting the personalized description information of the text modality, the personalized social interaction data set is filtered by adopting a regular method when filtering the text with the inappropriate words so as to improve the efficiency.

When carrying out word embedding coding on the obtained user personalized description, the word embedding coding step of the personalized description information aiming at the image modality comprises the following steps: respectively constructing user personalized description information in a key value pair form and personalized attribute feature information in a character string form according to the obtained age and gender classification results, and respectively obtaining codes of the description information and the attribute feature information by carrying out corresponding word embedding coding operation, wherein the results are expressed as the following formula;

in the above formula, the first and second carbon atoms are,E ^v _Ucoding the personalized description information for the user, general formulae ^v _iUser-personalized descriptive coding information characterized in the form of key-value pairs extracted for an image modality,i=1,2 code vectors of key-value pairs for age, gender respectively,sis a coded vector of the key(s),vis a coded vector of values. Extracting personalized attribute information of a user, wherein the personalized attribute information consists of age and gender character strings, and word embedding coding is respectively carried out by adopting the following formula;

wherein,G ^v _U、A ^v _Urespectively encoding the gender and age personalized attributes of the user extracted from the image modality,g ^v _i、a ^v _irespectively representing gender and age attributes extracted from the image modalityiThe code of the position is used to determine,mpersonalized description information coding representing a userE ^v _UThe total length of the first and second support members,i⊆[1,m]。

when carrying out word embedding coding on the obtained user personalized description, the word embedding coding step of the personalized description information aiming at the text mode comprises the following steps: performing word embedding operation on the user and robot personalized description information contained in the text modality, wherein the user and robot personalized description information is expressed in a key value pair form, and the following formula is shown as follows:

in the above formula, the first and second carbon atoms are,E ^t _Uencoded information extracted for the user's personalized description for the text modality,E ^t _Rcoded information of personalized description of robot extracted for text modality, general formulae ^t _iFor personalized description coding in the form of key-value pairs extracted from a text modality,i=1,2,3 represent the encoding vectors of key-value pairs of age, gender, interest tags, respectively. Considering the current text information word embedding of the user, the robot text information processing mode is the same (subscript is represented by R), all the embedded information prescribes the same embedding length, and if the length of the corresponding personalized information does not reach the total length of the sentence, the personalized information is used "<PAD>"as placeholder, complement the personalized information, when the total length of the corresponding sentence is exceeded, take truncation operation, and only take the first half of the input information, here, only take user input as an example. Then the encoded information of the complete sentence dialog input by the user is:

in the above formula, the first and second carbon atoms are,X _Uthe encoded information of the complete sentence dialog entered for the user,x _Uifor the first in the sentence input by the user

The words of the word are embedded into the encoded vector.

The method comprises the following steps of taking values of gender, age and interest tags, respectively carrying out word embedding coding operation, and when a user has a plurality of interest tags, carrying out average operation on codes, wherein the average operation is shown as the following formula:

in the above formula, the first and second carbon atoms are,G ^t _Uencoded information representing a text modality extracted user gender tag,A ^t _Uencoded information representing a user age tag extracted by the text modality,T ^t _Uencoded information representing the user interest tags extracted by the text modality,g ^t _Uiuser gender information representing text extraction corresponds to input

The word at the position of the word is embedded in the encoded vector,a ^t _Uia word representing the user age information of the text extraction is embedded in the encoding vector,t ^t _Uiembedding a word representing an interest tag of a text extraction user into a coding vector, and when a plurality of interest tags exist, averaging the plurality of embedded coding vectors,i⊆[1,n]，nindicating the overall length of the sentence. Because the personalized information of the text mode is absent, the personalized attribute information extracted from the image supplements the personalized attribute information of the text, such as the missing gender, age and the like, and meanwhile, the personalized attribute information error of the text is corrected in a mode of cross-mode information correction. In this embodiment, when performing word embedding coding on the obtained user personalized description, the method further includes supplementing age and gender information in the user personalized description information by using different modality information, as shown in the following formula:

in the above formula, the first and second carbon atoms are,E _Uextracting the personalized description coding information of the supplementary user from the image and text information,E ^v _Uencoded information extracted for the user's personalized description for the text modality,E ^t _Uencoded information of the personalized description of the robot extracted for the text modality,

and adding the coded information of different modes, and supplementing the personalized description information lacking in the text mode of the user by the personalized description information extracted by the image mode.

Indicates that the age-key-value pair encoding information of the user is supplemented from the text and image information,

indicates that the user's gender-key-value pair encoding information is supplemented from the text and image information,e ₃ ^tthe key value pair of interest tag representing the user from the text information encodes the information. The personalized description of the robot is predefined and does not need to be obtained from the image modality, soE _R=E ^t _RWhereinE _RRepresenting the robot personalization descriptive information,E ^t _Rand robot personalized description information representing text extraction.

Age and gender information in the personalized attribute information of the user is supplemented and fused through different modal information, and the information is shown as the following formula:

in the above formula, the first and second carbon atoms are,G _Uthe gender code information of the user after the image and the text information are fused,G ^v _Ugender and age coding information representing the image extraction,G ^t _Uuser gender code information representing text extraction,

~

the representations respectively correspond to the total length of the text input by the usernEach position in the image information is user gender code information fused by text and image information,A _Urepresenting the user age code information fused by the image and the text information,A ^v _Uage-coding information representing the image extraction,A ^t _Uuser age-coding information representing text extraction,

~

the representations respectively correspond to the total length of the text input by the usernEach position in the image, the post-user age code information fused by the text and the image modality information,

and adding the coded information of different modes, and fusing the personalized information lacking in the text mode of the user by the personalized information extracted by the image mode.

Defining the personalized information embedded representation of the current user asD _UThe robot personalized information embedding expression system consists of user dialogue sentences, user age information, user gender information and user interest labels, and the robot personalized information embedding expression of the same reason isD _RThe method is the same as above.

In the above formula, the first and second carbon atoms are,D _Uindicating the current userIs embedded in the representation of the personalized information,D _Rthe representation robot personalization information is embedded in the representation,G _Uthe gender code information of the user after the image and the text information are fused,A _Urepresenting the user's age-coded information fused by the image and the text information,T ^t _Uencoding information representing user interests extracted from a text modality,X _Ua dialog sentence representing the user is presented,X _Ra dialog sentence representing the robot,G ^t _Rgender code information representing the robot extracted by the text modality,A ^t _Rage-coded information representing the robot extracted by the text modality,T ^t _Rand the interest coding information of the robot extracted by the text mode is represented.

In the above formula, the first and second carbon atoms are,E _Cthe information is coded for the historical dialogue context and is formed by adding the robot personalized description information, the user personalized information, the robot personalized information and the dialogue historical information after being coded. General formula (VII)C _jRepresents the firstjThe history information of the wheel is recorded,lrepresenting the total number of rounds of the historical conversation,j⊆[1,l]，D _Ua personalized information embedded representation representing the current user,D _Rthe robot personalization information is represented embedded in the representation.

In this embodiment, the embedded vector is subjected to position coding to obtain a position coding vector, and a functional expression for performing position coding is shown as the following formula:

in the above formula, the first and second carbon atoms are,E _pos,2i()indicating characterIs marked with 2iThe position-embedded code that maps positionally into a sinusoidal function,E _pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d _modelthe dimensions of the code are represented and,d _model=512，pos⊆[1,n]，ncontent length encoded for word embedding. And correspondingly adding each embedded vector and the position coding vector to obtain final trans-modal vector representation, and inputting the final trans-modal vector representation together into an encoder of the bilateral personalized generation model for encoding operation.

Referring to fig. 5, in this embodiment, the weighted fusion is performed in step 1) to obtain a weighted fusion codeO _encComprises the following steps:

s1) encoding the dialog contextE _CEncoding of output results at the previous timeE _prevCoding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO _CCoding the personalized information of the robotE _TEncoding of output results at the previous timeE _prevCoding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO _TEncoding user personalized informationE _SEncoding of output results at the previous timeE _prevCoding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO _SEncoding the output result of the previous timeE _prevCoding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO _prev(ii) a Can be respectively expressed as:

In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γindicating the probability that no personalized information is present in the reply,O _Scoding the user personalized information after multi-head self-attention coding,O _Tfor coding the personalized information of the robot after multi-head self-attention coding,O _Cfor multi-headed self-attention coded dialog context coding,O _prevthe method is used for encoding the output result of the last moment after multi-head self-attention encoding. In the embodiment, each category is passed throughsoftmaxThe step of operation is carried out, and the prediction result is defined as a weight value of weighted fusion, whereinα、β、γThe function of (a) is expressed as follows:

in the above formula, the first and second carbon atoms are,O _Cfor multi-headed self-attention coded dialog context coding, y =0 represents that the returned personalized information is related to the other party,αrepresenting the probability of the presence of the user personalization in the reply, y =1 representing that the personalization of the reply is related to itself,βrepresenting the probability of robot personalization in the reply, y =2 representing the reply is context-dependent, without personalization,γrepresenting the probability that no personalized information is present in the reply.

Referring to FIG. 6, a decoder for bilateral personalized generative models is used to encode the weighted fusionO _encEncoding of output results at the previous timeE _prevThe best N candidate reply lists are generated (output results). As an optional implementation manner, the bilateral personalized generation model in step 2) in this embodiment is a GPT network model. Since the GPT network model is an existing network, it is not described again. The encoder and the decoder of the bilateral personalized generative model are common parameters, the parameters of the encoder and the decoder need to be trained before use, and the processing steps of the training data set adopted when the parameters of the encoder and the decoder are trained are the same as the step 1) and the preprocessing steps thereof, and are used for obtaining the weighted fusion coding of the data samplesO _encEncoding of output results at the previous timeE _prev. On the basis, weighted fusion coding is adopted when the parameters of the encoder and the decoder are trainedO _encEncoding of output results at the previous timeE _prevAnd as input, a multi-task learning mechanism is utilized, loss functions of a language model, an individualized prediction model and a decoding model in the GPT network model are optimized simultaneously, and an encoder and a decoder of the bilateral individualized generation model are trained to obtain optimal parameters, so that the bilateral individualized generation model is trained. Defining a personalized profile of the current speakerIs described asTPersonal description code of own party isE _TThe personalized description of the other party isSThe personalized description of the other party is coded intoE _SWhen modeling a bilateral personalized interaction model, the method is divided into two cases: case a: when the speaker is a robot, at this timeTFor the personalized description of the robot, the personalized description of the own party is coded intoE _T =E _RAnd coding is carried out through a multi-head attention mechanism network. Case b: in model training, it is necessary to make full use of social interaction data, when the speaker is a user, at this timeTFor the personalized description of the user, the personalized description of the own party is coded intoE _T = E _UThe personalized description of the other party is coded intoE _S = E _RAt this time, only the following steps are neededO _SAndO _Tand (4) interchanging, and calculating by adopting an attention mechanism.

And step 2) further includes a step of training a GPT network model before, and as a preferred embodiment, a joint loss function adopted when the GPT network model is trained in this embodiment is as follows:

in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L _D(φ) A loss function of the decoding model is represented,L _LM(φ) A loss function of the language model is represented,L _p(θ) Representing a loss function of the personalized predictive model,λ ₁andλ ₂is a weight coefficient (can be obtained by experience, for example, in this embodiment

₁=0.2，

₂= 0.5); wherein the decoding model loss functionL _D(φ) The expression of (a) is as follows:

in the above formula, the first and second carbon atoms are,P _φ(y _i|x ₀ ,…,x _i-1 ，O _enc) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y _iindicating generation of the second in the string for the decoderiThe number of the characters is one,x ₀~x _i-1generating prefixes in character strings for decodersiThe number of the characters is one,O _encit is shown that the weighted fusion coding,iindicating the length of the generated character string;x ₀ ,…,x _i-1after encoding for a given input text by the encoder, the decoder generates the top of the stringiAnd (4) each character is used as the input of the bilateral personalized Transformer decoder.

in the above formula, the first and second carbon atoms are,P _φ(y _i|x _i-k ,…,x _i-1) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 character predictionx _i-k～x _i-1First, thekCharactery _iThe probability of (a) of (b) being,y _iindicating the first in a fixed windowkThe number of the characters is one,x _i-k~x _i-1representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position; the present embodiment utilizes the existing pre-trained model parameter setφTo initialize the encoder and decoder and train the language model by optimizing the standard maximum log likelihood loss.kIs the size of the contextual window and,x _i-k ,…,x _i-1is a series of coded sequence samples sampled from a training corpus.

Referring to fig. 3, the present embodiment designs a coding information dynamic weighting fusion module, which inputs the dialog context information coding into the personalized prediction model to predict the probability of the personalized information appearing in the reply sentence, and dynamically fuses the input coding information into a three-classification personalized information prediction task, and dynamically weights and fuses the context, the personalized information of the listener at the opposite side and the personalized information of the caller at the other side. Personalized predictive loss functionL _p(θ) Representing the probability of occurrence of individualized information in response to a prediction of contextual information, an individualized predictive loss functionL _p(θ) The function of (a) is expressed as follows:

in the above formula, the first and second carbon atoms are,y _ja tag representing the personalized information is provided,P _θ(y=j|O _C) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O _Crepresents a multi-headed self-attention coded dialog context code and has:

in the above formula, the first and second carbon atoms are,O _C ^(j)the representation corresponds tojThe context of the personalized tag encodes the information classification result,O _C ⁽ⁱ⁾the representation corresponds toiAnd the context coding information classification result of the personalized label.

As can be seen from the foregoing, in order to achieve a better effect of the model, the training of this embodiment adopts a multi-task learning manner to perform fine tuning on the constructed personalized interactive text data set, so that not only is the loss function of the language model optimized, but also the personalized prediction loss function is performedL _p(θ) Also, the decoding model loss function is optimizedL _D(φ)。

Decoding model loss functionL _D(φ) It can also be expressed as:

in the above formula, the first and second carbon atoms are,x ₀ ,…,x _i-1after encoding for a given input text by the encoder, the decoder generates the top of the stringiA character which is used as the input of a decoder of the bilateral personalized generative model,y _igenerating the first of the character strings for a decoderiAnd (4) characters. After the output results of the encoder of the bilateral personalized generative model are subjected to weighted fusion, the above formula can be expressed as a decoding model loss function in the contextL _D(φ) In the form of (1).

After the personalized prediction model is trained, the decoder of the bilateral personalized generation model can be utilized to perform weighted fusion coding aiming at the inputO _encEncoding of output results at the previous timeE _prevAn optimal list of N candidate replies is generated. In this embodiment, a decoder of the bilateral personalized generation model generates an optimal top-5 candidate reply list through the bilateral personalized model in a cluster search manner, sorts the sentences in the candidate reply list by the size of a conditional mutual information abundance value (history included in replies and personalized information of both parties) by using a maximum conditional mutual information principle (CMIM) principle, selects the sentence with the largest conditional mutual information abundance value, performs normalization by using a text post-processing method, and outputs a suitable reply satisfying the personalized features of both parties of interaction.

In this embodiment, the step 3) includes: marking the reply text, the conversation history information and the individual information of both interaction parties of the robot in each section of conversation in the collected human-computer social conversation data in advance as a deep neural network

Training positive samples of (1); meanwhile, in order to balance the training data, a negative sampling method is adopted to construct a training negative sample, and the cross entropy of the two classifications is used as a deep neural network

Training a deep neural network model

(ii) a When the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;

in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) The information abundance value of the condition mutual information is shown,

the probability of generating reply text by the input text through a coder and a decoder under the condition of giving personalized information and historical information of both parties is shown,

representing the co-occurrence probability of reply texts, historical information, the personalized information of the other party and the personalized information of the own party in the dialogue corpus,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Tthe personalized information of the own party is shown,

representing that the bilateral personalized dialog generation model generates a Top-5 best candidate reply list,E _Crepresenting a dialog context code,E _TCode for expressing the personalized information of the robot,E _SRepresenting the coding of the personalized information of the user,

parameters representing the bilateral personalized generative model,

representing parameters of a deep neural network model.

Inputting the conditional mutual information abundance values of the text X and the reply text Y under the condition of giving the personalized information S and T and the historical information C of both parties

Can be expressed as:

in the above formula, the first and second carbon atoms are,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) And the probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown. Note that in the dialog generation model, generating the reply text Y requires the input text X as a premise, while the denominator of the joint loss function does not contain the input text X. Therefore, in this embodiment, a way of jointly optimizing the mutual information abundance value of the condition is not adopted, but a way of calculating step by step is adopted, the molecular part of the mutual information abundance value is calculated first, personalized information S and T and historical information C of both parties are encoded, and a bilateral personalized dialogue generating model is used to generate a Top-5 optimal candidate reply list

As shown in the following formula:

in the above formula, the first and second carbon atoms are,

the probability of the Top-5 optimal candidate reply list obtained by the input information code, the historical information code, the other party personalized information code and the own party personalized information code is shown,

top-5 best candidate reply list generated for the decoder of the bilateral personalized generative model. Further, the conditional probability formula is used for determining the abundance value of the conditional mutual informationI(X;Y|C,S,T) The denominator part of (a) is transformed as follows:

in the above formula, the first and second carbon atoms are,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,P(Y,C,S,T) Representing the co-occurrence probability of the reply text, the history information, the personalized information of the opposite party and the personalized information of the own party,P(C,S,T) Representing the co-occurrence probability of the historical information, the personalized information of the other party and the personalized information of the own party,Yrepresenting each sentence of reply text in the best candidate reply list. Personalizing information given the same partiesSAndThistory informationCUnder the condition of (1), each pair of input texts is calculatedXAnd reply textYThe condition of (2) is a mutual information abundance value,P(C,S,T) Since the probabilities of (a) and (b) are the same, if this part can be ignored in calculating the amount of condition information, the calculation of the following equation is equivalent to the calculation of the following equation.

In the above formula, the first and second carbon atoms are,

representing the co-occurrence probability of the reply text, the historical information, the personalized information of the other party and the personalized information of the own party calculated by the deep neural network model,

and representing a deep neural network model obtained by training on the collected human-computer social interaction data, and representing the co-occurrence probability of each reply text, the historical information and the individual information of both interaction parties. And in the collected human-computer social conversation data, marking the reply text of the robot, the conversation historical information and the individual information of both interactive parties in each section of conversation as a training positive sample of the deep neural network model. Meanwhile, in order to balance the training data, a negative sampling method is adopted to construct a training negative sample, cross entropy of the two classes is used as a loss function of the deep neural network model, the deep neural network model is trained, and the calculation mode is as follows:

in the above formula, the first and second carbon atoms are,

representing a deep neural network

Is used to determine the loss function of (c),Nrepresents the number of training samples in a batch,N=32，y _ia value of a personalized tag is represented,y _i=0,1，

representing a model of a deep neural networkCalculating the co-occurrence probability of the reply text, the historical information, the personalized information of the opposite side and the personalized information of the own side;y _i=1 represents that the reply is present in the current dialog, and 0 represents that the reply is not present in the current dialog. In order to generalize the CMIM target, a hyper-parameter is introducedλ ₃And a penalty coefficient for irrelevant replies in the dialog, the goal of optimization based on the loss function in the training process is shown as the following formula:

in the above formula, the first and second carbon atoms are,Y' represents the reply with the largest conditional mutual information abundance value in the top-5 candidate reply list, and the hyperparameterλ ₃The value can be taken according to experience, and the value taking super-parameter in the embodimentλ ₃And = 0.2. After the output statement with the maximum conditional mutual information abundance value is obtained, post-processing operation is carried out on the output statement, the reply statement is regulated by a template and keyword matching method, the proportion of the context information and the personalized information in the reply is increased by adopting a deleting and rewriting mode, and the reply which is more natural and smooth, different from person to person and rich in personality is generated.

As shown in fig. 7, embedding user gender information obtained from images and texts, user age information obtained from images and texts, and user interest information obtained from texts indicates that the same position information embedding codes are added at each position in the user input text to form context coding information of the user terminal; similarly, the robot does not need to acquire additional information through images, and only needs to acquire corresponding information such as conversation text, age, gender interest and the like from text information and then perform embedded coding representation, so that the gender coding information, the age coding information and the interest coding information of the robot are embedded and coded with the same position information, and are added at each position in the conversation text of the robot to form context coding information at the robot end, and the user and the robot coding information are connected in series to form cross-modal conversation context coding information.

In order to verify the cross-modal bilateral personalized human-computer social conversation generation method of the embodiment, the following evaluation indexes are adopted in the embodiment: (1) bilateral persona accuracy (Acc): inputting the generated response and the role information of the two people into a role classifier to obtain classification precision; (2) BLEU: calculating the n-gram (n =1, 2) overlapping rate of the generated statement and the original reply statement; (3) perlexity (ppl): and measuring the fitting degree of the model and the test data, wherein the lower the numerical value is, the more smooth the sentence grammar generated by the model is. Social interaction data meeting bilateral personalization is manually selected from an existing social interaction test set, a comparison experiment is carried out with different methods, and evaluation results are shown in table 1.

Table 1: the evaluation results of the method of the present example were compared with those of the conventional methods.

See Table 1 for comparative methods including Transfo, Lconv, P-TDW, Transfo + Bi-P + CMIM and Lconv + Bi-P + CMIM. Transfo (TransferTransfo) and Lconv (lost in conversation) are the most popular methods in dialog generation, and the methods do not consider personalized information; P-TDW (Pre-Training with Dynamic Weight) is a more advanced method, but the method only emphasizes the expression of self personalization; transfo + Bi-P + CMIM is an enhanced version of the Transfo + P method, and Lconv + Bi-P + CMIM is an enhanced version of the Lconv + P method. The cross-modal Bilateral personalized man-machine social interaction generation method comprises personalized information (binary Persona) of both parties and constraint of Conditional Mutual Information (CMIM), improvement is conducted on the basis of the method, and compared with test results of other methods, the method has superiority in indexes such as Acc, BLEU and PPL.

An experimental result of the cross-modal bilateral personalized man-machine social interaction generation method is shown in fig. 8, and as can be seen from an example result (left graph), when a user mentions' lovely sister, i find you fatIn the results generated by different methods, the Transfo and Lconv methods do not consider personalized information and are only related to the context, while the P-TDWP method only considers the personalized expression of the P-TDWP method, but the method provided by the invention can generate the following results according to the difference of weights: (1) "sister is fat", a reply related to the contra-party personalised information (setting = 1); (2) "i do not have fat", reply related to self personalized information (setting = 1); (3) do you (do you get fat)

", a reply that is contextually related to the conversation (setting = 1); (4) 'I do not get fat, miss get fat', reply combining the individuation of the opposite side, the individuation of the own side and the context-related information in a self-adaptive way. Because the context information is included, the personal information of the opposite side is added, and the reply has more information amount and is more natural and smooth. As can be seen from the example results (right diagram), when the user mentions "old brother coming back" and the user mentions "brother" as the personality information of the other party, the method proposed by the present invention can generate: (1) "back for old sister", reply related to the other party's personalized information (setting = 1); (2) "old brother wants you up", reply related to own personalized information (setting = 1); (3) "i come back soon," a reply that is contextually related to the conversation (setting = 1); (4) the 'old brother wants you, old sister' replies by self-adaptive fusion of the individuation of the opposite side, the individuation of the own side and the context related information. As can be seen from the different examples: the reply generated by the cross-modal bilateral personalized man-machine social interaction generation method can control the personalized information amount in the reply according to different weight settings, and the method can generate the reply which is personalized and varies from person to person for different users.

In addition, the present embodiment also provides a cross-modal bilateral personalized human-computer social conversation generating system, which includes a computer device, where the computer device at least includes a microprocessor and a memory, which are connected to each other, and the microprocessor is programmed or configured to execute the steps of the aforementioned cross-modal bilateral personalized human-computer social conversation generating method, or a computer program programmed or configured to execute the aforementioned cross-modal bilateral personalized human-computer social conversation generating method is stored in the memory.

In addition, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured to execute the aforementioned cross-modal bilateral personalized human-computer social conversation generation method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A cross-modal bilateral personalized man-machine social conversation generation method is characterized by comprising the following steps:

2) Encoding the weighted fusionO _encEncoding of output results at the previous timeE _prevInputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generative model is trained in advance, and input weighted fusion codes and codes of output results at the last moment are establishedE _prevAnd the mapping relation between the output optimal N candidate reply lists;

3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists; when the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;

in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) Indicating conditionsThe abundance value of the mutual information is obtained,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Trepresenting own personalized information;

2. The method according to claim 1, wherein step 1) is preceded by generating a dialog context codeE _CThe steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE _C。

3. The cross-modal bilateral personalized human-computer social conversation generation method according to claim 1, further comprising generating a robot personalized information code before step 1)E _TThe steps of (1): carrying out word embedding coding on the preset robot personalized description to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain the code of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE _T。

4. The cross-modal bilateral personalized human-computer social conversation generation method according to claim 1, further comprising generating a user personalized information code before step 1)E _SThe steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE _S。

5. The method of generating a cross-modal bilateral personalized human-machine social conversation according to claim 2,3 or 4, wherein the position-coded function expression is as follows:

6. The method for generating a cross-modal bilateral personalized human-computer social conversation according to claim 1, wherein the weighted fusion coding is obtained by performing weighted fusion in step 1)O _encComprises the following steps:

In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γpresentation of no personalization in repliesThe probability of the information is determined by the probability of the information,O _Scoding the user personalized information after multi-head self-attention coding,O _Tfor coding the personalized information of the robot after multi-head self-attention coding,O _Cfor multi-headed self-attention coded dialog context coding,O _prevthe method is used for encoding the output result of the last moment after multi-head self-attention encoding.

7. The cross-modal bilateral personalized human-computer social conversation generating method according to claim 1, wherein the bilateral personalized generating model in step 2) is a GPT network model, and the step 2) further includes a step of training the GPT network model, and a joint loss function adopted in training the GPT network model is as follows:

in the above formula, the first and second carbon atoms are,P _φ(y _i|x ₀ ,…,x _i-1 ，O _enc) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y _irepresenting the generation of character strings for a decoderTo (1)iThe number of the characters is one,x ₀~x _i-1generating prefixes in character strings for decodersiThe number of the characters is one,O _encit is shown that the weighted fusion coding,iindicating the length of the generated character string;

in the above formula, the first and second carbon atoms are,P _φ(y _i|x _i-k ,…,x _i-1) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 characterx _i-k～x _i-1Predict the firstkCharactery _iThe probability of (a) of (b) being,y _iindicating the first in a fixed windowkThe number of the characters is one,x _i-k~x _i-1representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position;

8. The cross-modal bilateral personalized human-machine social conversation generation method according to claim 1, wherein the step of step 3) comprises: marking a robot reply text, conversation history information and interactive two-party personalized information in each section of conversation in collected human-computer social conversation data in advance, and using the marked information as a training positive sample of a deep neural network model; meanwhile, in order to balance the training data, a training negative sample is constructed by adopting a negative sampling method, the cross entropy of the two classes is used as a loss function of the deep neural network model, and the deep neural network model is trained.

9. A cross-modal bilateral personalized human-computer social conversation generating system comprising a computer device including at least a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the steps of the cross-modal bilateral personalized human-computer social conversation generating method according to any one of claims 1 to 8, or the memory stores therein a computer program programmed or configured to perform the cross-modal bilateral personalized human-computer social conversation generating method according to any one of claims 1 to 8.

10. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform a cross-modal bilateral personalized social interaction generating method according to any one of claims 1 to 8.