CN111930918B - Cross-modal bilateral personalized man-machine social interaction dialog generation method and system - Google Patents

Cross-modal bilateral personalized man-machine social interaction dialog generation method and system Download PDF

Info

Publication number
CN111930918B
CN111930918B CN202011046353.7A CN202011046353A CN111930918B CN 111930918 B CN111930918 B CN 111930918B CN 202011046353 A CN202011046353 A CN 202011046353A CN 111930918 B CN111930918 B CN 111930918B
Authority
CN
China
Prior art keywords
personalized
information
coding
code
bilateral
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011046353.7A
Other languages
Chinese (zh)
Other versions
CN111930918A (en
Inventor
李树涛
李宾
孙斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Xinxin Xiangrong Intelligent Technology Co.,Ltd.
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202011046353.7A priority Critical patent/CN111930918B/en
Publication of CN111930918A publication Critical patent/CN111930918A/en
Application granted granted Critical
Publication of CN111930918B publication Critical patent/CN111930918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a cross-modal bilateral personalized man-machine social interaction generation method and a cross-modal bilateral personalized man-machine social interaction generation system. According to the invention, the personalized information is fused in a cross-modal manner, the personalized information of the characters of both interaction parties is considered, the personalized characteristics of both interaction parties are fully utilized on the premise of ensuring reasonable reply content, smooth grammar and continuous logic, and the reply with individuality and individuality according to different people can be generated.

Description

Cross-modal bilateral personalized man-machine social interaction dialog generation method and system
Technical Field
The invention relates to the technical field of human-computer interaction, in particular to a cross-modal bilateral personalized human-computer social interaction generation method and system.
Background
With the progress of science and technology, human-computer interaction gradually develops towards intellectualization and individuation, and the interaction between people and robots is increasingly close to the interaction between people in the real world. Traditional man-machine social interaction generation belongs to the field of natural language processing, and mainly researches that a robot can make a natural reply according to text input of a user. Different from interpersonal social contact, vision is a main sensory source for receiving external information by people, and people can make natural personalized expression according to the external information. Therefore, in order to make the robot more "humanoid", in man-machine socialization, the robot needs to have a certain perception capability, and also needs to understand the language of the user and make a natural and personalized reply. How to fuse computer vision and natural language processing, and research cross-modal fusion coding of different modal information such as texts, images and the like, so that the robot generates 'human-like' personalized expression according to the acquired user images and natural language, and becomes one of challenges of man-machine natural interaction.
Personalized man-machine interaction mainly researches how a robot generates a reply consistent with a preset personality or characteristic, namely a reply with single-side personalization consistency. In recent years, the research of human-computer interaction based on deep learning has been greatly developed and is increasingly widely used in daily life, such as widely known microsoft ice, apple Siri, etc., which all have preset personality and characteristic attributes. In the field of personalized interaction, the latest progress based on pre-training methods has advanced the latest results of a series of natural language processing tasks, and recent research attempts to solve the problem of personalized dialog generation in a data-driven manner, i.e., learning features related to characters directly from large-scale dialog data sets, and generating corresponding responses conforming to the personality of the speaker by encoding sparse personalized interaction information of the characters and then performing fine tuning on a pre-training model.
Nowadays, the social chat robot can best embody personalized interaction and can generate personalized replies rich in character characteristics. At present, a chat robot is mainly based on character interaction, and in the process of man-machine interaction, the chat robot cannot see a user and cannot acquire personalized information of the user, but personalized attribute information of the robot can be predefined, so that personalized expression related to the chat robot can be generated by embedding and representing the personalized information of the robot.
However, in the social interaction between real people, the two interacting parties can acquire the personalized information of each other, and when replying, the replying person not only concentrates on the personalized expression of the replying person, but also should reply by considering the personalized characteristics of the other party and the problems of the other party, so that the man-machine interaction neglecting the personalized information of the user can generate feeling and disgust emotion, and the user experience is reduced. Therefore, the cross-mode bilateral personalized man-machine interaction technology needs to be developed in the natural interaction of the robot.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a cross-modal bilateral personalized man-machine social interaction session generation method and system.
In order to solve the technical problems, the invention adopts the technical scheme that:
a cross-modal bilateral personalized man-machine social conversation generation method comprises the following steps:
1) encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Encoding of output results at the previous timeE prev Performing weighted fusion to obtain weighted fusion codeO enc
2) Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generative model is trained to pre-establish the weighted fusion code of the input and the code of the output result at the last momentE prev And the mapping relation between the output optimal N candidate reply lists;
3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists;
4) and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result.
Optionally, step 1) is preceded by generating a dialog context codeE C The steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE C
Optionally, step 1) is preceded by generating a robot personalized information codeE T The steps of (1): carrying out word embedding coding on the preset robot personalized description to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain the code of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE T
Optionally, step 1) is preceded by generating a user personalized information codeE S The steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE S
Optionally, the functional expression for position coding is as follows:
Figure 476469DEST_PATH_IMAGE001
in the above formula, the first and second carbon atoms are,E pos,2i()indicating that the character is at 2iThe position-embedded code that maps positionally into a sinusoidal function,E pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d model the dimensions of the code are represented and,d model =512,pos⊆[1,n],ncontent length encoded for word embedding.
Optionally, performing weighted fusion in step 1) to obtain a weighted fusion codeO enc Comprises the following steps:
s1) encoding the dialog contextE C Encoding of output results at the previous timeE prev Coding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO C Coding the personalized information of the robotE T Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO T Encoding user personalized informationE S Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO S Encoding the output result of the previous timeE prev Coding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO prev
S2) obtaining the weighted fusion code according to the following formulaO enc
Figure 906313DEST_PATH_IMAGE002
In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γindicating the probability that no personalized information is present in the reply,O S to pay attention to the channel by multiple headsThe user personalized information code after the force code,O T for coding the personalized information of the robot after multi-head self-attention coding,O C for multi-headed self-attention coded dialog context coding,O prev the method is used for encoding the output result of the last moment after multi-head self-attention encoding.
Optionally, the bilateral personalized generation model in step 2) is a GPT network model, and step 2) further includes a step of training the GPT network model, and a joint loss function adopted when the GPT network model is trained is as follows:
Figure 820043DEST_PATH_IMAGE003
in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L D (φ) A loss function of the decoding model is represented,L LM (φ) A loss function of the language model is represented,L p (θ) Representing a loss function of the personalized predictive model,λ 1andλ 2is a weight coefficient; wherein the decoding model loss functionL D (φ) The expression of (a) is as follows:
Figure 360745DEST_PATH_IMAGE004
in the above formula, the first and second carbon atoms are,P φ (y i |x 0 ,…,x i-1 ,O enc ) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y i indicating generation of the second in the string for the decoderiThe number of the characters is one,x 0 ~x i-1 generating prefixes in character strings for decodersiThe number of the characters is one,O enc it is shown that the weighted fusion coding,iindicating the length of the generated character string;
loss function of language modelL LM (φ) The function of (a) is expressed as follows:
Figure 4216DEST_PATH_IMAGE005
in the above formula, the first and second carbon atoms are,P φ (y i |x i-k ,…,x i-1) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 character predictionx i-k x i-1 First, thekCharactery i The probability of (a) of (b) being,y i indicating the first in a fixed windowkThe number of the characters is one,x i-k ~x i-1 representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position;
personalized predictive loss functionL p (θ) The function of (a) is expressed as follows:
Figure 339383DEST_PATH_IMAGE006
in the above formula, the first and second carbon atoms are,y j a tag representing the personalized information is provided,P θ (y=j|O C ) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O C representing a multi-headed self-attention coded dialog context code.
Optionally, the step of step 3) comprises: marking a robot reply text, conversation history information and interactive two-party personalized information in each section of conversation in collected human-computer social conversation data in advance, and using the marked information as a training positive sample of a deep neural network model; meanwhile, in order to balance the training data, a training negative sample is constructed by adopting a negative sampling method, the cross entropy of the two classes is used as a loss function of the deep neural network model, and the deep neural network model is trained; when the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;
Figure 582403DEST_PATH_IMAGE007
in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) The information abundance value of the condition mutual information is shown,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Trepresenting own personalized information.
In addition, the present invention also provides a cross-modal bilateral personalized human-computer social conversation generating system, including a computer device, the computer device at least includes a microprocessor and a memory connected with each other, the microprocessor is programmed or configured to execute the steps of the cross-modal bilateral personalized human-computer social conversation generating method, or the memory stores a computer program programmed or configured to execute the cross-modal bilateral personalized human-computer social conversation generating method.
Furthermore, the present invention also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to perform the cross-modal bilateral personalized human-machine social conversation generation method.
Compared with the prior art, the invention has the following advantages: the invention includes encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Last one timeEncoding of output results of scalesE prev Performing weighted fusion to obtain weighted fusion codeO enc (ii) a Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists; and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result. According to the invention, the personalized information is fused in a cross-modal manner, the personalized information of the characters of both interaction parties is considered, the personalized characteristics of both interaction parties are fully utilized on the premise of ensuring reasonable reply content, smooth grammar and continuous logic, and the reply with individuality and individuality according to different people can be generated. The invention constructs the cross-modal personalized interaction model by taking the generation of the robot natural language reply which is rich in personality and varies from person to person as a target in the process of human-computer interaction, adopts the pre-training technology, and can obviously improve the chat quality and the social experience of human-computer interaction compared with the traditional method.
Drawings
FIG. 1 is a core flow diagram of a method according to an embodiment of the present invention.
FIG. 2 is a schematic view of a complete flow of the method according to the embodiment of the present invention.
Fig. 3 is a schematic diagram of a framework structure of a bilateral personalized generative model according to an embodiment of the present invention.
FIG. 4 is a diagram of a frame structure of an encoder according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a framework structure of a dynamic weighting and fusing module for encoding information according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a frame structure of a decoder according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of a variable across modal information coding fusion in an embodiment of the present invention;
FIG. 8 is a diagram illustrating comparison between output results of different methods according to an embodiment of the present invention.
Detailed Description
As shown in fig. 1 and fig. 2, the method for generating a cross-modal bilateral personalized human-computer social conversation according to the present embodiment includes:
1) encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Encoding of output results at the previous timeE prev Performing weighted fusion to obtain weighted fusion codeO enc
2) Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generation model is trained in advance to establish a weighted fusion code of input and a code of an output result at the previous momentE prev And the mapping relation between the output optimal N candidate reply lists;
3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists;
4) and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result.
Referring to fig. 2, fig. 3 and fig. 4, the present embodiment further includes generating a dialog context code before step 1)E C The steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE C
Referring to fig. 2, fig. 3 and fig. 4, the method further includes generating a robot personalized information code before step 1) of the present embodimentE T The steps of (1): carrying out words on preset robot personalized descriptionEmbedding the codes to obtain embedded vectors, then carrying out position coding on the embedded vectors to obtain position coding vectors, and adding the embedded vectors and the position coding vectors to obtain codes of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE T
Referring to fig. 2, fig. 3 and fig. 4, the embodiment further includes generating a user personalized information code before step 1)E S The steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE S
In this embodiment, the step of extracting the personalized description information of the image modality includes: adopting a face enhancement tool kit (such as a Facelib open source tool kit) to enhance and preprocess face data, and extracting a face position and a feature point position in a picture by using a face position and feature point position extraction model (such as a MobileNet V1 pre-training model), wherein the extracted face position is the upper left corner coordinate and the lower right corner coordinate of an extraction frame; the characteristic points comprise coordinates of left eye pupils, coordinates of right eye pupils, coordinates of nose tips, coordinates of left mouth corners and coordinates of right mouth corners, and the extracted face images are corrected and cut by adopting designed parameters to obtain standard face images with the same size. And processing the obtained standard face picture by using a standardized method, and inputting the standard face picture into an age and gender face recognition classification model to extract an age and gender classification result after face recognition. For example, inputting a standard face image into a ShuffleNet pre-training model for fine adjustment operation, training and optimizing a loss function shown in the following formula, and obtaining an age and gender classification result after face recognition after forward propagation of the standard face image;
Figure 661218DEST_PATH_IMAGE008
in the above formula, the results of age and sex classification are used respectively
Figure 159195DEST_PATH_IMAGE009
And
Figure 665263DEST_PATH_IMAGE010
it is shown that,P(a,g|I,B) Representing the probability of the predicted age and gender of the face image selected from the face bounding box,Iis an image of a face, and is,Bis a face bounding box. In addition, when extracting the personalized description information of the text modality, the personalized social interaction data set is filtered by adopting a regular method when filtering the text with the inappropriate words so as to improve the efficiency.
When carrying out word embedding coding on the obtained user personalized description, the word embedding coding step of the personalized description information aiming at the image modality comprises the following steps: respectively constructing user personalized description information in a key value pair form and personalized attribute feature information in a character string form according to the obtained age and gender classification results, and respectively obtaining codes of the description information and the attribute feature information by carrying out corresponding word embedding coding operation, wherein the results are expressed as the following formula;
Figure 819164DEST_PATH_IMAGE011
in the above formula, the first and second carbon atoms are,E v U coding the personalized description information for the user, general formulae v i User-personalized descriptive coding information characterized in the form of key-value pairs extracted for an image modality,i=1,2 code vectors of key-value pairs for age, gender respectively,sis a coded vector of the key(s),vis a coded vector of values. Extracting personalized attribute information of a user, wherein the personalized attribute information consists of age and gender character strings, and word embedding coding is respectively carried out by adopting the following formula;
Figure 701669DEST_PATH_IMAGE012
Figure 54153DEST_PATH_IMAGE013
wherein,G v U A v U respectively encoding the gender and age personalized attributes of the user extracted from the image modality,g v i a v i respectively representing gender and age attributes extracted from the image modalityiThe code of the position is used to determine,mpersonalized description information coding representing a userE v U The total length of the first and second support members,i⊆[1,m]。
when carrying out word embedding coding on the obtained user personalized description, the word embedding coding step of the personalized description information aiming at the text mode comprises the following steps: performing word embedding operation on the user and robot personalized description information contained in the text modality, wherein the user and robot personalized description information is expressed in a key value pair form, and the following formula is shown as follows:
Figure 996701DEST_PATH_IMAGE014
Figure 621587DEST_PATH_IMAGE015
in the above formula, the first and second carbon atoms are,E t U encoded information extracted for the user's personalized description for the text modality,E t R coded information of personalized description of robot extracted for text modality, general formulae t i For personalized description coding in the form of key-value pairs extracted from a text modality,i=1,2,3 represent the encoding vectors of key-value pairs of age, gender, interest tags, respectively. Considering the current text information word embedding of the user, the robot text information processing mode is the same (subscript is represented by R), all the embedded information prescribes the same embedding length, and if the length of the corresponding personalized information does not reach the total length of the sentence, the personalized information is used "<PAD>"as placeholder, complement the personalized information, when the total length of the corresponding sentence is exceeded, take truncation operation, and only take the first half of the input information, here, only take user input as an example. Then the encoded information of the complete sentence dialog input by the user is:
Figure 42204DEST_PATH_IMAGE016
in the above formula, the first and second carbon atoms are,X U the encoded information of the complete sentence dialog entered for the user,x Ui for the first in the sentence input by the user
Figure 514773DEST_PATH_IMAGE017
The words of the word are embedded into the encoded vector.
The method comprises the following steps of taking values of gender, age and interest tags, respectively carrying out word embedding coding operation, and when a user has a plurality of interest tags, carrying out average operation on codes, wherein the average operation is shown as the following formula:
Figure 362644DEST_PATH_IMAGE018
in the above formula, the first and second carbon atoms are,G t U encoded information representing a text modality extracted user gender tag,A t U encoded information representing a user age tag extracted by the text modality,T t U encoded information representing the user interest tags extracted by the text modality,g t Ui user gender information representing text extraction corresponds to input
Figure 287874DEST_PATH_IMAGE017
The word at the position of the word is embedded in the encoded vector,a t Ui a word representing the user age information of the text extraction is embedded in the encoding vector,t t Ui embedding a word representing an interest tag of a text extraction user into a coding vector, and when a plurality of interest tags exist, averaging the plurality of embedded coding vectors,i⊆[1,n],nindicating the overall length of the sentence. Because the personalized information of the text mode is absent, the personalized attribute information extracted from the image supplements the personalized attribute information of the text, such as the missing gender, age and the like, and meanwhile, the personalized attribute information error of the text is corrected in a mode of cross-mode information correction. In this embodiment, when performing word embedding coding on the obtained user personalized description, the method further includes supplementing age and gender information in the user personalized description information by using different modality information, as shown in the following formula:
Figure 449865DEST_PATH_IMAGE019
in the above formula, the first and second carbon atoms are,E U extracting the personalized description coding information of the supplementary user from the image and text information,E v U encoded information extracted for the user's personalized description for the text modality,E t U encoded information of the personalized description of the robot extracted for the text modality,
Figure 776941DEST_PATH_IMAGE020
and adding the coded information of different modes, and supplementing the personalized description information lacking in the text mode of the user by the personalized description information extracted by the image mode.
Figure 61292DEST_PATH_IMAGE021
Indicates that the age-key-value pair encoding information of the user is supplemented from the text and image information,
Figure 473819DEST_PATH_IMAGE022
indicates that the user's gender-key-value pair encoding information is supplemented from the text and image information,e 3 t the key value pair of interest tag representing the user from the text information encodes the information. The personalized description of the robot is predefined and does not need to be obtained from the image modality, soE R =E t R WhereinE R Representing the robot personalization descriptive information,E t R and robot personalized description information representing text extraction.
Age and gender information in the personalized attribute information of the user is supplemented and fused through different modal information, and the information is shown as the following formula:
Figure 423189DEST_PATH_IMAGE023
Figure 604772DEST_PATH_IMAGE024
in the above formula, the first and second carbon atoms are,G U the gender code information of the user after the image and the text information are fused,G v U gender and age coding information representing the image extraction,G t U user gender code information representing text extraction,
Figure 60024DEST_PATH_IMAGE025
~
Figure 959847DEST_PATH_IMAGE026
the representations respectively correspond to the total length of the text input by the usernEach position in the image information is user gender code information fused by text and image information,A U representing the user age code information fused by the image and the text information,A v U age-coding information representing the image extraction,A t U user age-coding information representing text extraction,
Figure 525958DEST_PATH_IMAGE027
~
Figure 499730DEST_PATH_IMAGE028
the representations respectively correspond to the total length of the text input by the usernEach position in the image, the post-user age code information fused by the text and the image modality information,
Figure 125883DEST_PATH_IMAGE020
and adding the coded information of different modes, and fusing the personalized information lacking in the text mode of the user by the personalized information extracted by the image mode.
Defining the personalized information embedded representation of the current user asD U The robot personalized information embedding expression system consists of user dialogue sentences, user age information, user gender information and user interest labels, and the robot personalized information embedding expression of the same reason isD R The method is the same as above.
Figure 513002DEST_PATH_IMAGE029
Figure 617224DEST_PATH_IMAGE030
In the above formula, the first and second carbon atoms are,D U indicating the current userIs embedded in the representation of the personalized information,D R the representation robot personalization information is embedded in the representation,G U the gender code information of the user after the image and the text information are fused,A U representing the user's age-coded information fused by the image and the text information,T t U encoding information representing user interests extracted from a text modality,X U a dialog sentence representing the user is presented,X R a dialog sentence representing the robot,G t R gender code information representing the robot extracted by the text modality,A t R age-coded information representing the robot extracted by the text modality,T t R and the interest coding information of the robot extracted by the text mode is represented.
Figure 773399DEST_PATH_IMAGE031
In the above formula, the first and second carbon atoms are,E C the information is coded for the historical dialogue context and is formed by adding the robot personalized description information, the user personalized information, the robot personalized information and the dialogue historical information after being coded. General formula (VII)C j Represents the firstjThe history information of the wheel is recorded,lrepresenting the total number of rounds of the historical conversation,j⊆[1,l],D U a personalized information embedded representation representing the current user,D R the robot personalization information is represented embedded in the representation.
In this embodiment, the embedded vector is subjected to position coding to obtain a position coding vector, and a functional expression for performing position coding is shown as the following formula:
Figure 757405DEST_PATH_IMAGE032
in the above formula, the first and second carbon atoms are,E pos,2i()indicating characterIs marked with 2iThe position-embedded code that maps positionally into a sinusoidal function,E pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d model the dimensions of the code are represented and,d model =512,pos⊆[1,n],ncontent length encoded for word embedding. And correspondingly adding each embedded vector and the position coding vector to obtain final trans-modal vector representation, and inputting the final trans-modal vector representation together into an encoder of the bilateral personalized generation model for encoding operation.
Referring to fig. 5, in this embodiment, the weighted fusion is performed in step 1) to obtain a weighted fusion codeO enc Comprises the following steps:
s1) encoding the dialog contextE C Encoding of output results at the previous timeE prev Coding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO C Coding the personalized information of the robotE T Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO T Encoding user personalized informationE S Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO S Encoding the output result of the previous timeE prev Coding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO prev (ii) a Can be respectively expressed as:
Figure 366241DEST_PATH_IMAGE033
Figure 539733DEST_PATH_IMAGE034
Figure 550414DEST_PATH_IMAGE035
Figure 518370DEST_PATH_IMAGE036
s2) obtaining the weighted fusion code according to the following formulaO enc
Figure 552185DEST_PATH_IMAGE037
In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γindicating the probability that no personalized information is present in the reply,O S coding the user personalized information after multi-head self-attention coding,O T for coding the personalized information of the robot after multi-head self-attention coding,O C for multi-headed self-attention coded dialog context coding,O prev the method is used for encoding the output result of the last moment after multi-head self-attention encoding. In the embodiment, each category is passed throughsoftmaxThe step of operation is carried out, and the prediction result is defined as a weight value of weighted fusion, whereinαβγThe function of (a) is expressed as follows:
Figure 998210DEST_PATH_IMAGE038
Figure 128977DEST_PATH_IMAGE039
Figure 267835DEST_PATH_IMAGE040
in the above formula, the first and second carbon atoms are,O C for multi-headed self-attention coded dialog context coding, y =0 represents that the returned personalized information is related to the other party,αrepresenting the probability of the presence of the user personalization in the reply, y =1 representing that the personalization of the reply is related to itself,βrepresenting the probability of robot personalization in the reply, y =2 representing the reply is context-dependent, without personalization,γrepresenting the probability that no personalized information is present in the reply.
Referring to FIG. 6, a decoder for bilateral personalized generative models is used to encode the weighted fusionO enc Encoding of output results at the previous timeE prev The best N candidate reply lists are generated (output results). As an optional implementation manner, the bilateral personalized generation model in step 2) in this embodiment is a GPT network model. Since the GPT network model is an existing network, it is not described again. The encoder and the decoder of the bilateral personalized generative model are common parameters, the parameters of the encoder and the decoder need to be trained before use, and the processing steps of the training data set adopted when the parameters of the encoder and the decoder are trained are the same as the step 1) and the preprocessing steps thereof, and are used for obtaining the weighted fusion coding of the data samplesO enc Encoding of output results at the previous timeE prev . On the basis, weighted fusion coding is adopted when the parameters of the encoder and the decoder are trainedO enc Encoding of output results at the previous timeE prev And as input, a multi-task learning mechanism is utilized, loss functions of a language model, an individualized prediction model and a decoding model in the GPT network model are optimized simultaneously, and an encoder and a decoder of the bilateral individualized generation model are trained to obtain optimal parameters, so that the bilateral individualized generation model is trained. Defining a personalized profile of the current speakerIs described asTPersonal description code of own party isE T The personalized description of the other party isSThe personalized description of the other party is coded intoE S When modeling a bilateral personalized interaction model, the method is divided into two cases: case a: when the speaker is a robot, at this timeTFor the personalized description of the robot, the personalized description of the own party is coded intoE T =E R And coding is carried out through a multi-head attention mechanism network. Case b: in model training, it is necessary to make full use of social interaction data, when the speaker is a user, at this timeTFor the personalized description of the user, the personalized description of the own party is coded intoE T = E U The personalized description of the other party is coded intoE S = E R At this time, only the following steps are neededO S AndO T and (4) interchanging, and calculating by adopting an attention mechanism.
And step 2) further includes a step of training a GPT network model before, and as a preferred embodiment, a joint loss function adopted when the GPT network model is trained in this embodiment is as follows:
Figure 851263DEST_PATH_IMAGE041
in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L D (φ) A loss function of the decoding model is represented,L LM (φ) A loss function of the language model is represented,L p (θ) Representing a loss function of the personalized predictive model,λ 1andλ 2is a weight coefficient (can be obtained by experience, for example, in this embodiment
Figure 290859DEST_PATH_IMAGE042
1=0.2,
Figure 276132DEST_PATH_IMAGE042
2= 0.5); wherein the decoding model loss functionL D (φ) The expression of (a) is as follows:
Figure 585891DEST_PATH_IMAGE043
in the above formula, the first and second carbon atoms are,P φ (y i |x 0 ,…,x i-1 ,O enc ) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y i indicating generation of the second in the string for the decoderiThe number of the characters is one,x 0 ~x i-1 generating prefixes in character strings for decodersiThe number of the characters is one,O enc it is shown that the weighted fusion coding,iindicating the length of the generated character string;x 0 ,…,x i-1after encoding for a given input text by the encoder, the decoder generates the top of the stringiAnd (4) each character is used as the input of the bilateral personalized Transformer decoder.
Loss function of language modelL LM (φ) The function of (a) is expressed as follows:
Figure 656615DEST_PATH_IMAGE044
in the above formula, the first and second carbon atoms are,P φ (y i |x i-k ,…,x i-1 ) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 character predictionx i-k x i-1 First, thekCharactery i The probability of (a) of (b) being,y i indicating the first in a fixed windowkThe number of the characters is one,x i-k ~x i-1 representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position; the present embodiment utilizes the existing pre-trained model parameter setφTo initialize the encoder and decoder and train the language model by optimizing the standard maximum log likelihood loss.kIs the size of the contextual window and,x i-k ,…,x i-1 is a series of coded sequence samples sampled from a training corpus.
Referring to fig. 3, the present embodiment designs a coding information dynamic weighting fusion module, which inputs the dialog context information coding into the personalized prediction model to predict the probability of the personalized information appearing in the reply sentence, and dynamically fuses the input coding information into a three-classification personalized information prediction task, and dynamically weights and fuses the context, the personalized information of the listener at the opposite side and the personalized information of the caller at the other side. Personalized predictive loss functionL p (θ) Representing the probability of occurrence of individualized information in response to a prediction of contextual information, an individualized predictive loss functionL p (θ) The function of (a) is expressed as follows:
Figure 710022DEST_PATH_IMAGE045
in the above formula, the first and second carbon atoms are,y j a tag representing the personalized information is provided,P θ (y=j|O C ) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O C represents a multi-headed self-attention coded dialog context code and has:
Figure 487485DEST_PATH_IMAGE046
in the above formula, the first and second carbon atoms are,O C (j) the representation corresponds tojThe context of the personalized tag encodes the information classification result,O C (i) the representation corresponds toiAnd the context coding information classification result of the personalized label.
As can be seen from the foregoing, in order to achieve a better effect of the model, the training of this embodiment adopts a multi-task learning manner to perform fine tuning on the constructed personalized interactive text data set, so that not only is the loss function of the language model optimized, but also the personalized prediction loss function is performedL p (θ) Also, the decoding model loss function is optimizedL D (φ)。
Decoding model loss functionL D (φ) It can also be expressed as:
Figure 968145DEST_PATH_IMAGE047
in the above formula, the first and second carbon atoms are,x 0 ,…,x i-1after encoding for a given input text by the encoder, the decoder generates the top of the stringiA character which is used as the input of a decoder of the bilateral personalized generative model,y i generating the first of the character strings for a decoderiAnd (4) characters. After the output results of the encoder of the bilateral personalized generative model are subjected to weighted fusion, the above formula can be expressed as a decoding model loss function in the contextL D (φ) In the form of (1).
After the personalized prediction model is trained, the decoder of the bilateral personalized generation model can be utilized to perform weighted fusion coding aiming at the inputO enc Encoding of output results at the previous timeE prev An optimal list of N candidate replies is generated. In this embodiment, a decoder of the bilateral personalized generation model generates an optimal top-5 candidate reply list through the bilateral personalized model in a cluster search manner, sorts the sentences in the candidate reply list by the size of a conditional mutual information abundance value (history included in replies and personalized information of both parties) by using a maximum conditional mutual information principle (CMIM) principle, selects the sentence with the largest conditional mutual information abundance value, performs normalization by using a text post-processing method, and outputs a suitable reply satisfying the personalized features of both parties of interaction.
In this embodiment, the step 3) includes: marking the reply text, the conversation history information and the individual information of both interaction parties of the robot in each section of conversation in the collected human-computer social conversation data in advance as a deep neural network
Figure 260586DEST_PATH_IMAGE048
Training positive samples of (1); meanwhile, in order to balance the training data, a negative sampling method is adopted to construct a training negative sample, and the cross entropy of the two classifications is used as a deep neural network
Figure 117683DEST_PATH_IMAGE048
Training a deep neural network model
Figure 811970DEST_PATH_IMAGE048
(ii) a When the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;
Figure 650482DEST_PATH_IMAGE049
in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) The information abundance value of the condition mutual information is shown,
Figure 695798DEST_PATH_IMAGE050
the probability of generating reply text by the input text through a coder and a decoder under the condition of giving personalized information and historical information of both parties is shown,
Figure 91007DEST_PATH_IMAGE051
representing the co-occurrence probability of reply texts, historical information, the personalized information of the other party and the personalized information of the own party in the dialogue corpus,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Tthe personalized information of the own party is shown,
Figure 639800DEST_PATH_IMAGE052
representing that the bilateral personalized dialog generation model generates a Top-5 best candidate reply list,E C representing a dialog context code,E T Code for expressing the personalized information of the robot,E S Representing the coding of the personalized information of the user,
Figure 727842DEST_PATH_IMAGE053
parameters representing the bilateral personalized generative model,
Figure 932559DEST_PATH_IMAGE054
representing parameters of a deep neural network model.
Inputting the conditional mutual information abundance values of the text X and the reply text Y under the condition of giving the personalized information S and T and the historical information C of both parties
Figure 865880DEST_PATH_IMAGE055
Can be expressed as:
Figure 534758DEST_PATH_IMAGE056
in the above formula, the first and second carbon atoms are,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) And the probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown. Note that in the dialog generation model, generating the reply text Y requires the input text X as a premise, while the denominator of the joint loss function does not contain the input text X. Therefore, in this embodiment, a way of jointly optimizing the mutual information abundance value of the condition is not adopted, but a way of calculating step by step is adopted, the molecular part of the mutual information abundance value is calculated first, personalized information S and T and historical information C of both parties are encoded, and a bilateral personalized dialogue generating model is used to generate a Top-5 optimal candidate reply list
Figure 793701DEST_PATH_IMAGE057
As shown in the following formula:
Figure 282451DEST_PATH_IMAGE058
in the above formula, the first and second carbon atoms are,
Figure 471993DEST_PATH_IMAGE059
the probability of the Top-5 optimal candidate reply list obtained by the input information code, the historical information code, the other party personalized information code and the own party personalized information code is shown,
Figure 995378DEST_PATH_IMAGE057
top-5 best candidate reply list generated for the decoder of the bilateral personalized generative model. Further, the conditional probability formula is used for determining the abundance value of the conditional mutual informationI(X;Y|C,S,T) The denominator part of (a) is transformed as follows:
Figure 159644DEST_PATH_IMAGE060
in the above formula, the first and second carbon atoms are,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,P(Y,C,S,T) Representing the co-occurrence probability of the reply text, the history information, the personalized information of the opposite party and the personalized information of the own party,P(C,S,T) Representing the co-occurrence probability of the historical information, the personalized information of the other party and the personalized information of the own party,Yrepresenting each sentence of reply text in the best candidate reply list. Personalizing information given the same partiesSAndThistory informationCUnder the condition of (1), each pair of input texts is calculatedXAnd reply textYThe condition of (2) is a mutual information abundance value,P(C,S,T) Since the probabilities of (a) and (b) are the same, if this part can be ignored in calculating the amount of condition information, the calculation of the following equation is equivalent to the calculation of the following equation.
Figure 401269DEST_PATH_IMAGE061
In the above formula, the first and second carbon atoms are,
Figure 941972DEST_PATH_IMAGE062
representing the co-occurrence probability of the reply text, the historical information, the personalized information of the other party and the personalized information of the own party calculated by the deep neural network model,
Figure 257547DEST_PATH_IMAGE054
and representing a deep neural network model obtained by training on the collected human-computer social interaction data, and representing the co-occurrence probability of each reply text, the historical information and the individual information of both interaction parties. And in the collected human-computer social conversation data, marking the reply text of the robot, the conversation historical information and the individual information of both interactive parties in each section of conversation as a training positive sample of the deep neural network model. Meanwhile, in order to balance the training data, a negative sampling method is adopted to construct a training negative sample, cross entropy of the two classes is used as a loss function of the deep neural network model, the deep neural network model is trained, and the calculation mode is as follows:
Figure 858292DEST_PATH_IMAGE063
in the above formula, the first and second carbon atoms are,
Figure 321635DEST_PATH_IMAGE064
representing a deep neural network
Figure 666028DEST_PATH_IMAGE048
Is used to determine the loss function of (c),Nrepresents the number of training samples in a batch,N=32,y i a value of a personalized tag is represented,y i =0,1,
Figure 164006DEST_PATH_IMAGE065
representing a model of a deep neural networkCalculating the co-occurrence probability of the reply text, the historical information, the personalized information of the opposite side and the personalized information of the own side;y i =1 represents that the reply is present in the current dialog, and 0 represents that the reply is not present in the current dialog. In order to generalize the CMIM target, a hyper-parameter is introducedλ 3And a penalty coefficient for irrelevant replies in the dialog, the goal of optimization based on the loss function in the training process is shown as the following formula:
Figure 857024DEST_PATH_IMAGE066
in the above formula, the first and second carbon atoms are,Y' represents the reply with the largest conditional mutual information abundance value in the top-5 candidate reply list, and the hyperparameterλ 3The value can be taken according to experience, and the value taking super-parameter in the embodimentλ 3And = 0.2. After the output statement with the maximum conditional mutual information abundance value is obtained, post-processing operation is carried out on the output statement, the reply statement is regulated by a template and keyword matching method, the proportion of the context information and the personalized information in the reply is increased by adopting a deleting and rewriting mode, and the reply which is more natural and smooth, different from person to person and rich in personality is generated.
As shown in fig. 7, embedding user gender information obtained from images and texts, user age information obtained from images and texts, and user interest information obtained from texts indicates that the same position information embedding codes are added at each position in the user input text to form context coding information of the user terminal; similarly, the robot does not need to acquire additional information through images, and only needs to acquire corresponding information such as conversation text, age, gender interest and the like from text information and then perform embedded coding representation, so that the gender coding information, the age coding information and the interest coding information of the robot are embedded and coded with the same position information, and are added at each position in the conversation text of the robot to form context coding information at the robot end, and the user and the robot coding information are connected in series to form cross-modal conversation context coding information.
In order to verify the cross-modal bilateral personalized human-computer social conversation generation method of the embodiment, the following evaluation indexes are adopted in the embodiment: (1) bilateral persona accuracy (Acc): inputting the generated response and the role information of the two people into a role classifier to obtain classification precision; (2) BLEU: calculating the n-gram (n =1, 2) overlapping rate of the generated statement and the original reply statement; (3) perlexity (ppl): and measuring the fitting degree of the model and the test data, wherein the lower the numerical value is, the more smooth the sentence grammar generated by the model is. Social interaction data meeting bilateral personalization is manually selected from an existing social interaction test set, a comparison experiment is carried out with different methods, and evaluation results are shown in table 1.
Table 1: the evaluation results of the method of the present example were compared with those of the conventional methods.
Figure 73242DEST_PATH_IMAGE067
See Table 1 for comparative methods including Transfo, Lconv, P-TDW, Transfo + Bi-P + CMIM and Lconv + Bi-P + CMIM. Transfo (TransferTransfo) and Lconv (lost in conversation) are the most popular methods in dialog generation, and the methods do not consider personalized information; P-TDW (Pre-Training with Dynamic Weight) is a more advanced method, but the method only emphasizes the expression of self personalization; transfo + Bi-P + CMIM is an enhanced version of the Transfo + P method, and Lconv + Bi-P + CMIM is an enhanced version of the Lconv + P method. The cross-modal Bilateral personalized man-machine social interaction generation method comprises personalized information (binary Persona) of both parties and constraint of Conditional Mutual Information (CMIM), improvement is conducted on the basis of the method, and compared with test results of other methods, the method has superiority in indexes such as Acc, BLEU and PPL.
An experimental result of the cross-modal bilateral personalized man-machine social interaction generation method is shown in fig. 8, and as can be seen from an example result (left graph), when a user mentions' lovely sister, i find you fatIn the results generated by different methods, the Transfo and Lconv methods do not consider personalized information and are only related to the context, while the P-TDWP method only considers the personalized expression of the P-TDWP method, but the method provided by the invention can generate the following results according to the difference of weights: (1) "sister is fat", a reply related to the contra-party personalised information (setting = 1); (2) "i do not have fat", reply related to self personalized information (setting = 1); (3) do you (do you get fat)
Figure 955747DEST_PATH_IMAGE068
", a reply that is contextually related to the conversation (setting = 1); (4) 'I do not get fat, miss get fat', reply combining the individuation of the opposite side, the individuation of the own side and the context-related information in a self-adaptive way. Because the context information is included, the personal information of the opposite side is added, and the reply has more information amount and is more natural and smooth. As can be seen from the example results (right diagram), when the user mentions "old brother coming back" and the user mentions "brother" as the personality information of the other party, the method proposed by the present invention can generate: (1) "back for old sister", reply related to the other party's personalized information (setting = 1); (2) "old brother wants you up", reply related to own personalized information (setting = 1); (3) "i come back soon," a reply that is contextually related to the conversation (setting = 1); (4) the 'old brother wants you, old sister' replies by self-adaptive fusion of the individuation of the opposite side, the individuation of the own side and the context related information. As can be seen from the different examples: the reply generated by the cross-modal bilateral personalized man-machine social interaction generation method can control the personalized information amount in the reply according to different weight settings, and the method can generate the reply which is personalized and varies from person to person for different users.
In addition, the present embodiment also provides a cross-modal bilateral personalized human-computer social conversation generating system, which includes a computer device, where the computer device at least includes a microprocessor and a memory, which are connected to each other, and the microprocessor is programmed or configured to execute the steps of the aforementioned cross-modal bilateral personalized human-computer social conversation generating method, or a computer program programmed or configured to execute the aforementioned cross-modal bilateral personalized human-computer social conversation generating method is stored in the memory.
In addition, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured to execute the aforementioned cross-modal bilateral personalized human-computer social conversation generation method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A cross-modal bilateral personalized man-machine social conversation generation method is characterized by comprising the following steps:
1) encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Encoding of output results at the previous timeE prev Performing weighted fusion to obtain weighted fusion codeO enc
2) Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generative model is trained in advance, and input weighted fusion codes and codes of output results at the last moment are establishedE prev And the mapping relation between the output optimal N candidate reply lists;
3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists; when the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;
Figure 972820DEST_PATH_IMAGE001
in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) Indicating conditionsThe abundance value of the mutual information is obtained,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Trepresenting own personalized information;
4) and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result.
2. The method according to claim 1, wherein step 1) is preceded by generating a dialog context codeE C The steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE C
3. The cross-modal bilateral personalized human-computer social conversation generation method according to claim 1, further comprising generating a robot personalized information code before step 1)E T The steps of (1): carrying out word embedding coding on the preset robot personalized description to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain the code of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE T
4. The cross-modal bilateral personalized human-computer social conversation generation method according to claim 1, further comprising generating a user personalized information code before step 1)E S The steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE S
5. The method of generating a cross-modal bilateral personalized human-machine social conversation according to claim 2,3 or 4, wherein the position-coded function expression is as follows:
Figure 889961DEST_PATH_IMAGE002
in the above formula, the first and second carbon atoms are,E pos,2i()indicating that the character is at 2iThe position-embedded code that maps positionally into a sinusoidal function,E pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d model the dimensions of the code are represented and,d model =512,pos⊆[1,n],ncontent length encoded for word embedding.
6. The method for generating a cross-modal bilateral personalized human-computer social conversation according to claim 1, wherein the weighted fusion coding is obtained by performing weighted fusion in step 1)O enc Comprises the following steps:
s1) encoding the dialog contextE C Encoding of output results at the previous timeE prev Coding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO C Coding the personalized information of the robotE T Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO T Encoding user personalized informationE S Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO S Encoding the output result of the previous timeE prev Coding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO prev
S2) obtaining the weighted fusion code according to the following formulaO enc
Figure 669698DEST_PATH_IMAGE003
In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γpresentation of no personalization in repliesThe probability of the information is determined by the probability of the information,O S coding the user personalized information after multi-head self-attention coding,O T for coding the personalized information of the robot after multi-head self-attention coding,O C for multi-headed self-attention coded dialog context coding,O prev the method is used for encoding the output result of the last moment after multi-head self-attention encoding.
7. The cross-modal bilateral personalized human-computer social conversation generating method according to claim 1, wherein the bilateral personalized generating model in step 2) is a GPT network model, and the step 2) further includes a step of training the GPT network model, and a joint loss function adopted in training the GPT network model is as follows:
Figure 64907DEST_PATH_IMAGE004
in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L D (φ) A loss function of the decoding model is represented,L LM (φ) A loss function of the language model is represented,L p (θ) Representing a loss function of the personalized predictive model,λ 1andλ 2is a weight coefficient; wherein the decoding model loss functionL D (φ) The expression of (a) is as follows:
Figure 692329DEST_PATH_IMAGE005
in the above formula, the first and second carbon atoms are,P φ (y i |x 0 ,…,x i-1 ,O enc ) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y i representing the generation of character strings for a decoderTo (1)iThe number of the characters is one,x 0 ~x i-1 generating prefixes in character strings for decodersiThe number of the characters is one,O enc it is shown that the weighted fusion coding,iindicating the length of the generated character string;
loss function of language modelL LM (φ) The function of (a) is expressed as follows:
Figure 514791DEST_PATH_IMAGE006
in the above formula, the first and second carbon atoms are,P φ (y i |x i-k ,…,x i-1 ) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 characterx i-k x i-1 Predict the firstkCharactery i The probability of (a) of (b) being,y i indicating the first in a fixed windowkThe number of the characters is one,x i-k ~x i-1 representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position;
personalized predictive loss functionL p (θ) The function of (a) is expressed as follows:
Figure 781825DEST_PATH_IMAGE007
in the above formula, the first and second carbon atoms are,y j a tag representing the personalized information is provided,P θ (y=j|O C ) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O C representing a multi-headed self-attention coded dialog context code.
8. The cross-modal bilateral personalized human-machine social conversation generation method according to claim 1, wherein the step of step 3) comprises: marking a robot reply text, conversation history information and interactive two-party personalized information in each section of conversation in collected human-computer social conversation data in advance, and using the marked information as a training positive sample of a deep neural network model; meanwhile, in order to balance the training data, a training negative sample is constructed by adopting a negative sampling method, the cross entropy of the two classes is used as a loss function of the deep neural network model, and the deep neural network model is trained.
9. A cross-modal bilateral personalized human-computer social conversation generating system comprising a computer device including at least a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the steps of the cross-modal bilateral personalized human-computer social conversation generating method according to any one of claims 1 to 8, or the memory stores therein a computer program programmed or configured to perform the cross-modal bilateral personalized human-computer social conversation generating method according to any one of claims 1 to 8.
10. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform a cross-modal bilateral personalized social interaction generating method according to any one of claims 1 to 8.
CN202011046353.7A 2020-09-29 2020-09-29 Cross-modal bilateral personalized man-machine social interaction dialog generation method and system Active CN111930918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011046353.7A CN111930918B (en) 2020-09-29 2020-09-29 Cross-modal bilateral personalized man-machine social interaction dialog generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011046353.7A CN111930918B (en) 2020-09-29 2020-09-29 Cross-modal bilateral personalized man-machine social interaction dialog generation method and system

Publications (2)

Publication Number Publication Date
CN111930918A CN111930918A (en) 2020-11-13
CN111930918B true CN111930918B (en) 2020-12-18

Family

ID=73333707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011046353.7A Active CN111930918B (en) 2020-09-29 2020-09-29 Cross-modal bilateral personalized man-machine social interaction dialog generation method and system

Country Status (1)

Country Link
CN (1) CN111930918B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7329585B2 (en) 2021-05-24 2023-08-18 ネイバー コーポレーション Persona chatbot control method and system

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342947B (en) * 2021-05-26 2022-03-15 华南师范大学 Multi-round dialog text generation method capable of sensing dialog context relative position information
CN113781853B (en) * 2021-08-23 2023-04-25 安徽教育出版社 Teacher-student remote interactive education platform based on terminal
CN114996431B (en) * 2022-08-01 2022-11-04 湖南大学 Man-machine conversation generation method, system and medium based on mixed attention
CN116580445B (en) * 2023-07-14 2024-01-09 江西脑控科技有限公司 Large language model face feature analysis method, system and electronic equipment
CN117131426B (en) * 2023-10-26 2024-01-19 一网互通(北京)科技有限公司 Brand identification method and device based on pre-training and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488662B (en) * 2016-01-07 2021-09-03 北京华品博睿网络技术有限公司 Online recruitment system based on bidirectional recommendation
WO2018097091A1 (en) * 2016-11-25 2018-05-31 日本電信電話株式会社 Model creation device, text search device, model creation method, text search method, data structure, and program
CN108320218B (en) * 2018-02-05 2020-12-11 湖南大学 Personalized commodity recommendation method based on trust-score time evolution two-way effect
CN108920497B (en) * 2018-05-23 2021-10-15 北京奇艺世纪科技有限公司 Man-machine interaction method and device
CN111625660A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Dialog generation method, video comment method, device, equipment and storage medium
CN111708874B (en) * 2020-08-24 2020-11-13 湖南大学 Man-machine interaction question-answering method and system based on intelligent complex intention recognition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7329585B2 (en) 2021-05-24 2023-08-18 ネイバー コーポレーション Persona chatbot control method and system

Also Published As

Publication number Publication date
CN111930918A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN111930918B (en) Cross-modal bilateral personalized man-machine social interaction dialog generation method and system
CN111897933B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN109447242B (en) Image description regeneration system and method based on iterative learning
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN107818084B (en) Emotion analysis method fused with comment matching diagram
CN110688832B (en) Comment generation method, comment generation device, comment generation equipment and storage medium
CN111783455B (en) Training method and device of text generation model, and text generation method and device
CN114995657B (en) Multimode fusion natural interaction method, system and medium for intelligent robot
CN112182161B (en) Personalized dialogue generation method and system based on user dialogue history
CN111581970B (en) Text recognition method, device and storage medium for network context
CN114021524B (en) Emotion recognition method, device, equipment and readable storage medium
CN111985243B (en) Emotion model training method, emotion analysis device and storage medium
CN113254625A (en) Emotion dialogue generation method and system based on interactive fusion
CN113065344A (en) Cross-corpus emotion recognition method based on transfer learning and attention mechanism
WO2021135457A1 (en) Recurrent neural network-based emotion recognition method, apparatus, and storage medium
CN116564338B (en) Voice animation generation method, device, electronic equipment and medium
CN112214585A (en) Reply message generation method, system, computer equipment and storage medium
WO2023231576A1 (en) Generation method and apparatus for mixed language speech recognition model
CN111382257A (en) Method and system for generating dialog context
CN112183106A (en) Semantic understanding method and device based on phoneme association and deep learning
CN114386426B (en) Gold medal speaking skill recommendation method and device based on multivariate semantic fusion
CN114444510A (en) Emotion interaction method and device and emotion interaction model training method and device
US20230290371A1 (en) System and method for automatically generating a sign language video with an input speech using a machine learning model
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230629

Address after: 410001 No. 002, Floor 5, Building B, No. 10, Zone 2, CSCEC Smart Industrial Park, No. 50, Jinjiang Road, Yuelu Street, Yuelu District, Changsha, Hunan Province

Patentee after: Hunan Xinxin Xiangrong Intelligent Technology Co.,Ltd.

Address before: Yuelu District City, Hunan province 410082 Changsha Lushan Road No. 1

Patentee before: HUNAN University

TR01 Transfer of patent right