CN111930918B - Cross-modal bilateral personalized man-machine social interaction dialog generation method and system - Google Patents
Cross-modal bilateral personalized man-machine social interaction dialog generation method and system Download PDFInfo
- Publication number
- CN111930918B CN111930918B CN202011046353.7A CN202011046353A CN111930918B CN 111930918 B CN111930918 B CN 111930918B CN 202011046353 A CN202011046353 A CN 202011046353A CN 111930918 B CN111930918 B CN 111930918B
- Authority
- CN
- China
- Prior art keywords
- personalized
- information
- coding
- code
- bilateral
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 230000002146 bilateral effect Effects 0.000 title claims abstract description 66
- 230000003997 social interaction Effects 0.000 title claims abstract description 29
- 230000003993 interaction Effects 0.000 claims abstract description 32
- 239000013598 vector Substances 0.000 claims description 60
- 230000006870 function Effects 0.000 claims description 51
- 125000004432 carbon atom Chemical group C* 0.000 claims description 38
- 230000004927 fusion Effects 0.000 claims description 35
- 238000012549 training Methods 0.000 claims description 31
- 230000007246 mechanism Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 13
- 238000003062 neural network model Methods 0.000 claims description 13
- 238000002372 labelling Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 230000002452 interceptive effect Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 11
- 238000000605 extraction Methods 0.000 description 10
- 238000011160 research Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 2
- 210000001747 pupil Anatomy 0.000 description 2
- 230000001502 supplementing effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000238558 Eucarida Species 0.000 description 1
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 241001134453 Lista Species 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000012733 comparative method Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Manipulator (AREA)
Abstract
The invention discloses a cross-modal bilateral personalized man-machine social interaction generation method and a cross-modal bilateral personalized man-machine social interaction generation system. According to the invention, the personalized information is fused in a cross-modal manner, the personalized information of the characters of both interaction parties is considered, the personalized characteristics of both interaction parties are fully utilized on the premise of ensuring reasonable reply content, smooth grammar and continuous logic, and the reply with individuality and individuality according to different people can be generated.
Description
Technical Field
The invention relates to the technical field of human-computer interaction, in particular to a cross-modal bilateral personalized human-computer social interaction generation method and system.
Background
With the progress of science and technology, human-computer interaction gradually develops towards intellectualization and individuation, and the interaction between people and robots is increasingly close to the interaction between people in the real world. Traditional man-machine social interaction generation belongs to the field of natural language processing, and mainly researches that a robot can make a natural reply according to text input of a user. Different from interpersonal social contact, vision is a main sensory source for receiving external information by people, and people can make natural personalized expression according to the external information. Therefore, in order to make the robot more "humanoid", in man-machine socialization, the robot needs to have a certain perception capability, and also needs to understand the language of the user and make a natural and personalized reply. How to fuse computer vision and natural language processing, and research cross-modal fusion coding of different modal information such as texts, images and the like, so that the robot generates 'human-like' personalized expression according to the acquired user images and natural language, and becomes one of challenges of man-machine natural interaction.
Personalized man-machine interaction mainly researches how a robot generates a reply consistent with a preset personality or characteristic, namely a reply with single-side personalization consistency. In recent years, the research of human-computer interaction based on deep learning has been greatly developed and is increasingly widely used in daily life, such as widely known microsoft ice, apple Siri, etc., which all have preset personality and characteristic attributes. In the field of personalized interaction, the latest progress based on pre-training methods has advanced the latest results of a series of natural language processing tasks, and recent research attempts to solve the problem of personalized dialog generation in a data-driven manner, i.e., learning features related to characters directly from large-scale dialog data sets, and generating corresponding responses conforming to the personality of the speaker by encoding sparse personalized interaction information of the characters and then performing fine tuning on a pre-training model.
Nowadays, the social chat robot can best embody personalized interaction and can generate personalized replies rich in character characteristics. At present, a chat robot is mainly based on character interaction, and in the process of man-machine interaction, the chat robot cannot see a user and cannot acquire personalized information of the user, but personalized attribute information of the robot can be predefined, so that personalized expression related to the chat robot can be generated by embedding and representing the personalized information of the robot.
However, in the social interaction between real people, the two interacting parties can acquire the personalized information of each other, and when replying, the replying person not only concentrates on the personalized expression of the replying person, but also should reply by considering the personalized characteristics of the other party and the problems of the other party, so that the man-machine interaction neglecting the personalized information of the user can generate feeling and disgust emotion, and the user experience is reduced. Therefore, the cross-mode bilateral personalized man-machine interaction technology needs to be developed in the natural interaction of the robot.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a cross-modal bilateral personalized man-machine social interaction session generation method and system.
In order to solve the technical problems, the invention adopts the technical scheme that:
a cross-modal bilateral personalized man-machine social conversation generation method comprises the following steps:
1) encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Encoding of output results at the previous timeE prev Performing weighted fusion to obtain weighted fusion codeO enc ;
2) Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generative model is trained to pre-establish the weighted fusion code of the input and the code of the output result at the last momentE prev And the mapping relation between the output optimal N candidate reply lists;
3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists;
4) and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result.
Optionally, step 1) is preceded by generating a dialog context codeE C The steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE C 。
Optionally, step 1) is preceded by generating a robot personalized information codeE T The steps of (1): carrying out word embedding coding on the preset robot personalized description to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain the code of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE T 。
Optionally, step 1) is preceded by generating a user personalized information codeE S The steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE S 。
Optionally, the functional expression for position coding is as follows:
in the above formula, the first and second carbon atoms are,E pos,2i()indicating that the character is at 2iThe position-embedded code that maps positionally into a sinusoidal function,E pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d model the dimensions of the code are represented and,d model =512,pos⊆[1,n],ncontent length encoded for word embedding.
Optionally, performing weighted fusion in step 1) to obtain a weighted fusion codeO enc Comprises the following steps:
s1) encoding the dialog contextE C Encoding of output results at the previous timeE prev Coding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO C Coding the personalized information of the robotE T Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO T Encoding user personalized informationE S Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO S Encoding the output result of the previous timeE prev Coding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO prev ;
S2) obtaining the weighted fusion code according to the following formulaO enc ;
In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γindicating the probability that no personalized information is present in the reply,O S to pay attention to the channel by multiple headsThe user personalized information code after the force code,O T for coding the personalized information of the robot after multi-head self-attention coding,O C for multi-headed self-attention coded dialog context coding,O prev the method is used for encoding the output result of the last moment after multi-head self-attention encoding.
Optionally, the bilateral personalized generation model in step 2) is a GPT network model, and step 2) further includes a step of training the GPT network model, and a joint loss function adopted when the GPT network model is trained is as follows:
in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L D (φ) A loss function of the decoding model is represented,L LM (φ) A loss function of the language model is represented,L p (θ) Representing a loss function of the personalized predictive model,λ 1andλ 2is a weight coefficient; wherein the decoding model loss functionL D (φ) The expression of (a) is as follows:
in the above formula, the first and second carbon atoms are,P φ (y i |x 0 ,…,x i-1 ,O enc ) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y i indicating generation of the second in the string for the decoderiThe number of the characters is one,x 0 ~x i-1 generating prefixes in character strings for decodersiThe number of the characters is one,O enc it is shown that the weighted fusion coding,iindicating the length of the generated character string;
loss function of language modelL LM (φ) The function of (a) is expressed as follows:
in the above formula, the first and second carbon atoms are,P φ (y i |x i-k ,…,x i-1) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 character predictionx i-k ~x i-1 First, thekCharactery i The probability of (a) of (b) being,y i indicating the first in a fixed windowkThe number of the characters is one,x i-k ~x i-1 representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position;
personalized predictive loss functionL p (θ) The function of (a) is expressed as follows:
in the above formula, the first and second carbon atoms are,y j a tag representing the personalized information is provided,P θ (y=j|O C ) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O C representing a multi-headed self-attention coded dialog context code.
Optionally, the step of step 3) comprises: marking a robot reply text, conversation history information and interactive two-party personalized information in each section of conversation in collected human-computer social conversation data in advance, and using the marked information as a training positive sample of a deep neural network model; meanwhile, in order to balance the training data, a training negative sample is constructed by adopting a negative sampling method, the cross entropy of the two classes is used as a loss function of the deep neural network model, and the deep neural network model is trained; when the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;
in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) The information abundance value of the condition mutual information is shown,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Trepresenting own personalized information.
In addition, the present invention also provides a cross-modal bilateral personalized human-computer social conversation generating system, including a computer device, the computer device at least includes a microprocessor and a memory connected with each other, the microprocessor is programmed or configured to execute the steps of the cross-modal bilateral personalized human-computer social conversation generating method, or the memory stores a computer program programmed or configured to execute the cross-modal bilateral personalized human-computer social conversation generating method.
Furthermore, the present invention also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to perform the cross-modal bilateral personalized human-machine social conversation generation method.
Compared with the prior art, the invention has the following advantages: the invention includes encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Last one timeEncoding of output results of scalesE prev Performing weighted fusion to obtain weighted fusion codeO enc (ii) a Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists; and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result. According to the invention, the personalized information is fused in a cross-modal manner, the personalized information of the characters of both interaction parties is considered, the personalized characteristics of both interaction parties are fully utilized on the premise of ensuring reasonable reply content, smooth grammar and continuous logic, and the reply with individuality and individuality according to different people can be generated. The invention constructs the cross-modal personalized interaction model by taking the generation of the robot natural language reply which is rich in personality and varies from person to person as a target in the process of human-computer interaction, adopts the pre-training technology, and can obviously improve the chat quality and the social experience of human-computer interaction compared with the traditional method.
Drawings
FIG. 1 is a core flow diagram of a method according to an embodiment of the present invention.
FIG. 2 is a schematic view of a complete flow of the method according to the embodiment of the present invention.
Fig. 3 is a schematic diagram of a framework structure of a bilateral personalized generative model according to an embodiment of the present invention.
FIG. 4 is a diagram of a frame structure of an encoder according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a framework structure of a dynamic weighting and fusing module for encoding information according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a frame structure of a decoder according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of a variable across modal information coding fusion in an embodiment of the present invention;
FIG. 8 is a diagram illustrating comparison between output results of different methods according to an embodiment of the present invention.
Detailed Description
As shown in fig. 1 and fig. 2, the method for generating a cross-modal bilateral personalized human-computer social conversation according to the present embodiment includes:
1) encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Encoding of output results at the previous timeE prev Performing weighted fusion to obtain weighted fusion codeO enc ;
2) Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generation model is trained in advance to establish a weighted fusion code of input and a code of an output result at the previous momentE prev And the mapping relation between the output optimal N candidate reply lists;
3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists;
4) and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result.
Referring to fig. 2, fig. 3 and fig. 4, the present embodiment further includes generating a dialog context code before step 1)E C The steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE C 。
Referring to fig. 2, fig. 3 and fig. 4, the method further includes generating a robot personalized information code before step 1) of the present embodimentE T The steps of (1): carrying out words on preset robot personalized descriptionEmbedding the codes to obtain embedded vectors, then carrying out position coding on the embedded vectors to obtain position coding vectors, and adding the embedded vectors and the position coding vectors to obtain codes of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE T 。
Referring to fig. 2, fig. 3 and fig. 4, the embodiment further includes generating a user personalized information code before step 1)E S The steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE S 。
In this embodiment, the step of extracting the personalized description information of the image modality includes: adopting a face enhancement tool kit (such as a Facelib open source tool kit) to enhance and preprocess face data, and extracting a face position and a feature point position in a picture by using a face position and feature point position extraction model (such as a MobileNet V1 pre-training model), wherein the extracted face position is the upper left corner coordinate and the lower right corner coordinate of an extraction frame; the characteristic points comprise coordinates of left eye pupils, coordinates of right eye pupils, coordinates of nose tips, coordinates of left mouth corners and coordinates of right mouth corners, and the extracted face images are corrected and cut by adopting designed parameters to obtain standard face images with the same size. And processing the obtained standard face picture by using a standardized method, and inputting the standard face picture into an age and gender face recognition classification model to extract an age and gender classification result after face recognition. For example, inputting a standard face image into a ShuffleNet pre-training model for fine adjustment operation, training and optimizing a loss function shown in the following formula, and obtaining an age and gender classification result after face recognition after forward propagation of the standard face image;
in the above formula, the results of age and sex classification are used respectivelyAndit is shown that,P(a,g|I,B) Representing the probability of the predicted age and gender of the face image selected from the face bounding box,Iis an image of a face, and is,Bis a face bounding box. In addition, when extracting the personalized description information of the text modality, the personalized social interaction data set is filtered by adopting a regular method when filtering the text with the inappropriate words so as to improve the efficiency.
When carrying out word embedding coding on the obtained user personalized description, the word embedding coding step of the personalized description information aiming at the image modality comprises the following steps: respectively constructing user personalized description information in a key value pair form and personalized attribute feature information in a character string form according to the obtained age and gender classification results, and respectively obtaining codes of the description information and the attribute feature information by carrying out corresponding word embedding coding operation, wherein the results are expressed as the following formula;
in the above formula, the first and second carbon atoms are,E v U coding the personalized description information for the user, general formulae v i User-personalized descriptive coding information characterized in the form of key-value pairs extracted for an image modality,i=1,2 code vectors of key-value pairs for age, gender respectively,sis a coded vector of the key(s),vis a coded vector of values. Extracting personalized attribute information of a user, wherein the personalized attribute information consists of age and gender character strings, and word embedding coding is respectively carried out by adopting the following formula;
wherein,G v U 、A v U respectively encoding the gender and age personalized attributes of the user extracted from the image modality,g v i 、a v i respectively representing gender and age attributes extracted from the image modalityiThe code of the position is used to determine,mpersonalized description information coding representing a userE v U The total length of the first and second support members,i⊆[1,m]。
when carrying out word embedding coding on the obtained user personalized description, the word embedding coding step of the personalized description information aiming at the text mode comprises the following steps: performing word embedding operation on the user and robot personalized description information contained in the text modality, wherein the user and robot personalized description information is expressed in a key value pair form, and the following formula is shown as follows:
in the above formula, the first and second carbon atoms are,E t U encoded information extracted for the user's personalized description for the text modality,E t R coded information of personalized description of robot extracted for text modality, general formulae t i For personalized description coding in the form of key-value pairs extracted from a text modality,i=1,2,3 represent the encoding vectors of key-value pairs of age, gender, interest tags, respectively. Considering the current text information word embedding of the user, the robot text information processing mode is the same (subscript is represented by R), all the embedded information prescribes the same embedding length, and if the length of the corresponding personalized information does not reach the total length of the sentence, the personalized information is used "<PAD>"as placeholder, complement the personalized information, when the total length of the corresponding sentence is exceeded, take truncation operation, and only take the first half of the input information, here, only take user input as an example. Then the encoded information of the complete sentence dialog input by the user is:
in the above formula, the first and second carbon atoms are,X U the encoded information of the complete sentence dialog entered for the user,x Ui for the first in the sentence input by the userThe words of the word are embedded into the encoded vector.
The method comprises the following steps of taking values of gender, age and interest tags, respectively carrying out word embedding coding operation, and when a user has a plurality of interest tags, carrying out average operation on codes, wherein the average operation is shown as the following formula:
in the above formula, the first and second carbon atoms are,G t U encoded information representing a text modality extracted user gender tag,A t U encoded information representing a user age tag extracted by the text modality,T t U encoded information representing the user interest tags extracted by the text modality,g t Ui user gender information representing text extraction corresponds to inputThe word at the position of the word is embedded in the encoded vector,a t Ui a word representing the user age information of the text extraction is embedded in the encoding vector,t t Ui embedding a word representing an interest tag of a text extraction user into a coding vector, and when a plurality of interest tags exist, averaging the plurality of embedded coding vectors,i⊆[1,n],nindicating the overall length of the sentence. Because the personalized information of the text mode is absent, the personalized attribute information extracted from the image supplements the personalized attribute information of the text, such as the missing gender, age and the like, and meanwhile, the personalized attribute information error of the text is corrected in a mode of cross-mode information correction. In this embodiment, when performing word embedding coding on the obtained user personalized description, the method further includes supplementing age and gender information in the user personalized description information by using different modality information, as shown in the following formula:
in the above formula, the first and second carbon atoms are,E U extracting the personalized description coding information of the supplementary user from the image and text information,E v U encoded information extracted for the user's personalized description for the text modality,E t U encoded information of the personalized description of the robot extracted for the text modality,and adding the coded information of different modes, and supplementing the personalized description information lacking in the text mode of the user by the personalized description information extracted by the image mode.Indicates that the age-key-value pair encoding information of the user is supplemented from the text and image information,indicates that the user's gender-key-value pair encoding information is supplemented from the text and image information,e 3 t the key value pair of interest tag representing the user from the text information encodes the information. The personalized description of the robot is predefined and does not need to be obtained from the image modality, soE R =E t R WhereinE R Representing the robot personalization descriptive information,E t R and robot personalized description information representing text extraction.
Age and gender information in the personalized attribute information of the user is supplemented and fused through different modal information, and the information is shown as the following formula:
in the above formula, the first and second carbon atoms are,G U the gender code information of the user after the image and the text information are fused,G v U gender and age coding information representing the image extraction,G t U user gender code information representing text extraction,~the representations respectively correspond to the total length of the text input by the usernEach position in the image information is user gender code information fused by text and image information,A U representing the user age code information fused by the image and the text information,A v U age-coding information representing the image extraction,A t U user age-coding information representing text extraction,~the representations respectively correspond to the total length of the text input by the usernEach position in the image, the post-user age code information fused by the text and the image modality information,and adding the coded information of different modes, and fusing the personalized information lacking in the text mode of the user by the personalized information extracted by the image mode.
Defining the personalized information embedded representation of the current user asD U The robot personalized information embedding expression system consists of user dialogue sentences, user age information, user gender information and user interest labels, and the robot personalized information embedding expression of the same reason isD R The method is the same as above.
In the above formula, the first and second carbon atoms are,D U indicating the current userIs embedded in the representation of the personalized information,D R the representation robot personalization information is embedded in the representation,G U the gender code information of the user after the image and the text information are fused,A U representing the user's age-coded information fused by the image and the text information,T t U encoding information representing user interests extracted from a text modality,X U a dialog sentence representing the user is presented,X R a dialog sentence representing the robot,G t R gender code information representing the robot extracted by the text modality,A t R age-coded information representing the robot extracted by the text modality,T t R and the interest coding information of the robot extracted by the text mode is represented.
In the above formula, the first and second carbon atoms are,E C the information is coded for the historical dialogue context and is formed by adding the robot personalized description information, the user personalized information, the robot personalized information and the dialogue historical information after being coded. General formula (VII)C j Represents the firstjThe history information of the wheel is recorded,lrepresenting the total number of rounds of the historical conversation,j⊆[1,l],D U a personalized information embedded representation representing the current user,D R the robot personalization information is represented embedded in the representation.
In this embodiment, the embedded vector is subjected to position coding to obtain a position coding vector, and a functional expression for performing position coding is shown as the following formula:
in the above formula, the first and second carbon atoms are,E pos,2i()indicating characterIs marked with 2iThe position-embedded code that maps positionally into a sinusoidal function,E pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d model the dimensions of the code are represented and,d model =512,pos⊆[1,n],ncontent length encoded for word embedding. And correspondingly adding each embedded vector and the position coding vector to obtain final trans-modal vector representation, and inputting the final trans-modal vector representation together into an encoder of the bilateral personalized generation model for encoding operation.
Referring to fig. 5, in this embodiment, the weighted fusion is performed in step 1) to obtain a weighted fusion codeO enc Comprises the following steps:
s1) encoding the dialog contextE C Encoding of output results at the previous timeE prev Coding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO C Coding the personalized information of the robotE T Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO T Encoding user personalized informationE S Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO S Encoding the output result of the previous timeE prev Coding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO prev (ii) a Can be respectively expressed as:
s2) obtaining the weighted fusion code according to the following formulaO enc ;
In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γindicating the probability that no personalized information is present in the reply,O S coding the user personalized information after multi-head self-attention coding,O T for coding the personalized information of the robot after multi-head self-attention coding,O C for multi-headed self-attention coded dialog context coding,O prev the method is used for encoding the output result of the last moment after multi-head self-attention encoding. In the embodiment, each category is passed throughsoftmaxThe step of operation is carried out, and the prediction result is defined as a weight value of weighted fusion, whereinα、β、γThe function of (a) is expressed as follows:
in the above formula, the first and second carbon atoms are,O C for multi-headed self-attention coded dialog context coding, y =0 represents that the returned personalized information is related to the other party,αrepresenting the probability of the presence of the user personalization in the reply, y =1 representing that the personalization of the reply is related to itself,βrepresenting the probability of robot personalization in the reply, y =2 representing the reply is context-dependent, without personalization,γrepresenting the probability that no personalized information is present in the reply.
Referring to FIG. 6, a decoder for bilateral personalized generative models is used to encode the weighted fusionO enc Encoding of output results at the previous timeE prev The best N candidate reply lists are generated (output results). As an optional implementation manner, the bilateral personalized generation model in step 2) in this embodiment is a GPT network model. Since the GPT network model is an existing network, it is not described again. The encoder and the decoder of the bilateral personalized generative model are common parameters, the parameters of the encoder and the decoder need to be trained before use, and the processing steps of the training data set adopted when the parameters of the encoder and the decoder are trained are the same as the step 1) and the preprocessing steps thereof, and are used for obtaining the weighted fusion coding of the data samplesO enc Encoding of output results at the previous timeE prev . On the basis, weighted fusion coding is adopted when the parameters of the encoder and the decoder are trainedO enc Encoding of output results at the previous timeE prev And as input, a multi-task learning mechanism is utilized, loss functions of a language model, an individualized prediction model and a decoding model in the GPT network model are optimized simultaneously, and an encoder and a decoder of the bilateral individualized generation model are trained to obtain optimal parameters, so that the bilateral individualized generation model is trained. Defining a personalized profile of the current speakerIs described asTPersonal description code of own party isE T The personalized description of the other party isSThe personalized description of the other party is coded intoE S When modeling a bilateral personalized interaction model, the method is divided into two cases: case a: when the speaker is a robot, at this timeTFor the personalized description of the robot, the personalized description of the own party is coded intoE T =E R And coding is carried out through a multi-head attention mechanism network. Case b: in model training, it is necessary to make full use of social interaction data, when the speaker is a user, at this timeTFor the personalized description of the user, the personalized description of the own party is coded intoE T = E U The personalized description of the other party is coded intoE S = E R At this time, only the following steps are neededO S AndO T and (4) interchanging, and calculating by adopting an attention mechanism.
And step 2) further includes a step of training a GPT network model before, and as a preferred embodiment, a joint loss function adopted when the GPT network model is trained in this embodiment is as follows:
in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L D (φ) A loss function of the decoding model is represented,L LM (φ) A loss function of the language model is represented,L p (θ) Representing a loss function of the personalized predictive model,λ 1andλ 2is a weight coefficient (can be obtained by experience, for example, in this embodiment 1=0.2, 2= 0.5); wherein the decoding model loss functionL D (φ) The expression of (a) is as follows:
in the above formula, the first and second carbon atoms are,P φ (y i |x 0 ,…,x i-1 ,O enc ) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y i indicating generation of the second in the string for the decoderiThe number of the characters is one,x 0 ~x i-1 generating prefixes in character strings for decodersiThe number of the characters is one,O enc it is shown that the weighted fusion coding,iindicating the length of the generated character string;x 0 ,…,x i-1after encoding for a given input text by the encoder, the decoder generates the top of the stringiAnd (4) each character is used as the input of the bilateral personalized Transformer decoder.
Loss function of language modelL LM (φ) The function of (a) is expressed as follows:
in the above formula, the first and second carbon atoms are,P φ (y i |x i-k ,…,x i-1 ) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 character predictionx i-k ~x i-1 First, thekCharactery i The probability of (a) of (b) being,y i indicating the first in a fixed windowkThe number of the characters is one,x i-k ~x i-1 representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position; the present embodiment utilizes the existing pre-trained model parameter setφTo initialize the encoder and decoder and train the language model by optimizing the standard maximum log likelihood loss.kIs the size of the contextual window and,x i-k ,…,x i-1 is a series of coded sequence samples sampled from a training corpus.
Referring to fig. 3, the present embodiment designs a coding information dynamic weighting fusion module, which inputs the dialog context information coding into the personalized prediction model to predict the probability of the personalized information appearing in the reply sentence, and dynamically fuses the input coding information into a three-classification personalized information prediction task, and dynamically weights and fuses the context, the personalized information of the listener at the opposite side and the personalized information of the caller at the other side. Personalized predictive loss functionL p (θ) Representing the probability of occurrence of individualized information in response to a prediction of contextual information, an individualized predictive loss functionL p (θ) The function of (a) is expressed as follows:
in the above formula, the first and second carbon atoms are,y j a tag representing the personalized information is provided,P θ (y=j|O C ) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O C represents a multi-headed self-attention coded dialog context code and has:
in the above formula, the first and second carbon atoms are,O C (j) the representation corresponds tojThe context of the personalized tag encodes the information classification result,O C (i) the representation corresponds toiAnd the context coding information classification result of the personalized label.
As can be seen from the foregoing, in order to achieve a better effect of the model, the training of this embodiment adopts a multi-task learning manner to perform fine tuning on the constructed personalized interactive text data set, so that not only is the loss function of the language model optimized, but also the personalized prediction loss function is performedL p (θ) Also, the decoding model loss function is optimizedL D (φ)。
Decoding model loss functionL D (φ) It can also be expressed as:
in the above formula, the first and second carbon atoms are,x 0 ,…,x i-1after encoding for a given input text by the encoder, the decoder generates the top of the stringiA character which is used as the input of a decoder of the bilateral personalized generative model,y i generating the first of the character strings for a decoderiAnd (4) characters. After the output results of the encoder of the bilateral personalized generative model are subjected to weighted fusion, the above formula can be expressed as a decoding model loss function in the contextL D (φ) In the form of (1).
After the personalized prediction model is trained, the decoder of the bilateral personalized generation model can be utilized to perform weighted fusion coding aiming at the inputO enc Encoding of output results at the previous timeE prev An optimal list of N candidate replies is generated. In this embodiment, a decoder of the bilateral personalized generation model generates an optimal top-5 candidate reply list through the bilateral personalized model in a cluster search manner, sorts the sentences in the candidate reply list by the size of a conditional mutual information abundance value (history included in replies and personalized information of both parties) by using a maximum conditional mutual information principle (CMIM) principle, selects the sentence with the largest conditional mutual information abundance value, performs normalization by using a text post-processing method, and outputs a suitable reply satisfying the personalized features of both parties of interaction.
In this embodiment, the step 3) includes: marking the reply text, the conversation history information and the individual information of both interaction parties of the robot in each section of conversation in the collected human-computer social conversation data in advance as a deep neural networkTraining positive samples of (1); meanwhile, in order to balance the training data, a negative sampling method is adopted to construct a training negative sample, and the cross entropy of the two classifications is used as a deep neural networkTraining a deep neural network model(ii) a When the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;
in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) The information abundance value of the condition mutual information is shown,the probability of generating reply text by the input text through a coder and a decoder under the condition of giving personalized information and historical information of both parties is shown,representing the co-occurrence probability of reply texts, historical information, the personalized information of the other party and the personalized information of the own party in the dialogue corpus,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Tthe personalized information of the own party is shown,representing that the bilateral personalized dialog generation model generates a Top-5 best candidate reply list,E C representing a dialog context code,E T Code for expressing the personalized information of the robot,E S Representing the coding of the personalized information of the user,parameters representing the bilateral personalized generative model,representing parameters of a deep neural network model.
Inputting the conditional mutual information abundance values of the text X and the reply text Y under the condition of giving the personalized information S and T and the historical information C of both partiesCan be expressed as:
in the above formula, the first and second carbon atoms are,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) And the probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown. Note that in the dialog generation model, generating the reply text Y requires the input text X as a premise, while the denominator of the joint loss function does not contain the input text X. Therefore, in this embodiment, a way of jointly optimizing the mutual information abundance value of the condition is not adopted, but a way of calculating step by step is adopted, the molecular part of the mutual information abundance value is calculated first, personalized information S and T and historical information C of both parties are encoded, and a bilateral personalized dialogue generating model is used to generate a Top-5 optimal candidate reply listAs shown in the following formula:
in the above formula, the first and second carbon atoms are,the probability of the Top-5 optimal candidate reply list obtained by the input information code, the historical information code, the other party personalized information code and the own party personalized information code is shown,top-5 best candidate reply list generated for the decoder of the bilateral personalized generative model. Further, the conditional probability formula is used for determining the abundance value of the conditional mutual informationI(X;Y|C,S,T) The denominator part of (a) is transformed as follows:
in the above formula, the first and second carbon atoms are,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,P(Y,C,S,T) Representing the co-occurrence probability of the reply text, the history information, the personalized information of the opposite party and the personalized information of the own party,P(C,S,T) Representing the co-occurrence probability of the historical information, the personalized information of the other party and the personalized information of the own party,Yrepresenting each sentence of reply text in the best candidate reply list. Personalizing information given the same partiesSAndThistory informationCUnder the condition of (1), each pair of input texts is calculatedXAnd reply textYThe condition of (2) is a mutual information abundance value,P(C,S,T) Since the probabilities of (a) and (b) are the same, if this part can be ignored in calculating the amount of condition information, the calculation of the following equation is equivalent to the calculation of the following equation.
In the above formula, the first and second carbon atoms are,representing the co-occurrence probability of the reply text, the historical information, the personalized information of the other party and the personalized information of the own party calculated by the deep neural network model,and representing a deep neural network model obtained by training on the collected human-computer social interaction data, and representing the co-occurrence probability of each reply text, the historical information and the individual information of both interaction parties. And in the collected human-computer social conversation data, marking the reply text of the robot, the conversation historical information and the individual information of both interactive parties in each section of conversation as a training positive sample of the deep neural network model. Meanwhile, in order to balance the training data, a negative sampling method is adopted to construct a training negative sample, cross entropy of the two classes is used as a loss function of the deep neural network model, the deep neural network model is trained, and the calculation mode is as follows:
in the above formula, the first and second carbon atoms are,representing a deep neural networkIs used to determine the loss function of (c),Nrepresents the number of training samples in a batch,N=32,y i a value of a personalized tag is represented,y i =0,1,representing a model of a deep neural networkCalculating the co-occurrence probability of the reply text, the historical information, the personalized information of the opposite side and the personalized information of the own side;y i =1 represents that the reply is present in the current dialog, and 0 represents that the reply is not present in the current dialog. In order to generalize the CMIM target, a hyper-parameter is introducedλ 3And a penalty coefficient for irrelevant replies in the dialog, the goal of optimization based on the loss function in the training process is shown as the following formula:
in the above formula, the first and second carbon atoms are,Y' represents the reply with the largest conditional mutual information abundance value in the top-5 candidate reply list, and the hyperparameterλ 3The value can be taken according to experience, and the value taking super-parameter in the embodimentλ 3And = 0.2. After the output statement with the maximum conditional mutual information abundance value is obtained, post-processing operation is carried out on the output statement, the reply statement is regulated by a template and keyword matching method, the proportion of the context information and the personalized information in the reply is increased by adopting a deleting and rewriting mode, and the reply which is more natural and smooth, different from person to person and rich in personality is generated.
As shown in fig. 7, embedding user gender information obtained from images and texts, user age information obtained from images and texts, and user interest information obtained from texts indicates that the same position information embedding codes are added at each position in the user input text to form context coding information of the user terminal; similarly, the robot does not need to acquire additional information through images, and only needs to acquire corresponding information such as conversation text, age, gender interest and the like from text information and then perform embedded coding representation, so that the gender coding information, the age coding information and the interest coding information of the robot are embedded and coded with the same position information, and are added at each position in the conversation text of the robot to form context coding information at the robot end, and the user and the robot coding information are connected in series to form cross-modal conversation context coding information.
In order to verify the cross-modal bilateral personalized human-computer social conversation generation method of the embodiment, the following evaluation indexes are adopted in the embodiment: (1) bilateral persona accuracy (Acc): inputting the generated response and the role information of the two people into a role classifier to obtain classification precision; (2) BLEU: calculating the n-gram (n =1, 2) overlapping rate of the generated statement and the original reply statement; (3) perlexity (ppl): and measuring the fitting degree of the model and the test data, wherein the lower the numerical value is, the more smooth the sentence grammar generated by the model is. Social interaction data meeting bilateral personalization is manually selected from an existing social interaction test set, a comparison experiment is carried out with different methods, and evaluation results are shown in table 1.
Table 1: the evaluation results of the method of the present example were compared with those of the conventional methods.
See Table 1 for comparative methods including Transfo, Lconv, P-TDW, Transfo + Bi-P + CMIM and Lconv + Bi-P + CMIM. Transfo (TransferTransfo) and Lconv (lost in conversation) are the most popular methods in dialog generation, and the methods do not consider personalized information; P-TDW (Pre-Training with Dynamic Weight) is a more advanced method, but the method only emphasizes the expression of self personalization; transfo + Bi-P + CMIM is an enhanced version of the Transfo + P method, and Lconv + Bi-P + CMIM is an enhanced version of the Lconv + P method. The cross-modal Bilateral personalized man-machine social interaction generation method comprises personalized information (binary Persona) of both parties and constraint of Conditional Mutual Information (CMIM), improvement is conducted on the basis of the method, and compared with test results of other methods, the method has superiority in indexes such as Acc, BLEU and PPL.
An experimental result of the cross-modal bilateral personalized man-machine social interaction generation method is shown in fig. 8, and as can be seen from an example result (left graph), when a user mentions' lovely sister, i find you fatIn the results generated by different methods, the Transfo and Lconv methods do not consider personalized information and are only related to the context, while the P-TDWP method only considers the personalized expression of the P-TDWP method, but the method provided by the invention can generate the following results according to the difference of weights: (1) "sister is fat", a reply related to the contra-party personalised information (setting = 1); (2) "i do not have fat", reply related to self personalized information (setting = 1); (3) do you (do you get fat)", a reply that is contextually related to the conversation (setting = 1); (4) 'I do not get fat, miss get fat', reply combining the individuation of the opposite side, the individuation of the own side and the context-related information in a self-adaptive way. Because the context information is included, the personal information of the opposite side is added, and the reply has more information amount and is more natural and smooth. As can be seen from the example results (right diagram), when the user mentions "old brother coming back" and the user mentions "brother" as the personality information of the other party, the method proposed by the present invention can generate: (1) "back for old sister", reply related to the other party's personalized information (setting = 1); (2) "old brother wants you up", reply related to own personalized information (setting = 1); (3) "i come back soon," a reply that is contextually related to the conversation (setting = 1); (4) the 'old brother wants you, old sister' replies by self-adaptive fusion of the individuation of the opposite side, the individuation of the own side and the context related information. As can be seen from the different examples: the reply generated by the cross-modal bilateral personalized man-machine social interaction generation method can control the personalized information amount in the reply according to different weight settings, and the method can generate the reply which is personalized and varies from person to person for different users.
In addition, the present embodiment also provides a cross-modal bilateral personalized human-computer social conversation generating system, which includes a computer device, where the computer device at least includes a microprocessor and a memory, which are connected to each other, and the microprocessor is programmed or configured to execute the steps of the aforementioned cross-modal bilateral personalized human-computer social conversation generating method, or a computer program programmed or configured to execute the aforementioned cross-modal bilateral personalized human-computer social conversation generating method is stored in the memory.
In addition, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured to execute the aforementioned cross-modal bilateral personalized human-computer social conversation generation method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
Claims (10)
1. A cross-modal bilateral personalized man-machine social conversation generation method is characterized by comprising the following steps:
1) encoding a dialog contextE C Robot personalized information codingE T User-customized information codingE S Encoding of output results at the previous timeE prev Performing weighted fusion to obtain weighted fusion codeO enc ;
2) Encoding the weighted fusionO enc Encoding of output results at the previous timeE prev Inputting the two-sided personalized generation model decoder together to generate an optimal N candidate reply lists; the bilateral personalized generative model is trained in advance, and input weighted fusion codes and codes of output results at the last moment are establishedE prev And the mapping relation between the output optimal N candidate reply lists;
3) calculating the conditional mutual information abundance value of each candidate reply in the N candidate reply lists; when the conditional mutual information abundance value replied by each candidate in the N candidate reply lists needs to be calculated, the conditional mutual information abundance value replied by each candidate in the N candidate reply lists is calculated according to the following formula;
in the above formula, the first and second carbon atoms are,I(X;Y|C,S,T) Indicating conditionsThe abundance value of the mutual information is obtained,P(Y|X,C,S,T) The probability of obtaining the reply text by the input text, the historical information, the personalized information of the opposite party and the personalized information of the own party is shown,P(Y|C,S,T) The probability of obtaining the reply text by the historical information, the personalized information of the other party and the personalized information of the own party is shown,Xwhich represents the input of the text to be entered,Ya representation of the reply text is made,Cthe history information is represented by a history information,Spersonalized information representing the other party is provided,Trepresenting own personalized information;
4) and selecting the candidate reply with the maximum conditional mutual information abundance value as a final output result.
2. The method according to claim 1, wherein step 1) is preceded by generating a dialog context codeE C The steps of (1): combining personalized attribute information and conversation context history of both parties to generate personalized interaction history information, respectively carrying out word embedding coding on the personalized interaction history information and current input of a user to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code described by the conversation context; inputting the encoding of the dialog context description into the encoder of the bilateral personalized generative model to obtain the dialog context encodingE C 。
3. The cross-modal bilateral personalized human-computer social conversation generation method according to claim 1, further comprising generating a robot personalized information code before step 1)E T The steps of (1): carrying out word embedding coding on the preset robot personalized description to obtain an embedded vector, then carrying out position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain the code of the robot personalized description; inputting the code of the robot personalized description into the encoder of the bilateral personalized generative model to obtain the robot personalized information codeE T 。
4. The cross-modal bilateral personalized human-computer social conversation generation method according to claim 1, further comprising generating a user personalized information code before step 1)E S The steps of (1): carrying out face recognition on a picture of a user to extract personalized description information of an image modality, carrying out face recognition on the picture of the user to extract personalized description information of the image modality, carrying out personalized social interaction on a personalized social interaction data set of the user, wherein the personalized social interaction data set comprises labeling information of sex, age and interest labels of two interacting parties, labeling the labels of personalized information of the two interacting parties related to a speaker for each sentence, and filtering a text with inappropriate words in the personalized social interaction data set to obtain personalized description information of a text modality; combining the personalized description information of the image modality and the personalized description information of the text modality to obtain a user personalized description; performing word embedding coding on the obtained user personalized description to obtain an embedded vector, then performing position coding on the embedded vector to obtain a position coding vector, and adding the embedded vector and the position coding vector to obtain a code of the user personalized description; inputting the code of the user personalized description into the encoder of the bilateral personalized generative model to obtain the user personalized information codeE S 。
5. The method of generating a cross-modal bilateral personalized human-machine social conversation according to claim 2,3 or 4, wherein the position-coded function expression is as follows:
in the above formula, the first and second carbon atoms are,E pos,2i()indicating that the character is at 2iThe position-embedded code that maps positionally into a sinusoidal function,E pos,2i+1()indicating that the character is at 2i+1 position-embedded code mapped into the cosine function,poswhich indicates the position where the character is located,ithe dimension of the word embedding is represented,d model the dimensions of the code are represented and,d model =512,pos⊆[1,n],ncontent length encoded for word embedding.
6. The method for generating a cross-modal bilateral personalized human-computer social conversation according to claim 1, wherein the weighted fusion coding is obtained by performing weighted fusion in step 1)O enc Comprises the following steps:
s1) encoding the dialog contextE C Encoding of output results at the previous timeE prev Coding through a multi-head attention mechanism network to obtain multi-head self-attention coded dialog context codingO C Coding the personalized information of the robotE T Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the robot personalized information code after multi-head self-attention codingO T Encoding user personalized informationE S Encoding of output results at the previous timeE prev Coding is carried out through a multi-head attention mechanism network to obtain the user personalized information code after multi-head self-attention codingO S Encoding the output result of the previous timeE prev Coding is carried out through a multi-head attention mechanism network with a mask to obtain a code of an output result at the last moment after multi-head self-attention codingO prev ;
S2) obtaining the weighted fusion code according to the following formulaO enc ;
In the above formula, the first and second carbon atoms are,αindicating the probability of the user-customized information appearing in the reply,βindicating the probability of the robot-personalised information appearing in the reply,γpresentation of no personalization in repliesThe probability of the information is determined by the probability of the information,O S coding the user personalized information after multi-head self-attention coding,O T for coding the personalized information of the robot after multi-head self-attention coding,O C for multi-headed self-attention coded dialog context coding,O prev the method is used for encoding the output result of the last moment after multi-head self-attention encoding.
7. The cross-modal bilateral personalized human-computer social conversation generating method according to claim 1, wherein the bilateral personalized generating model in step 2) is a GPT network model, and the step 2) further includes a step of training the GPT network model, and a joint loss function adopted in training the GPT network model is as follows:
in the above formula, the first and second carbon atoms are,L(φ,θ) A joint loss function is represented as a function of,L D (φ) A loss function of the decoding model is represented,L LM (φ) A loss function of the language model is represented,L p (θ) Representing a loss function of the personalized predictive model,λ 1andλ 2is a weight coefficient; wherein the decoding model loss functionL D (φ) The expression of (a) is as follows:
in the above formula, the first and second carbon atoms are,P φ (y i |x 0 ,…,x i-1 ,O enc ) Represents the probability of inputting the weighted fusion code with the generated character into the decoder for prediction of the next character,y i representing the generation of character strings for a decoderTo (1)iThe number of the characters is one,x 0 ~x i-1 generating prefixes in character strings for decodersiThe number of the characters is one,O enc it is shown that the weighted fusion coding,iindicating the length of the generated character string;
loss function of language modelL LM (φ) The function of (a) is expressed as follows:
in the above formula, the first and second carbon atoms are,P φ (y i |x i-k ,…,x i-1 ) Represents passing through a fixed window under the fixed windowkMiddle frontk-1 characterx i-k ~x i-1 Predict the firstkCharactery i The probability of (a) of (b) being,y i indicating the first in a fixed windowkThe number of the characters is one,x i-k ~x i-1 representing front in a fixed windowk-1 number of characters of the character,kwhich represents the size of the contextual window and,ito represent the second in the windowiA character of each position;
personalized predictive loss functionL p (θ) The function of (a) is expressed as follows:
in the above formula, the first and second carbon atoms are,y j a tag representing the personalized information is provided,P θ (y=j|O C ) Indicating the probability of predicting the presence of personalized information in reply based on contextual information,O C representing a multi-headed self-attention coded dialog context code.
8. The cross-modal bilateral personalized human-machine social conversation generation method according to claim 1, wherein the step of step 3) comprises: marking a robot reply text, conversation history information and interactive two-party personalized information in each section of conversation in collected human-computer social conversation data in advance, and using the marked information as a training positive sample of a deep neural network model; meanwhile, in order to balance the training data, a training negative sample is constructed by adopting a negative sampling method, the cross entropy of the two classes is used as a loss function of the deep neural network model, and the deep neural network model is trained.
9. A cross-modal bilateral personalized human-computer social conversation generating system comprising a computer device including at least a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the steps of the cross-modal bilateral personalized human-computer social conversation generating method according to any one of claims 1 to 8, or the memory stores therein a computer program programmed or configured to perform the cross-modal bilateral personalized human-computer social conversation generating method according to any one of claims 1 to 8.
10. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform a cross-modal bilateral personalized social interaction generating method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011046353.7A CN111930918B (en) | 2020-09-29 | 2020-09-29 | Cross-modal bilateral personalized man-machine social interaction dialog generation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011046353.7A CN111930918B (en) | 2020-09-29 | 2020-09-29 | Cross-modal bilateral personalized man-machine social interaction dialog generation method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111930918A CN111930918A (en) | 2020-11-13 |
CN111930918B true CN111930918B (en) | 2020-12-18 |
Family
ID=73333707
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011046353.7A Active CN111930918B (en) | 2020-09-29 | 2020-09-29 | Cross-modal bilateral personalized man-machine social interaction dialog generation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111930918B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7329585B2 (en) | 2021-05-24 | 2023-08-18 | ネイバー コーポレーション | Persona chatbot control method and system |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113342947B (en) * | 2021-05-26 | 2022-03-15 | 华南师范大学 | Multi-round dialog text generation method capable of sensing dialog context relative position information |
CN113781853B (en) * | 2021-08-23 | 2023-04-25 | 安徽教育出版社 | Teacher-student remote interactive education platform based on terminal |
CN114996431B (en) * | 2022-08-01 | 2022-11-04 | 湖南大学 | Man-machine conversation generation method, system and medium based on mixed attention |
CN116580445B (en) * | 2023-07-14 | 2024-01-09 | 江西脑控科技有限公司 | Large language model face feature analysis method, system and electronic equipment |
CN117131426B (en) * | 2023-10-26 | 2024-01-19 | 一网互通(北京)科技有限公司 | Brand identification method and device based on pre-training and electronic equipment |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105488662B (en) * | 2016-01-07 | 2021-09-03 | 北京华品博睿网络技术有限公司 | Online recruitment system based on bidirectional recommendation |
WO2018097091A1 (en) * | 2016-11-25 | 2018-05-31 | 日本電信電話株式会社 | Model creation device, text search device, model creation method, text search method, data structure, and program |
CN108320218B (en) * | 2018-02-05 | 2020-12-11 | 湖南大学 | Personalized commodity recommendation method based on trust-score time evolution two-way effect |
CN108920497B (en) * | 2018-05-23 | 2021-10-15 | 北京奇艺世纪科技有限公司 | Man-machine interaction method and device |
CN111625660A (en) * | 2020-05-27 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Dialog generation method, video comment method, device, equipment and storage medium |
CN111708874B (en) * | 2020-08-24 | 2020-11-13 | 湖南大学 | Man-machine interaction question-answering method and system based on intelligent complex intention recognition |
-
2020
- 2020-09-29 CN CN202011046353.7A patent/CN111930918B/en active Active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7329585B2 (en) | 2021-05-24 | 2023-08-18 | ネイバー コーポレーション | Persona chatbot control method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111930918A (en) | 2020-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111930918B (en) | Cross-modal bilateral personalized man-machine social interaction dialog generation method and system | |
CN111897933B (en) | Emotion dialogue generation method and device and emotion dialogue model training method and device | |
CN109447242B (en) | Image description regeneration system and method based on iterative learning | |
CN111966800B (en) | Emotion dialogue generation method and device and emotion dialogue model training method and device | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN107818084B (en) | Emotion analysis method fused with comment matching diagram | |
CN110688832B (en) | Comment generation method, comment generation device, comment generation equipment and storage medium | |
CN111783455B (en) | Training method and device of text generation model, and text generation method and device | |
CN114995657B (en) | Multimode fusion natural interaction method, system and medium for intelligent robot | |
CN112182161B (en) | Personalized dialogue generation method and system based on user dialogue history | |
CN111581970B (en) | Text recognition method, device and storage medium for network context | |
CN114021524B (en) | Emotion recognition method, device, equipment and readable storage medium | |
CN111985243B (en) | Emotion model training method, emotion analysis device and storage medium | |
CN113254625A (en) | Emotion dialogue generation method and system based on interactive fusion | |
CN113065344A (en) | Cross-corpus emotion recognition method based on transfer learning and attention mechanism | |
WO2021135457A1 (en) | Recurrent neural network-based emotion recognition method, apparatus, and storage medium | |
CN116564338B (en) | Voice animation generation method, device, electronic equipment and medium | |
CN112214585A (en) | Reply message generation method, system, computer equipment and storage medium | |
WO2023231576A1 (en) | Generation method and apparatus for mixed language speech recognition model | |
CN111382257A (en) | Method and system for generating dialog context | |
CN112183106A (en) | Semantic understanding method and device based on phoneme association and deep learning | |
CN114386426B (en) | Gold medal speaking skill recommendation method and device based on multivariate semantic fusion | |
CN114444510A (en) | Emotion interaction method and device and emotion interaction model training method and device | |
US20230290371A1 (en) | System and method for automatically generating a sign language video with an input speech using a machine learning model | |
CN116580691A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230629 Address after: 410001 No. 002, Floor 5, Building B, No. 10, Zone 2, CSCEC Smart Industrial Park, No. 50, Jinjiang Road, Yuelu Street, Yuelu District, Changsha, Hunan Province Patentee after: Hunan Xinxin Xiangrong Intelligent Technology Co.,Ltd. Address before: Yuelu District City, Hunan province 410082 Changsha Lushan Road No. 1 Patentee before: HUNAN University |
|
TR01 | Transfer of patent right |