CN118279704B

CN118279704B - Digital human interaction evaluation method, device, storage medium and equipment

Info

Publication number: CN118279704B
Application number: CN202410712819.4A
Authority: CN
Inventors: 王帅; 熊文轩; 周舒婷
Original assignee: Sichuan Shutian Information Technology Co ltd
Current assignee: Sichuan Shutian Information Technology Co ltd
Priority date: 2024-06-04
Filing date: 2024-06-04
Publication date: 2024-08-13
Anticipated expiration: 2044-06-04
Also published as: CN118279704A

Abstract

The application discloses a digital person interaction evaluation method, a device, a storage medium and equipment, wherein an evaluation system is constructed from four aspects of digital person dialogue text response quality, emotion consistency deviation degree, sound lip synchronization deviation degree, facial expression and limb action richness, and a plurality of evaluation indexes with different layers are utilized for comprehensive evaluation so as to improve the accuracy of evaluating the interaction performance of a digital person model under multiple modes. In addition, based on the difference of the user on the evaluation index emphasis points, the weight value corresponding to each evaluation index can be adaptively adjusted so as to meet the requirements of different users on the evaluation difference on the model emphasis points to be evaluated under different application scenes, and the interaction performance of the digital human model under multiple modes is correspondingly and personally reflected.

Description

Digital human interaction evaluation method, device, storage medium and equipment

Technical Field

The present application relates to the field of artificial intelligence interaction technologies, and in particular, to a method, an apparatus, a storage medium, and a device for evaluating digital human interaction.

Background

With the rapid development of the digital personal industry, the application field of the digital personal digital camera is increasingly wide, and the digital personal camera not only can enlarge the wonderful colors in the fields of entertainment, games and the like, but also can show great commercial value in the industries of education, medical treatment, finance, government affairs and the like. In order to continuously optimize the digital person to meet the requirements of users in different periods, it is particularly critical to perform omnibearing performance evaluation on the digital person. The evaluation can not only provide a clear optimization direction for technical teams, but also ensure that digital people meet the requirements of users in practical application, thereby improving the overall user satisfaction and loyalty.

As an aggregate integrating artificial intelligence, computer graphics, natural language processing, motion capture and other leading edge technologies, how to construct a comprehensive evaluation system is important for overall evaluation of multi-modal digital people from multiple angles. However, most of the current evaluation on digital people focuses on evaluating the performance of intelligent questions and answers, but neglects the evaluation on the performance of multi-mode interaction performance based on the expression of the related emotion in the visual and auditory aspects, the richness of the content expression and the like, so that the evaluation result is not objective and accurate enough due to the single evaluation index level.

Disclosure of Invention

The application mainly aims to provide a method, a device, a storage medium and equipment for evaluating interaction of a digital person, and aims to solve the problem that the interaction performance of the digital person in multiple modes cannot be objectively and accurately evaluated only by the performance of the digital person in intelligent question and answer.

In order to achieve the above object, the present application provides a method for interactive evaluation of digital people, which obtains multiple groups of multi-source interactive data of a digital person model, wherein each group of multi-source interactive data at least comprises dialogue text data, audio data and video data; obtaining a plurality of emotion sequences of the digital human model according to each group of the multi-source interaction data, and determining a first evaluation value based on the deviation degree of emotion consistency among the plurality of emotion sequences; determining a response quality of the digital person model based on each of the dialogue text data, and determining a second evaluation value based on the response quality; determining a sound lip-sync deviation degree of the digital person model based on each of the audio data and each of the video data, and determining a third evaluation value based on the sound lip-sync deviation degree; determining facial expression and limb action richness of the digital human model based on each video data, and determining a fourth evaluation value based on the facial expression and limb action richness; and evaluating the man-machine interaction performance of the digital human model based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value.

Optionally, the determining the first evaluation value based on the deviation degree of emotion consistencies among the plurality of emotion sequences includes: respectively acquiring the dialogue text data, the audio data and the video data based on a time axis; respectively identifying the dialogue text data, the audio data and the video data by using a preset emotion identification model to correspondingly obtain a first emotion sequence, a second emotion sequence and a third emotion sequence; wherein the first emotional sequence, the second emotional sequence, and the third emotional sequence are sequences respectively composed of corresponding emotional tag arrangements based on a time axis; and based on a time axis, respectively evaluating the consistency deviation degree of the emotion labels corresponding to the first emotion sequence, the second emotion sequence and the third emotion sequence in the same time period, and determining a first evaluation value.

Optionally, the determining the first evaluation value based on the evaluation of the consistency deviation degree of the emotion tags in the first emotion sequence, the second emotion sequence and the third emotion sequence in the same time period respectively by using a time axis includes: determining a deviation time value and a total duration by arranging the first emotion sequence, the second emotion sequence and the third emotion sequence side by side based on a time axis; the deviation time value is based on a maximum time value which is experienced when the emotion labels are inconsistent corresponding to the emotion labels between a reference party and a comparison party, wherein the reference party is any one of the first emotion sequence, the second emotion sequence and the third emotion sequence as an evaluation reference sequence; the comparison party is a combination of two emotion sequences which are left in the first emotion sequence, the second emotion sequence and the third emotion sequence except the reference party; the total duration is the maximum value of the duration of the first emotion sequence, the second emotion sequence and the third emotion sequence; calculating based on the deviation time value and the total duration to obtain emotion deviation degree; and determining a first evaluation value based on the emotion deviation degree and a preset emotion deviation degree comparison evaluation table.

Optionally, the determining the response quality of the digital person model based on each of the dialogue text data and determining the second evaluation value based on the response quality include: respectively obtaining a response accuracy evaluation value, a context consistency evaluation value, a response naturalness evaluation value and a response fluency evaluation value based on the dialogue text data; and determining a second evaluation value according to the response accuracy evaluation value, the context consistency evaluation value, the response naturalness evaluation value and the response fluency evaluation value.

Optionally, the dialogue text data includes a subjective text data set and an objective text data set, and the answer accuracy evaluation value is calculated by a first parameter corresponding to the objective text data set and a second parameter corresponding to the subjective text data set, where the first parameter is a numerical value determined by a plurality of objective question-answer dialogue text corresponding correct rates; the second parameter is a value determined by evaluation values given by a plurality of subjective question and answer dialog texts under corresponding quality evaluation criteria.

Optionally, the determining the sound lip-sync deviation degree of the digital person model based on each of the audio data and each of the video data, and determining the third evaluation value based on the sound lip-sync deviation degree, includes: processing each piece of audio data and each piece of video data based on a preset audio lip synchronization test model to obtain error values of lip and sound synchronization in each video, and obtaining the sound lip synchronization deviation degree of the digital human model based on a plurality of error values; the third evaluation value is determined based on the sound lip-sync deviation.

Optionally, the determining facial expression and limb action richness based on each of the video data, and determining the fourth evaluation value based on the facial expression and limb action richness includes: acquiring the total interaction time length of each group of multi-source interaction data; obtaining the total number of facial expressions and limb actions of the digital human model corresponding to each third emotion label respectively; determining the facial expression and limb action richness of the digital human model according to the ratio of the total number of facial expressions and limb actions corresponding to each third emotion label to the total interaction time length of the digital human model; the fourth evaluation value is determined based on the facial expression and limb-motion richness.

In addition, to achieve the above object, the present application also provides a digital human interaction evaluation device, which is characterized by comprising: the data acquisition module is used for acquiring a plurality of groups of multi-source interaction data of the digital human model, wherein each group of multi-source interaction data at least comprises dialogue text data, audio data and video data; the first evaluation module is used for obtaining a plurality of emotion sequences of the digital human model according to each group of the multi-source interaction data, and determining a first evaluation value based on the deviation degree of emotion consistency among the plurality of emotion sequences; a second evaluation module for determining a response quality of the digital person model based on each of the dialog text data, and determining the second evaluation value based on the response quality; a third evaluation module for determining a sound lip-sync deviation degree of the digital person model based on each of the audio data and each of the video data, and determining a third evaluation value based on the sound lip-sync deviation degree; a fourth evaluation module for determining facial expression and limb action richness of the digital human model based on each of the video data, and determining a fourth evaluation value based on the facial expression and limb action richness; and a fifth evaluation module for evaluating the man-machine interaction performance of the digital human model based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value.

In addition, to achieve the above object, the present application also provides a computer-readable storage medium including instructions which, when run on a computer, cause the computer to perform the digital human interaction assessment method of any one of the above.

In addition, to achieve the above object, the present application also provides an apparatus, characterized in that the apparatus includes: at least one processor, memory, and input output unit; wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program stored in the memory to perform the method for interactive assessment of a digital person according to any one of the above.

According to the method, the device, the storage medium and the equipment for evaluating the interaction of the digital person, provided by the embodiment of the application, a plurality of groups of multi-source interaction data of a digital person model are obtained, and each group of multi-source interaction data at least comprises dialogue text data, audio data and video data; obtaining a plurality of emotion sequences of the digital human model according to each group of the multi-source interaction data, and determining a first evaluation value based on the deviation degree of emotion consistency among the plurality of emotion sequences; determining a response quality of the digital person model based on each of the dialogue text data, and determining a second evaluation value based on the response quality; determining a sound lip-sync deviation degree of the digital person model based on each of the audio data and each of the video data, and determining a third evaluation value based on the sound lip-sync deviation degree; determining facial expression and limb action richness of the digital human model based on each video data, and determining a fourth evaluation value based on the facial expression and limb action richness; based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value, the man-machine interaction performance of the digital human model is evaluated, an evaluation system is constructed from four aspects of text dialogue response quality, dialogue emotion consistency, lip synchronization rate and expression limb richness of the digital human, and a plurality of evaluation indexes with different levels are utilized for comprehensive evaluation so as to improve the accuracy of evaluating the interaction performance of the digital human model under multiple modes. In addition, based on different emphasis points of the evaluation indexes by the user, the weight value corresponding to each evaluation index can be adaptively adjusted, and the interactive performance of the digital human model under multiple modes is reflected while the evaluation requirements of the user are met.

Drawings

FIG. 1 is a flow chart of a method for interactive assessment of digital people according to an embodiment of the application;

FIG. 2 is a schematic diagram of a digital human interaction assessment device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a computer-readable storage medium according to an embodiment of the application;

FIG. 4 is a schematic diagram of an apparatus according to an embodiment of the present application;

the achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

In the figure: 50. an optical disc; 60. an apparatus; 100. a digital human interaction evaluation device; 101. a data acquisition module; 102. a first evaluation module; 103. a second evaluation module; 104. a third evaluation module; 105. a fourth evaluation module; 106. a fifth evaluation module; 601. A processing unit; 602. a system memory; 603. a bus; 604. an external device; 605. an I/O interface; 606. a network adapter; 6021. a RAM; 6022. a cache memory; 6023. a ROM; 6024. an algorithm module; 6025. algorithm/utility.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Fig. 1 is a flow chart of a method for evaluating interaction of a digital person according to an embodiment of the present application, which is intended to solve a problem that the interaction performance of the digital person in multiple modes cannot be objectively and accurately evaluated only by the performance of the digital person in terms of intelligent question and answer, the method may be executed by a computer or a server, and referring to fig. 1, the method for evaluating interaction of a digital person includes:

S10, acquiring a plurality of groups of multi-source interaction data of a digital human model, wherein each group of multi-source interaction data at least comprises dialogue text data, audio data and video data;

Wherein the computer obtains multiple sets of multi-source interactive data of the digital human model by first selecting appropriate data sets which should contain rich dialogue text data, high quality audio and video data. For example, reference may be made to the digital human multimodal dataset BEAT proposed by the university of tokyo for the university of tokyo, which contains 76 hours of dynamic capture data and semantic-emotional annotations, which may be used to generate more lively conversational actions. Secondly, after the data is acquired, the data can be preprocessed to adapt to the requirements of the digital human model. For example, for video data, cropping, scaling may be performed to focus on the face and body movements of the speaker; for audio data, noise reduction and speech recognition may be performed to extract text information.

S20, obtaining a plurality of emotion sequences of the digital human model according to each group of multi-source interaction data, and determining a first evaluation value based on the deviation degree of emotion consistency among the plurality of emotion sequences;

in an embodiment of the present application, the specific implementation step of step S20 may include:

s21, respectively acquiring dialogue text data, audio data and video data based on a time axis;

Wherein the time axis is the basis for synchronizing different types of data. The time axis may be an actual time stamp (e.g., hours, minutes, seconds) or a relative time (e.g., percent progress of the video). After determining the time axis, the computer may extract all the dialog text and the audio and video clips matching the dialog text in a specific time period on the time axis, thereby obtaining dialog text data, audio data, and video data.

S22, respectively identifying dialogue text data, audio data and video data by using a preset emotion identification model, and correspondingly obtaining a first emotion sequence, a second emotion sequence and a third emotion sequence; the first emotion sequence, the second emotion sequence and the third emotion sequence are sequences formed by corresponding emotion label arrangement based on a time axis respectively;

Specifically, the computer may input the dialogue text data into a preset emotion recognition model to obtain at least one first emotion tag, where the preset emotion recognition model may be constructed based on the emotion-enhanced graph network dialogue text emotion recognition model, or may be constructed based on a relationship between the LSTM model and the attention mechanism combined binary learning loss capture tag, and is not limited in particular. The computer needs to preprocess the audio data before carrying out emotion recognition on the audio data, wherein the preprocessing comprises denoising, segmentation, framing, window function processing and the like. The computer can use a preset emotion recognition model to recognize the preprocessed audio data, so as to obtain at least one second emotion label. Wherein training of the preset emotion recognition model needs to rely on the audio data set with emotion tags. The computer may segment the audio data based on the endpoint detection algorithm, that is, the audio data is segmented with a pause or silence state, and then, a preset emotion recognition model is utilized to contact the audio data context information and the above-mentioned multi-angle synthesis such as speech speed, tone, speaking rhythm, emotion plumpness, etc. for carrying out emotion recognition, so as to obtain at least one second emotion label corresponding to each segment of audio data. These emotion tags typically represent basic emotion categories such as happy, sad, angry, surprise, etc. The computer first pre-processes the acquired video data before emotion recognition to facilitate subsequent feature recognition and emotion recognition. The computer may then use a machine learning or deep learning algorithm to extract emotion-related features from the preprocessed image frames. These features may include, for example: facial features, limb features, contextual information, etc., facial features such as the position and shape of the eyebrows, eyes, nose, mouth, and movement of facial muscles; limb features such as position of arms and legs, gestures, postures, etc.; contextual information such as the interaction of a character with the environment may provide additional emotional cues. Then, the computer may input the facial features, limb features and context information in a preset emotion recognition model, and output at least one third emotion tag corresponding to the facial and limb actions, where the third emotion tag may include basic emotion categories such as happiness, sadness, anger, surprise, and the like.

S23, evaluating consistency deviation degree of emotion labels corresponding to the first emotion sequence, the second emotion sequence and the third emotion sequence in the same time period based on a time axis, and determining a first evaluation value.

In the step, whether the emotion represented by the corresponding text emotion, voice emotion and digital human face and limb actions is highly consistent or adaptive is judged based on a time axis, so that the difference of emotion consistency among the three is avoided, the personification effect of the digital human on emotion presentation is greatly reduced, and the user experience is further reduced.

In an embodiment of the present application, the step S23 may specifically include the following execution steps:

S231, arranging the first emotion sequence, the second emotion sequence and the third emotion sequence side by side based on a time axis, and determining a deviation time value and total duration; the deviation time value is based on the maximum time value which is experienced when the corresponding emotion labels between the reference party and the comparison party are inconsistent, and the reference party is any one of the first emotion sequence, the second emotion sequence and the third emotion sequence as an evaluation reference sequence; the comparison party is a combination of the remaining two emotion sequences except the reference party in the first emotion sequence, the second emotion sequence and the third emotion sequence; the total duration is the maximum value of the duration of experiences corresponding to the first emotion sequence, the second emotion sequence and the third emotion sequence;

s232, calculating based on the deviation time value and the total duration to obtain the emotion deviation degree.

According to the application, the value of the deviation time value belonging to the corresponding grade is determined based on the magnitude of the deviation time value according to the grade 5 scoring standard in the standard ITU-R BT.1359-1 (RELATIVE TIMING OF SOUND AND VISION FOR BROADCASTING),. Then, the ratio of the deviation time value to the total duration is calculated, and the product of the ratio and the value of the corresponding grade is taken as the emotion deviation degree, and the emotion deviation degree is comprehensively estimated from the two aspects of the emotion deviation time value and the deviation times.

S233, determining a first evaluation value based on the emotion deviation degree and a preset emotion deviation degree comparison evaluation table.

The preset emotion deviation degree comparison table is an evaluation table obtained by respectively inviting at least 50 persons to make video watching tests from different emotion deviation times and emotion deviation time, giving satisfaction scores by watching persons and counting. The evaluation table may find a corresponding evaluation value based on the emotion departure degree, and the evaluation value may be given as a first evaluation value.

S30, determining the response quality of the digital human model based on the dialogue text data, and determining a second evaluation value based on the response quality;

in the embodiment of the present application, the implementation step of step S30 specifically includes:

S31, respectively obtaining a response accuracy evaluation value, a context consistency evaluation value, a response naturalness evaluation value and a response fluency evaluation value based on each dialogue text data;

The response accuracy refers to whether the information provided by the digital mannequin is accurate or not. This includes the correctness of the fact information, the accuracy of the logical reasoning, and a correct understanding of the user's intent. The computer typically has a standard answer or a series of possible correct answers as a reference in evaluating the accuracy of the response. The computer can obtain a response accuracy evaluation value by comparing the output of the artificial intelligence with the standard answers. Context consistency refers to whether a digital mannequin can maintain topic consistency and logic during a conversation. The digital person model can not only understand the current input, but also memorize and refer to the previous dialog content. When evaluating the consistency of the conversation context of the digital human model, the computer checks whether the response of the digital human model is matched with the previous conversation content and whether the conversation can be reasonably advanced, so as to obtain a consistency evaluation value of the context. Response naturalness refers to whether the response of a human model is close to the way a human speaks. In evaluating response naturalness, a computer typically considers whether a reply from a digital human model sounds out of a real human, and whether the proper language style and expression can be used to derive a response naturalness evaluation value. Response fluency refers to whether the responses of the mannequin are coherent, smooth, without linguistic impediments or unnatural pauses. When the computer evaluates the response fluency of the digital human model, the computer can check whether the reply of the model is finished at one time, whether obvious grammar errors or unnatural expressions exist, and therefore a response fluency evaluation value is given.

Specifically, one embodiment of step S31 may include the following specific implementation steps;

S311, the dialogue text data comprises a subjective text data set and an objective text data set, and the response accuracy evaluation value is calculated by a first parameter corresponding to the objective text data set and a second parameter corresponding to the subjective text data set, wherein the first parameter is a numerical value determined by the accuracy of a plurality of objective question-answer dialogue texts; the second parameter is a value determined by evaluation values given by the plurality of subjective question-answer dialog texts under the corresponding quality evaluation criteria.

First, a subjective text data set and an objective text data set are screened based on the historical dialogue text of a digital person, and the question and answer numbers in the two sets are consistent.

And then, carrying out question and answer selection based on each objective question in the objective text data set, comparing and judging whether the questions are correct based on the digital person answer questions and the corresponding answers, and determining a first parameter based on the accuracy of all objective questions. And meanwhile, evaluating each subjective question and answer data and the corresponding quality evaluation standard in the subjective text data set, and calculating the average value of all scores to obtain a second parameter. For example, the subjective question is "what you think is true happiness", the digital person replies "true happiness … …", and a prompt word is constructed and sent to a large model under the quality judgment standard "depth and insight of article content, judgment standard capable of causing emotion resonance, logicality and language expression" … … is adopted to obtain the score of the subjective question.

And finally, calculating an average value or a weighted average value and the like based on the first parameter corresponding to the objective text and the second parameter corresponding to the subjective text to obtain a response accuracy evaluation value. According to the application, the evaluation of the digital person in the aspect of response accuracy is comprehensively evaluated from two major aspects of objective and subjective questions and answers, so that the evaluation of intelligent response accuracy can be improved, and the response performance of the digital person can be comprehensively evaluated.

Specifically, another embodiment of step S31 may include the following specific implementation steps;

s312, processing each dialogue text data by utilizing a preset quality screening scoring language model to obtain a confusion degree evaluation value;

the computer uses a quality screening scoring language model to score the dialog text data to obtain a confusion rating. The confusion degree can be used for measuring the quality of the text, and generally, the lower the numerical value is, the better the quality of the text is, the quality of the text is scored by using a FastText model in the application, and other quality screening scoring language models can be used in other embodiments, and the application is not limited in particular.

S313, comparing the confusion degree evaluation value with a preset threshold value; if the confusion degree evaluation value is smaller than a preset threshold value, judging whether the current dialogue text data is template response data or not; if the current dialogue text data is the template response data, judging whether the template response data is meaningless response or not; if the response is meaningless, determining the response accuracy of the digital human model as a first parameter; if the answer is not meaningless, determining the answer accuracy of the digital human model as a second parameter; if the current dialogue text data is not the template response data, judging whether the dialogue text data is correct or not; if the answer is correct, determining the answer accuracy of the digital human model as a third parameter; if the answer is wrong, determining that the answer accuracy of the digital human model is a fourth parameter; if the confusion degree evaluation value is greater than or equal to a preset threshold value, determining that the response accuracy of the digital human model is a fifth parameter;

Wherein the computer compares the obtained confusion degree evaluation value with a preset threshold value. If the confusion degree evaluation value is smaller than the preset threshold value, the fact that the current generated text sentence is smooth, the logic is correct, semantic ambiguity does not exist, and the text quality is high is indicated. If the confusion degree is larger than or equal to a preset threshold value, namely the generated text is low in quality, and problems such as statement confusion, semantic errors and the like possibly exist, the response accuracy is set to a fifth parameter, and the fifth parameter can represent poor performance of the digital human model or needs further optimization. On the basis that the confusion degree evaluation value is smaller than a preset threshold value, whether the dialogue text data is template response data is further judged, wherein the template response can be a preset response or a preset response aiming at a spam scheme which cannot be set by the digital person to process the business questions, and individuation is not achieved. Wherein, the bottom of the pocket returns like: when asking the digital person in the financial field about the medical problem, the digital person may reply "your problem is too deep for I to answer"; or preset answers such as: when replying to "your good" of the user's voice input, the digital person replies "your good, what is small X able to help you? ". If the answer is the template answer data, the computer judges whether the template answer data is nonsensical, namely, whether the answer is not helpful to the actual requirement of the user, the example of spam answer is nonsensical answer, the example of preset answer about calling is meaningful, and in other scenes, the digital person can clearly judge the template answer of the actual intention of the user by asking the user back. If the template response data is meaningless and the contextual relevance is not strong, the response accuracy of the digital human model is set to the first parameter. If the calculation determines that the template response is meaningful and the contextual relevance is strong, the computer may determine the second parameter based on the dialog text data response accuracy. If the computer judges that the response data is not the template response data, the computer continues to judge whether the response data in the dialogue text data is correct, if so, the response accuracy of the digital human model is determined to be a third parameter, otherwise, the response accuracy of the digital human model is determined to be a fourth parameter.

S314, determining response accuracy evaluation values of the digital human model based on the first parameter, the second parameter, the third parameter, the fourth parameter and the fifth parameter, wherein the values of the fifth parameter, the first parameter, the fourth parameter, the second parameter and the third parameter are sequentially increased.

And the computer calculates an average value or a weighted value based on the first parameter, the second parameter, the third parameter, the fourth parameter and the fifth parameter to obtain the corresponding response accuracy. Or determining the parameter value corresponding to the maximum number percentage as the response accuracy according to the number percentages corresponding to the first parameter, the second parameter, the third parameter, the fourth parameter and the fifth parameter, and obtaining the response accuracy through reasonable operation according to the values without limitation.

S32, determining a second evaluation value of the digital human model according to the response accuracy evaluation value, the context consistency evaluation value, the response naturalness evaluation value and the response fluency evaluation value;

Specifically, the computer may calculate the second evaluation value of the digital person model based on a weighted value, an average value, or a sum of the answer accuracy evaluation value, the context consistency evaluation value, the answer naturalness evaluation value, and the answer fluency evaluation value of the digital person model.

S40, determining the sound lip synchronization deviation degree of the digital human model based on each audio data and each video data, and determining a third evaluation value based on the sound lip synchronization deviation degree;

In an embodiment of the present application, the step S40 may specifically include the following steps:

s41, processing each audio data and each video data based on a preset lip synchronization test model to obtain error values of lip and sound synchronization in each video, and obtaining the sound lip synchronization deviation degree of the digital human model based on a plurality of error values;

The purpose of the lip sync test model is to evaluate the synchronicity between the lips action and sound of the digital person model by analyzing the audio data and the video data. This process may first process the audio data to extract speech features such as phonemes, pitch, volume, speed, etc., which help determine the time point and duration of the pronunciation; secondly, analyzing the video data, and tracking and identifying the actions of lips in the video through an image processing technology, wherein the actions comprise opening and closing of the lips, shape change and other facial actions related to pronunciation; finally, by comparing the audio characteristics to the lip movements in the video data, an error value for lip and sound synchronization in each video is calculated. The error value may be a time deviation (e.g., a time difference between lip motion and sound generation) or a motion matching degree (e.g., consistency of lip shape and pronunciation mouth shape). The computer can obtain the sound lip synchronization deviation degree of the digital human model based on the error values. The sound lip synchronization deviation index reflects the performance of the digital human model in the whole sound lip synchronization, and the lower the deviation is, the better the synchronization performance is, and the more natural the digital human model is.

S42, determining a third evaluation value based on the sound lip synchronization deviation.

Specifically, the computer may perform the improvement of a lip sync test model, adjust lip animation parameters, or perform an optimized audio processing procedure to obtain different lip sync deviations. The computer may process different sound lip sync deviations using a weighting algorithm, and may obtain a third evaluation value. The third evaluation value not only provides a quantitative index of the lip synchronization performance of the digital human model, but also guides further improvement and optimization of the model. This is important to promote the realism and user experience of digital persons.

S50, determining facial expression and limb action richness of the digital human model based on each video data, and determining a fourth evaluation value based on the facial expression and limb action richness;

in an embodiment of the present application, the step S50 may specifically include the following steps:

S51, acquiring the total interaction time length of each group of multi-source interaction data;

Wherein, the total interaction time length of each group of multi-source interaction data refers to the total time length from the beginning to the end in one interaction session. This index helps to understand the duration of the interaction, providing a basis for subsequent analysis.

S52, obtaining the total number of facial expressions and limb actions of the digital human model corresponding to each third emotion label respectively;

Specifically, the computer may analyze each third emotion label in the interaction process, and the computer may count the total number of facial expressions and limb actions of the digital human model corresponding to each label.

S53, determining the abundance of the facial expression and the limb actions of the digital human model according to the ratio of the total number of the facial expression and the limb actions corresponding to each third emotion label to the total interaction time of the digital human model;

Specifically, the computer may determine the facial expression and the limb action richness of the digital human model by calculating a ratio of the total number of facial expressions and limb actions corresponding to each third emotion label to the total interaction time. This ratio reflects the diversity and degree of variation in facial expression and limb movements exhibited by the digital human model over a unit of time. The higher the ratio, the more abundant the facial expression and limb movements of the digital human model are, and the better the interaction performance is.

And S54, determining a fourth evaluation value based on the facial expression and the limb action richness.

Specifically, the computer may map the ratio of facial expression to limb motion richness to a normalized scoring range, e.g., 0 to 10 points, for comparison and evaluation, and then calculate the score of each third emotion tag by a weighting algorithm to obtain an overall fourth evaluation value, which not only provides performance metrics of the digital human model in terms of facial expression and limb motion, but also helps to guide further optimization and improvement of the model. This is important to enhance the user interaction experience and naturalness of digital people.

S60, evaluating the man-machine interaction performance of the digital human model based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value.

Specifically, the computer may first determine the weight of each evaluation value. The assignment of weights is typically based on the importance of each evaluation item to the overall human-machine interaction performance. For example, lip synchronization may be critical to interaction naturalness and thus may be assigned a higher weight. Next, the computer may normalize the first evaluation value, the second evaluation value, the third evaluation value, and the fourth evaluation value to the same scoring range, such as 0 to 1 or 0 to 100. Finally, the computer may add the weighted evaluation values to obtain a composite score. Of course, in other embodiments, the average value of the four values may be also used, the specific calculation mode is not limited, the design may be performed according to the user requirement, and the comprehensive score is used as the basis for evaluating the man-machine interaction performance of the digital human model.

On the basis of the foregoing embodiments, the embodiment of the present application further provides a digital person interaction evaluation device, fig. 2 is a schematic structural diagram of the digital person interaction evaluation device according to an embodiment of the present application, and referring to fig. 2, the digital person interaction evaluation device 100 may include: the data acquisition module 101, the first evaluation module 102, the second evaluation module 103, the third evaluation module 104, the fourth evaluation module 105 and the fifth evaluation module 106, wherein the data acquisition module 101 is configured to acquire multiple groups of multi-source interaction data of the digital human model, and each group of multi-source interaction data at least includes dialogue text data, audio data and video data; the first evaluation module 102 may be configured to obtain a plurality of emotion sequences of the digital human model according to each set of multi-source interaction data, and determine a first evaluation value based on a degree of deviation of emotion consistency between the plurality of emotion sequences; the second evaluation module 103 may be configured to determine a response quality of the digital person model based on each of the dialog text data, and determine a second evaluation value based on the response quality; the third evaluation module 104 may be configured to determine a sound lip sync deviation of the digital person model based on each of the audio data and each of the video data, and determine a third evaluation value based on the sound lip sync deviation; the fourth evaluation module 105 may be configured to determine facial expressions and limb motion richness of the digital human model based on each video data, and determine a fourth evaluation value based on the facial expressions and limb motion richness; the fifth evaluation module 106 may be configured to evaluate the human-machine interaction performance of the digital human model based on the first, second, third, and fourth evaluation values.

In an embodiment of the present application, the first evaluation module 102 may be specifically configured to acquire the dialog text data, the audio data, and the video data, respectively, based on a time axis; respectively identifying the dialogue text data, the audio data and the video data by using a preset emotion identification model to correspondingly obtain a first emotion sequence, a second emotion sequence and a third emotion sequence; wherein the first emotional sequence, the second emotional sequence, and the third emotional sequence are sequences respectively composed of corresponding emotional tag arrangements based on a time axis; and based on a time axis, respectively evaluating the consistency deviation degree of the emotion labels corresponding to the first emotion sequence, the second emotion sequence and the third emotion sequence in the same time period, and determining a first evaluation value.

In an embodiment of the present application, the first evaluation module 102 may be further specifically configured to determine a departure time value and a total duration by arranging the first emotion sequence, the second emotion sequence, and the third emotion sequence side by side based on a time axis; the deviation time value is based on a maximum time value which is experienced when the emotion labels are inconsistent corresponding to the emotion labels between a reference party and a comparison party, wherein the reference party is any one of the first emotion sequence, the second emotion sequence and the third emotion sequence as an evaluation reference sequence; the comparison party is a combination of two emotion sequences which are left in the first emotion sequence, the second emotion sequence and the third emotion sequence except the reference party; the total duration is the maximum value of the duration of the first emotion sequence, the second emotion sequence and the third emotion sequence; calculating based on the deviation time value and the total duration to obtain emotion deviation degree; and determining a first evaluation value based on the emotion deviation degree and a preset emotion deviation degree comparison evaluation table.

In an embodiment of the present application, the second evaluation module 103 may be specifically configured to obtain, based on each of the dialog text data, a response accuracy evaluation value, a context consistency evaluation value, a response naturalness evaluation value, and a response fluency evaluation value, respectively; and determining a second evaluation value according to the response accuracy evaluation value, the context consistency evaluation value, the response naturalness evaluation value and the response fluency evaluation value.

In the embodiment of the present application, the second evaluation module 103 may be further specifically configured to process each dialog text data by using a preset quality screening scoring language model to obtain a confusion degree evaluation value; comparing the confusion degree evaluation value with a preset threshold value; if the confusion degree evaluation value is smaller than a preset threshold value, judging whether the current dialogue text data is template response data or not; if the current dialogue text data is the template response data, judging whether the template response data is meaningless response or not; if the response is meaningless, determining the response accuracy of the digital human model as a first parameter; if the answer is not meaningless, determining that the answer accuracy of the digital human model is a second parameter; if the current dialogue text data is not the template response data, judging whether the dialogue text data is correct or not; if the response accuracy of the digital human model is correct, determining the response accuracy of the digital human model as a third parameter; if the answer is wrong, determining that the answer accuracy of the digital human model is a fourth parameter; if the confusion degree evaluation value is larger than or equal to a preset threshold value, determining that the response accuracy of the digital human model is a fifth parameter; and determining a response accuracy evaluation value of the digital human model based on the first parameter, the second parameter, the third parameter, the fourth parameter and the fifth parameter, wherein the values of the fifth parameter, the first parameter, the fourth parameter, the second parameter and the third parameter are sequentially increased.

In the embodiment of the present application, the third evaluation module 104 may be specifically configured to process each piece of audio data and each piece of video data based on a preset lip synchronization test model, obtain an error value of lip and sound synchronization in each video, and obtain a sound lip synchronization deviation degree of the digital person model based on a plurality of the error values; the third evaluation value is determined based on the sound lip-sync deviation.

In an embodiment of the present application, the fourth evaluation module 105 may be specifically configured to obtain a total interaction duration of each set of the multi-source interaction data; obtaining the total number of facial expressions and limb actions of the digital human model corresponding to each third emotion label respectively; determining the facial expression and limb action richness of the digital human model according to the ratio of the total number of facial expressions and limb actions corresponding to each third emotion label to the total interaction time length of the digital human model; the fourth evaluation value is determined based on the facial expression and limb-motion richness.

On the basis of the foregoing embodiment, the embodiment of the present application further provides a computer-readable storage medium, referring to fig. 3, where the computer-readable storage medium is shown as an optical disc 50, and a computer algorithm (i.e. an algorithm product) is stored on the optical disc, where the computer algorithm, when executed by a processor, implements the steps described in the foregoing method implementation manner, for example, obtaining multiple sets of multi-source interaction data of a digital human model, where each set of multi-source interaction data includes at least dialogue text data, audio data, and video data; according to each group of multisource interaction data, a plurality of emotion sequences of the digital human model are obtained, and a first evaluation value is determined based on the deviation degree of emotion consistency among the plurality of emotion sequences; determining a response quality of the digital person model based on each dialogue text data, and determining a second evaluation value based on the response quality; determining a sound lip synchronization deviation degree of the digital person model based on each of the audio data and each of the video data, and determining a third evaluation value based on the sound lip synchronization deviation degree; determining facial expression and limb action richness of the digital human model based on each video data, and determining a fourth evaluation value based on the facial expression and limb action richness; based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value, the man-machine interaction performance of the digital human model is evaluated, and the specific implementation of each step is not repeated here. It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.

Furthermore, on the basis of the above-described embodiments, the present embodiment also provides an apparatus, and fig. 4 shows a block diagram of an exemplary apparatus 60 suitable for implementing the embodiment of the present application, where the apparatus 60 may be a computer system or a server. The device 60 shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the application.

Referring to fig. 4, components of device 60 may include, but are not limited to: one or more processors or processing units 601, a system memory 602, and a bus 603 that connects the different system components (including the system memory 602 and the processing units 601).

Device 60 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by device 60 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 602 may include computer system readable media in the form of volatile memory such as random access memory RAM6021 and/or cache memory 6022. The device 60 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM6023 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 4 and commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media), may be provided. In these cases, each drive may be connected to a bus 603 that connects the different system components through one or more data medium interfaces. The system memory 602 may include at least one algorithm product having a set (e.g., at least one) of algorithm modules configured to perform the functions of the various embodiments of the application.

An algorithm/utility 6025 having a set (at least one) of algorithm modules 6024 may be stored, for example, in system memory 602, and such algorithm modules 6024 include, but are not limited to: an operating system, one or more application algorithms, other algorithm modules, and algorithm data, each or some combination of which may include an implementation of a network environment. The algorithm module 6024 generally performs the functions and/or methods of the described embodiments of the application.

The device 60 may also communicate with one or more external devices 604 (e.g., keyboard, pointing device, display, etc.). Such communication may occur through an input/output I/O interface 605. Also, the device 60 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 606. As shown in fig. 4, the network adapter 606 communicates with other modules of the device 60, such as the processing unit 601, etc., via a bus 603 that connects the different system components. It should be appreciated that although not shown in fig. 4, other hardware and/or software program modules may be used in connection with device 60.

The processing unit 601 executes various functional applications and data processing by executing algorithms stored in the system memory 602, for example, acquiring a plurality of sets of multi-source interactive data of a digital human model, each set of multi-source interactive data including at least dialogue text data, audio data, and video data; according to each group of multisource interaction data, a plurality of emotion sequences of the digital human model are obtained, and a first evaluation value is determined based on the deviation degree of emotion consistency among the plurality of emotion sequences; determining a response quality of the digital person model based on each dialogue text data, and determining a second evaluation value based on the response quality; determining a sound lip synchronization deviation degree of the digital person model based on each of the audio data and each of the video data, and determining a third evaluation value based on the sound lip synchronization deviation degree; determining facial expression and limb action richness of the digital human model based on each video data, and determining a fourth evaluation value based on the facial expression and limb action richness; based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value, the man-machine interaction performance of the digital human model is evaluated, and the specific implementation of each step is not repeated here. It should be noted that although in the above detailed description, several units/modules or sub-units/sub-modules of the motion profile processing device of the robot are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present application. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

In the description of the present application, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented as software program functional units and sold or used as a stand-alone product, may be stored on a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software program product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing algorithm codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

Claims

1. A method for interactive assessment of digital persons, comprising:

Acquiring multiple groups of multi-source interaction data of a digital human model, wherein each group of multi-source interaction data at least comprises dialogue text data, audio data and video data;

Respectively acquiring the dialogue text data, the audio data and the video data in each group of the multi-source interaction data based on a time axis, acquiring a plurality of emotion sequences of the digital human model based on the dialogue text data, the audio data and the video data, and determining a first evaluation value based on the deviation degree of emotion consistency among the plurality of emotion sequences;

determining response quality of the digital human model based on each dialogue text data, and determining a second evaluation value based on the response quality, wherein the response quality comprises machine language personification evaluation data and response fluency evaluation data;

determining a sound lip-sync deviation degree of the digital person model based on each of the audio data and each of the video data, and determining a third evaluation value based on the sound lip-sync deviation degree;

Determining facial expression and limb action richness of the digital human model based on each video data, and determining a fourth evaluation value based on the facial expression and limb action richness;

and evaluating the man-machine interaction performance of the digital human model based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value.

2. The method of claim 1, wherein determining the first evaluation value based on the degree of deviation in emotion correspondence between the plurality of emotion sequences comprises:

respectively acquiring the dialogue text data, the audio data and the video data based on a time axis;

Respectively identifying the dialogue text data, the audio data and the video data by using a preset emotion identification model to correspondingly obtain a first emotion sequence, a second emotion sequence and a third emotion sequence; wherein the first emotional sequence, the second emotional sequence, and the third emotional sequence are sequences respectively composed of corresponding emotional tag arrangements based on a time axis;

And based on a time axis, respectively evaluating the consistency deviation degree of the emotion labels corresponding to the first emotion sequence, the second emotion sequence and the third emotion sequence in the same time period, and determining a first evaluation value.

3. The digital human interaction assessment method according to claim 2, wherein the determining a first assessment value based on the assessment of the degree of consistent deviation of the emotion tags in the first emotion sequence, the second emotion sequence, and the third emotion sequence in the same period of time, respectively, on a time axis, includes:

Determining a deviation time value and a total duration by arranging the first emotion sequence, the second emotion sequence and the third emotion sequence side by side based on a time axis; the deviation time value is based on a maximum time value which is experienced when the emotion labels are inconsistent corresponding to the emotion labels between a reference party and a comparison party, wherein the reference party is any one of the first emotion sequence, the second emotion sequence and the third emotion sequence as an evaluation reference sequence; the comparison party is a combination of two emotion sequences which are left in the first emotion sequence, the second emotion sequence and the third emotion sequence except the reference party; the total duration is the maximum value of the duration of the first emotion sequence, the second emotion sequence and the third emotion sequence;

Calculating based on the deviation time value and the total duration to obtain emotion deviation degree;

And determining a first evaluation value based on the emotion deviation degree and a preset emotion deviation degree comparison evaluation table.

4. The digital person interaction assessment method according to claim 1, wherein said determining a response quality of said digital person model based on each of said dialogue text data and determining a second assessment value based on said response quality comprises:

Respectively obtaining a response accuracy evaluation value, a context consistency evaluation value, a response naturalness evaluation value and a response fluency evaluation value based on the dialogue text data;

and determining a second evaluation value according to the response accuracy evaluation value, the context consistency evaluation value, the response naturalness evaluation value and the response fluency evaluation value.

5. The method of claim 4, wherein the dialogue text data comprises a subjective text data set and an objective text data set, and the response accuracy evaluation value is calculated from a first parameter corresponding to the objective text data set and a second parameter corresponding to the subjective text data set, wherein the first parameter is a numerical value determined by a plurality of objective question-and-answer dialogue text correspondence accuracy rates; the second parameter is a value determined by evaluation values given by a plurality of subjective question and answer dialog texts under corresponding quality evaluation criteria.

6. The digital human interaction assessment method according to claim 1, wherein the determining of the sound lip sync deviation degree of the digital human model based on each of the audio data and each of the video data, and the determining of the third assessment value based on the sound lip sync deviation degree, comprises:

processing each piece of audio data and each piece of video data based on a preset audio lip synchronization test model to obtain error values of lip and sound synchronization in each video, and obtaining the sound lip synchronization deviation degree of the digital human model based on a plurality of error values;

the third evaluation value is determined based on the sound lip-sync deviation.

7. The digital human interaction assessment method according to claim 2, wherein said determining facial expression and limb-motion richness based on each of said video data, and determining a fourth assessment value based on said facial expression and limb-motion richness, comprises:

acquiring the total interaction time length of each group of multi-source interaction data;

Based on the emotion labels corresponding to each third emotion sequence, respectively and correspondingly acquiring the total number of facial expressions and limb actions of the digital human model;

determining the facial expression and limb action richness of the digital human model according to the ratio of the total number of the facial expressions and limb actions corresponding to each third emotion sequence to the total interaction duration of the digital human model;

the fourth evaluation value is determined based on the facial expression and limb-motion richness.

8. A digital human interaction assessment device, comprising:

the data acquisition module is used for acquiring a plurality of groups of multi-source interaction data of the digital human model, wherein each group of multi-source interaction data at least comprises dialogue text data, audio data and video data;

A first evaluation module for respectively acquiring the dialogue text data, the audio data and the video data in each group of the multi-source interactive data based on a time axis, acquiring a plurality of emotion sequences of the digital human model based on the dialogue text data, the audio data and the video data, and determining a first evaluation value based on the deviation degree of emotion consistency among the plurality of emotion sequences;

the second evaluation module is used for determining the response quality of the digital human model based on each piece of dialogue text data and determining a second evaluation value based on the response quality, wherein the response quality comprises machine language personification evaluation data and response fluency evaluation data;

A third evaluation module for determining a sound lip-sync deviation degree of the digital person model based on each of the audio data and each of the video data, and determining a third evaluation value based on the sound lip-sync deviation degree;

a fourth evaluation module for determining facial expression and limb action richness of the digital human model based on each of the video data, and determining a fourth evaluation value based on the facial expression and limb action richness;

And a fifth evaluation module for evaluating the man-machine interaction performance of the digital human model based on the first evaluation value, the second evaluation value, the third evaluation value and the fourth evaluation value.

9. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the digital human interaction assessment method of any of claims 1 to 7.

10. A digital human interaction assessment device, the device comprising:

At least one processor, memory, and input output unit;

wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the method of interaction assessment of digital persons according to any one of claims 1 to 7.