CN115439582A

CN115439582A - Method for driving avatar model, avatar driving apparatus, and storage medium

Info

Publication number: CN115439582A
Application number: CN202210972793.8A
Authority: CN
Inventors: 钟静华; 孙立发
Original assignee: Shenzhen Dadan Shusheng Technology Co ltd
Current assignee: Shenzhen Dadan Shusheng Technology Co ltd
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2022-12-06

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a driving method of an avatar model, avatar driving equipment and a storage medium, wherein the method comprises the following steps: when input information is acquired, determining response information corresponding to the input information; determining an emotion label and an action label associated with the response information; and driving the avatar model to output the expression matched with the emotion label, and driving the avatar model to output the action matched with the action label. The virtual person with the action and the expression is generated by the emotion label and the action label fed back by the input information, so that the driving effect of the virtual image model approaching the real person more naturally is realized, and the problem of how to improve the interactivity between the virtual image model and the real person is solved.

Description

Method for driving avatar model, avatar driving apparatus, and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technology, and in particular, to a method for driving an avatar model, an avatar driving apparatus, and a storage medium.

Background

An AI (Artificial Intelligence) driven type avatar generation technique is a technique for generating corresponding actions, expressions, languages, etc. of a specific person using audio based on Artificial Intelligence driving. Virtual images may be used in various fields of virtual assistants, virtual anchors, virtual teachers, and the like. The mapping relation from the audio to the speaking expression (mainly speaking mouth shape) is learned through a time sequence Neural Network, such as RNN (Recurrent Neural Network), GRU (Gated Recurrent Unit) and the like, and then the expression parameters are used for controlling the face synthesis process to synthesize the final natural speaking video.

In the related art, the generated AI avatar is usually a fixed answer to various questions preset when interacting with a real person. And then in the interaction process, calling a relevant answer of the question according to the question provided by the real person, and outputting the expression and the action corresponding to the answer.

However, in the interaction mode of the AI avatar and the real person, when a problem other than a fixed answer occurs, the AI avatar cannot interact with the real person, and a problem that a driving mode of the avatar model is single exists.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a driving method of an avatar model, aiming at solving the problem of how to improve the interactivity between an avatar and a real person.

In order to achieve the above object, the present invention provides a method for driving an avatar model, the method comprising:

when input information is acquired, determining response information corresponding to the input information;

determining an emotion label and an action label associated with the response information;

and driving the virtual image model to output the expression matched with the emotion label, and driving the virtual image model to output the action matched with the action label.

Optionally, the step of determining an emotion tag and an action tag associated with the response information includes:

determining a text vector fusing language prior knowledge in the response information based on a pre-training bidirectional coding representation model;

inputting the text vector into a target linear classifier, and determining an emotion category and an action category in the text vector;

determining the probability of the text vector in each corresponding emotion category and the probability of the text vector in each corresponding action category;

and selecting the emotion category with the maximum probability in the emotion categories as the emotion label, and selecting the action category with the maximum probability in the action categories as the action label.

Optionally, the step of inputting the text vector into the target linear classifier, and determining the emotion category and the motion category in the text vector, where the target linear classifier includes an emotion recognition linear classifier and an action intention recognition linear classifier, includes:

acquiring preset training samples, wherein the preset training samples comprise emotion recognition training samples and action intention recognition training samples;

updating corresponding parameters in the pre-training bidirectional coding representation model and the initial linear classifier based on the emotion recognition training sample to obtain the emotion recognition linear classifier and the finely-adjusted pre-training bidirectional coding representation model, wherein the parameters are the optimal solution of a loss function corresponding to the emotion recognition training sample;

and updating corresponding parameters in the pre-training bidirectional code representation model and the initial linear classifier based on the action intention recognition training sample to obtain the action intention recognition linear classifier and the fine-tuned pre-training bidirectional code representation model, wherein the parameters are the optimal solution of the loss function corresponding to the action intention recognition training sample.

Optionally, the determining the emotion category and the action category in the text vector comprises:

determining a first marker position in the text vector;

determining expression characteristics of the text vector according to the sub-text vector corresponding to the first mark position;

inputting the expression features into the emotion recognition linear classifier, and determining the emotion category corresponding to the text vector;

and determining a second marker position and a text length in the text vector;

splicing the text length to the sub-text vector corresponding to the second mark position

Determining the action characteristics of the text vector according to the spliced sub-vectors;

inputting the action features into the action intention recognition linear classifier, and determining the action category of the text vector.

Optionally, before the step of determining the text vector fused with the language prior knowledge in the text information based on the pre-trained bidirectional coding representation model, the method includes:

acquiring a depth self-attention network;

inputting large-scale unsupervised data into the deep self-attention network, and training the deep self-attention network through a masking language model and next sentence prediction to generate the pre-training bidirectional coding representation model.

Optionally, the step of driving the avatar model to output the expression matched with the emotion tag, and driving the avatar model to output the action matched with the action tag includes:

matching the expression corresponding to the virtual image model according to the emotion label, and matching the action corresponding to the model according to the action label;

performing frame interpolation processing on the expression and the action;

and inputting the expression after the frame interpolation and the action after the frame interpolation into the model so as to drive the virtual image model.

Optionally, the step of performing frame interpolation processing on the expression and the action includes:

extracting a first expression frame and a second expression frame of the expression, wherein the first expression frame is positioned in front of the second expression frame;

determining a corresponding time difference between the first expression frame and the second expression frame;

determining the position offset of each pixel point between the first expression frame and the second expression frame in the time difference;

generating an expression transition frame between the first expression frame and the second expression frame according to the position offset;

inserting the expression transition frame into a corresponding position between the first expression frame and the second expression frame;

and extracting a first action frame and a second action frame of the action, the first action frame preceding the second action frame;

determining a corresponding time difference between the first motion frame and the second motion frame;

determining the position offset of each pixel point between the first action frame and the second action frame in the time difference;

generating an action transition frame between the first action frame and the second action frame according to the position offset;

inserting the motion transition frame to a corresponding location between the first motion frame and the second motion frame.

Optionally, before the step of extracting the first motion frame and the second motion frame of the motion, the method includes:

determining a dwell time, duration, and naturalness of the action;

if the pause time is within a first preset range, the duration time is greater than a preset time threshold, the naturalness is greater than a preset naturalness threshold, the extraction mode is determined to be sequential frame extraction, and the steps of extracting a first action frame and a second action frame of the action are executed according to the sequential frame extraction;

or, if the pause time is within a second preset range, the duration time is less than or equal to the preset time threshold, the naturalness is less than or equal to the naturalness threshold, the extraction mode is determined to be reverse order frame extraction, and the step of extracting the first action frame and the second action frame of the action is executed according to the reverse order frame extraction.

Further, to achieve the above object, the present invention also provides an avatar driving apparatus including a memory, a processor, and an avatar model driver stored on the memory and operable on the processor, the driving of the avatar model being performed by the processor to implement the steps of the avatar model driving method as described above.

Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a driver of an avatar model, the driver of the avatar model implementing the steps of the driving method of the avatar model as described above when being executed by a processor.

The embodiment of the invention provides a driving method of an avatar model, avatar driving equipment and a storage medium, wherein the method comprises the following steps: when input information is acquired, determining response information corresponding to the input information; determining an emotion label and an action label associated with the response information; and driving the avatar model to output the expression matched with the emotion label, and driving the avatar model to output the action matched with the action label. When various input information is acquired, response information corresponding to the input information is determined, the associated emotion label and action label are identified from the response information, and a virtual person with actions and expressions can be generated based on the emotion label and the action label, so that a more natural driving effect approaching the virtual image of the real person is realized.

Drawings

Fig. 1 is a schematic diagram of a hardware architecture of an avatar driving apparatus according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a driving method of an avatar model according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a training process of a BERT model for an emotion classification recognition task;

FIG. 4 is a schematic diagram of a training flow of the BERT model of the action intention recognition task;

FIG. 5 is a detailed flowchart of step S30 in the second embodiment of the driving method of the avatar model according to the present invention;

fig. 6 is a schematic view of an avatar model without significant changes in motion and expression according to a second embodiment of the driving method of an avatar model of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

The method and the device realize automatic text matching of the action and expression labels through a Natural Language Processing (NLP) technology, and realize Natural transition between different actions and expressions through an action and expression frame interpolation algorithm.

For a better understanding of the above technical solutions, exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As an implementation, the hardware architecture of the avatar driving apparatus may be as shown in fig. 1.

The embodiment scheme of the invention relates to a hardware architecture of an avatar driving device, which comprises the following components: a processor 101, e.g. a CPU, a memory 102, a communication bus 103. Wherein a communication bus 103 is used to enable the connection communication between these components.

The memory 102 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). As shown in fig. 1, a memory 102, which is a computer-readable storage medium, may include a driver of the avatar model therein; and the processor 101 may be configured to call a driver of the avatar model stored in the memory 102 and perform the following operations:

In one embodiment, the processor 101 may be configured to invoke a driver for the avatar model stored in the memory 102 and perform the following operations:

determining a first marker position in the text vector;

and determining a second marker position and a text length in the text vector;

splicing the text length to a subfile vector corresponding to the second mark position;

acquiring a depth self-attention network;

inputting large-scale unsupervised data to the deep self-attention network, and training the deep self-attention network through a masking language model and next sentence prediction to generate the pre-training bidirectional coding representation model.

performing frame interpolation processing on the expression and the action;

determining a corresponding time difference between the first action frame and the second action frame;

inserting the motion transition frame into a corresponding position between the first motion frame and the second motion frame.

determining a dwell time, duration, and naturalness of the action;

Based on the hardware architecture of the virtual image driving device based on the artificial intelligence technology, the embodiment of the driving method of the virtual image model is provided.

Referring to fig. 2, in a first embodiment, the method comprises the steps of:

step S10, when input information is acquired, response information corresponding to the input information is determined;

in this embodiment, when the input information is acquired, the corresponding information of the avatar model corresponding to the input information is determined. The input information includes but is not limited to contents such as characters, voice, instructions and the like, for example, a live broadcast scene, when audiences in a live broadcast room enter the live broadcast room, a welcome instruction is triggered, a virtual live broadcast operator (namely, an avatar model) generates corresponding response information according to the welcome instruction, and executes related actions of the response information; for another example, for a classroom scene, when a student asks an AI teacher (i.e., an avatar model), the input information is speech, and the AI teacher generates corresponding response information according to the content in the speech; for another example, when the background operation and maintenance personnel operate and control the AI live broadcast personnel, the input information is characters, and the AI live broadcast personnel generate corresponding response information according to the characters. The response information includes, but is not limited to, an expression response, an action response, and a voice response, and in this embodiment, the expression response and the action response are emphasized.

Step S20, determining emotion labels and action labels associated with the response information;

in this embodiment, the response information needs to be mapped into text data, the mapping task of the response information is divided into an expression mapping task and an action mapping task, the two tasks can map the response information into a section of text, analyze characters in the text, and extract emotion tags and action tags capable of reflecting people from the characters.

Optionally, the emotion tag and the action tag are extracted by inputting the text into a pre-trained Bidirectional coding representation model (BERT) based on a deep self-attention network.

It should be noted that, in this embodiment, the BERT model uses a transform network as a basic structure of the model, and is pre-trained on two pre-training tasks, namely, a masking language model and a next sentence prediction, on large-scale unsupervised data, so as to obtain a pre-trained BERT model. The BERT pre-training model can fully utilize language prior knowledge learned in unsupervised pre-training and migrate the model to corresponding NLP tasks (namely expression mapping tasks and action mapping tasks) in model fine tuning.

And determining a text vector fused with language prior knowledge in the text through a BERT pre-training model, wherein the language prior knowledge is represented by the language prior knowledge obtained by the BERT model pre-training, namely, the text is input into the text vector obtained by the BERT model and the language prior knowledge obtained by the BERT model pre-training is fused in the text vector. The text vectors are then input into a task-dependent target linear classifier in the BERT model. The BERT model is provided with a plurality of types of linear classifiers, and the linear classifiers are used for classifying the types of the character vectors. In this embodiment, since the content to be extracted is a portion of a character that is characterized by emotion and motion, the target linear classifiers corresponding to the character vector are an emotion recognition linear classifier and a motion intention recognition linear classifier, the emotion class in the text vector is determined according to the emotion recognition linear classifier, and the motion class in the text vector is determined according to the motion intention recognition linear classifier. In this embodiment, the emotion category and the action category set in the linear classifier are both at least one, and in the process of linearly classifying the word vector, the word vector may appear in a plurality of emotion categories and a plurality of action categories, so that the probability of the emotion category and the action category is the highest, and the emotion category and the action category are selected as the emotion label and the action label.

Illustratively, a text segment is "welcome friends to come to my live room", and through automatic text matching, the emotion category and probability corresponding to the text segment are obtained as follows: happy with a probability of 80%; surprisingly, the probability is 40%; neutral, with a probability of 30%; gas generation with a probability of 0%. The "happy" with the highest probability is selected as the emotion label of the sentence. The action label is selected in the same way.

Optionally, in the training mode of the target linear classifier, a preset training sample is labeled, the BERT model is finely tuned in a specific scene, and the optimal solution of the loss function of the text mapping classification task corresponding to the linear classifier is used as the update parameters of the BERT model and the linear classifier, so that the target linear classifier and the finely tuned BERT model corresponding to the finely tuned training sample of the category are obtained. Based on this principle, the emotion recognition linear classifier and the action intention recognition linear classifier are different in that: the classification features input into the linear classifier are different from the classes of the training samples.

Optionally, the emotion category in the text vector may be determined by determining a first position in the text vector as a first mark position, where the first mark position is characterized as a position corresponding to a [ CLS ] mark in the text vector, where the [ CLS ] mark position enables the last layer of bit corresponding vector of the BERT model to be used as a semantic representation of the whole text for a downstream classification task, and determining the emotion category of the text vector by using the expression feature as an input vector of a linear classifier according to a sub-text vector corresponding to the first mark position as the expression feature. Similarly, the determination method of the action category in the text vector is determined by a second mark position and a text length, and the second mark position is characterized as a position corresponding to a [ CLS ] mark which is a first position in the text vector. And taking the sub-text vector corresponding to the second mark position as the motion characteristic, and taking the motion characteristic as the input vector of the linear classifier, thereby determining the motion category of the text vector.

It should be noted that, the purpose of adding the text length as the input quantity of the linear classifier in the determination process of the action category in the text vector is to improve the matching accuracy between the text and the action intention recognition BERT model, because the text length features may provide good information for matching actions because the durations of different actions are different. For example, taking the live e-commerce scene as an example, if the backstage operation and maintenance staff inputs the text "everybody is ready to make a purchase? 5,4,3,2,1, the text length is longer, and more probably, the action which is intended to be the reciprocal of 5,4,3,2,1 is recognized, and the duration of the action and the corresponding text length are both longer; if the input text is 'Hello', the text length is shorter, the intention corresponding to the 'Hello' is more likely to be recognized as call-out, and the duration of the action and the corresponding text length are shorter.

Illustratively, for the BERT model of which the text mapping classification task is an emotion classification recognition task, refer to fig. 3, and fig. 3 is a schematic diagram of a training flow of the BERT model of the emotion classification recognition task. Firstly, input text data is coded by using a pre-trained BERT model to obtain a text vector fused with pre-trained prior knowledge. Then, the output vector of the first position (namely [ CLS ] mark corresponding position), namely the classification feature, is taken out, and is input into a linear classifier with softmax, so that emotion classification output is obtained. Taking a live broadcast e-commerce scene as an example, aiming at the scene, manually marking a training sample, finely tuning (fine-tuning) the model by using a small sample marked in the specific scene, updating all parameters of the BERT model and the classifier, and obtaining the emotion recognition BERT model by using the updated parameters as parameters corresponding to the optimal solution of the loss function of the emotion classification task. The emotion categories include happiness, sadness, surprise, anger, fear, neutrality and the like. Based on the finely adjusted emotion recognition BERT model, the probability of each emotion category can be obtained for any input text, and then the emotion label corresponding to the text can be predicted.

Exemplarily, for a BERT model in which a text mapping classification task is an action intention recognition task, refer to fig. 4, and fig. 4 is a schematic diagram of a training flow of the BERT model of the action intention recognition task. Firstly, the input text is coded by using a pre-training BERT model to obtain a text vector fused with pre-training prior knowledge. Then, an output vector of a first position (namely a [ CLS ] mark corresponding position) is taken out, text length features are spliced, and a linear classifier with softmax is input to obtain intention category output. Taking a live broadcast e-commerce scene as an example, designing an intention type aiming at the scene, manually marking a training sample, finely adjusting the model by using a small sample marked on the specific scene, updating all parameters of a BERT model and a classifier, and obtaining an action intention recognition BERT model by using the updated parameters as an optimal solution of a loss function of an action intention recognition classification task. Based on the trimmed BERT model, the probability of each intention type can be obtained from any input text, and then the intention type label corresponding to the text can be predicted. Based on the trimmed BERT model, the probability of each intention category can be obtained for any input text, and then the action label corresponding to the text can be predicted.

And S30, driving the virtual image model to output the expression matched with the emotion label, and driving the virtual image model to output the action matched with the action label.

After the emotion label and the action label are determined, driving the virtual image model to output an expression matched with the emotion label, and driving the virtual image model to output an action matched with the action label.

Illustratively, taking a live e-commerce scenario as an example, common actions include: "call," "like," "refuel," "clap," "love," "neutral," and so forth. For text that is intended to be identified as "other," the system randomly matches a natural action. In addition, after the system automatically matches the action, the user can modify the action according to the use condition of the user, for example, the virtual human wants to make a specified action in some sentences. Common expressions include "smiling", "surprised", "angry", "difficult", and so on. In addition, the user can customize the expression according to the use condition of the user or modify the expression on the original basis.

In the scheme provided by this embodiment, when multiple types of input information are acquired, response information corresponding to the input information is determined, the associated emotion tag and action tag are identified from the response information, and based on the emotion tag and action tag, a virtual person with actions and expressions can be generated, so that a more natural driving effect of the virtual image approaching the real person is achieved.

Referring to fig. 5, in the second embodiment, based on the first embodiment, the step S30 includes:

step S31, matching the expression corresponding to the virtual image model according to the emotion label, and matching the action corresponding to the model according to the action label;

step S32, performing frame interpolation processing on the expression and the action;

and step S33, inputting the expression after the frame interpolation and the action after the frame interpolation into the model so as to drive the virtual image model.

Optionally, in this embodiment, in order to achieve a better driving effect, the obtained emotion label is matched with the expression corresponding to the emotion label, and the obtained motion label is matched with the motion corresponding to the motion label. The expression and the action as the matched object are recorded by the recorder as template data of the action and the expression of the virtual image model according to the requirement. After the template data are obtained, the data processing is carried out on the template data by adopting a frame interpolation algorithm. The interpolation frame of the motion and the interpolation frame of the expression are the same in principle, and the relatively complex motion interpolation frame is taken as an example for explanation.

Action frames need to be selected first. In the selection process, each action comprises a start frame, an intermediate frame and an end frame, and the start frame and the natural state frame (without obvious action and expression changes, such as fig. 6) are transited through the frame interpolation algorithm. For example, in "call," there are many frames in the middle of the left-right waving motion, and it is necessary to select the appropriate frame to make the synthesized motion more natural, where all the frames are selected to make the synthesized motion more natural. The ending frame refers to a corresponding frame when the motion returns to a natural state after finishing the motion. The initial frame and the end frame of the action are transited with the natural state frame through the frame interpolation algorithm, so that the switching effect of the action is better and smooth.

Optionally, the frame interpolation mode includes a transition between a natural state frame and an action start frame, the natural state frame is used as a first action frame, the action start frame is used as a second action frame, a position offset of each pixel point between the first action frame and the second action frame is calculated, then, according to the position offset, an action transition frame between the first action frame and the second action frame is generated, and the action transition frame is inserted between action tracks of the first action frame and the second action frame, so that frame interpolation is achieved.

Optionally, the frame interpolation mode includes transition between an end frame of the motion and a natural state frame, the end frame of the motion is used as a first motion frame, the natural state frame is used as a second motion frame, a position offset of each pixel point between the first motion frame and the second motion frame is calculated, then the motion transition frame between the first motion frame and the second motion frame is generated according to the position offset, and the motion transition frame is inserted between motion tracks of the first motion frame and the second motion frame, so that frame interpolation is achieved.

Optionally, because each action frame has a large number of frames to be selected, and the frame fetching mode and the frame fetching sequence are different between different actions, in order to improve the frame fetching efficiency, the embodiment provides two different frame fetching modes, namely, cyclic frame fetching and non-cyclic frame fetching, and the frame fetching mode can be determined according to the category of the action. Non-cyclic frame fetching refers to the frame fetching of the whole motion which needs to be completed in sequence when the motion like the reciprocal of 5,4,3,2,1 cannot be cycled. The cyclic frame taking means that if one action is almost symmetrical to the hand lifting action and the hand receiving action, only the hand lifting action is taken, and the hand receiving action is taken back in an inverted mode. For example, the motion of "finger left" only needs to determine the middle frame, and only the first half of the motion is "extending hand" to take the frame, and the "receiving hand" part can complete the whole motion by means of circular reverse. Under the condition of ensuring the motion naturalness, the cyclic frame taking is based on the principle that the number of each motion frame contained finally is as small as possible. And the starting frame and the ending frame are required to be transited through the frame interpolation algorithm and the natural state frame. The method comprises the steps that a frame is cyclically fetched, and the frame is sequentially fetched and reversely fetched according to a frame fetching sequence, specifically, according to the action process from a starting frame to an intermediate frame, the pause time, the duration time and the naturalness of an action are determined, if the pause time is within a first preset range, the duration time is greater than a preset time threshold value, the naturalness is greater than a preset naturalness threshold value, the extraction mode is determined to be sequential frame fetching, if the pause time is within a second preset range, the duration time is less than or equal to the preset time threshold value, the naturalness is less than or equal to the naturalness threshold value, and the extraction mode is determined to be reverse frame fetching.

In some specific embodiments, the position offset of each pixel point in the first motion frame and the second motion frame (i.e., the displacement of the point at the time t is passed) is first calculated, and then the natural state frame and the offset are processed through a remapping function cv2.Remap () function in OpenCV to directly generate an intermediate transition frame, where the interpolation processing process is implemented through a LINEAR interpolation method (cv. The function constructed is as follows:

dst＝cv2.remap(img,mapx1,mapy1,cv2.INTER_LINEAR)

where dst represents the calculated transition frame, src represents the natural state frame, and mapx1 and copy 1 represent offsets in the x and y directions of the coordinates.

In the technical scheme provided by this embodiment, by matching the corresponding expressions or actions of the corresponding labels, frame interpolation processing is performed between the expressions or actions, so as to achieve a better model driving effect.

Furthermore, it can be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by instructing relevant hardware by a computer program. The computer program includes program instructions, and the computer program may be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the avatar driving device to implement the flow steps of the embodiments of the methods described above.

Accordingly, the present invention also provides a computer-readable storage medium storing a driver of an avatar model, which when executed by a processor implements the steps of the driving method of an avatar model as described in the above embodiments.

The computer-readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store program codes.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising," does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, and includes several instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims

1. A method of driving an avatar model, the method comprising:

and driving the avatar model to output the expression matched with the emotion label, and driving the avatar model to output the action matched with the action label.

2. The avatar model driving method of claim 1, wherein the step of determining an emotion label and an action label associated with the response information comprises:

determining a text vector fused with language prior knowledge in the response information based on a pre-training bidirectional coding representation model;

selecting the emotion category with the maximum probability in the emotion categories as the emotion label, and selecting the action category with the maximum probability in the action categories as the action label.

3. The method of driving an avatar model according to claim 2, wherein said target linear classifier includes a emotion recognition linear classifier and an action intention recognition linear classifier, and said step of inputting said text vector into the target linear classifier, determining an emotion class and an action class in said text vector, is preceded by the steps of:

4. The avatar model driving method of claim 3, wherein said determining an emotion category and an action category in said text vector comprises:

determining a first mark position in the text vector;

determining the expression characteristics of the text vector according to the sub-text vector corresponding to the first mark position;

and determining a second marker position and a text length in the text vector;

Determining the action characteristics of the text vector according to the spliced sub-text vectors;

5. The method of driving an avatar model according to claim 2, wherein said step of determining a text vector incorporating language prior knowledge in said text information based on a pre-trained bi-directional coded representation model is preceded by the steps of:

acquiring a depth self-attention network;

inputting large-scale unsupervised data to the deep self-attention network, and training the deep self-attention network through a masking language model and next sentence prediction to generate the pre-trained bidirectional coding representation model.

6. The avatar model driving method of claim 1, wherein said driving said avatar model to output said emotion label matched expression and driving said avatar model to output said action label matched action comprises:

performing frame interpolation processing on the expression and the action;

and inputting the expression after the frame interpolation processing and the action after the frame interpolation processing into the model so as to drive the virtual image model.

7. The avatar model driving method of claim 6, wherein said step of framing said expressions and said motions comprises:

extracting a first action frame and a second action frame of the action, wherein the first action frame is positioned before the second action frame;

8. The avatar model driving method of claim 7, wherein said step of extracting a first motion frame and a second motion frame of said motion is preceded by:

determining a dwell time, duration, and naturalness of the action;

or, if the pause time is within a second preset range, the duration time is less than or equal to the preset time threshold, the naturalness is less than or equal to the naturalness threshold, the extraction mode is determined to be reverse order frame fetching, and the step of extracting the first action frame and the second action frame of the action is executed according to the reverse order frame fetching.

9. An avatar driving apparatus, comprising: a memory, a processor, and an avatar model driver stored on the memory and executable on the processor, the avatar driver implementing the steps of the avatar model driving method of any one of claims 1-8 when executed by the processor.

10. A computer-readable storage medium, characterized in that a driver of an avatar model is stored on the computer-readable storage medium, and the driver of the avatar model, when executed by a processor, implements the steps of the driving method of an avatar model according to any one of claims 1 to 8.