WO2023226913A1 - 基于表情识别的虚拟人物驱动方法、装置及设备 - Google Patents

基于表情识别的虚拟人物驱动方法、装置及设备 Download PDF

Info

Publication number
WO2023226913A1
WO2023226913A1 PCT/CN2023/095446 CN2023095446W WO2023226913A1 WO 2023226913 A1 WO2023226913 A1 WO 2023226913A1 CN 2023095446 W CN2023095446 W CN 2023095446W WO 2023226913 A1 WO2023226913 A1 WO 2023226913A1
Authority
WO
WIPO (PCT)
Prior art keywords
expression
user
virtual character
category
target
Prior art date
Application number
PCT/CN2023/095446
Other languages
English (en)
French (fr)
Inventor
马远凯
朱鹏程
张昆才
冷海涛
罗智凌
周伟
钱景
李禹�
王郁菲
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023226913A1 publication Critical patent/WO2023226913A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/175Static expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • This application relates to the fields of artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and in particular, to a virtual character driving method, device and equipment based on expression recognition.
  • the traditional interaction between virtual characters and people mainly uses voice as the carrier.
  • the interaction between virtual characters and people only stays at the voice level and does not have the ability to understand visual information such as human expressions.
  • the virtual characters cannot make corresponding actions based on human expressions.
  • Feedback for example, during the virtual character broadcasting process, if the content currently reported by the virtual character is not the information that the person being interacted with wants to obtain, the person will make an impatient or even angry expression. If it is a real person interacting, he will actively ask to prompt the current The conversation proceeds smoothly and effectively, but the avatar does not have this ability; when the user interrupts the avatar's broadcast without speaking, but there is an obvious intention to interrupt on the expression, the avatar cannot make the corresponding interrupting behavior.
  • the degree of anthropomorphism of virtual characters is low, resulting in an unsmooth and unintelligent interaction process.
  • This application provides a virtual character driving method, device and equipment based on expression recognition to solve the problem that traditional virtual characters have low anthropomorphism, resulting in unsmooth and unintelligent communication processes.
  • this application provides a virtual character driving method based on expression recognition, including:
  • the user's face image is input into a base model and a multi-modal alignment model for facial expression recognition respectively, the first expression classification result is determined through the base model, and the second expression classification result is determined through the multi-modal alignment model.
  • the corresponding driving data is determined according to the response strategy corresponding to the target category
  • the virtual character is driven to perform the corresponding response behavior.
  • this application provides a virtual character driving device based on expression recognition, including:
  • the rendering model acquisition module is used to obtain the three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to users;
  • a real-time data acquisition module used to acquire the user's face image in real time during a conversation between the virtual character and the user;
  • a real-time expression recognition module configured to input the user's facial image into a base model and a multi-modal alignment model for facial expression recognition respectively, determine the first expression classification result through the base model, and use the multi-modal alignment model to determine the first expression classification result.
  • the modal alignment model determines the second expression classification result; determines the target classification of the user's current expression according to the first expression classification result and the second expression classification result;
  • a decision-making driving module used to determine the corresponding driving data according to the response strategy corresponding to the target classification if it is determined that the target category belongs to the preset expression category and the response triggering condition of the target category is currently met; according to the The driving data and the three-dimensional image rendering model of the virtual character drive the virtual character to perform the corresponding response behavior.
  • the present application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;
  • the memory stores computer execution instructions
  • the processor executes computer execution instructions stored in the memory to implement the above-mentioned method.
  • the present application provides a computer-readable storage medium in which computer-executable instructions are stored, and when executed by a processor, the computer-executable instructions are used to implement the above-mentioned method.
  • the virtual character driving method, device and equipment based on expression recognition provided by this application obtain the user's face image in real time during a round of dialogue between the virtual character and the user, and input the user's face image separately for facial expressions
  • the recognized base model and the multi-modal alignment model determine the first expression classification result through the base model, and determine the second expression classification result through the multi-modal alignment model; determine the user based on the first expression classification result and the second expression classification result.
  • the target classification of the current expression thereby accurately identifying the expression classification of the user's facial expression in real time; the expression classification based on the user's expression, when it is determined that the target classification belongs to the preset expression classification and the response triggering conditions of the target classification are currently met, based on the user's expression
  • the response strategy corresponding to the expression classification determines the corresponding driving data, and drives the virtual character to perform the corresponding response behavior based on the driving data and the virtual character's three-dimensional image rendering model, so that the virtual character in the output video stream makes the corresponding response behavior, increasing users
  • the ability to recognize expressions and drive virtual characters to respond promptly to the user's facial expressions improves the degree of anthropomorphism of virtual characters and makes the interaction between virtual characters and people smoother and more intelligent.
  • Figure 1 is a system framework diagram of the virtual character driving method based on expression recognition provided by this application;
  • Figure 2 is a flow chart of a virtual character driving method based on expression recognition provided by an embodiment of the present application
  • Figure 3 is a framework diagram of an expression recognition method provided by an exemplary embodiment of the present application.
  • Figure 4 is a flow chart of a virtual character driving method based on expression recognition provided by another embodiment of the present application.
  • Figure 5 is a flow chart of a virtual character driving method based on expression recognition provided by another embodiment of the present application.
  • Figure 6 is a flow chart of a virtual character driving method provided by another embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a virtual character driving device based on expression recognition provided by an exemplary embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.
  • Multi-modal interaction Users can communicate with virtual characters through text, voice, expressions, etc.
  • the virtual characters can understand user text, voice, expressions and other information, and can in turn communicate with users through text, voice, expressions, etc.
  • Duplex interaction A real-time, two-way interaction method.
  • the user can interrupt the virtual character at any time, and the virtual character can also interrupt himself who is speaking when necessary.
  • Static expression recognition Separate a person's specific expression state from a given static image and give a judgment on the type of expression.
  • the virtual character driving method based on expression recognition involves artificial intelligence, deep learning, machine learning, virtual reality and other fields in computer technology, and can be specifically applied to scenarios where virtual characters interact with humans.
  • this application provides a virtual character driving method based on expression recognition, which obtains the user's information in real time during a conversation between the virtual character and the user.
  • Face image input the user's face image into the base model and multi-modal alignment model for facial expression recognition respectively, determine the first expression classification result through the base model, and determine the second expression classification through the multi-modal alignment model Result: According to the first expression classification result and the second expression classification result, the target classification of the user's current expression is determined, thereby accurately and real-time identifying the user's facial expression, and it is determined that the target classification belongs to the preset expression classification and currently meets the target classification.
  • the corresponding driving data is determined according to the response strategy corresponding to the target classification; based on the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the corresponding response behavior, so that the virtual character can respond to the user's expression
  • Making corresponding responses in a timely manner improves the degree of anthropomorphism of virtual characters, making the communication process between virtual characters and people smoother and more intelligent.
  • Figure 1 is a system framework diagram of the virtual character driving method based on expression recognition provided by this application.
  • the system framework includes the following four modules: user face image acquisition module, real-time expression recognition module, and duplex decision-making Module, virtual character driver module.
  • the user face image acquisition module is used to: during the interaction process between the virtual character and the user, monitor the video stream on the user side in real time, obtain the video frames on the user side, and perform face detection on the video frames through the face detection algorithm. Get the user's face image.
  • the real-time expression recognition module is used to: use the trained base model and multi-modal alignment model for facial expression recognition to perform expression recognition on the user's face image, and identify the expression classification and expression of the user's current expression in the face image.
  • the duplex decision-making module is used to: pre-set the preset expression categories that need to be responded to and the response strategy corresponding to each preset expression category, and make a decision based on the target category and confidence of the user's current expression, as well as the current conversation context information. Decision-making results of whether the virtual character responds to the user's current expression and what response behavior to make.
  • response strategies include making expressions, speaking skills, making actions, etc.
  • the virtual character driving module is used to determine the corresponding driving data according to the response strategy corresponding to the target classification; according to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform the corresponding response behavior, so that the virtual character can respond to the user's current expression in a timely manner It can respond accordingly, improve the degree of anthropomorphism of virtual characters, and make the communication process between virtual characters and people smoother and more intelligent.
  • the response behavior performed by the avatar in response to the user's expression may include duplex interaction strategies such as interrupting the current broadcast, emotional care behavior, taking over the auxiliary dialogue process, etc.
  • Figure 2 is a flow chart of a virtual character driving method based on expression recognition provided by an embodiment of the present application.
  • the virtual character driving method based on expression recognition provided in this embodiment can be specifically applied to electronic devices that have the function of using virtual characters to interact with humans.
  • the electronic device can be a conversation robot, a terminal or a server, etc.
  • electronic devices The device can also be implemented using other devices, which are not specifically limited in this embodiment.
  • Step S201 Obtain a three-dimensional image rendering model of the virtual character to provide interactive services to users using the virtual character.
  • the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character.
  • the three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.
  • the method provided in this embodiment can be applied in scenarios where virtual characters interact with people, using virtual characters with three-dimensional images to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.
  • Step S202 During a round of dialogue between the virtual character and the user, obtain the user's face image in real time.
  • the virtual characters can have multiple rounds of dialogue with people.
  • the video stream from the user can be monitored in real time, video frames are sampled at a preset frequency, face detection is performed on the video frames, the face part in the video frame is obtained, and the user's face image is obtained.
  • performing face detection on video frames to obtain the user's face image can be implemented using commonly used face detection algorithms, which are not specifically limited here.
  • the interaction scene between virtual characters and people is usually a one-to-one conversation scene, that is, a scene where a user interacts with a virtual character. If there are multiple faces in the video frame, the person with each face in the video frame can be detected. Face image, based on the distance between the center point of the face image and the center point of the video frame, and the area of the face image, one of the face images is used as the face image of the current user.
  • the face image with the largest face image area and closest to the center of the video frame can be used as the face image of the current user.
  • the face image with the largest area is used as the face image of the current user. If there are multiple face images with the largest area, that is, the areas of the multiple face images are the same and the largest, then based on the distance between the center point of the face image with the largest area and the center point of the video frame, it will be compared with the center point of the video frame. The face image with the largest point distance is used as the face image of the current user.
  • expression recognition processing is performed on the user's face image in real time through steps S203-S204, and the expression classification of the user's current expression in the face image is determined.
  • Step S203 Input the user's facial image into a base model and a multi-modal alignment model for facial expression recognition respectively, determine the first expression classification result through the base model, and determine the second expression classification result through the multi-modal alignment model. .
  • a method of combining the base model of facial expression recognition and the multi-modal alignment model is used for expression recognition.
  • the user's face image is input into a trained base model for facial expression recognition, and the user's face image is used for expression recognition through the base model to obtain the first expression classification result; and the user's face is The image inputs the trained multi-modal alignment model, and the multi-modal alignment model performs expression recognition on the user's face image to obtain the second expression classification result.
  • the base model used for facial expression recognition is a model trained based on the expression classification task.
  • Expression recognition models that perform better on public data sets in the field of facial expression recognition can be used, such as the face alignment algorithm ( Deep Alignment Network (DAN), MobileNet, ResNet, etc.
  • DAN Deep Alignment Network
  • MobileNet MobileNet
  • ResNet ResNet
  • this solution introduces a multi-modal alignment model that has been pre-trained on a large number of public data sets of graphic and text data, and trains a small amount of expression classification annotations for expression recognition. By fine-tuning the pre-trained multi-modal alignment model on the data, the trained multi-modal alignment model can be obtained.
  • the multi-modal alignment model is used for expression recognition to determine expression classification.
  • the multi-modal alignment model can use any one of CoOp (Context Optimization), CLIP (Contrastive Language-Image Pre-training), Prompt Ensembling, PET (Pattern-Exploting Training) and other models, which will not be done here. Specific limitations.
  • Step S204 Determine the target score of the user's current expression based on the first expression classification result and the second expression classification result. kind.
  • Step S205 Determine whether the target category belongs to the preset expression category.
  • preset expression categories can be configured, and response strategies corresponding to each preset expression category can be configured.
  • the default expression category is a configured expression category that requires the virtual character to perform corresponding behaviors.
  • the preset expression categories may include one or more expression categories.
  • the response strategy corresponding to the preset expression category may include one or more response strategies.
  • the response strategy includes a specific implementation method for the virtual character to respond to the expression corresponding to the preset expression category.
  • the type of response strategy corresponding to each preset expression category The specific implementation method of each response strategy is configured independently, and the response strategies corresponding to different preset expression categories can be different.
  • the preset expression categories may include at least one of the following: neutral, sad, angry, happy, scared, disgusted, and surprised.
  • the response strategies for the three preset expression categories of "sad”, “scared” and “disgusted” can only include interruption strategies; the response strategies for the preset expression category of "happy” can only include the acceptance strategy; "angry”
  • the response strategies for the two preset expression categories "" and “Surprise” can include both interruption strategies and acceptance strategies.
  • preset expression categories as well as the response strategy for each preset expression category, can be set differently according to specific application scenarios, and are not specifically limited here.
  • more other preset expression categories can be set, and acceptance strategies can also be set for preset expression categories such as “sad”, “fear”, and “disgusted”, such as “angry” and “surprised” It is also possible not to set interruption policies and so on.
  • this step it is determined whether the target category of the user's current expression belongs to the preset expression category.
  • the avatar may need to respond to the user's current expression and continue to perform subsequent step S206.
  • the avatar does not need to respond to the user's current expression, and the avatar continues the current processing.
  • Step S206 Determine whether the response triggering condition of the target classification is currently met.
  • each preset expression category also has a corresponding response trigger condition. Only when the corresponding response trigger condition is met, the virtual character can be driven to perform the corresponding response behavior based on the response strategy corresponding to the preset expression category. .
  • the response triggering conditions of the preset expression classification can be configured and adjusted according to specific application scenarios, and are not specifically limited here.
  • steps S207-S208 are executed to drive the virtual character to execute the corresponding response strategy according to the response strategy corresponding to the target classification. response behavior.
  • the avatar does not need to respond to the user's current expression, and the avatar continues the current processing.
  • Step S207 Determine corresponding driving data according to the response strategy corresponding to the target classification.
  • the response strategy includes specific implementation methods for virtual characters to respond to expressions corresponding to preset expression categories.
  • the response strategies corresponding to the preset expression categories may include: expressions, words, actions, etc. made by the virtual characters.
  • the corresponding driving data is determined according to the response strategy corresponding to the target category.
  • the driving data includes all driving parameters required to drive the virtual character to execute the response strategy corresponding to the target classification.
  • the driving data includes expression driving parameters; if the response strategy corresponding to the target classification includes the virtual character making prescribed actions, the driving data includes action driving Parameters; if the response strategy corresponding to the target category includes a virtual character broadcasting prescribed words, then the driving data includes voice drive parameters; if the response strategy corresponding to the target category includes multiple response methods in expressions, words, and actions, then the drive data
  • the data includes corresponding multiple driving parameters, which can drive the virtual character to perform response behaviors corresponding to the response strategy.
  • Step S208 Drive the virtual character to perform the corresponding response behavior according to the driving data and the three-dimensional image rendering model of the virtual character.
  • the skeleton model of the virtual character is driven according to the driving data to obtain the skeleton data corresponding to the response behavior, and the skeleton data is rendered according to the three-dimensional detailed rendering model of the virtual character to obtain the response behavior.
  • Corresponding virtual character image data By rendering the virtual character image data into the output video stream, the virtual character in the output video stream makes corresponding response behaviors, thereby realizing a multi-modal duplex interaction function in which the virtual character responds promptly to the user's facial expressions.
  • the user's face image is acquired in real time during a conversation between the virtual character and the user, and the user's face image is input into the base model and the multi-modal alignment model for facial expression recognition respectively.
  • the base model Determine the first expression classification result, and determine the second expression classification result through the multi-modal alignment model; determine the target classification of the user's current expression based on the first expression classification result and the second expression classification result, thereby accurately identifying the user's face in real time Expression classification of expressions; expression classification based on user expressions.
  • the corresponding driving data is determined based on the response strategy corresponding to the expression category of the user's expression, and According to the driving data and the three-dimensional image rendering model of the virtual character, the virtual character is driven to perform the corresponding response behavior, so that the virtual character in the output video stream performs the corresponding response behavior, increases the real-time recognition ability of the user's expression, and drives the virtual character to target the user's face.
  • Expressions respond in a timely manner, improving the degree of anthropomorphism of virtual characters and making the interaction between virtual characters and people smoother and more intelligent.
  • a method of combining a base model for facial expression recognition and a multi-modal alignment model is used for expression recognition.
  • step S203 the user's facial image is input into the trained base model for facial expression recognition, and the user's facial image is used for expression recognition through the base model to obtain the first expression classification result; and the user The face image is input to the trained multi-modal alignment model, and the multi-modal alignment model is used to perform expression recognition on the user's face image to obtain the second expression classification result.
  • the first expression classification result includes a set of confidence levels corresponding to all expression categories, including the first confidence level that the user's current expression belongs to each expression category. The greater the first confidence level, the greater the possibility that the user's current expression belongs to that expression category. The higher.
  • the second expression classification result includes another set of confidence levels corresponding to all expression categories, including the second confidence level that the user's current expression belongs to each expression category.
  • the greater the second confidence level the greater the possibility that the user's current expression belongs to that expression category. The higher the sex.
  • the target category of the user's current expression and the confidence level that the user's current expression belongs to the target category are determined based on the first confidence level and the second confidence level that the user's current expression belongs to each expression category.
  • the expression classification corresponding to the largest confidence among the two sets of confidences is used as the target classification of the current user's expression.
  • the first confidence degree and the second confidence degree corresponding to the same expression classification in the two sets of confidence levels can be averaged as the third expression classification result.
  • Figure 3 is a framework diagram of an expression recognition method provided by an exemplary embodiment of the present application.
  • the base model adopts the DAN model for facial expression recognition
  • the multi-modal alignment model adopts the DAN model for facial expression recognition.
  • the user's face image obtained in real time is input into the DAN model and the CoOp model respectively, and the face image is encoded through the picture encoder of the DAN model, based on multiple attention modules.
  • Feature extraction is performed, the features extracted by multiple attention modules are fused based on the attention fusion module, and classification processing is performed based on the fusion results to obtain the first expression classification result; at the same time, the face image is encoded through the image encoder of the CoOp model to obtain Image features, and based on the text encoder, encode the text information built into the model to obtain text features. Perform similarity calculation and classification processing on multi-modal features (including text features and image features) to obtain the second expression classification result. The first expression classification result and the second expression classification result are combined to determine the final classification result, and obtain the final result of the expression classification of the user's expression.
  • the pre-trained CoOp model refers to the CoOp model pre-trained on the public data set.
  • facial expression recognition is performed by combining the base model for facial expression recognition and the multi-modal alignment model, which improves the accuracy of expression recognition.
  • the facial expression classification on a test data set reaches 92.9%. Accuracy.
  • the model used for face recognition in this embodiment combines a base model and a multi-modal alignment model.
  • the model parameters are large and the model is complex.
  • hardware resources are limited (for example, only In the case of CPU resources (no GPU resources), the model's inference time (RT) is high and the efficiency of expression recognition is low.
  • the user's face image is input into the base model and the multi-modal alignment model for facial expression recognition respectively, and the first expression classification result is determined through the base model, and Before determining the second expression classification result through the multi-modal alignment model, obtain the trained base model and multi-modal alignment model for facial expression recognition, and perform model distillation on the base model and multi-modal alignment model to classify the expression.
  • the recognized model is compressed to reduce the number of parameters of the model while ensuring that the accuracy of facial expression recognition meets the requirements, reduce model inference time, improve the efficiency of expression recognition, and achieve real-time classification and recognition of facial expressions.
  • the classification accuracy of expression recognition can be basically maintained (reaching 91.2% on the above test data set), the number of parameters of the model is reduced, and the single-frame inference time on the CPU is controlled within 30ms, reaching The effect of real-time recognition is achieved.
  • model pruning technology or other model compression technology can also be used to replace model sorting to compress the base model and multi-modal alignment model of face recognition, which are not specifically limited here.
  • Figure 4 is a flow chart of a virtual character driving method based on expression recognition provided by another embodiment of the present application. Based on any of the above method embodiments, it is possible to realize that the avatar has duplex capabilities and has the ability to actively or passively interrupt its current broadcast, and respond to the user's expression after the interruption to guide the subsequent dialogue. The process makes the interaction between virtual characters and users smoother and more intelligent. As shown in Figure 4, the specific steps of this method are as follows:
  • Step S401 Obtain a three-dimensional image rendering model of the virtual character to provide interactive services to users using the virtual character.
  • Step S402 During a round of dialogue between the virtual character and the user, obtain the user's face image in real time.
  • Step S403 Input the user's face image into a base model and a multi-modal alignment model for facial expression recognition respectively, determine the first expression classification result through the base model, and determine the second expression classification result through the multi-modal alignment model. .
  • Step S404 Determine the target classification of the user's current expression based on the first expression classification result and the second expression classification result.
  • Step S405 If the current dialogue state is a state in which the virtual character outputs the user's acceptance, determine whether the target category belongs to the first preset expression category.
  • the response strategy of the first preset expression classification includes an interruption strategy.
  • the interruption strategy When the interruption strategy is executed, it will interrupt the current processing of the virtual character and drive the virtual character to execute the response behavior corresponding to the interruption strategy.
  • a first preset expression category is set, and each first expression is set. Interruption strategies corresponding to preset expression categories.
  • the first preset expression category and the interruption strategy corresponding to each first preset expression category can be set and adjusted according to the needs of the actual application scenario, and are not specifically limited here.
  • the interruption policy corresponding to the first preset expression score and the first preset expression category can be set as shown in Table 1 below.
  • the interruption strategy may not include reporting speech, making prescribed expressions, and making prescribed actions at the same time. It may only include any one or any two of them. For example, it may not broadcast speech. They only make prescribed expressions and movements; or they broadcast prescribed words and make prescribed expressions but do not perform any movements, etc.
  • this step it is determined whether the target category of the user's current expression belongs to the first preset expression category.
  • the avatar may need to respond to the user's current expression and continue to perform subsequent step S406.
  • the avatar does not need to respond to the user's current expression, and the avatar continues the current processing.
  • step S406 determines whether the interrupt triggering condition corresponding to the target category is currently met. If the interrupt triggering condition corresponding to the target category is currently met, interrupt the virtual character's current session. Output in advance, and drive the virtual character to perform the corresponding interruption response behavior according to the interruption strategy corresponding to the target classification.
  • Step S406 Determine whether the interrupt triggering condition corresponding to the target classification is currently met.
  • the interruption triggering condition corresponding to the target category includes at least one of the following: the confidence that the user's current expression belongs to the target category is greater than or equal to the confidence threshold corresponding to the target category; the current dialogue round is different from the previous dialogue round that triggered interruption.
  • the number of rounds between intervals is greater than or equal to the preset number of rounds.
  • context information can be recorded, and the context information includes information about whether the virtual character was interrupted. According to the current context information, it can be determined whether the number of rounds between the current dialogue round and the previous dialogue round that triggers interruption is greater than or equal to the preset number of rounds.
  • the corresponding first preset expression category can be set according to the needs of the actual application scenario. Interrupt the trigger conditions to avoid affecting the normal interaction between the virtual character and the user, and improve the smoothness and intelligence of the interaction between the virtual character and the user.
  • the interrupt triggering condition corresponding to the first preset expression category may include: the confidence that the user's current expression belongs to the first preset expression category is greater than or equal to the confidence threshold corresponding to the first preset expression category.
  • the execution of the interruption strategy corresponding to the first preset expression classification is triggered, which can avoid affecting the normal interaction between the avatar and the user and improve the avatar's performance. Smooth and intelligent interactions with users.
  • the interrupt triggering condition corresponding to the first preset expression category may include: the number of rounds in the interval between the current dialogue round and the previous dialogue round that triggered interruption is greater than or equal to the preset number of rounds, This avoids frequently interrupting the output of the avatar, avoids multiple rounds of interruptions affecting the normal interaction between the avatar and the user, and improves the smoothness and intelligence of the interaction between the avatar and the user.
  • the interrupt triggering condition corresponding to the first preset expression category may include: the confidence that the user's current expression belongs to the first preset expression category is greater than or equal to the confidence threshold corresponding to the first preset expression category, and The number of rounds between the current dialogue round and the previous dialogue round that triggered interruption is greater than or equal to the preset number of rounds.
  • the confidence thresholds corresponding to different first preset expression classifications can be different, which can be set and adjusted according to the needs of actual application scenarios.
  • the number of preset rounds can be set and adjusted according to the needs of actual application scenarios. This implementation Examples are not specifically limited here.
  • the interrupt triggering conditions corresponding to the target category are: the confidence that the user's current expression belongs to the target category is greater than or equal to the confidence threshold corresponding to the target category; and the current dialogue round is the same as the previous dialogue round that triggers interruption.
  • the number of rounds between intervals is greater than or equal to the preset number of rounds. For example, if these two items are satisfied at the same time, in this step, based on the confidence that the user's current expression belongs to the target category and the current context information, it is judged whether the target is currently met.
  • the interrupt triggering conditions corresponding to the categories are: the confidence that the user's current expression belongs to the target category is greater than or equal to the confidence threshold corresponding to the target category; and the current dialogue round is the same as the previous dialogue round that triggers interruption.
  • the number of rounds between intervals is greater than or equal to the preset number of rounds. For example, if these two items are satisfied at the same time, in this step, based on the confidence that the user's current expression belongs to the
  • steps S407-S408 are executed to drive the virtual character to perform the corresponding interruption response behavior according to the interruption strategy corresponding to the target classification.
  • the avatar does not need to perform interrupt response processing for the user's current expression, and the avatar continues the current processing.
  • Step S407 If the interruption triggering condition corresponding to the target classification is currently met, interrupt the current output of the avatar, and determine the corresponding driving data according to the interruption strategy corresponding to the target classification.
  • the driving data is used to drive the avatar to perform at least the following An interrupting response behavior: broadcasting words that classify corresponding expressions, making expressions with specific emotions, and performing prescribed actions.
  • the interrupting strategy may include at least one of the following interrupting response behaviors: broadcasting words classified into corresponding expressions, making expressions with specific emotions, and performing prescribed actions.
  • the interruption strategies corresponding to different first preset expression categories may include different types and specific contents of interruption response behaviors.
  • the corresponding driving data is determined according to the interruption strategy corresponding to the target category.
  • the driving data includes all driving parameters required to drive the virtual character to execute the interruption strategy corresponding to the target classification.
  • any existing avatar driving method that generates avatar driving data based on a determined strategy can be used, and will not be described in detail here.
  • Step S408 According to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform at least one of the following interrupt response behaviors: broadcasting words for the corresponding expression classification, making expressions with specific emotions, and performing prescribed actions.
  • the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character.
  • the three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.
  • the method provided in this embodiment can be applied in scenarios where virtual characters interact with people.
  • Virtual characters with three-dimensional images are used to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.
  • the skeleton model of the virtual character is driven according to the driving data to obtain the skeleton data corresponding to the response behavior, and the skeleton data is rendered according to the three-dimensional detailed rendering model of the virtual character to obtain the response behavior.
  • Corresponding virtual character image data By rendering the virtual character image data into the output video stream, the virtual character in the output video stream makes corresponding response behaviors, thereby realizing a duplex interaction function in which the virtual character responds promptly to the user's facial expressions.
  • a first preset expression category is set, and each first expression is set.
  • the interruption strategy corresponding to the preset expression category is used to identify the target category of the user's expression in real time.
  • the corresponding driving data is determined according to the response strategy corresponding to the target category, Driving the virtual character to perform at least one of the following interrupt response behaviors: broadcasting words for the corresponding expression classification, making expressions with specific emotions, and performing prescribed actions, which can avoid affecting the normal interaction between the virtual character and the user, and at the same time improve the interaction between the virtual character and the user.
  • the smoothness and intelligence of user interaction is determined.
  • step S408 the following steps may be included:
  • Step S409 If the user's voice input is received within the first preset time period and the semantic information of the user's voice input is recognized, the next round of dialogue is started and dialogue processing is performed based on the semantic information of the user's voice input.
  • the first preset duration is generally set to a shorter duration so that the user will not feel a long pause.
  • the first preset duration can be set and adjusted according to the needs of the actual application scenario, such as a few hundred milliseconds or 1 second. , or even a few seconds, etc., there is no specific limit here.
  • Step S410 If the user's voice input is not received within the first preset time period, or the semantic information of the user's voice input cannot be recognized, continue the current output of the interrupted virtual character.
  • the interrupted avatar can be continued after a pause for the third preset time period. Current output to give the user enough time for voice input.
  • the third preset duration can be hundreds of milliseconds, 1 second, or even several seconds, etc., and can be set and adjusted according to the needs of actual application scenarios, and is not specifically limited here.
  • a new round of dialogue is started. If no voice input with semantic information by the user is received, For voice input, you can pause for a certain period of time and then continue the avatar's previous broadcast, so as to avoid interrupting the response behavior and affecting the normal interaction between the avatar and the user, and improve the smoothness and intelligence of the interaction between the avatar and the user.
  • Figure 5 is a flow chart of a virtual character driving method based on expression recognition provided by another embodiment of the present application.
  • the interaction solution between the virtual character and the person can have duplex capabilities, and the virtual character can actively undertake the function according to the user's expression to guide the subsequent dialogue process, making the interaction between the virtual character and the user smoother. Smarter.
  • the specific steps of this method are as follows:
  • Step S501 Obtain a three-dimensional image rendering model of the virtual character to provide interactive services to users using the virtual character.
  • Step S502 During a round of dialogue between the virtual character and the user, obtain the user's face image in real time.
  • Step S503 Enter the user's facial image into a base model and a multi-modal alignment model for facial expression recognition respectively, determine the first expression classification result through the base model, and determine the second expression classification result through the multi-modal alignment model. .
  • Step S504 Determine the target classification of the user's current expression based on the first expression classification result and the second expression classification result.
  • Step S505 If the current dialogue state is the state where the user inputs the virtual character, determine whether the target category belongs to in the second default expression category.
  • the response strategy of the second preset expression classification includes an acceptance strategy.
  • the acceptance strategy is mainly used in the dialogue state where the user inputs the avatar to receive. It will not significantly interrupt the user's input, and the avatar will make an acceptance response behavior that will not affect the user's input.
  • the driving avatar in the dialogue state where the user inputs the avatar to receive, according to the facial expression during the user input process, can simulate the situation where real humans respond to the other party's expression in real time during the interaction process, and it is set A second preset expression category is used, and an undertaking strategy corresponding to each second preset expression category is set.
  • the second preset expression classification and the undertaking strategy corresponding to each second preset expression classification can be set and adjusted according to the needs of the actual application scenario, and are not specifically limited here.
  • the acceptance strategy corresponding to the second preset expression score and the second preset expression category may be set as shown in Table 2 below.
  • this step it is determined whether the target category of the user's current expression belongs to the second preset expression category.
  • the avatar may need to respond to the user's current expression and continue to perform subsequent steps.
  • the avatar does not need to respond to the user's current expression, and the avatar continues the current processing.
  • Step S506 Determine whether the contact triggering condition corresponding to the target classification is currently met.
  • the triggering conditions corresponding to the target category include: at least the user's expressions in N consecutive frames of images belong to the target category, where N is a positive integer and N is a preset value corresponding to the target category.
  • N may be 5, and the value of N may be set and adjusted according to the needs of the actual application scenario, and is not specifically limited in this embodiment.
  • the contact triggering condition corresponding to the second preset expression classification to be in at least N consecutive frames of images.
  • the user's expressions all belong to the second preset expression category, which can avoid frequent and unnecessary response behaviors of the virtual character and improve the smoothness and intelligence of the interaction between the virtual character and the user.
  • steps S507-S508 are executed to drive the virtual character to perform the corresponding acceptance response behavior according to the acceptance strategy corresponding to the target classification.
  • the avatar does not need to perform acceptance response processing for the user's current expression, and the avatar continues the current processing.
  • step S506 is an optional step.
  • step S507 when it is determined in step S505 that the target category belongs to the second preset expression category, step S507 can be directly executed to determine the corresponding acceptance strategy according to the target category. Drive the data, and drive the virtual character to perform the response behavior based on the driving data.
  • Step S507 Determine the corresponding driving data according to the acceptance strategy corresponding to the target classification.
  • the driving data is used to drive the virtual character to perform at least one of the following acceptance response behaviors: broadcasting acceptance words with a specific tone, making expressions with specific emotions, Perform prescribed actions.
  • the acceptance strategy may include at least one of the following acceptance response behaviors: broadcasting acceptance words with a specific tone, making expressions with specific emotions, and performing prescribed actions. Among them, the types and specific contents of the acceptance response behaviors included in the acceptance strategies corresponding to different second preset expression categories may be different.
  • the corresponding driving data is determined according to the acceptance strategy corresponding to the target category.
  • the driving data includes all driving parameters required to drive the virtual character to execute the undertaking strategy corresponding to the target classification.
  • any existing avatar driving method that generates avatar driving data based on a determined strategy can be used, and will not be described in detail here.
  • this step can also be implemented in the following manner: obtain the voice data currently input by the user, identify the user intention information corresponding to the language data, and determine the emotional polarity corresponding to the user intention information; according to the user intention The emotional polarity and undertaking strategy corresponding to the information determine the specific tone and specific emotion used in undertaking the response behavior; according to the undertaking strategy corresponding to the target classification and the specific tone and specific emotion used in undertaking the response behavior, the corresponding driving data is determined, and the driving data is used
  • the virtual character is driven to perform at least one of the following acceptance response behaviors: broadcasting acceptance speech with a specific tone, making an expression with a specific emotion, and performing a prescribed action.
  • the voice stream input by the user can be acquired in real time.
  • the voice data input by the user in the most recent period can be acquired, and the voice data can be converted into corresponding text information; recognition
  • the user intention information corresponding to the text information is determined, and the emotional polarity corresponding to the user intention information is determined.
  • the emotional polarity corresponding to the user intention information includes: positive, negative and neutral.
  • Identify text information corresponding to User intention information can be realized through existing neural network classification models based on Natural Language Understanding (NLU). For example, TextRCNN can be used. This model balances the effect and the computational cost of the model, has a high classification accuracy and a Low complexity and low reasoning overhead.
  • alternative models of this model include TextCNN, Transformer, etc., which will not be described again in this embodiment.
  • the undertaking strategy may also include a corresponding relationship between the emotional polarity corresponding to the user's intention information and the specific tone of the speech in the undertaking response behavior, and the corresponding relationship between the emotional polarity corresponding to the user's intention information and the expression used in the undertaking response behavior.
  • the correspondence between specific emotions, the emotional polarity corresponding to the user's intention information, and the type of action taken in the response behavior Based on the emotional polarity and takeover strategy corresponding to the user's intention information, the specific tone and specific emotion used in the takeover response behavior can be determined.
  • the specific tone of the speech and the specific emotion of the expression when the avatar performs the acceptance response behavior in the acceptance strategy are determined, so that the avatar can Responding to the user's current emotional polarity with corresponding tone and emotion, improving the degree of anthropomorphism of the virtual character, can increase the user's enthusiasm for continued interaction, and improve the smoothness and intelligence of the interaction between the virtual character and the user.
  • the acceptance words broadcast in the acceptance strategy are usually set to short contents, such as "uh-huh”, “yes”, “right”, “um”, “oh”, etc.
  • the acceptance words broadcast will not Affecting the user's normal voice input.
  • Step S508 According to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform at least one of the following acceptance response behaviors: broadcasting acceptance words with a specific tone, making expressions with specific emotions, and performing prescribed actions.
  • the three-dimensional image rendering model of the virtual character includes the rendering data required to realize the rendering of the virtual character.
  • the three-dimensional image rendering model based on the virtual character can render the skeletal data of the virtual character into the three-dimensional image of the virtual character displayed to the user.
  • the method provided in this embodiment can be applied in scenarios where virtual characters interact with people, using virtual characters with three-dimensional images to realize real-time interaction functions between machines and people, so as to provide intelligent services to people.
  • the driving avatar in the dialogue state where the user inputs the avatar to receive, according to the facial expression during the user input process, can simulate the situation where real humans respond to the other party's expression in real time during the interaction process, and it is set The second preset expression category, and set the acceptance strategy corresponding to each second preset expression category, by identifying the target category of the user's expression in real time, determine that the target category belongs to the second preset expression category, and determine that the current target category correspondence is met
  • the corresponding drive data is determined based on the engagement strategy corresponding to the target classification and the emotional polarity of the user's intention information in the user input voice data, and the virtual character is driven to perform at least one of the following engagement response behaviors: broadcast with a specific tone
  • the ability to speak, express expressions with specific emotions, and perform prescribed actions can avoid affecting the normal interaction between the virtual character and the user, while improving the degree of anthropomorphism of the virtual character and improving the fluency and intelligence of the interaction between the virtual character and the user
  • Figure 6 is a flow chart of a virtual character driving method provided by another embodiment of the present application. Based on any of the above method embodiments, the interaction solution between the avatar and the person can have duplex capabilities, and the avatar has the function of actively accepting input according to the user's voice to guide the subsequent dialogue process, so that the avatar and the user can interact Interaction is smoother and more intelligent. As shown in Figure 6, the specific steps of this method are as follows:
  • Step S601 Obtain a three-dimensional image rendering model of the virtual character to provide interactive services to users using the virtual character.
  • Step S602 In a round of dialogue between the virtual character and the user, the voice data input by the user is obtained in real time.
  • the virtual character can have multiple rounds of dialogue with the person.
  • the voice stream from the user that is, the voice data input by the user, can be received in real time.
  • Step S603 When it is detected that the silence duration of the voice data input by the user is greater than or equal to the second preset duration, and if it is determined that the voice input has not ended, the voice data is converted into corresponding text information.
  • VAD Voice Activity Detection
  • the second preset duration is a shorter duration that is less than the silence duration threshold.
  • the silence duration threshold is the silence duration used to determine whether the user's current round of input has ended. When the silence duration of the user's voice input reaches the silence duration threshold, the user's current round of input is determined. Voice input ends.
  • the silent duration threshold can be 800ms
  • the second preset duration can be 300ms. The second preset duration can be set and adjusted according to the needs of the actual application scenario, and is not specifically limited here.
  • the voice data is converted into corresponding text information, And perform subsequent processing based on text information.
  • Step S604 Identify the user intention information corresponding to the text information, and determine the emotional polarity corresponding to the user intention information.
  • the emotional polarity corresponding to the user intention information includes: positive, negative and neutral.
  • This step can be implemented through the existing Natural Language Understanding (NLU) algorithm, which will not be described again in this embodiment.
  • NLU Natural Language Understanding
  • Step S605 Determine corresponding driving data according to the emotional polarity corresponding to the user intention information.
  • the corresponding undertaking strategy is determined according to the emotional polarity corresponding to the user intention information; the corresponding driving data is generated according to the corresponding undertaking strategy.
  • the different emotional polarities of the user's intention information can be set to the acceptance strategy shown in Table 3.
  • any existing avatar driving method that generates avatar driving data based on a determined strategy can be used, and will not be described in detail here.
  • Step S606 According to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform at least one of the following acceptance response behaviors: broadcasting the acceptance speech with a prescribed tone configured for the emotional polarity corresponding to the user's intention information, making a specific Emotional expressions and prescribed actions.
  • the delivery technique of broadcasting the prescribed tone configured for the emotional polarity corresponding to the user's intention information does not affect the user's voice input.
  • the acceptance strategy may include at least one of the following acceptance response behaviors: broadcasting acceptance words with a specific tone, making expressions with specific emotions, and performing prescribed actions. Among them, the types and specific contents of the acceptance response behaviors included in the acceptance strategies corresponding to different second preset expression categories may be different.
  • any existing avatar driving method that generates avatar driving data based on a determined strategy can be used, and will not be described in detail here.
  • the voice data input by the user is obtained in real time during a conversation between the virtual character and the user; when it is detected that the silence duration of the voice data input by the user is greater than or equal to the second preset duration and the voice input has not ended , identify user intention information and its corresponding emotional polarity based on voice data, determine the corresponding undertaking strategy based on the emotional polarity corresponding to the current user intention information, and drive the virtual character to make at least one undertaking response behavior according to the corresponding undertaking strategy : Broadcasting the prescribed tone of speech corresponding to the emotional polarity corresponding to the user's intention information, making expressions with specific emotions, and performing prescribed actions, which can improve the degree of anthropomorphism of the virtual character without affecting the user's input, and improve the virtual The smoothness and intelligence of the interaction between characters and users.
  • Figure 7 is a schematic structural diagram of a virtual character driving device based on expression recognition provided by an exemplary embodiment of the present application.
  • the virtual character driving device based on expression recognition provided by the embodiment of the present application can execute the virtual character driving device based on expression recognition.
  • the virtual character driving device 70 based on expression recognition includes: a rendering model acquisition module 71 , a real-time data acquisition module 72 , a real-time expression recognition module 73 and a decision-making and driving module 74 .
  • the rendering model acquisition module 71 is used to acquire a three-dimensional image rendering model of the virtual character, so as to use the virtual character to provide interactive services to the user.
  • the real-time data acquisition module 72 is used to acquire the user's face image in real time during a round of dialogue between the virtual character and the user.
  • the real-time expression recognition module 73 is used to respectively input the user's facial image into a base model and a multi-modal alignment model for facial expression recognition, determine the first expression classification result through the base model, and determine the third expression classification result through the multi-modal alignment model. 2. expression classification results; determine the target classification of the user's current expression based on the first expression classification result and the second expression classification result.
  • the decision-making and driving module 74 is used to determine the corresponding driving data according to the response strategy corresponding to the target classification if it is determined that the target classification belongs to the preset expression classification and currently meets the response triggering conditions of the target classification; according to the driving data and the three-dimensional image of the virtual character
  • the image rendering model drives virtual characters to perform corresponding response behaviors.
  • the device provided by the embodiment of the present application can be specifically used to execute the solution provided by the method embodiment corresponding to Figure 2 above.
  • the specific functions and the technical effects that can be achieved will not be described again here.
  • the first expression classification result includes: a first confidence level that the user's current expression belongs to each expression category, and the second expression classification result includes a second confidence level that the user's current expression belongs to each expression category.
  • the real-time expression recognition module is also used to: determine the first confidence level and the second confidence level of each expression category according to which the user's current expression belongs. , determine the target category of the user's current expression, and the confidence that the user's current expression belongs to the target category.
  • the decision-making and driving module also Used for: If the current dialogue state is the state that the avatar outputs and the user receives, and the target category belongs to the first preset expression category, then based on the confidence that the user's current expression belongs to the target category and the current context information, determine whether the target is currently met.
  • the interrupt trigger condition corresponding to the classification, the first preset expression category has a corresponding interrupt strategy; if it is determined that the interrupt trigger condition corresponding to the target category is currently met, the current output of the avatar is interrupted, and the interrupt trigger condition corresponding to the target category is interrupted.
  • the interruption strategy determines the corresponding driving data.
  • the driving data is used to drive the virtual character to perform at least one of the following interruption response behaviors: broadcasting words for the corresponding expression classification, making expressions with prescribed emotions, and making prescribed actions.
  • the interruption triggering condition corresponding to the target category includes at least one of the following: the confidence that the user's current expression belongs to the target category is greater than or equal to the confidence threshold corresponding to the target category; the current conversation round is different from the previous one.
  • the number of turns in the interval between dialogue turns that trigger an interruption is greater than or equal to the preset number of turns.
  • the decision-making and driving module when driving the virtual character to perform the corresponding response behavior based on the driving data and the three-dimensional image rendering model of the virtual character, is also used to: if the virtual character is received within the first preset time period, If the user's voice input is recognized and the semantic information of the user's voice input is recognized, the next round of dialogue will be started and dialogue processing will be performed based on the semantic information of the user's voice input; if the user's voice input is not received within the first preset time period , or cannot be recognized Out of the semantic information of the user's voice input, the interrupted avatar's current output is continued.
  • the decision-making and driving module also Used for: If the current dialogue state is the state where the user inputs the avatar, and the target category belongs to the second preset expression category, then based on the target category, determine whether the current contact triggering condition corresponding to the target category is met, and the second preset expression
  • the classification has a corresponding undertaking strategy; if it is determined that the undertaking triggering conditions corresponding to the target classification are currently met, the corresponding driving data is determined according to the undertaking strategy corresponding to the target classification.
  • the driving data is used to drive the virtual character to perform at least one of the following undertaking response behaviors: Broadcast the following words with a specific tone, make expressions with specific emotions, and perform prescribed actions; among them, broadcasting the following words with a specific tone does not affect the user's voice input.
  • the decision-making and driving module is also used to: according to the user Based on the currently input voice data, identify the user intention information corresponding to the language data, and determine the emotional polarity corresponding to the user intention information; determine the specific tone and specific emotion used in the response behavior based on the emotional polarity and undertaking strategy corresponding to the user intention information. ; According to the acceptance strategy corresponding to the target classification and the specific tone and specific emotion used in the acceptance response behavior, the corresponding driving data is determined.
  • the driving data is used to drive the virtual character to perform at least one of the following acceptance response behaviors: broadcasting the acceptance speech with a specific tone , make expressions with specific emotions and perform prescribed actions.
  • the contact triggering conditions corresponding to the target category include: at least the user's expressions in N consecutive frames of images belong to the target category, where N is a positive integer and N is a preset value corresponding to the target category.
  • the real-time data acquisition module is also used to: obtain the voice data input by the user in real time during a round of dialogue between the virtual character and the user.
  • the decision-making and driving module is also used to: when it is detected that the silence duration of the voice data input by the user is greater than or equal to the second preset duration, if it is determined that the voice input has not ended, convert the voice data into corresponding text information; identify the text information Corresponding user intention information, and determine the emotional polarity corresponding to the user intention information; determine the corresponding driving data according to the emotional polarity corresponding to the user intention information; according to the driving data and the three-dimensional image rendering model of the virtual character, drive the virtual character to perform the following At least one type of response behavior: broadcasting a predetermined tone of voice configured for the emotional polarity corresponding to the user's intention information, making an expression with a specific emotion, and performing a prescribed action.
  • the delivery technique of broadcasting the prescribed tone configured for the emotional polarity corresponding to the user's intention information does not affect the user's voice input.
  • the real-time expression recognition module is also used to: obtain the trained base model and multi-modal alignment model for facial expression recognition; and perform model distillation on the base model and multi-modal alignment model.
  • the device provided by the embodiments of the present application can be specifically used to execute the solution provided by any of the above method embodiments.
  • the specific functions and the technical effects that can be achieved will not be described again here.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.
  • the electronic device 80 includes: a processor 801, and a memory 802 communicatively connected to the processor 801.
  • the memory 802 stores computer execution instructions.
  • the processor executes the computer execution instructions stored in the memory to implement the solution provided by any of the above method embodiments.
  • the specific functions and the technical effects that can be achieved will not be described again here.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium. When the computer-executable instructions are executed by a processor, they are used to implement the solutions provided by any of the above method embodiments. Specific functions And the technical effects that can be achieved will not be repeated here.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product includes: a computer program.
  • the computer program is stored in a readable storage medium.
  • At least one processor of the electronic device can read the computer program from the readable storage medium.
  • At least A processor executes a computer program so that the electronic device executes the solution provided by any of the above method embodiments. The specific functions and technical effects that can be achieved will not be described again here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请提供一种基于表情识别的虚拟人物驱动方法、装置及设备,涉及计算机技术中的人工智能、深度学习、机器学习、虚拟现实等领域。本申请的方法,通过在虚拟人物与用户对话中,实时获取用户的人脸图像,通过用于人脸表情识别的基模型和多模态对齐模型,根据用户的人脸图像精准地识别用户当前表情的目标分类;在确定目标分类属于预设表情分类并且当前满足目标分类的响应触发条件时,根据用户表情的表情分类对应的响应策略确定对应的驱动数据,并根据驱动数据和虚拟人物的三维形象渲染模型驱动虚拟人物执行对应的响应行为,使得虚拟人物针对用户表情做出及时响应,提高了虚拟人物拟人化程度,使得虚拟人物与人的交互更顺畅、更智能。

Description

基于表情识别的虚拟人物驱动方法、装置及设备
本申请要求于2022年05月23日提交中国专利局、申请号为202210567627.X、申请名称为“基于表情识别的虚拟人物驱动方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术中的人工智能、深度学习、机器学习、虚拟现实等领域,尤其涉及一种基于表情识别的虚拟人物驱动方法、装置及设备。
背景技术
传统的虚拟人物与人的交互中主要以语音为载体,虚拟人物与人的交互仅停留在语音层面,不具备理解人的表情等视觉信息的能力,虚拟人物无法根据人的表情做出相应的反馈,例如虚拟人物播报过程中,若虚拟人物当前播报的内容不是作为交互对象的人想要获取的信息时,人会做出不耐烦甚至愤怒的表情,如果是真人交互会主动询问以促使当前对话顺利并有效地进行,但是虚拟人物不具有这种能力;遇到用户无语音打断虚拟人物播报,但表情上有明显打断意图的情况时,虚拟人物无法做出相应的打断行为,虚拟人物拟人化程度低,导致交互过程不顺畅、不智能。
发明内容
本申请提供一种基于表情识别的虚拟人物驱动方法、装置及设备,用以解决传统虚拟人物拟人化程度低,导致沟通过程不顺畅、不智能的问题。
一方面,本申请提供一种基于表情识别的虚拟人物驱动方法,包括:
获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;
在虚拟人物与用户的一轮对话中,实时获取所述用户的人脸图像;
将所述用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过所述基模型确定第一表情分类结果,并通过所述多模态对齐模型确定第二表情分类结果;
根据所述第一表情分类结果和所述第二表情分类结果,确定所述用户当前表情的目标分类;
若确定所述目标分类属于预设表情分类,并且当前满足所述目标分类的响应触发条件,则根据所述目标分类对应的响应策略,确定对应的驱动数据;
根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。
另一方面,本申请提供一种基于表情识别的虚拟人物驱动装置,包括:
渲染模型获取模块,用于获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;
实时数据获取模块,用于在虚拟人物与用户的一轮对话中,实时获取所述用户的人脸图像;
实时表情识别模块,用于将所述用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过所述基模型确定第一表情分类结果,并通过所述多模态对齐模型确定第二表情分类结果;根据所述第一表情分类结果和所述第二表情分类结果,确定所述用户当前表情的目标分类;
决策驱动模块,用于若确定所述目标分类属于预设表情分类,并且当前满足所述目标分类的响应触发条件,则根据所述目标分类对应的响应策略,确定对应的驱动数据;根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。
另一方面,本申请提供一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,以实现上述所述的方法。
另一方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现上述所述的方法。
本申请提供的基于表情识别的虚拟人物驱动方法、装置及设备,通过在虚拟人物与用户的一轮对话中,实时获取用户的人脸图像,将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果;根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类,从而实时地精准地识别用户面部表情的表情分类;基于用户表情的表情分类,在确定目标分类属于预设表情分类并且当前满足目标分类的响应触发条件时,根据用户表情的表情分类对应的响应策略,确定对应的驱动数据,并根据驱动数据和虚拟人物的三维形象渲染模型驱动虚拟人物执行对应的响应行为,使得输出视频流中虚拟人物做出对应的响应行为,增加用户表情的识别能力,并且驱动虚拟人物针对用户的面部表情做出及时响应,提高了虚拟人物拟人化程度,使得虚拟人物与人的交互更顺畅、更智能。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1为本申请提供的基于表情识别的虚拟人物驱动方法的系统框架图;
图2为本申请一实施例提供的基于表情识别的虚拟人物驱动方法流程图;
图3为本申请一示例性实施例提供的表情识别方法的框架图;
图4为本申请另一实施例提供的基于表情识别的虚拟人物驱动方法流程图;
图5为本申请另一实施例提供的基于表情识别的虚拟人物驱动方法流程图;
图6为本申请另一实施例提供的虚拟人物驱动方法流程图;
图7为本申请一示例性实施例提供的基于表情识别的虚拟人物驱动装置的结构示意图;
图8为本申请一示例实施例提供的电子设备的结构示意图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
首先对本申请所涉及的名词进行解释:
多模态交互:用户可通过文字、语音、表情等方式与虚拟人物交流,虚拟人物可以理解用户文字、语音、表情等信息,并可以反过来通过文字、语音、表情等方式与用户进行交流。
双工交互:实时的、双向的交互方式,用户可以随时打断虚拟人物,虚拟人物也可以在必要的时候打断正在说话的自己。
静态表情识别:从给定的静态图像中分离出人特定的表情状态,给出表情种类的判断。
本申请提供的基于表情识别的虚拟人物驱动方法,涉及计算机技术中的人工智能、深度学习、机器学习、虚拟现实等领域,具体可以应用于虚拟人物与人类交互的场景中。
示例性地,常见的虚拟人物与人类交互的场景包括:智能客服、政务咨询、生活服务、智慧交通、虚拟陪伴人、虚拟主播、虚拟教师、网络游戏等等。
针对传统虚拟人物拟人化程度低,导致沟通过程不顺畅、不智能的问题,本申请提供一种基于表情识别的虚拟人物驱动方法,通过在虚拟人物与用户的一轮对话中,实时获取用户的人脸图像;将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果;根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类,从而精准地、实时地识别用户的面部表情,在确定目标分类属于预设表情分类,并且当前满足目标分类的响应触发条件时,根据目标分类对应的响应策略,确定对应的驱动数据;根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为,从而使得虚拟人物能够针对用户的表情及时地做出相应的响应,提高了虚拟人物拟人化程度,使得虚拟人物与人的沟通过程更顺畅、更智能。
图1为本申请提供的基于表情识别的虚拟人物驱动方法的系统框架图,如图1所示,该系统框架包括以下4个模块:用户人脸图像获取模块、实时表情识别模块、双工决策模块、虚拟人物驱动模块。其中,用户人脸图像获取模块用于:在虚拟人物与用户的交互过程中,实时地监测用户侧的视频流,获取用户侧的视频帧,通过人脸检测算法对视频帧进行人脸检测,得到用户的人脸图像。实时表情识别模块用于:使用训练好的用于人脸表情识别的基模型和多模态对齐模型,对用户的人脸图像进行表情识别,识别出人脸图像中用户当前表情的表情分类及置信度,以实时地、精准地识别出用户的面部表情。双工决策模块用于:预先设置好需做出响应的预设表情分类及每一预设表情分类对应的响应策略,针对用户当前表情的目标分类及置信度,以及当前对话上下文信息,做出虚拟人物是否针对用户当前表情进行响应,以及做出何种响应行为的决策结果。具体地,基于用户当前表情的目标分类,确定用户当前表情的目标分类是否属于预设表情分类,以及当前是否满足目标分类的响应触发条件,并在确定用户当前表情的目标分类属于预设表情分类,并且当前满足目标分类的响应触发条件时,确定目标分类对应的响应策略。其中,响应策略包括做出表情、播报话术、做出动作等。虚拟人物驱动模块用于根据目标分类对应的响应策略,确定对应的驱动数据;根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为,以使虚拟人物针对用户当前表情及时地做出相应的响应行为,提高了虚拟人物拟人化程度,使得虚拟人物与人的沟通过程更顺畅、更智能。
示例性地,在虚拟人物与人的对话过程中,虚拟人物针对用户的表情执行的响应行为可以包括打断当前播报、情感关怀行为、承接辅助对话流程等等双工交互策略。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
图2为本申请一实施例提供的基于表情识别的虚拟人物驱动方法流程图。本实施例提供的基于表情识别的虚拟人物驱动方法具体可以应用于具有使用虚拟人物实现与人类交互功能的电子设备,该电子设备可以是对话机器人、终端或服务器等,在其他实施例中,电子设备还可以采用其他设备实现,本实施例此处不做具体限定。
如图2所示,该方法具体步骤如下:
步骤S201、获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务。
其中,虚拟人物的三维形象渲染模型包括实现虚拟人物渲染所需的渲染数据,基于虚拟人物的三维形象渲染模型可以将虚拟人物的骨骼数据渲染成呈现给用户时展示的虚拟人物的三维形象。
本实施例提供的方法,可以应用于虚拟人物与人交互的场景中,利用具有三维形象的虚拟人物,实现机器与人的实时交互功能,以向人提供智能服务。
步骤S202、在虚拟人物与用户的一轮对话中,实时获取用户的人脸图像。
通常,在虚拟人物与人的交互过程中,虚拟人物可以与人进行多轮的对话,在每一轮 对话过程中,可以通过实时地监测来自用户的视频流,按照预设频率采样视频帧,对视频帧进行人脸检测,获取视频帧中人脸部分,得到用户的人脸图像。
其中,对视频帧进行人脸检测获取用户的人脸图像,可以使用常用的人脸检测算法实现,此处不做具体限定。
通常虚拟人物与人的交互场景通常为一对一的对话场景,也即一个用户与虚拟人物进行交互的场景,如果视频帧中存在多个人脸,则可以检测出视频帧中每一人脸的人脸图像,根据人脸图像的中心点与视频帧的中心点的距离,以及人脸图像的面积,将其中一个人脸图像作为当前用户的人脸图像。
示例性地,可以将人脸图像面积最大且最靠近视频帧中心的人脸图像作为当前用户的人脸图像。
示例性地,若面积最大的人脸图像只有一个,则将该面积最大的人脸图像作为当前用户的人脸图像。若面积最大的人脸图像有多个,也即多个人脸图像的面积相同且面积最大,则根据面积最大的人脸图像的中心点与视频帧的中心点的距离,将与视频帧的中心点的距离最大的人脸图像作为当前用户的人脸图像。
在实时地获取到用户的人脸图像之后,通过步骤S203-S204,实时地对用户的人脸图像进行表情识别处理,确定人脸图像中用户当前表情的表情分类。
步骤S203、将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果。
本实施例中,为了提高表情识别的精准度,采用人脸表情识别的基模型与多模态对齐模型相结合进行表情识别的方法。
具体地,将用户的人脸图像输入训练好的用于人脸表情识别的基模型,通过该基模型对用户的人脸图像进行表情识别,得到第一表情分类结果;并将用户的人脸图像输入训练好的多模态对齐模型,通过多模态对齐模型对用户的人脸图像进行表情识别,得到第二表情分类结果。
其中,用于人脸表情识别的基模型为基于表情分类任务训练得到的模型,可以采用在人脸表情识别领域中在公开数据集上表现较优的表情识别模型,例如,人脸对齐算法(Deep Alignment Network,简称DAN)、MobileNet、ResNet等等。
为了缓解训练数据不足影响表情识别效果的问题,本方案引入已经在大量图文数据的公开数据集上完成预训练的多模态对齐模型,在用于表情识别的少量带有表情分类标注的训练数据上对预训练的多模态对齐模型进行微调,可以得到训练好的多模态对齐模型,该多模态对齐模型用于表情识别确定表情分类。
示例性地,多模态对齐模型可以采用CoOp(Context Optimization)、CLIP(Contrastive Language-Image Pre-training)、Prompt Ensembling、PET(Pattern-Exploting Training)等模型中的任意一种,此处不做具体限定。
步骤S204、根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分 类。
在分别通过用于人脸表情识别的基模型和多模态对齐模型对用户的人脸图像进行表情识别,得到第一表情分类结果和第二表情分类结果之后,综合第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类,以提高表情识别的精准度。
步骤S205、判断目标分类是否属于预设表情分类。
本实施例中,可以配置预设表情分类,并配置每一预设表情分类对应的响应策略。预设表情分类是配置的需要虚拟人物做出相应行为的表情分类。预设表情分类可以包括一种或多种表情分类。
预设表情分类对应的响应策略可以包括一种或多种响应策略,响应策略包含虚拟人物针对对应预设表情分类的表情进行响应的具体实现方式,每一预设表情分类对应的响应策略的种类和每一种响应策略的具体实现方式均独立配置,不同预设表情分类对应的响应策略可以不同。
示例性地,一种可选地实施方式中,预设表情分类可以包括以下至少一种:中性、伤心、生气、高兴、害怕、厌恶、吃惊。其中,“伤心”、“害怕”、“厌恶”这三种预设表情分类的响应策略可以只包括打断策略;“高兴”这种预设表情分类的响应策略可以只包括承接策略;“生气”、“吃惊”这两种预设表情分类的响应策略可以同时包括打断策略和承接策略。
另外,预设表情分类的数量和种类,以及每一种预设表情分类的响应策略可以根据具体应用场景的不同进行不同的设置,此处不做具体限定。例如,其他可选地实施方式中,还可以设置更多其他预设表情分类,“伤心”、“害怕”、“厌恶”等预设表情分类也可以设置承接策略,“生气”、“吃惊”也可以不设置打断策略等等。
在确定用户当前表情的目标分类之后,该步骤中判断用户当前表情的目标分类是否属于预设表情分类。
若目标分类属于预设表情分类,则虚拟人物可能需要针对用户当前表情进行响应处理,继续执行后续步骤S206。
若目标分类不属于预设表情分类,则虚拟人物无需针对用户当前表情进行响应处理,虚拟人物继续当前的处理。
步骤S206、判断当前是否满足目标分类的响应触发条件。
本实施例中,每一预设表情分类还具有对应的响应触发条件,只有在满足对应的响应触发条件时,才可以基于预设表情分类对应的响应策略,驱动虚拟人物做出对应的响应行为。
其中,预设表情分类的响应触发条件,可以根据具体应用场景进行配置和调整,此处不做具体限定。
该步骤中,判断当前是否满足目标分类的响应触发条件,若当前满足目标分类的响应触发条件,执行步骤S207-S208,根据目标分类对应的响应策略,驱动虚拟人物执行对应 的响应行为。
若当前不满足目标分类的响应触发条件,则虚拟人物无需针对用户当前表情进行响应处理,虚拟人物继续当前的处理。
步骤S207、根据目标分类对应的响应策略,确定对应的驱动数据。
其中,响应策略包含虚拟人物针对对应预设表情分类的表情进行响应的具体实现方式。
示例性地,预设表情分类对应的响应策略可以包括:虚拟人物做出的表情、话术、动作等。
在确定目标分类属于预设表情分类,并且当前满足目标分类的响应触发条件时,根据目标分类对应的响应策略,确定对应的驱动数据。该驱动数据包括驱动虚拟人物执行目标分类对应的响应策略所需的所有驱动参数。
示例性地,若目标分类对应的响应策略包括虚拟人物做出规定表情,则该驱动数据包括表情驱动参数;若目标分类对应的响应策略包括虚拟人物做出规定动作,则该驱动数据包括动作驱动参数;若目标分类对应的响应策略包括虚拟人物播报规定话术,则该驱动数据包括语音驱动参数;若目标分类对应的响应策略包括表情、话术和动作中的多种响应方式,则该驱动数据包括对应的多种驱动参数,可以驱动虚拟人物执行响应策略对应的响应行为。
步骤S208、根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。
在根据目标分类对应的响应策略,确定对应的驱动数据之后,根据驱动数据驱动虚拟人物的骨骼模型得到响应行为对应的骨骼数据,根据虚拟人物的三维详细渲染模型对骨骼数据进行渲染,得到响应行为对应的虚拟人物图像数据。通过将虚拟人物图像数据渲染到输出视频流中,使得输出视频流中虚拟人物做出对应的响应行为,从而实现虚拟人物针对用户的面部表情做出及时响应的多模态双工交互功能。
本实施例通过在虚拟人物与用户的一轮对话中,实时获取用户的人脸图像,将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果;根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类,从而实时地精准地识别用户面部表情的表情分类;基于用户表情的表情分类,在确定目标分类属于预设表情分类并且当前满足目标分类的响应触发条件时,根据用户表情的表情分类对应的响应策略,确定对应的驱动数据,并根据驱动数据和虚拟人物的三维形象渲染模型驱动虚拟人物执行对应的响应行为,使得输出视频流中虚拟人物做出对应的响应行为,增加用户表情的实时识别能力,并且驱动虚拟人物针对用户的面部表情做出及时地响应,提高了虚拟人物拟人化程度,使得虚拟人物与人的交互更顺畅、更智能。
一种可选的实施例中,为了提高表情识别的精准度,采用人脸表情识别的基模型与多模态对齐模型相结合进行表情识别的方法。
在上述步骤S203中,将用户的人脸图像输入训练好的用于人脸表情识别的基模型,通过该基模型对用户的人脸图像进行表情识别,得到第一表情分类结果;并将用户的人脸图像输入训练好的多模态对齐模型,通过多模态对齐模型对用户的人脸图像进行表情识别,得到第二表情分类结果。
其中,第一表情分类结果包括所有表情分类对应的一组置信度,包括用户当前表情属于每一表情分类的第一置信度,第一置信度越大表示用户当前表情属于该表情分类的可能性越高。
其中,第二表情分类结果包括所有表情分类对应的另一组置信度,包括用户当前表情属于每一表情分类的第二置信度,第二置信度越大表示用户当前表情属于该表情分类的可能性越高。
在上述步骤S204中,根据用户当前表情属于每一表情分类的第一置信度和第二置信度,确定用户当前表情的目标分类,以及用户当前表情属于目标分类的置信度。
可选地,根据第一表情分类结果和第二表情分类结果,将两组置信度中最大的置信度对应的表情分类作为当前用户表情的目标分类。
可选地,该步骤中可以根据第一表情分类结果和第二表情分类结果,对两组置信度中同一表情分类对应的第一置信度和第二置信度求均值,作为该表情分类的第三置信度;根据各个表情分类的第三置信度,将第三置信度最大的表情分类作为当前用户表情的目标分类。
示例性地,图3为本申请一示例性实施例提供的表情识别方法的框架图,以基模型采用用于人脸表情识别的DAN模型,多模态对齐模型采用用于人脸表情识别的CoOp模型实现为例,如图3所示,将实时获取到的用户的人脸图像分别输入DAN模型和CoOp模型,通过DAN模型的图片编码器对人脸图像进行编码,基于多个注意力模块进行特征提取,基于注意力融合模块融合多个注意力模块提取的特征,并基于融合结果进行分类处理,得到第一表情分类结果;同时,通过CoOp模型的图片编码器对人脸图像进行编码得到图片特征,并基于文本编码器对模型内置的文本信息进行编码得到文本特征,对于多模态特征(包括文本特征和图片特征)进行相似度计算和分类处理,得到第二表情分类结果。综合第一表情分类结果和第二表情分类结果,确定最终分类结果,得到用户表情的表情分类的最终结果。
进一步地,基于当前应用场景,获取具有多种不同表情的人脸图片,以及每一人脸图片的表情分类标签,作为训练数据,对预训练的CoOp模型进行训练直至模型收敛,得到训练好的CoOp模型,训练好的CoOp模型在测试集上的表情分类准确率较高,能够满足当前应用场景的要求。其中,预训练的CoOp模型是指在公开数据集上预训练得到的CoOp模型。
本实施例中通过结合用于人脸表情识别的基模型和多模态对齐模型进行人脸表情识别,提高了表情识别的精准度,在一测试数据集上的人脸表情分类达到92.9%的准确率。
在上述任一方法实施例的基础上,本实施例中用于人脸识别的模型结合了基模型和多模态对齐模型,模型参数量大、模型复杂,在硬件资源受限(例如,只有CPU资源无GPU资源)的情况下,模型的推理耗时(respond time,简称RT)较高,表情识别的效率较低。
为了满足表情识别的实时性要求,提高表情识别的效率,将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果之前,获取训练好的用于人脸表情识别的基模型和多模态对齐模型,对基模型和多模态对齐模型进行模型蒸馏,以对表情识别的模型进行压缩,在保证人脸表情识别的准确率满足要求的情况下减少模型的参数量,减少模型推理耗时,提高表情识别的效率,实现人脸表情的实时分类识别。
通过模型蒸馏,可使得表情识别的分类准确率基本维持(在上述测试数据集上达到91.2%)的情况下,模型的参数量减少,在CPU上单帧推理耗时控制在了30ms以内,达到了实时识别的效果。
可选地,还可以采用模型剪枝技术或其他模型压缩技术替换模型整理,对人脸识别的基模型和多模态对齐模型进行压缩,此处不做具体限定。
图4为本申请另一实施例提供的基于表情识别的虚拟人物驱动方法流程图。在上述任一方法实施例的基础上,可以实现具有双工能力、虚拟人物具有主动或被动打断自己当前播报的能力,并针对用户表情做出打断后的响应行为,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加智能。如图4所示,该方法具体步骤如下:
步骤S401、获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务。
步骤S402、在虚拟人物与用户的一轮对话中,实时获取用户的人脸图像。
步骤S403、将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果。
步骤S404、根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类。
上述步骤S401-S404与上述步骤S201-S204的实现方式类似,具体实现参见上述实施例的详细介绍,本实施例此处不再赘述。
步骤S405、若当前的对话状态为虚拟人物输出用户接收的状态,判断目标分类是否属于第一预设表情分类。
其中,第一预设表情分类的响应策略包括打断策略。打断策略执行时会打断虚拟人物当前处理,并驱动虚拟人物执行打断策略对应的响应行为。
本实施例中,针对在虚拟人物输出用户接收的对话状态下,需要虚拟人物打断当前处理,并执行针对用户表情的响应行为的情况,设置第一预设表情分类,并设置每一第一预设表情分类对应的打断策略。
第一预设表情分类及每一第一预设表情分类对应的打断策略可以根据实际应用场景的需要进行设置和调整,此处不做具体限定。
示例性地,第一预设表情分及第一预设表情分类对应的打断策略可以设置为如下表1所示。
表1
上述表1仅为一示例,打断策略可以不同时包括播报话术、做出规定表情、做出规定动作的响应行为,可以只包含其中的任意一种或任意两种,例如可以不播报话术,仅仅做出规定表情和动作;或者播报规定话术,做出规定表情,但不做任何动作等等。
在确定用户当前表情的目标分类之后,该步骤中判断用户当前表情的目标分类是否属于第一预设表情分类。
若目标分类属于第一预设表情分类,则虚拟人物可能需要针对用户当前表情进行响应处理,继续执行后续步骤S406。
若目标分类不属于第一预设表情分类,则虚拟人物无需针对用户当前表情进行响应处理,虚拟人物继续当前的处理。
示例性地,基于上述表1的第一预设表情分类及对应的打断策略,假设当前虚拟人物正在基于用户提出的问题播报答复信息,若检测到用户的表情为生气,则确定虚拟人物可能需要针对用户生气的表情做出响应,继续执行后续步骤S406判断当前是否满足目标分类对应的打断触发条件,若当前满足目标分类对应的打断触发条件,则打断虚拟人物的当 前输出,并根据目标分类对应的打断策略,驱动虚拟人物执行对应的打断响应行为。
步骤S406、判断当前是否满足目标分类对应的打断触发条件。
其中,目标分类对应的打断触发条件,包括以下至少一项:用户当前表情属于目标分类的置信度大于或等于目标分类对应的置信度阈值;当前对话轮次与前一次触发打断的对话轮次之间的间隔的轮次数量大于或等于预设轮次数量。
具体地,在虚拟人物与用户的交互过程中,可以记录上下文信息,上下文信息中包括是否被打断的信息。根据当前的上下文信息,可以判断当前对话轮次与前一次触发打断的对话轮次之间的间隔的轮次数量是否大于或等于预设轮次数量。
本实施例中,考虑到虚拟人物打断当前输出做出打断响应行为可能会干扰虚拟人物与用户的正常交互,因此可以根据实际应用场景的需要,设置每一第一预设表情分类对应的打断触发条件,来避免影响虚拟人物与用户的正常交互,提高虚拟人物与用户的交互的流畅度和智能化。
可选地,第一预设表情分类对应的打断触发条件可以包括:用户当前表情属于该第一预设表情分类的置信度大于或等于该第一预设表情分类对应的置信度阈值。这样,只有在用户的表情有较高置信度为第一预设表情分类时,才触发第一预设表情分类对应打断策略的执行,能够避免影响虚拟人物与用户的正常交互,提高虚拟人物与用户的交互的流畅度和智能化。
可选地,第一预设表情分类对应的打断触发条件可以包括:当前对话轮次与前一次触发打断的对话轮次之间的间隔的轮次数量大于或等于预设轮次数量,从而避免频繁地打断虚拟人物的输出,避免出现连续多轮的打断影响虚拟人物与用户的正常交互,提高虚拟人物与用户的交互的流畅度和智能化。
可选地,第一预设表情分类对应的打断触发条件可以包括:用户当前表情属于该第一预设表情分类的置信度大于或等于该第一预设表情分类对应的置信度阈值,并且当前对话轮次与前一次触发打断的对话轮次之间的间隔的轮次数量大于或等于预设轮次数量。这样,只有在用户的表情有较高置信度为第一预设表情分类,并且不会频繁打断虚拟人物的输出的情况下,触发第一预设表情分类对应打断策略的执行,能够更好地避免影响虚拟人物与用户的正常交互,提高虚拟人物与用户的交互的流畅度和智能化。
其中,不同的第一预设表情分类对应的置信度阈值可以不同,具体可以根据实际应用场景的需要进行设置和调整,预设轮次数量可以根据实际应用场景的需要进行设置和调整,本实施例此处不做具体限定。
示例性地,以目标分类对应的打断触发条件为:用户当前表情属于目标分类的置信度大于或等于目标分类对应的置信度阈值;和当前对话轮次与前一次触发打断的对话轮次之间的间隔的轮次数量大于或等于预设轮次数量,这两项同时满足为例,该步骤中,根据用户当前表情属于目标分类的置信度和当前的上下文信息,判断当前是否满足目标分类对应的打断触发条件。
若当前满足目标分类对应的打断触发条件,执行步骤S407-S408,根据目标分类对应的打断策略,驱动虚拟人物执行对应的打断响应行为。
若当前不满足目标分类对应的打断触发条件,则虚拟人物无需针对用户当前表情进行打断响应处理,虚拟人物继续当前的处理。
步骤S407、若当前满足目标分类对应的打断触发条件,则打断虚拟人物的当前输出,并根据目标分类对应的打断策略,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行如下至少一种打断响应行为:播报针对对应表情分类的话术、做出具有特定情绪的表情、做出规定动作。
本实施例中,打断策略中可以包括以下至少一种打断响应行为:播报针对对应表情分类的话术、做出具有特定情绪的表情、做出规定动作。其中,不同的第一预设表情分类对应的打断策略包括的打断响应行为的种类和具体内容可以不同。
示例性地,如表1中所示,“伤心”和“生气”对应打断策略中做出相同的表情和动作,但是播报的话术不同;“害怕”和“厌恶”中做出的表情和动作均不相同,播报的话术也不同。
在确定目标分类属于第一预设表情分类,并且当前满足目标分类的打断触发条件时,根据目标分类对应的打断策略,确定对应的驱动数据。该驱动数据包括驱动虚拟人物执行目标分类对应的打断策略所需的所有驱动参数。
该步骤中,可以采用现有任意一种基于确定的策略生成虚拟人物的驱动数据的虚拟人物驱动方法实现,此处不再做详细地说明。
步骤S408、根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行如下至少一种打断响应行为:播报针对对应表情分类的话术、做出具有特定情绪的表情、做出规定动作。
其中,虚拟人物的三维形象渲染模型包括实现虚拟人物渲染所需的渲染数据,基于虚拟人物的三维形象渲染模型可以将虚拟人物的骨骼数据渲染成呈现给用户时展示的虚拟人物的三维形象。
本实施例提供的方法可以应用于虚拟人物与人交互的场景中,利用具有三维形象的虚拟人物,实现机器与人的实时交互功能,以向人提供智能服务。
在根据目标分类对应的响应策略,确定对应的驱动数据之后,根据驱动数据驱动虚拟人物的骨骼模型得到响应行为对应的骨骼数据,根据虚拟人物的三维详细渲染模型对骨骼数据进行渲染,得到响应行为对应的虚拟人物图像数据。通过将虚拟人物图像数据渲染到输出视频流中,使得输出视频流中虚拟人物做出对应的响应行为,从而实现虚拟人物针对用户的面部表情做出及时响应的双工交互功能。
本实施例中,针对在虚拟人物输出用户接收的对话状态下,需要虚拟人物打断当前处理,并执行针对用户表情的响应行为的情况,设置第一预设表情分类,并设置每一第一预设表情分类对应的打断策略,通过实时识别用户表情的目标分类,在确定目标分类属于第 一预设表情分类,并根据用户当前表情属于目标分类的置信度和当前的上下文信息,确定当前满足目标分类对应的打断触发条件时,根据目标分类对应的响应策略,确定对应的驱动数据,驱动虚拟人物执行如下至少一种打断响应行为:播报针对对应表情分类的话术、做出具有特定情绪的表情、做出规定动作,能够避免影响虚拟人物与用户的正常交互,同时提高虚拟人物与用户的交互的流畅度和智能化。
一种可选地实施方式中,在步骤S408之后,可以包括如下步骤:
步骤S409、若在第一预设时长内接收到用户的语音输入,并识别出用户的语音输入的语义信息,则开启下一轮对话,根据用户的语音输入的语义信息进行对话处理。
其中,第一预设时长一般设置为较短的时长,使得用户不会感觉到长时间的停顿,第一预设时长可以根据实际应用场景的需要进行设置和调整,例如几百毫秒、1秒、甚至几秒等,此处不做具体限定。
步骤S410、若在第一预设时长内未接收到用户的语音输入,或者无法识别出用户的语音输入的语义信息,则继续被打断的虚拟人物的当前输出。
可选地,若在第一预设时长内未接收到用户的语音输入,或者无法识别出用户的语音输入的语义信息,可以在停顿第三预设时长之后,继续被打断的虚拟人物的当前输出,以给用户留出足够的语音输入时间。
其中,第三预设时长可以为几百毫秒、1秒、甚至几秒等,可以根据实际应用场景的需要进行设置和调整,此处不做具体限定。
本实施例中,通过在驱动虚拟人物执行打断响应行为之后,如果在第一预设时长内接收到用户具有语义信息的语音输入,则开启新一轮对话,如果没有接收到用户具有语义信息的语音输入,则可以停顿一定时长后继续虚拟人物之前的播报,以避免打断响应行为影响虚拟人物与用户正常交互,提高虚拟人物与用户的交互的流畅度和智能化。
图5为本申请另一实施例提供的基于表情识别的虚拟人物驱动方法流程图。在上述方法实施例的基础上,虚拟人物与人的交互方案可以具有双工能力、虚拟人物具有根据用户表情进行主动承接功能,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加智能。如图5所示,该方法具体步骤如下:
步骤S501、获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务。
步骤S502、在虚拟人物与用户的一轮对话中,实时获取用户的人脸图像。
步骤S503、将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果。
步骤S504、根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类。
上述步骤S501-S504与上述步骤S201-S204的实现方式类似,具体实现参见上述实施例的详细介绍,本实施例此处不再赘述。
步骤S505、若当前的对话状态为用户输入虚拟人物接收的状态,判断目标分类是否属 于第二预设表情分类。
其中,第二预设表情分类的响应策略包括承接策略。承接策略主要应用于用户输入虚拟人物接收的对话状态中,不会明显打断用户的输入,虚拟人物做出不会影响用户输入的承接响应行为。
本实施例中,针对在用户输入虚拟人物接收的对话状态下,根据用户输入过程中的面部表情,驱动虚拟人物可以模拟真实人类在交互过程中对对方的表情实时地做出回应的情况,设置第二预设表情分类,并设置每一第二预设表情分类对应的承接策略。
第二预设表情分类及每一第二预设表情分类对应的承接策略可以根据实际应用场景的需要进行设置和调整,此处不做具体限定。
示例性地,第二预设表情分及第二预设表情分类对应的承接策略可以设置为如下表2所示。
表2
在确定用户当前表情的目标分类之后,该步骤中判断用户当前表情的目标分类是否属于第二预设表情分类。
若目标分类属于第二预设表情分类,则虚拟人物可能需要针对用户当前表情进行响应处理,继续执行后续步骤。
若目标分类不属于第二预设表情分类,则虚拟人物无需针对用户当前表情进行响应处理,虚拟人物继续当前的处理。
步骤S506、判断当前是否满足目标分类对应的承接触发条件。
其中,目标分类对应的承接触发条件,包括:至少连续N帧图像中用户的表情均属于目标分类,其中N为正整数,N为目标分类对应的预设值。
示例性地,N可以为5,N的值可以根据实际应用场景的需要进行设置和调整,本实施例此处不做具体限定。
本实施例中,通过设置第二预设表情分类对应的承接触发条件为至少连续N帧图像中 用户的表情均属于该第二预设表情分类,能够避免虚拟人物频繁、不必要的承接响应行为,提高虚拟人物与用户的交互的流畅性和智能化。
该步骤中,若当前满足目标分类对应的承接触发条件,执行步骤S507-S508,根据目标分类对应的承接策略,驱动虚拟人物执行对应的承接响应行为。
若当前不满足目标分类对应的承接触发条件,则虚拟人物无需针对用户当前表情进行承接响应处理,虚拟人物继续当前的处理。
另外,该步骤S506为可选步骤,在其他实施例中,在上述步骤S505中确定目标分类属于第二预设表情分类时,可以直接执行步骤S507,根据目标分类对应的承接策略,确定对应的驱动数据,并根据驱动数据驱动虚拟人物执行承接响应行为。
步骤S507、根据目标分类对应的承接策略,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作。
本实施例中,承接策略中可以包括以下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作。其中,不同的第二预设表情分类对应的承接策略包括的承接响应行为的种类和具体内容可以不同。
在确定目标分类属于第二预设表情分类,并且当前满足目标分类的承接触发条件时,根据目标分类对应的承接策略,确定对应的驱动数据。该驱动数据包括驱动虚拟人物执行目标分类对应的承接策略所需的所有驱动参数。
该步骤中,可以采用现有任意一种基于确定的策略生成虚拟人物的驱动数据的虚拟人物驱动方法实现,此处不再做详细地说明。
一种可选地实施方式中,该步骤还可以采用如下方式实现:获取用户当前输入的语音数据,并识别语言数据对应的用户意图信息,并确定用户意图信息对应的情感极性;根据用户意图信息对应的情感极性和承接策略,确定承接响应行为使用的特定语气和特定情绪;根据目标分类对应的承接策略和承接响应行为使用的特定语气和特定情绪,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作。
示例性地,可以实时地获取用户输入的语音流,在确定目标分类属于第二预设表情分类时,可以获取最近一个时段内用户输入的语音数据,将语音数据转换为对应的文本信息;识别文本信息对应的用户意图信息,并确定用户意图信息对应的情感极性。
通常,用户意图一共有6种:“表示命令”,“表示辱骂”,“表示提问”,“正向陈述”,“负向陈述”,“其它意图”。其中:“表示命令”和“正向陈述”可被归类为正向语义,也即情感极性为正向;“表示提问”和“其它意图”可被归类为中性语义,也即情感极性为中性;“表示辱骂”和“负向陈述”可被归类为负向语义,也即情感极性为负向。
其中,用户意图信息对应的情感极性包括:正向、负向和中性。识别文本信息对应的 用户意图信息,可以通过现有的基于自然语言理解(Natural Language Understanding,简称NLU)神经网络分类模型实现,例如可以采用TextRCNN,该模型平衡了效果和模型的计算开支、分类准确率较高且模型复杂度低,推理开销小。另外,该模型的可替代模型还有TextCNN、Transformer等等,本实施例此处不再赘述。
示例性地,承接策略中还可以包括用户意图信息对应的情感极性与承接响应行为中播报话术的特定语气的对应关系,用户意图信息对应的情感极性与承接响应行为中做出表情的特定情绪的对应关系,用户意图信息对应的情感极性与承接响应行为中做出动作类型的对应关系。基于用户意图信息对应的情感极性和承接策略,可以确定承接响应行为使用的特定语气和特定情绪。
这一实施方式中,通过识别用户当前语音输入的用户意图信息对应的情感极性,确定承接策略中虚拟人物做出承接响应行为时播报话术的特定语气和表情的特定情绪,使得虚拟人物能够针对用户当前的情感极性做出具有对应语气和情绪的承接,提高虚拟人物的拟人化程度,能够提高用户继续交互积极性,提高虚拟人物与用户交互的流畅度和智能化。
需要说明的是,承接策略中播报的承接话术通常设置为简短内容,如“嗯嗯”、“是”、“对对”、“嗯”、“哦哦”等,播报承接话术不会影响用户正常语音输入。
步骤S508、根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作。
其中,虚拟人物的三维形象渲染模型包括实现虚拟人物渲染所需的渲染数据,基于虚拟人物的三维形象渲染模型可以将虚拟人物的骨骼数据渲染成呈现给用户时展示的虚拟人物的三维形象。
本实施例提供的方法,可以应用于虚拟人物与人交互的场景中,利用具有三维形象的虚拟人物,实现机器与人的实时交互功能,以向人提供智能服务。
本实施例中,针对在用户输入虚拟人物接收的对话状态下,根据用户输入过程中的面部表情,驱动虚拟人物可以模拟真实人类在交互过程中对对方的表情实时地做出回应的情况,设置第二预设表情分类,并设置每一第二预设表情分类对应的承接策略,通过实时识别用户表情的目标分类,在确定目标分类属于第二预设表情分类,并确定当前满足目标分类对应的承接触发条件时,根据目标分类对应的承接策略,结合用户输入语音数据中用户意图信息的情感极性,确定对应的驱动数据,驱动虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作,能够避免影响虚拟人物与用户的正常交互,同时提高虚拟人物的拟人化程度,提高虚拟人物与用户的交互的流畅度和智能化。
图6为本申请另一实施例提供的虚拟人物驱动方法流程图。在上述任一方法实施例的基础上,虚拟人物与人的交互方案可以具有双工能力、虚拟人物具有根据用户输入语音进行主动承接的功能,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加 智能。如图6所示,该方法具体步骤如下:
步骤S601、获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务。
该步骤与上述步骤S201一致,此处不再赘述。
步骤S602、在虚拟人物与用户的一轮对话中,实时获取用户输入的语音数据。
通常,在虚拟人物与人的交互过程中,虚拟人物可以与人进行多轮的对话,在每一轮对话过程中,可以实时地接收来自用户的语音流,也即用户输入的语音数据。
步骤S603、当检测到用户输入的语音数据的静默时长大于或等于第二预设时长时,若确定语音输入未结束,则将语音数据转换为对应的文本信息。
本实施例中,可以实时地对用户输入的语音流进行语音活动检测(Voice Activity Detection,简称VAD),得到用户输入的静默时长(也即VAD时间)。
当检测到用户输入的静默时长大于或等于第二预设时长时,若确定此时本轮语音输入未结束,也即是用户在语音输入过程中产生了较长时间的停顿,这种情况下进行后续的承接响应处理,使得虚拟人物做出承接响应行为,以引导后续的对话流程,使得虚拟人物与用户的交互更加流畅、更加智能。
其中,第二预设时长为一个小于静默时长阈值的较短时长,静默时长阈值为判断用户本轮输入是否结束的静默时长,当用户语音输入的静默时长达到静默时长阈值,则确定用户本轮语音输入结束。例如静默时长阈值可以为800ms,第二预设时长可以为300ms。第二预设时长可以根据实际应用场景的需要进行设置和调整,此处不做具体限定。
具体地,当检测到用户输入的语音数据的静默时长大于或等于第二预设时长时,若确定语音输入未结束,也即静默时长小于静默时长阈值,将语音数据转换为对应的文本信息,并基于文本信息进行后续的处理。
步骤S604、识别文本信息对应的用户意图信息,并确定用户意图信息对应的情感极性。
其中,用户意图信息对应的情感极性包括:正向、负向和中性。
该步骤中,可以通过现有的自然语言理解(Natural Language Understanding,简称NLU)算法实现,本实施例此处不再赘述。
步骤S605、根据用户意图信息对应的情感极性,确定对应的驱动数据。
具体地,根据用户意图信息对应的情感极性,确定对应的承接策略;根据对应的承接策略生成对应的驱动数据。
本实施例中,针对用户语音输入发生较长时间的停顿(达到预设静默时长)时用户意图信息对应的情感极性,设置不同情感极性对应的承接策略。用户意图信息的情感极性不同,对应的承接策略不同。
示例性地,根据用户输入语音进行主动承接的功能中,对用户意图信息的不同情感极性,可以设置为表3所示的承接策略。
表3
该步骤中,可以采用现有任意一种基于确定的策略生成虚拟人物的驱动数据的虚拟人物驱动方法实现,此处不再做详细地说明。
步骤S606、根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行如下至少一种承接响应行为:播报针对用户意图信息对应的情感极性配置的规定语气的承接话术、做出具有特定情绪的表情、做出规定动作。
其中,播报针对用户意图信息对应的情感极性配置的规定语气的承接话术不影响用户的语音输入。
本实施例中,承接策略中可以包括以下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作。其中,不同的第二预设表情分类对应的承接策略包括的承接响应行为的种类和具体内容可以不同。
该步骤中,可以采用现有任意一种基于确定的策略生成虚拟人物的驱动数据的虚拟人物驱动方法实现,此处不再做详细地说明。
本实施例中,通过在虚拟人物与用户的一轮对话中,实时获取用户输入的语音数据;当检测到用户输入的语音数据的静默时长大于或等于第二预设时长并且语音输入未结束时,根据语音数据识别用户意图信息及其对应的情感极性,根据当前用户意图信息对应的情感极性,确定对应的承接策略,并驱动虚拟人物根据对应的承接策略做出至少一种承接响应行为:播报针对用户意图信息对应的情感极性配置的规定语气的承接话术、做出具有特定情绪的表情、做出规定动作,能够不影响用户输入的同时提高虚拟人物的拟人化程度,提高虚拟人物与用户的交互的流畅度和智能化。
需要说明的是,在虚拟人物与用户的交互过程中,可以将上述实施例中的至少两种相结合使用,使得用户能感知到虚拟人物在视觉层面的反馈能力,获得“虚拟人物更智能、更聪明”的体感。
图7为本申请一示例性实施例提供的基于表情识别的虚拟人物驱动装置的结构示意图。本申请实施例提供的基于表情识别的虚拟人物驱动装置可以执行基于表情识别的虚拟人 物驱动方法实施例提供的处理流程。如图7所示,基于表情识别的虚拟人物驱动装置70包括:渲染模型获取模块71、实时数据获取模块72、实时表情识别模块73和决策及驱动模块74。
渲染模型获取模块71用于获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务。
实时数据获取模块72用于在虚拟人物与用户的一轮对话中,实时获取用户的人脸图像。
实时表情识别模块73用于将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果;根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类。
决策及驱动模块74用于若确定目标分类属于预设表情分类,并且当前满足目标分类的响应触发条件,则根据目标分类对应的响应策略,确定对应的驱动数据;根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。
本申请实施例提供的装置可以具体用于执行上述图2对应方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
一种可选地实施例中,第一表情分类结果包括:用户当前表情属于每一表情分类的第一置信度,第二表情分类结果包括用户当前表情属于每一表情分类的第二置信度。
在根据第一表情分类结果和第二表情分类结果,确定用户当前表情的目标分类时,实时表情识别模块还用于:根据用户当前表情属于每一表情分类的第一置信度和第二置信度,确定用户当前表情的目标分类,以及用户当前表情属于目标分类的置信度。
一种可选地实施例中,若确定目标分类属于预设表情分类,并且当前满足目标分类的响应触发条件,则根据目标分类对应的响应策略,确定对应的驱动数据时,决策及驱动模块还用于:若当前的对话状态为虚拟人物输出用户接收的状态,并且目标分类属于第一预设表情分类,则根据用户当前表情属于目标分类的置信度和当前的上下文信息,确定当前是否满足目标分类对应的打断触发条件,第一预设表情分类具有对应的打断策略;若确定当前满足目标分类对应的打断触发条件,则打断虚拟人物的当前输出,并根据目标分类对应的打断策略,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行如下至少一种打断响应行为:播报针对对应表情分类的话术、做出具有规定情绪的表情、做出规定动作。
一种可选地实施例中,目标分类对应的打断触发条件,包括以下至少一项:用户当前表情属于目标分类的置信度大于或等于目标分类对应的置信度阈值;当前对话轮次与前一次触发打断的对话轮次之间的间隔的轮次数量大于或等于预设轮次数量。
一种可选地实施例中,在根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为时,决策及驱动模块还用于:若在第一预设时长内接收到用户的语音输入,并识别出用户的语音输入的语义信息,则开启下一轮对话,根据用户的语音输入的语义信息进行对话处理;若在第一预设时长内未接收到用户的语音输入,或者无法识别 出用户的语音输入的语义信息,则继续被打断的虚拟人物的当前输出。
一种可选地实施例中,若确定目标分类属于预设表情分类,并且当前满足目标分类的响应触发条件,则根据目标分类对应的响应策略,确定对应的驱动数据时,决策及驱动模块还用于:若当前的对话状态为用户输入虚拟人物接收的状态,并且目标分类属于第二预设表情分类,则根据目标分类,判断当前是否满足目标分类对应的承接触发条件,第二预设表情分类具有对应的承接策略;若确定当前满足目标分类对应的承接触发条件,则根据目标分类对应的承接策略,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作;其中,播报具有特定语气的承接话术不影响用户的语音输入。
一种可选地实施例中,在根据目标分类对应的承接策略,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行至少一种承接响应行为时,决策及驱动模块还用于:根据用户当前输入的语音数据,识别语言数据对应的用户意图信息,并确定用户意图信息对应的情感极性;根据用户意图信息对应的情感极性和承接策略,确定承接响应行为使用的特定语气和特定情绪;根据目标分类对应的承接策略和承接响应行为使用的特定语气和特定情绪,确定对应的驱动数据,驱动数据用于驱动虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作。
一种可选地实施例中,目标分类对应的承接触发条件包括:至少连续N帧图像中用户的表情均属于目标分类,其中N为正整数,N为目标分类对应的预设值。
一种可选地实施例中,实时数据获取模块还用于:在虚拟人物与用户的一轮对话中,实时获取用户输入的语音数据。
决策及驱动模块还用于:当检测到用户输入的语音数据的静默时长大于或等于第二预设时长时,若确定语音输入未结束,则将语音数据转换为对应的文本信息;识别文本信息对应的用户意图信息,并确定用户意图信息对应的情感极性;根据用户意图信息对应的情感极性,确定对应的驱动数据;根据驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行如下至少一种承接响应行为:播报针对用户意图信息对应的情感极性配置的规定语气的承接话术、做出具有特定情绪的表情、做出规定动作。
其中,播报针对用户意图信息对应的情感极性配置的规定语气的承接话术不影响用户的语音输入。
一种可选地实施例中,在将用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过基模型确定第一表情分类结果,并通过多模态对齐模型确定第二表情分类结果之前,实时表情识别模块还用于:获取训练好的用于人脸表情识别的基模型和多模态对齐模型;对基模型和多模态对齐模型进行模型蒸馏。
本申请实施例提供的装置可以具体用于执行上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
图8为本申请一示例实施例提供的电子设备的结构示意图。如图8所示,该电子设备 80包括:处理器801,以及与处理器801通信连接的存储器802,存储器802存储计算机执行指令。
其中,处理器执行存储器存储的计算机执行指令,以实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,计算机执行指令被处理器执行时用于实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
本申请实施例还提供了一种计算机程序产品,计算机程序产品包括:计算机程序,计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从可读存储介质读取计算机程序,至少一个处理器执行计算机程序使得电子设备执行上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。“多个”的含义是两个以上,除非另有明确具体的限定。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。

Claims (13)

  1. 一种基于表情识别的虚拟人物驱动方法,其特征在于,包括:
    获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;
    在虚拟人物与用户的一轮对话中,实时获取所述用户的人脸图像;
    将所述用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过所述基模型确定第一表情分类结果,并通过所述多模态对齐模型确定第二表情分类结果;
    根据所述第一表情分类结果和所述第二表情分类结果,确定所述用户当前表情的目标分类;
    若确定所述目标分类属于预设表情分类,并且当前满足所述目标分类的响应触发条件,则根据所述目标分类对应的响应策略,确定对应的驱动数据;
    根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。
  2. 根据权利要求1所述的方法,其特征在于,所述第一表情分类结果包括:所述用户当前表情属于每一表情分类的第一置信度,所述第二表情分类结果包括所述用户当前表情属于每一表情分类的第二置信度,
    所述根据所述第一表情分类结果和所述第二表情分类结果,确定所述用户当前表情的目标分类,包括:
    根据所述用户当前表情属于每一表情分类的第一置信度和第二置信度,确定所述用户当前表情的目标分类,以及所述用户当前表情属于所述目标分类的置信度。
  3. 根据权利要求1所述的方法,其特征在于,所述若确定所述目标分类属于预设表情分类,并且当前满足所述目标分类的响应触发条件,则根据所述目标分类对应的响应策略,确定对应的驱动数据,包括:
    若当前的对话状态为虚拟人物输出用户接收的状态,并且所述目标分类属于第一预设表情分类,则根据所述用户当前表情属于所述目标分类的置信度和当前的上下文信息,确定当前是否满足所述目标分类对应的打断触发条件,所述第一预设表情分类具有对应的打断策略;
    若确定当前满足所述目标分类对应的打断触发条件,则打断所述虚拟人物的当前输出,并根据所述目标分类对应的打断策略,确定对应的驱动数据,所述驱动数据用于驱动所述虚拟人物执行如下至少一种打断响应行为:播报针对对应表情分类的话术、做出具有规定情绪的表情、做出规定动作。
  4. 根据权利要求3所述的方法,其特征在于,所述目标分类对应的打断触发条件,包括以下至少一项:
    所述用户当前表情属于所述目标分类的置信度大于或等于所述目标分类对应的置信度阈值;
    当前对话轮次与前一次触发打断的对话轮次之间的间隔的轮次数量大于或等于预设 轮次数量。
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为之后,还包括:
    若在第一预设时长内接收到用户的语音输入,并识别出所述用户的语音输入的语义信息,则开启下一轮对话,根据所述用户的语音输入的语义信息进行对话处理;
    若在第一预设时长内未接收到所述用户的语音输入,或者无法识别出所述用户的语音输入的语义信息,则继续被打断的所述虚拟人物的当前输出。
  6. 根据权利要求1所述的方法,其特征在于,所述若确定所述目标分类属于预设表情分类,并且当前满足所述目标分类的响应触发条件,则根据所述目标分类对应的响应策略,确定对应的驱动数据,包括:
    若当前的对话状态为用户输入虚拟人物接收的状态,并且所述目标分类属于第二预设表情分类,则根据所述目标分类,判断当前是否满足所述目标分类对应的承接触发条件,所述第二预设表情分类具有对应的承接策略;
    若确定当前满足所述目标分类对应的承接触发条件,则根据所述目标分类对应的承接策略,确定对应的驱动数据,所述驱动数据用于驱动所述虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作;
    其中,所述播报具有特定语气的承接话术不影响所述用户的语音输入。
  7. 根据权利要求6所述的方法,其特征在于,根据所述目标分类对应的承接策略,确定对应的驱动数据,所述驱动数据用于驱动所述虚拟人物执行至少一种所述承接响应行为,包括:
    根据用户当前输入的语音数据,识别所述语言数据对应的用户意图信息,并确定所述用户意图信息对应的情感极性;
    根据所述用户意图信息对应的情感极性和所述承接策略,确定承接响应行为使用的特定语气和特定情绪;
    根据所述目标分类对应的承接策略和承接响应行为使用的特定语气和特定情绪,确定对应的驱动数据,所述驱动数据用于驱动所述虚拟人物执行如下至少一种承接响应行为:播报具有特定语气的承接话术、做出具有特定情绪的表情、做出规定动作。
  8. 根据权利要求6所述的方法,其特征在于,所述目标分类对应的承接触发条件,包括:
    至少连续N帧图像中所述用户的表情均属于所述目标分类,其中N为正整数,N为所述目标分类对应的预设值。
  9. 根据权利要求1-8中任一项所述的方法,其特征在于,所述方法还包括:
    在虚拟人物与用户的一轮对话中,实时获取所述用户输入的语音数据;
    当检测到所述用户输入的语音数据的静默时长大于或等于第二预设时长时,若确定所述语音输入未结束,则将所述语音数据转换为对应的文本信息;
    识别所述文本信息对应的用户意图信息,并确定所述用户意图信息对应的情感极性;
    根据所述用户意图信息对应的情感极性,确定对应的驱动数据;
    根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行如下至少一种承接响应行为:播报针对所述用户意图信息对应的情感极性配置的规定语气的承接话术、做出具有特定情绪的表情、做出规定动作;
    其中,所述播报针对所述用户意图信息对应的情感极性配置的规定语气的承接话术不影响所述用户的语音输入。
  10. 根据权利要求1-8中任一项所述的方法,其特征在于,所述将所述用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过所述基模型确定第一表情分类结果,并通过所述多模态对齐模型确定第二表情分类结果之前,还包括:
    获取训练好的用于人脸表情识别的基模型和多模态对齐模型;
    对所述基模型和所述多模态对齐模型进行模型蒸馏。
  11. 一种基于表情识别的虚拟人物驱动装置,其特征在于,包括:
    渲染模型获取模块,用于获取虚拟人物的三维形象渲染模型,以利用虚拟人物向用户提供交互服务;
    实时数据获取模块,用于在虚拟人物与用户的一轮对话中,实时获取所述用户的人脸图像;
    实时表情识别模块,用于将所述用户的人脸图像分别输入用于人脸表情识别的基模型和多模态对齐模型,通过所述基模型确定第一表情分类结果,并通过所述多模态对齐模型确定第二表情分类结果;根据所述第一表情分类结果和所述第二表情分类结果,确定所述用户当前表情的目标分类;
    决策驱动模块,用于若确定所述目标分类属于预设表情分类,并且当前满足所述目标分类的响应触发条件,则根据所述目标分类对应的响应策略,确定对应的驱动数据;根据所述驱动数据和虚拟人物的三维形象渲染模型,驱动虚拟人物执行对应的响应行为。
  12. 一种电子设备,其特征在于,包括:处理器,以及与所述处理器通信连接的存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,以实现如权利要求1-10中任一项所述的方法。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如权利要求1-10中任一项所述的方法。
PCT/CN2023/095446 2022-05-23 2023-05-22 基于表情识别的虚拟人物驱动方法、装置及设备 WO2023226913A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210567627.X 2022-05-23
CN202210567627.XA CN114821744A (zh) 2022-05-23 2022-05-23 基于表情识别的虚拟人物驱动方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2023226913A1 true WO2023226913A1 (zh) 2023-11-30

Family

ID=82516364

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095446 WO2023226913A1 (zh) 2022-05-23 2023-05-22 基于表情识别的虚拟人物驱动方法、装置及设备

Country Status (2)

Country Link
CN (1) CN114821744A (zh)
WO (1) WO2023226913A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821744A (zh) * 2022-05-23 2022-07-29 阿里巴巴(中国)有限公司 基于表情识别的虚拟人物驱动方法、装置及设备
CN115578541B (zh) * 2022-09-29 2023-07-25 北京百度网讯科技有限公司 虚拟对象驱动方法及装置、设备、系统、介质和产品
CN115356953B (zh) * 2022-10-21 2023-02-03 北京红棉小冰科技有限公司 虚拟机器人决策方法、系统和电子设备
CN116643675B (zh) * 2023-07-27 2023-10-03 苏州创捷传媒展览股份有限公司 基于ai虚拟人物的智能交互系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919251A (zh) * 2017-01-09 2017-07-04 重庆邮电大学 一种基于多模态情感识别的虚拟学习环境自然交互方法
US20190130244A1 (en) * 2017-10-30 2019-05-02 Clinc, Inc. System and method for implementing an artificially intelligent virtual assistant using machine learning
CN112162628A (zh) * 2020-09-01 2021-01-01 魔珐(上海)信息科技有限公司 基于虚拟角色的多模态交互方法、装置及系统、存储介质、终端
CN114327041A (zh) * 2021-11-26 2022-04-12 北京百度网讯科技有限公司 智能座舱的多模态交互方法、系统及具有其的智能座舱
CN114357135A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 交互方法、交互装置、电子设备以及存储介质
CN114821744A (zh) * 2022-05-23 2022-07-29 阿里巴巴(中国)有限公司 基于表情识别的虚拟人物驱动方法、装置及设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919251A (zh) * 2017-01-09 2017-07-04 重庆邮电大学 一种基于多模态情感识别的虚拟学习环境自然交互方法
US20190130244A1 (en) * 2017-10-30 2019-05-02 Clinc, Inc. System and method for implementing an artificially intelligent virtual assistant using machine learning
CN112162628A (zh) * 2020-09-01 2021-01-01 魔珐(上海)信息科技有限公司 基于虚拟角色的多模态交互方法、装置及系统、存储介质、终端
CN114327041A (zh) * 2021-11-26 2022-04-12 北京百度网讯科技有限公司 智能座舱的多模态交互方法、系统及具有其的智能座舱
CN114357135A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 交互方法、交互装置、电子设备以及存储介质
CN114821744A (zh) * 2022-05-23 2022-07-29 阿里巴巴(中国)有限公司 基于表情识别的虚拟人物驱动方法、装置及设备

Also Published As

Publication number Publication date
CN114821744A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
WO2023226913A1 (zh) 基于表情识别的虚拟人物驱动方法、装置及设备
US11551804B2 (en) Assisting psychological cure in automated chatting
WO2023226914A1 (zh) 基于多模态数据的虚拟人物驱动方法、系统及设备
KR101925440B1 (ko) 가상현실 기반 대화형 인공지능을 이용한 화상 대화 서비스 제공 방법
CN108000526B (zh) 用于智能机器人的对话交互方法及系统
CN109658928A (zh) 一种家庭服务机器人云端多模态对话方法、装置及系统
CN107870994A (zh) 用于智能机器人的人机交互方法及系统
CN106020488A (zh) 一种面向对话系统的人机交互方法及装置
CN106503786B (zh) 用于智能机器人的多模态交互方法和装置
CN110299152A (zh) 人机对话的输出控制方法、装置、电子设备及存储介质
CN107704612A (zh) 用于智能机器人的对话交互方法及系统
CN112651334B (zh) 机器人视频交互方法和系统
WO2023216765A1 (zh) 多模态交互方法以及装置
CN109800295A (zh) 基于情感词典和词概率分布的情感会话生成方法
Yalçın et al. Evaluating levels of emotional contagion with an embodied conversational agent
CN111339940B (zh) 视频风险识别方法及装置
WO2023226239A1 (zh) 对象情绪的分析方法、装置和电子设备
CN113782010B (zh) 机器人响应方法、装置、电子设备及存储介质
CN116009692A (zh) 虚拟人物交互策略确定方法以及装置
CN114566187B (zh) 操作包括电子装置的系统的方法、电子装置及其系统
CN115543090A (zh) 话题转移方法、装置、电子设备和存储介质
CN114461772A (zh) 数字人交互系统及其方法、装置、计算机可读存储介质
CN112632262A (zh) 一种对话方法、装置、计算机设备及存储介质
CN116843805B (zh) 一种包含行为的虚拟形象生成方法、装置、设备及介质
CN116760942B (zh) 一种全息互动远程会议方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810984

Country of ref document: EP

Kind code of ref document: A1