CN117827001A

CN117827001A - Digital virtual person generation method based on cross-modal emotion analysis

Info

Publication number: CN117827001A
Application number: CN202410014719.4A
Authority: CN
Inventors: 张东裕; 李思雨
Original assignee: Jiangxi University of Finance and Economics
Current assignee: Jiangxi University of Finance and Economics
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-04-05

Abstract

The invention discloses a method and a system for generating a digital virtual person based on cross-modal emotion analysis, which realize the ability of accurately understanding the emotion of a user and generating the expression and language output of a corresponding virtual person through a cross-modal emotion analysis technology. The method has wide application prospects in the fields of virtual reality, games, advertisements and the like, and improves the emotion interaction effect between the user and the virtual person.

Description

Digital virtual person generation method based on cross-modal emotion analysis

Technical Field

The invention relates to a method and a system for generating a digital virtual person based on cross-modal emotion analysis, which are suitable for the fields of computer graphics and artificial intelligence.

Background

The digital virtual person is an artificial entity with lifelike appearance and intelligent interaction capability, and has wide application in the fields of virtual reality, games, advertisements and the like. However, the conventional digital virtual person generating method often carries out user emotion analysis through a single mode, so that the conventional digital virtual person generating method often cannot accurately capture the emotion and expression requirements of the user, and lacks emotion expression capability when interacting with the user. In addition, the traditional digital virtual man generation method generally lacks understanding of natural language semantics through a simple segmentation method when processing language related input of a user, and can influence the accuracy of emotion capture of the user. Therefore, there is a need for a novel method that enables accurate understanding of user emotion and generation of corresponding virtual human expressions and languages through cross-modal emotion analysis.

Disclosure of Invention

The invention provides a method and a system for generating a digital virtual person based on cross-modal emotion analysis, which extract emotion information from texts and images input by a user through a cross-modal emotion analysis technology, and generate corresponding virtual person expression and language output so as to realize effective interaction with the emotion of the user.

The invention specifically discloses a digital virtual person generation method based on cross-modal emotion analysis, which comprises the following steps:

s1, collecting and labeling data, and collecting a multi-mode data set containing texts, images and corresponding emotion labels as a basis for model training and evaluation;

s2, text preprocessing: performing word segmentation, stop word removal and word stem pretreatment operation on a text input by a user so as to facilitate subsequent emotion feature extraction and generation treatment;

s3: preprocessing an image;

s4: cross-modal feature extraction;

s5: cross-modal emotion representation learning;

s6: generating a model for processing a virtual person;

s7, model training, model evaluation and tuning;

and S8, generating an application by the digital virtual person, and deploying the trained model into the actual application, such as a virtual reality environment and a game role. According to the text and image information input by the user, the model generates virtual human expression and language output corresponding to the emotion expression. And carrying out digital virtual person generation application development by adopting a Unity virtual person modeling and rendering engine.

In a preferred scheme, the image preprocessing performs resizing, cropping and normalization preprocessing operations on the image input by the user so as to adapt to the processing requirements of subsequent feature extraction and generation.

In a preferred scheme, the cross-modal feature extraction utilizes a deep learning method to extract emotion related features from text and images respectively. For example, emotion vocabulary and syntactic structural features are extracted from texts, and facial expressions and color features are extracted from images. And extracting features by adopting BERT and a convolutional neural network model.

In a preferred scheme, the shared cross-modal emotion expression space maps and fuses emotion information of texts and images. Emotion representation learning is performed using a transducer and attention mechanism model.

In a preferred embodiment, the virtual person generating model specifically includes: and designing a virtual person generation model, and taking the cross-modal emotion representation as input to generate corresponding virtual person expression and language output. The generation processing can be performed by adopting a generation countermeasure network, and the virtual person generation model can be optimized by adopting a conditional generation and multi-mode fusion method.

In a preferred embodiment, the model training utilizes a annotated multimodal dataset to train a virtual person to generate a model. Model parameters are optimized by minimizing the generation error. And performing model training by adopting a cross-validation method.

In the preferred scheme, the model evaluation and tuning uses a test set to evaluate the trained model, and the model hyper-parameters and structure are adjusted to improve the accuracy and fidelity of the virtual human generation. And (5) evaluating and generating an effect by adopting a perception evaluation and user investigation method.

The invention realizes the digital virtual person generation method based on cross-modal emotion analysis, and can enable the virtual person to realize emotion interaction with the user through effective emotion expression learning and generation processing. The method has wide application prospect in the fields of virtual reality, games, advertisements and the like, and improves user experience and emotion interaction effect

Drawings

FIG. 1 is a schematic diagram of an embodiment of the present invention;

FIG. 2 is a flow chart for use with the present method;

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

As shown in FIG. 1, the invention provides a digital virtual person generating method and a system based on cross-modal emotion analysis, which extract emotion information from texts and images input by a user through a cross-modal emotion analysis technology, and generate corresponding virtual person expressions and language outputs so as to realize effective interaction with emotion of the user.

The implementation steps are as follows:

data collection and labeling: a multimodal dataset containing text, images and corresponding emotion tags is collected as a basis for model training and assessment.

Text preprocessing: and performing preprocessing operations such as word segmentation, stop word removal, word drying and the like on the text input by the user so as to facilitate subsequent emotion feature extraction and generation processing.

Image preprocessing: and carrying out preprocessing operations such as resizing, clipping, normalization and the like on the image input by the user so as to adapt to the processing requirements of subsequent feature extraction and generation.

Cross-modal feature extraction: and extracting emotion related features from the text and the image respectively by using a deep learning method. For example, features such as emotion vocabulary and syntax structure are extracted from text, and facial expression and color features are extracted from images. And extracting features by adopting models such as BERT, convolutional neural network and the like.

Cross-modal emotion representation learning: and combining the characteristics of the text and the image, learning a sharable cross-mode emotion expression space by using a deep learning method, and mapping and fusing emotion information of the text and the image. Emotion expression learning is performed by using models such as a transducer and an attention mechanism.

Virtual person generation model processing: and performing virtual human generation model processing through the neural network model, and generating corresponding virtual human expression and language output by taking the cross-modal emotion representation as input. The generation process may be performed using a model such as a Generation Antagonism Network (GAN). And optimizing the virtual person generation model by adopting methods such as condition generation, multi-mode fusion and the like.

Model training: training the virtual man to generate a model by using the marked multi-mode data set. Model parameters are optimized by minimizing the generation error. And performing model training by adopting methods such as cross validation and the like.

Model evaluation and tuning: and evaluating the trained model by using the test set, and adjusting the super-parameters and the structure of the model to improve the accuracy and the fidelity of the virtual human generation. And adopting methods such as perception evaluation, user investigation and the like to evaluate the generated effect.

Digital virtual person generation application: the trained models are deployed into practical applications, such as virtual reality environments, game characters, and the like. According to the text and image information input by the user, the model generates virtual human expression and language output corresponding to the emotion expression. And carrying out digital virtual person generation application development by adopting a Unity and other virtual person modeling and rendering engine.

The invention realizes the digital virtual person generation method based on cross-modal emotion analysis, and can enable the virtual person to realize emotion interaction with the user through effective emotion expression learning and generation processing. The method has wide application prospect in the fields of virtual reality, games, advertisements and the like, and improves user experience and emotion interaction effect.

Claims

1. A digital virtual person generating method based on cross-modal emotion analysis is characterized by comprising the following steps:

s3: preprocessing an image;

s4: cross-modal feature extraction;

s5: cross-modal emotion representation learning;

s6: generating a model for processing a virtual person;

s7, model training, model evaluation and tuning;

2. The method for generating digital virtual persons based on cross-modal emotion analysis according to claim 1, wherein the image preprocessing performs resizing, cropping and normalizing preprocessing operations on the image input by the user so as to adapt to the subsequent feature extraction and generation processing requirements.

3. The method for generating digital virtual persons based on cross-modal emotion analysis according to claim 1, wherein the cross-modal feature extraction utilizes a deep learning method to extract emotion related features from texts and images respectively. For example, emotion vocabulary and syntactic structural features are extracted from texts, and facial expressions and color features are extracted from images. And extracting features by adopting BERT and a convolutional neural network model.

4. The method for generating digital virtual persons based on cross-modal emotion analysis according to claim 1, wherein the cross-modal emotion expression learning combines the characteristics of texts and images, a sharable cross-modal emotion expression space is learned by using a deep learning method, and emotion information of the texts and the images is mapped and fused. Emotion representation learning is performed using a transducer and attention mechanism model.

5. The method for generating a digital virtual person based on cross-modal emotion analysis according to claim 1, wherein the virtual person generation model specifically comprises: and designing a virtual person generation model, and taking the cross-modal emotion representation as input to generate corresponding virtual person expression and language output. The generation processing can be performed by adopting a generation countermeasure network, and the virtual person generation model can be optimized by adopting a conditional generation and multi-mode fusion method.

6. The method for generating a digital virtual person based on cross-modal emotion analysis as recited in claim 1, wherein the model training utilizes a marked multi-modal data set to train the virtual person to generate the model. Model parameters are optimized by minimizing the generation error. And performing model training by adopting a cross-validation method.

7. The method for generating a digital virtual person based on cross-modal emotion analysis as recited in claim 1, wherein the model evaluation and tuning uses a test set to evaluate the trained model, and adjusts the model hyper-parameters and structure to improve the accuracy and fidelity of the virtual person generation. And (5) evaluating and generating an effect by adopting a perception evaluation and user investigation method.