CN117995160A - Server, display equipment and digital person generation method - Google Patents

Server, display equipment and digital person generation method Download PDF

Info

Publication number
CN117995160A
CN117995160A CN202311801123.0A CN202311801123A CN117995160A CN 117995160 A CN117995160 A CN 117995160A CN 202311801123 A CN202311801123 A CN 202311801123A CN 117995160 A CN117995160 A CN 117995160A
Authority
CN
China
Prior art keywords
emotion
style
coefficient sequence
mouth shape
driving model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311801123.0A
Other languages
Chinese (zh)
Inventor
刘韶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202311801123.0A priority Critical patent/CN117995160A/en
Publication of CN117995160A publication Critical patent/CN117995160A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Quality & Reliability (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Some embodiments of the application show a server, a display device and a digital person generation method, the method comprising: acquiring a set mouth style and emotion style, and determining a broadcasting text and a reply emotion; inputting a broadcasting text or broadcasting voice into a general mouth shape driving model to obtain a style-free mouth shape coefficient sequence; inputting the mouth shape style and the airless lattice mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence; inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence; and generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence. According to the embodiment of the application, the general mouth shape driving model, the style mouth shape driving model and the style emotion driving model are subjected to model division and stage training, so that the mouth shape and emotion driving separation can be realized, the combination of different style mouth shapes and different style emotion can be realized, and the digital human emotion expression is more natural.

Description

Server, display equipment and digital person generation method
Technical Field
The present application relates to the field of digital human interaction technologies, and in particular, to a server, a display device, and a digital human generation method.
Background
With the development of firecracks of metauniverse concepts and digital person concepts, 3D digital persons are increasingly applied, including scenes such as 3D movies, 3D games, AR (Augmented Reality )/VR (Virtual Reality), virtual office, virtual social contact, and the like. However, the level of the 3D digital person is mainly played in the prior art, emotional components are lacked, and a method for controlling emotion is lacked, so that even if some 3D digital person has a certain emotion, the 3D digital person is captured by using the surface capturing equipment, the 3D digital person cannot be popularized and applied at all, and the application range of the 3D digital person is seriously affected.
Disclosure of Invention
According to the server, the display equipment and the digital person generation method, the common mouth shape driving model, the style mouth shape driving model and the style emotion driving model are trained in stages, so that the mouth shape and emotion driving separation can be realized, and the combination of mouth shapes and air grid emotions of different styles can be realized when the method is applied, so that the digital person emotion expression and style migration are more natural.
In a first aspect, some embodiments of the present application provide a server configured to:
acquiring a mouth style and an emotion style set by a user, and determining a broadcasting text and a reply emotion;
Inputting a broadcasting text or broadcasting voice into a universal mouth shape driving model to obtain a style-free mouth shape coefficient sequence, synthesizing the broadcasting voice based on the broadcasting text, and training the universal mouth shape driving model based on multi-person speaking data;
Inputting the mouth shape style and the airless mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence, wherein the style mouth shape driving model is obtained by fusing an output result of the general mouth shape driving model with mouth shape style labels;
inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence, wherein the style emotion driving model is obtained by fusing an output result of the general mouth shape driving model, an emotion style label and emotion label training;
and generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence.
In some embodiments, the reply emotion comprises an emotion type and an emotion intensity, the emotion type comprises a basic emotion and a compound emotion, the compound emotion is an emotion after a plurality of basic emotions are compounded, the server performs inputting the emotion style, the reply emotion and the airless notch coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence, and is further configured to:
And inputting the emotion style, the emotion type, the emotion intensity and the mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence.
In some embodiments, the style emotion-driven model is trained based on emotion intensity data, the server executing the generating of emotion intensity data, further configured to:
obtaining a residual emotion sequence of the basic emotion target intensity according to the basic emotion target intensity coefficient sequence and the non-emotion coefficient sequence of the multiple sentences;
Determining a residual emotion sequence of basic emotion non-target intensity according to the residual emotion sequence of basic emotion target intensity;
determining a basic emotion non-target intensity coefficient sequence according to the residual emotion sequence of the basic emotion non-target intensity and the non-emotion coefficient sequence;
Calculating a residual emotion sequence of the compound emotion according to the residual emotion sequences of at least two basic emotion specific intensities;
and determining a compound emotion coefficient sequence according to the residual emotion sequence of the compound emotion and the non-emotion coefficient sequence.
In some embodiments, the server performs inputting the emotion style, the recovered emotion, and the non-wind pattern coefficient sequence into a style emotion driven model resulting in a style emotion coefficient sequence, further configured to:
Determining an emotion vector sequence of an image frame according to the recovered emotion and the number of image frames, wherein at least one group of component values in the emotion vector sequence show increasing or decreasing trend, components in the emotion vector represent basic emotion, the component values in the emotion vector represent the emotion intensity of the components corresponding to the basic emotion, and the number of image frames is determined based on the length of the broadcasting text or the broadcasting voice;
And inputting the emotion style, the emotion vector sequence of the image frame and the mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence.
In some embodiments, the server is configured to:
If the mouth style and the emotion style set by the user are empty, acquiring a random emotion style;
Inputting the random emotion style, the recovered emotion and the non-wind lattice type coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence;
And generating a digital human coefficient sequence based on the airless lattice type coefficient sequence and the style emotion coefficient sequence.
In some embodiments, the server performs determining the broadcast text, and is further configured to:
Receiving voice data input by a user sent by display equipment;
And determining a broadcasting text according to the voice data.
In some embodiments, the server performs determining to reply to the emotion, and is further configured to:
determining a reply emotion based on the voice data or voice text corresponding to the voice data; or alternatively
Receiving face images or physiological signals of a user acquired by display equipment;
determining a return emotion from the face image or the physiological signal.
In some embodiments, the server performs generating a digital person coefficient sequence based on the style mouth shape coefficient sequence and the style mood coefficient sequence, and is further configured to:
acquiring prestored user image data;
Mapping the user image data to a three-dimensional space to obtain an image coefficient sequence;
and generating a digital human coefficient sequence based on the image coefficient sequence, the style mouth shape coefficient sequence and the style emotion coefficient sequence.
In a second aspect, some embodiments of the present application provide a display apparatus, including:
A display configured to display a user interface;
A communicator configured to communicate data with the server;
A controller configured to:
Receiving voice data input by a user;
transmitting the voice data to a server through the communicator;
receiving digital human image data issued by the server based on the voice data and broadcasting voice;
and playing the broadcasting voice and displaying the digital human image based on the digital human image data.
In a third aspect, some embodiments of the present application provide a digital person generating method, including:
acquiring a mouth style and an emotion style set by a user, and determining a broadcasting text and a reply emotion;
Inputting a broadcasting text or broadcasting voice into a universal mouth shape driving model to obtain a style-free mouth shape coefficient sequence, synthesizing the broadcasting voice based on the broadcasting text, and training the universal mouth shape driving model based on multi-person speaking data;
Inputting the mouth shape style and the airless mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence, wherein the style mouth shape driving model is obtained by fusing an output result of the general mouth shape driving model with mouth shape style labels;
inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence, wherein the style emotion driving model is obtained by fusing an output result of the general mouth shape driving model, an emotion style label and emotion label training;
and generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence.
Some embodiments of the application provide a server, a display device, and a digital person generation method. Acquiring a mouth style and an emotion style set by a user, and determining a broadcasting text and a reply emotion; inputting a broadcasting text or broadcasting voice into a universal mouth shape driving model to obtain a style-free mouth shape coefficient sequence, wherein the broadcasting voice is synthesized based on the broadcasting text, and the universal mouth shape driving model is obtained based on multi-person speaking data training; inputting the mouth shape style and the airless mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence, wherein the style mouth shape driving model is obtained by fusing an output result of the general mouth shape driving model with mouth shape style labels; and inputting the emotion style, the recovered emotion and the non-wind lattice type coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence. And the style emotion driving model is obtained by fusing an output result of the general mouth shape driving model, an emotion style label and emotion label training. And generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence. According to the embodiment of the application, the general mouth shape driving model, the style mouth shape driving model and the style emotion driving model are subjected to model division and stage training, so that the mouth shape and emotion driving separation can be realized, and the combination of different style mouth shapes and style emotions can be realized when the method is applied, so that the digital human emotion expression and style migration are more natural.
Drawings
FIG. 1 illustrates an operational scenario between a display device and a control apparatus according to some embodiments;
FIG. 2 illustrates a hardware configuration block diagram of a control device according to some embodiments;
FIG. 3 illustrates a hardware configuration block diagram of a display device according to some embodiments;
FIG. 4 illustrates a software configuration diagram in a display device according to some embodiments;
FIG. 5 illustrates a flow chart of a digital person generation method provided in accordance with some embodiments;
FIG. 6 illustrates a schematic diagram of a digital human portal interface provided in accordance with some embodiments;
FIG. 7 illustrates a schematic diagram of a digital person selection interface provided in accordance with some embodiments;
FIG. 8 illustrates a schematic diagram of a style setup interface provided in accordance with some embodiments;
FIG. 9 illustrates a schematic diagram of an overall style setup interface provided in accordance with some embodiments;
FIG. 10 illustrates a schematic diagram of a combination style setup interface provided in accordance with some embodiments;
FIG. 11 illustrates a schematic diagram of one overall emotion-driven model training provided in accordance with some embodiments;
FIG. 12 illustrates a schematic diagram of one emotion tag arrangement provided in accordance with some embodiments;
FIG. 13 illustrates a schematic diagram of another emotion tag arrangement provided in accordance with some embodiments;
FIG. 14 illustrates a schematic diagram of one image frame corresponding emotion provided in accordance with some embodiments;
FIG. 15 illustrates a digital person generation scheme block diagram provided in accordance with some embodiments;
FIG. 16 illustrates another digital person generation scheme block diagram provided in accordance with some embodiments;
Fig. 17 illustrates yet another digital person generation scheme block diagram provided in accordance with some embodiments.
Detailed Description
For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.
It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The display device provided by the embodiment of the application can have various implementation forms, for example, a television, an intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table) and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.
Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.
In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.
In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.
In some embodiments, the display device may receive instructions not using the smart device or control device described above, but rather receive control of the user by touch or gesture, or the like.
In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.
In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.
Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.
As shown in fig. 3, the display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.
In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.
The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.
The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.
The display 260 further includes a touch screen, and the touch screen is used for receiving an action input control instruction such as sliding or clicking of a finger of a user on the touch screen.
The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.
A user interface, which may be used to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).
The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; either the detector 230 comprises an image collector, such as a camera, which may be used to collect external environmental scenes, user attributes or user interaction gestures, or the detector 230 comprises a sound collector, such as a microphone or the like, for receiving external sounds.
The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, etc. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.
The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.
In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.
The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.
In some embodiments, the controller includes at least one of a central processing unit (Central Processing Unit, CPU), a video processor, an audio processor, a graphics processor (Graphics Processing Unit, GPU), RAM (Random Access Memory, RAM), ROM (Read-Only Memory, ROM), a first interface to an nth interface for input/output, a communication Bus (Bus), and the like.
The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Or the user may input the user command by inputting a specific sound or gesture, the user input interface recognizes the sound or gesture through the sensor, and receives the user input command.
A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user, which enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of a user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a graphically displayed user interface that is related to computer operations. It may be an interface component such as an icon, window, control, etc., displayed in a display screen of the electronic device, where the control may include visual interface components such as icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, etc.
As shown in fig. 4, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom.
The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; a native application (NATIVE APPS);
the application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).
The native application (NATIVE APPS) may support online or offline, message pushing, or local resource access.
The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.
The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.
With the development of firecracks of metauniverse concepts and digital person concepts, 3D digital persons are increasingly widely applied, including scenes such as 3D movies, 3D games, AR/VR, virtual office, virtual social contact and the like. However, the level of the 3D digital person is mainly played in the prior art, emotional components are lacked, and a method for controlling emotion is lacked, so that even if some 3D digital person has a certain emotion, the 3D digital person is captured by using the surface capturing equipment, the 3D digital person cannot be popularized and applied at all, and the application range of the 3D digital person is seriously affected.
In order to solve the above technical problems, an embodiment of the present application provides a server 400. As shown in fig. 5, the server 400 performs the steps of:
step S501: acquiring a mouth style and an emotion style set by a user, and determining a broadcasting text and a reply emotion;
The user can set the mouth style and emotion style of the digital person through the style setting interface of the display device or the terminal. The style setting interface includes a no-style control, an overall style control, and a combination style control. If an instruction of selecting the no-style control by a user is received, the style flag bit is set to a first preset value, such as 0, which indicates that a specific style or a fixed style is not set for the selected digital person. If an instruction of selecting the whole style control by a user is received, the style zone bit is set to a second preset value, for example 1, and the same mouth style and emotion style are indicated for the selected digital person. If an instruction of selecting the combination style control by a user is received, the style zone bit is set to a third preset value, such as 2, and the method characterizes that different mouth style and emotion styles are set for the selected digital person.
Wherein the mouth style includes mouth style of a specific person or a specific group of people speaking. For example: actor small a or exaggerated mouth style. The emotional style includes the emotional style of a particular person or group of types of people speaking. For example: actor small a or mature and steady emotional style.
Illustratively, when the display device 200 displays a user interface, receiving an instruction input by a user to select a control corresponding to the digital person application, where the user interface includes a control corresponding to the installation application of the display device 200;
In response to a user entered instruction to select a digital person application corresponding control, a digital person entry interface as shown in FIG. 6 is displayed. The digital person entry interface includes a voice digital person control 61, a natural dialog control 62, a wake-free word control 63, and a focus 64.
Upon receiving a user input instruction to select the voice digital person control 61, the display device 200 displays a digital person selection interface. As shown in fig. 7, the digital person selection interface includes a default character control 71, a butyl control 72, a bottle control 73, an add control 74, and a focus 75. The user may select a desired digital person as the digital person in response to the voice command by moving the position of the focus 75.
When the focus 75 indicates the default character control 71, after receiving an instruction that the user presses a menu key of the control apparatus 100, the display device 200 displays a style setting interface, as shown in fig. 8. The style setup interface includes a no style control 81, an overall style control 82, and a combined style control 83. Upon receiving a user instruction to select the ensemble style control 82, an ensemble style setting interface as shown in fig. 9 is displayed. The overall style setting interface includes a mature weight stabilizing control 91, a young skin regulating control 92, a gentle body paste control 93 and a small a control 94. Upon receiving a user instruction to select the combination style control 82, a combination style setup interface as shown in FIG. 10 is displayed. The combination style setup interface includes a mouth style area 101 and an emotion style area 102. The mouth style region 101 includes an exaggeration control 1011, an enrichment control 1012, a small A control 1013, and a small B control 1014. Mood style field 102 includes a mature weight control 1021, a young skin adjustment control 1022, a small A control 1023, and a small B control 1024.
It should be noted that the controls, which are visual objects displayed in the display areas of the user interface in the display device to represent corresponding contents such as icons, thumbnails, video clips, links, etc., may provide the user with various conventional program contents received through data broadcasting, and various application and service contents set by the content manufacturer.
The presentation form of the control is typically diversified. For example, the controls may include text content and/or images for displaying thumbnails related to the text content, or video clips related to the text. As another example, the control may be text and/or an icon of an application.
The focus is used to indicate that any of the controls has been selected. In one aspect, the control may be selected or controlled by controlling movement of a display focus object in a display device according to user input through the control apparatus 100. Such as: the user may select and control controls by directional keying movement of the control focus object between controls on the control device 100. On the other hand, the movement of each control displayed in the display device may be controlled to cause the focus object to select or control the control according to the input of the user through the control apparatus 100. Such as: the user can control the controls to move left and right together through the direction keys on the control device 100, so that the focus object can select and control the controls while the position of the focus object is kept unchanged.
The form of identification of the focal point is typically varied. For example, the location of the focus object may be achieved or identified by magnifying the control, and also by setting a background color of the control, or may be identified by changing a border line, size, color, transparency, outline, and/or font of the text or image of the focus control.
The step of determining the broadcast text comprises the following steps:
the receiving display device 200 transmits voice data input by a user;
After the digital human interactive program is started, the display device 200 receives voice data input by a user.
In some embodiments, the step of initiating a digital human interactive program comprises:
When the digital human portal interface as shown in fig. 6 is displayed, in response to a user input of an instruction to select the natural conversation control 62, a digital human interactive program is started, waiting for the user to input voice data through the control device 100 or controlling the sound collector to start collecting voice data of the user. The natural conversation includes a boring mode, i.e., the user can chat with a digital person.
In some embodiments, the step of initiating a digital human interactive program comprises:
Receiving environmental voice data collected by a sound collector;
when the environment voice data is detected to be larger than or equal to the preset volume, judging whether the environment voice data comprises wake-up words corresponding to the digital person or not;
If the environment voice data comprises wake-up words corresponding to the digital person, a digital person interaction program is started, and the voice collector is controlled to start collecting voice data of the user.
Upon receiving the voice data input by the user, the display device 200 transmits the voice data, the digital person identifier, the style flag bit, and the style type corresponding to the style flag bit to the server 400.
And determining a broadcasting text according to the voice data.
The semantic service of the server 400 recognizes a voice text corresponding to the voice data using a voice recognition technology after receiving the voice data. And carrying out semantic understanding, service distribution, vertical domain analysis, text generation and other processing on the text content to obtain the broadcasting text.
In some embodiments, the reply emotion is determined based on the speech data.
The emotional state of the speaker is identified by analyzing the tones, audio features, voice content, etc. in the voice. For example, by analyzing characteristics of pitch, volume, speed, etc. in speech, it is possible to determine whether a speaker is anger, happy, sad, neutral, etc. The reply emotion is determined according to the emotion of the user.
In some embodiments, the reply emotion is determined based on the phonetic text to which the phonetic data corresponds.
The emotion state of the text author is identified by analyzing information such as vocabulary, grammar, semantics and the like in the text. For example, by analyzing emotion vocabulary, emotion intensity, emotion polarity, and the like in a text, whether the text author is positive, negative, neutral, or the like can be determined. The reply emotion is determined according to the emotion of the user.
In some embodiments, a receiving display device captures a face image or physiological signal of a user, and determines a return emotion from the face image or the physiological signal.
The emotional state of the person is identified by analyzing facial expression features in the face image or video captured by the display device 200. For example, by analyzing movements and changes of eyes, eyebrows, mouth, and the like in a facial expression, it is possible to determine whether an emotional state of a person is anger, happy, sad, surprise, or the like. The emotional state of a person is identified by analyzing physiological signals of the person, such as heart rate, skin conductance, brain waves, etc. For example, by monitoring changes in heart rate, it may be determined whether a person is stressed, relaxed, excited, or the like.
The embodiment of the application designs a staged emotion driving model, wherein the whole model is divided into three stages of modules/models, including a general mouth shape driving model in a first stage, a style mouth shape driving model in a second stage and a style emotion driving model in a third stage. The universal mouth shape driving model is responsible for realizing the basic mouth shape actions of the universal rule when speaking; the style mouth shape driving model is responsible for converting general basic mouth speaking activities into mouth speaking activities of a specific style; the style emotion driven model is responsible for achieving facial emotional activity with a particular style.
As shown in fig. 11, the first stage training, which uses blendshape coefficient sequences extracted from multi-person universal speech data to train an average universal model, trains a universal model based on a depth network, outputs blendshape (mixed deformation) coefficient sequences of the first stage drivable 3D character model.
And training in a second stage, wherein a result output in the first stage and a style label are combined to train a style mouth shape driving model based on a depth network, a blendshape coefficient sequence capable of driving a 3D model in the second stage is output, training in the second stage can be performed by adopting blendshape coefficient sequences extracted from one or more speaking style data, and a style label is set for each style, wherein the speaking style data does not bear any joy and fun emotion, and each speaker can be considered as a style.
And training in a third stage, combining the result output in the first stage, the style label and the emotion label to train a style emotion driving model based on a depth network, outputting blendshape coefficient sequences of style emotion, and adding the blendshape coefficient sequences of style mouth shape blendshape coefficient sequences output in the second stage to output blendshape coefficient sequences of the 3D character model driven in the third stage. The training data of the third stage is blendshape coefficient sequences extracted from the speaking data which is in the same style as the second stage and has emotion, and in order to ensure the consistency of the emotion of the style and the style label of the style mouth shape, the same style adopts the same person to collect data.
And freezing the general mouth shape driving model trained in the first stage during the second stage training, and freezing the general mouth shape driving model in the first stage and the style mouth shape driving model in the second stage during the third stage training.
The depth network of the general mouth shape driving model, the style mouth shape driving model and the style emotion driving model can be a transducer model or a CNN (Convolutional Neural Networks, convolutional neural network).
The style emotion driving model is trained based on emotion intensity data. The emotional intensity data includes blendshape coefficient sequences and emotional tags. The emotion tags include emotion type and emotion intensity.
In some embodiments, the emotion categories mainly comprise 6 basic emotions, namely happiness, anger, sadness, fear, surprise, aversion, and the emotion intensity ranges from 0 to 1.
In some embodiments, the mood tags are mood categories and mood intensity, e.g., mood tags are happy, medium.
In some embodiments, the emotion tag is designed as a vector and an arrangement is designed, for example, as shown in fig. 12, the emotion tag vector indicates that the emotion is currently happy, the emotion intensity is 0.3, and when all emotion intensities are 0, the current emotion is neutral, i.e. no emotion.
In some embodiments, not only a single basic emotion and its intensity can be controlled by the emotion tag, but also a compound emotion and its intensity can be controlled, for example, an anxiety emotion can be compounded by using anger, sadness, fear, aversion, and as shown in fig. 13, different kinds or intensities of anxiety emotions can be compounded by adjusting the intensities of the components.
In some embodiments, emotion intensity data may be obtained by collecting speech data of different styles, sentences, emotions, and intensities. However, the difficulty of acquiring each intensity data is high, and the accuracy of acquiring the intensity may be low due to different understanding of the intensity by each person.
In some embodiments, the emotional intensity data generating method includes:
obtaining a residual emotion sequence of the basic emotion target intensity according to the basic emotion target intensity coefficient sequence and the non-emotion coefficient sequence of the multiple sentences;
Collecting basic emotion target intensity speaking data of a plurality of sentences of each style, and extracting basic emotion target intensity coefficient sequences of each style from the speaking data;
Reasoning is carried out by taking the collected speaking data as a data source, and a sequence without emotion coefficients is generated;
And determining the residual emotion sequence of the basic emotion target intensity as the difference value between the basic emotion target intensity coefficient sequence and the non-emotion coefficient sequence.
Wherein the target intensity may be selected from medium intensity, low intensity, or high intensity.
Taking medium intensity as an example, a residual emotion sequence of medium intensity of basic emotion is calculated.
The medium-intensity emotions of several different sentences are collected for the 6 basic emotions of each style, the intensity of the collected emotions is marked as 0.5, and blendshape coefficients of the sequences are extracted and marked as blendshape _E (0.5), wherein E represents the category of the emotion and can be any one of the 6 basic emotions, for example, H represents happiness, A represents anger, S represents sadness, F represents fear, P represents surprise and D represents aversion, and E represents one of the emotions; 0.5 represents an intensity value of emotional intensity.
The method comprises the steps of training a general mouth shape driving model and a style mouth shape driving model by using non-emotion data, reasoning the collected emotion data as a data source, namely inputting sentences into the trained general mouth shape driving model to obtain a style-free mouth shape coefficient sequence, inputting mouth shape style and style-free mouth shape coefficient sequence into the trained style mouth shape driving model to obtain a driving blendshape coefficient sequence of the style mouth shape driving model, namely a mood-free blendshape coefficient sequence, and marking the model as blendshape _PBS.
Residual emotion delta_ blendshape _e (0.5) corresponding to an emotion with an intensity of 0.5 is calculated as follows:
delta_blendshape_E(0.5)=blendshape_E(0.5)-blendshape_PBS。
Determining a residual emotion sequence of basic emotion non-target intensity according to the residual emotion sequence of basic emotion target intensity;
Generating residual emotions delta_ blendshape _E (S) with other intensities for each collection sentence of 6 basic emotions in each style, wherein S represents the intensity, and the values are between 0 and 1, for example, ten residual emotions with corresponding intensities of 0.1,0.2,0.3 and 0.4 … … and 1.0 are generated by the following steps:
Determining a basic emotion non-target intensity coefficient sequence according to the residual emotion sequence and the non-emotion coefficient sequence of the basic emotion non-target intensity;
Generating emotion data blendshape _e (S) of different basic emotions and different intensities for each style:
blendshape_E(S)=blendshape_PBS+delta_blendshape_E(S)。
calculating a residual emotion sequence of the compound emotion according to the residual emotion sequences of at least two basic emotion specific intensities;
Residual emotions of the compound emotion are generated, other more complex emotions such as anxiety, tension and the like can be compounded from the 6 basic emotions, and training samples for generating the compound emotion are needed in order that the model can support driving of the compound emotion. The combined residual emotion of the multiple emotions is noted delta_ blendshape _ [ E 1(S1)E2(S2)···EN(SN) ], where N represents the number of combined emotions, N < = 6.
The compound calculation formula is as follows:
It should be noted that, since the above combination manner can generate a plurality of compound emotions, and a plurality of compound emotions are not needed in the business at all, specific needs of several compound emotions can be determined according to own business needs.
And determining the compound emotion coefficient sequence according to the residual emotion sequence and the non-emotion coefficient sequence of the compound emotion.
Generating emotion data blendshape _ [ E 1(S1)E2(S2)···EN(SN ] of compound emotion and different intensities:
blendshape_[E1(S1)E2(S2)···EN(SN)]=blendshape_PBS+delta_blendshape_[E1(S1)R2(S2)···EN(SN)].
Illustratively, the anxiety coefficient sequence shown in FIG. 13 is blendshape _ [ A (0.2) S (0.2) F (0.3) D (0.4) ].
The emotion intensity data generated by the embodiment of the application can greatly reduce the difficulty of acquiring each intensity data and improve the accuracy of intensity.
And training the style emotion driving model of the third stage by using the emotion intensity S, the generated emotion blendshape _E (S) and the corresponding blendshape _ [ E 1(S1)E2(S2)···EN(SN ] of the composite intensity vector S 1S2…SN, so that not only can the intensity of emotion be supported, but also the composite emotion be supported, training data are increased, and the accuracy of the training model of the third stage is improved.
Step S502: inputting a broadcasting text or broadcasting voice into a general mouth shape driving model to obtain a style-free mouth shape coefficient sequence;
The broadcasting voice is synthesized based on the broadcasting text, and the universal mouth shape driving model is obtained based on multi-person speaking data training.
And inputting the broadcast text or the broadcast voice into the universal mouth shape driving model to obtain a basic mouth shape action blendshape coefficient sequence of the universal rule of the broadcast text or the broadcast voice.
Step S503: inputting the mouth shape style and the airless lattice mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence;
The style mouth shape driving model is obtained by fusing the output result of the general mouth shape driving model with mouth shape style labels;
Inputting the mouth style and the coefficient sequence of the non-wind lattice mouth style into a style mouth style driving model to obtain the mouth style action blendshape coefficient sequence with the mouth style.
Step S504: inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence;
the style emotion driving model is obtained by fusing an output result of the general mouth shape driving model, an emotion style label and emotion label training;
inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain the facial emotion blendshape coefficient sequence of the emotion style.
In some embodiments, the reply emotion comprises an emotion type and an emotion intensity, wherein the emotion type comprises a basic emotion and a compound emotion, and the compound emotion is an emotion after the combination of a plurality of basic emotions. And inputting the emotion style, the emotion type, the emotion intensity and the mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence.
In some embodiments, the emotion vector is determined based on the recovered emotion. And inputting the emotion style, the emotion vector, the emotion intensity and the mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence.
The components in the emotion vector represent basic emotion, and the component numerical values in the emotion vector represent the emotion intensity of the corresponding basic emotion. Illustratively, the first component to the last component in the emotion vector represent happiness, anger, sadness, fear, surprise, aversion to the basic emotion, respectively.
For example: the emotion vector was [0.3,0,0,0,0,0] when the recovered emotion was happy and the emotion degree was 0.3. The recovered emotion was sad, and when the emotion degree was 0.5, the emotion vector was [0,0,0.5,0,0,0].
The composite emotion may not set intensity, i.e. comprise a fixed basic emotion intensity. The return emotion is anxiety and the emotion vector is [0,0.2,0.2,0.3,0,0.4]. The compound emotion may also set the emotional intensity. The value corresponding to the emotional intensity can be obtained by calculating the emotional intensity and the basic intensity, for example, the value of the non-0 component in the emotion vector of the composite emotional basic intensity can be added to the value corresponding to the emotional intensity. For example: the composite emotion basic intensity emotion vector is [0,0.2,0.2,0.3,0,0.4]. The emotion vector for the complex emotion intensity 0.1 is [0,0.3,0.3,0.4,0,0.5].
In some embodiments, determining an emotion vector sequence of image frames from the recovered emotion and an image frame number, the component values in the emotion vector sequence being in an increasing or decreasing trend, the image frame number being determined based on the broadcast text;
The length of the broadcast voice or the broadcast text corresponds to the number of image frames, and the number of the image frames can be determined according to the length of the broadcast voice or the broadcast text. For example, if the length of the broadcast voice is 2s and the frame rate is 60fps, the number of image frames is 2×60=120 frames.
The increasing or decreasing trend of the component values in the sequence of emotion vectors means that the single component or the plurality of component values have an increasing or decreasing trend. The increment or decrement may be linear or nonlinear. The application is not limited in the manner of increasing or decreasing.
In some embodiments, the step of determining the sequence of emotion vectors for the image frames from the recovered emotion and the number of image frames comprises:
Determining the step length as the ratio of the emotion intensity to the number of image frames;
Wherein, if the emotion is a compound emotion, step sizes of a plurality of basic emotions are respectively calculated.
An emotion vector sequence for the image frame is determined based on the step size, the emotion intensity, and the emotion type.
Illustratively, the emotion is recovered: the type is happy, and the intensity is 0.3. The number of image frames is 10 frames. The step size is 0.3/10=0.03. The emotion vector sequences [0.03,0,0,0,0,0], [0.06,0,0,0,0,0], [0.09,0,0,0,0,0] … … [0.3,0,0,0,0,0]. Restoring emotion: type anxiety [0,0.2,0.2,0.3,0,0.4]. The number of image frames is 10 frames. Anger step size is 0.02, sad step size is 0.02, fear step size is 0.03, aversion step size is 0.04. The emotion vector sequences [0,0.02,0.02,0.03,0,0.04], [0,0.04,0.04,0.06,0,0.08] … … [0,0.2,0.2,0.3,0,0.4].
In some embodiments, the step of determining the sequence of emotion vectors for the image frames from the recovered emotion and the number of image frames comprises:
judging whether the number of image frames is larger than a preset threshold value or not;
If the number of the image frames is not greater than a preset threshold, determining the step length as the ratio of the emotion intensity to the number of the image frames;
an emotion vector sequence for the image frame is determined based on the step size, the emotion intensity, and the emotion type.
If the number of the image frames is larger than a preset threshold, determining the step length as the ratio of the emotion intensity to the preset threshold;
And determining an emotion vector sequence with a preset threshold value before based on the step length, the emotion intensity and the emotion type, and determining an emotion vector sequence with a preset threshold value after based on the emotion intensity and the emotion type.
Illustratively, the emotion is recovered: the type anger, intensity 0.5. The preset threshold is 50 frames. The number of image frames is 100 frames and the step size is 0.01. The emotion vector sequences [0,0.01,0,0,0,0], [0,0.02,0,0,0,0], [0,0.03,0,0,0,0] … … [0,0.5,0,0,0,0] … … [0,0.5,0,0,0,0], i.e., the first 50 frames increment to 0.5 in 0.01 steps, and the last 50 frames keep the anger intensity of 0.5 unchanged. The emotion per frame is shown in fig. 14. The number of image frames is 25 frames and the step size is 0.02. The emotion vector sequences [0,0.02,0,0,0,0], [0,0.04,0,0,0,0] … … [0,0.5,0,0,0,0].
And inputting the emotion style, the emotion vector sequence of the image frame and the mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence.
In the embodiment of the application, the emotion of a single sentence is the same in the data acquisition and model training processes, namely, the emotion of all frames in the current sentence is the same, and in the application process, the emotion of each frame in a sentence is arranged and designed, so that the emotion of the whole sentence has very natural transition.
In some embodiments, the method comprises the steps of obtaining the reply emotion of the last-round dialogue text, wherein the last-round dialogue text refers to the last-round dialogue text of the last-round dialogue text, and the input time of the last-round dialogue and the last-round dialogue does not exceed the preset duration.
Judging whether the emotion types of the reply emotion of the previous round of dialogue text and the reply emotion of the current round of dialogue text are the same;
If the emotion types of the reply emotion of the previous round of dialogue text are different from those of the reply emotion of the current round of dialogue text, determining a first emotion vector sequence and a second emotion vector sequence according to the number of image frames. The first emotion vector sequence is an emotion vector sequence in which the recovered emotion of the previous dialog text is converted into neutral emotion, and the second emotion vector sequence is an emotion vector sequence in which the neutral emotion is converted into the recovered emotion of the current dialog text.
Illustratively, the previous round of dialogue text has sad return emotion type, strength of 0.5, the return emotion type of the current round of dialogue text is happy, strength of 0.5, and the number of image frames of the current round is 50 frames. The first 20% from sad to neutral and the last 80% from neutral to happy can be set, i.e. the first 10 frames of sad decrease linearly from 0.5 to 0.0 and the last 40 frames of happy increase linearly from 0.0 to 0.5. A first sequence of emotion vectors [0,0,0.45,0,0,0], [0,0,0.4,0,0,0] … … [0, 0], a second sequence of emotion vectors [0,0,0.125,0,0,0], [0,0,0.25,0,0,0] … … [0,0,0.5,0,0,0].
It can also be set that the first 20% is from sad to neutral, the middle 20% is neutral to happy, and the last 60% is unchanged in strength, i.e. the first 10 frames are linearly reduced from 0.5 to 0.0, the middle 10 frames are linearly increased from 0.0 to 0.5, and the last 10 frames are happy to 0.5. A first sequence of emotion vectors [0,0,0.45,0,0,0], [0,0,0.4,0,0,0] … … [0, 0], a second sequence of emotion vectors [0.125,0,0,0,0,0], [0.25,0,0,0,0,0] … … [0.5,0,0,0,0,0] … … [0.5,0,0,0,0,0].
If the emotion types of the reply emotion of the previous round of dialogue text are the same as those of the reply emotion of the current round of dialogue text, judging whether the emotion intensities of the reply emotion of the previous round of dialogue text and the reply emotion of the current round of dialogue text are the same;
If the emotion intensity of the reply emotion of the previous round of dialogue text is the same as that of the reply emotion of the current round of dialogue text, determining that the emotion vector sequence of the image frames of the current round of dialogue text adopts the emotion vector corresponding to the same reply emotion;
Illustratively, the type of the reply emotion of the previous round of dialogue text is happy, the intensity is 0.5, the type of the reply emotion of the current round of dialogue text is happy, the intensity is 0.5, and the number of the image frames of the current round is 50 frames. The emotion of the round does not need transition, the emotion vectors corresponding to 50 frames of images are the same, and the emotion vector sequence [0.5,0,0,0,0,0] … … [0.5,0,0,0,0,0] of the image frames.
If the emotion intensity of the reply emotion of the previous round of dialogue text is different from that of the reply emotion of the current round of dialogue text, calculating a step length according to the two emotion intensities, and determining an emotion vector sequence of the image frame according to the step length.
Illustratively, the type of the reply emotion of the previous round of dialogue text is happy, the intensity is 0.3, the type of the reply emotion of the current round of dialogue text is happy, the intensity is 0.5, and the number of the image frames is 50 frames. Step size = (0.5-0.3)/50 = 0.004, emotion vector sequence of image frames [0.304,0,0,0,0,0], [0.308,0,0,0,0,0] … … [0.5,0,0,0,0,0]. Or the preset threshold is 20 frames, the step size is = (0.5-0.3)/20 = 0.01, the emotion vector sequence of the image frames is [0.31,0,0,0,0,0], [0.32,0,0,0,0,0] … … [0.5,0,0,0,0,0] … … [0.5,0,0,0,0,0], and the later 30 frames keep happy 0.5 unchanged.
According to the embodiment of the application, the emotion of each frame in a sentence is designed by combining the arrangement of the recovered emotion of the previous dialogue text, so that the emotion of the previous dialogue and the emotion of the current dialogue have very natural transition, the direct conversion of the emotion of the happiness and the sadness is avoided, and the digital person is more simulated.
Step S505: and generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence.
In some embodiments, the sequence of style mouth-form coefficients and the sequence of style mood coefficients are added to obtain the sequence of digital human coefficients.
In some embodiments, pre-stored user image data is acquired;
The pre-stored user image data can upload the collected user pictures or videos to the server through the terminal or the display equipment, and the server processes the pictures and the cartoon images and the like and stores the pictures or videos. After receiving the digital person identifier transmitted from the display apparatus 200, the server 400 may determine user image data corresponding to the digital person identifier.
Mapping the user image data to a three-dimensional space to obtain an image coefficient sequence;
And generating a digital human coefficient sequence based on the image coefficient sequence, the style mouth shape coefficient sequence and the style emotion coefficient sequence, namely adding the image coefficient sequence, the style mouth shape coefficient sequence and the style emotion coefficient sequence to obtain the digital human coefficient sequence.
In some embodiments, the digital person image data comprises digital person image frame data. The server 400 generates digital human image frame data based on the digital human coefficient sequence, and transmits the image frame data to the display apparatus 200 to cause the display apparatus 200 to display a digital human image based on the digital human image frame data.
In some embodiments, the digital human image data comprises a sequence of digital human coefficients. The server 400 transmits the digital person coefficient sequence to the display apparatus 200 to cause the display apparatus 200 to draw and display a digital person image based on the digital person coefficient sequence.
In some embodiments, the broadcast voice is synthesized from the preset voice feature and the broadcast text.
In some embodiments, the broadcast voice is synthesized according to the voice features corresponding to the digital person identifiers and the broadcast text. The voice characteristics corresponding to the digital person identification can upload the collected user audio to the server through the terminal or the display equipment, and the server performs voice cloning on the audio data and then stores the audio data.
The server 400 transmits the broadcasting voice to the display apparatus 200 so that the display apparatus 200 synchronously plays the broadcasting voice and the digital human image.
The embodiment of the application can realize decoupling of digital population type and emotion, realize driving of different types of emotion, realize driving of different emotion intensities and realize transfer driving of emotion styles.
After receiving the style flag bit, if the style flag bit is a first preset value, that is, the mouth shape style and the emotion style set by the user are null, as shown in fig. 15, the server 400 inputs the broadcast text or the broadcast voice into the universal mouth shape driving model to obtain a non-style mouth shape blendshape coefficient sequence, inputs the random emotion style, the reply emotion and the non-style mouth shape blendshape coefficient sequence into the style emotion driving model to obtain a style emotion blendshape coefficient sequence, and adds the non-style mouth shape blendshape coefficient sequence and the style emotion blendshape coefficient sequence to obtain a digital human blendshape coefficient sequence. The style zone bit is a first preset value and is applied to a general emotion driving environment.
The embodiment of the application can use the universal mouth shape driver to match a certain emotion style for application under the condition that the figure style and emotion style are not required specifically, so that the copyright problem can be avoided and the reasoning time of the style mouth shape can be saved.
If the style flag bit is the second preset value, the server 400 also receives the overall style selected by the user, as shown in fig. 16, and inputs the broadcast text or the broadcast voice into the universal mouth shape driving model to obtain the non-style mouth shape blendshape coefficient sequence. And inputting the whole style and the non-wind lattice type blendshape coefficient sequence into a style mouth type driving model to obtain a style mouth type blendshape coefficient sequence. And inputting the whole style, the recovered emotion and the non-style lattice type blendshape coefficient sequence into a style emotion driving model to obtain a style emotion blendshape coefficient sequence, and adding the style lattice type blendshape coefficient sequence and the style emotion blendshape coefficient sequence to obtain a digital human blendshape coefficient sequence. The style flag bit is a second preset value and is applied to a customized emotion driving environment.
In the application scene requiring the customization of the character image style, the embodiment of the application can record the speaking style mouth shape and the style emotion of the character for special training, and simultaneously use the mouth shape style label and the emotion style label of the character image in the application process, thereby meeting the requirements of the customization and the reproduction of the character image and the style.
If the style flag bit is the third preset value, the server 400 also receives the mouth shape style and emotion style selected by the user, as shown in fig. 17, and inputs the broadcast text or broadcast voice into the universal mouth shape driving model to obtain the non-style mouth shape blendshape coefficient sequence. And inputting the mouth shape style and the non-wind lattice mouth shape blendshape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape blendshape coefficient sequence. Inputting the emotion style, the recovered emotion and the non-style mouth shape blendshape coefficient sequence into a style emotion driving model to obtain a style emotion blendshape coefficient sequence, and adding the style mouth shape blendshape coefficient sequence and the style emotion blendshape coefficient sequence to obtain a digital human blendshape coefficient sequence. The style zone bit is a third preset value and is applied to an emotion driving environment of entertainment innovation.
In the environment with strong entertainment, the embodiment of the application can combine different mouth styles and emotion styles to form novel effects such as exaggeration, strangeness and the like.
The style zone bit can be changed according to user setting, and can be set according to application environment, equipment type, operation configuration and the like.
The embodiment of the application designs a model-divided emotion controllable driving scheme, which can control emotion types and emotion intensities and realize refined emotion control; the method is matched with a designed staged training method, so that decoupling of mouth shapes and emotions is realized, emotion of different styles can be controlled, and emotion styles can be transferred to mouth shapes of different styles.
According to the embodiment of the application, training data with various intensities are generated by manually collecting a medium-intensity emotion data design scheme, so that complexity and intensity inconsistency of different-intensity emotion collection are avoided, meanwhile, training data with composite emotion are designed and generated, and the category of driving emotion is expanded.
The embodiment of the application fully utilizes the characteristics of emotion driving of the sub-model, designs different application environments in application, including general emotion driving, customized emotion driving and entertainment innovation emotion driving, and realizes compatibility of different application modes in the same sub-model staged design scheme.
According to the designed emotion tags, the embodiment of the application can be used for setting the frame-level emotion tags, and different emotion arrangement modes can be designed in the application, so that the emotion has variation and transition of fluctuation of the fall and the fall.
Some embodiments of the present application provide a digital person generation method, the method being applicable to a server configured to: acquiring a mouth style and an emotion style set by a user, and determining a broadcasting text and a reply emotion; inputting a broadcasting text or broadcasting voice into a universal mouth shape driving model to obtain a style-free mouth shape coefficient sequence, synthesizing the broadcasting voice based on the broadcasting text, and training the universal mouth shape driving model based on multi-person speaking data; inputting the mouth shape style and the airless mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence, wherein the style mouth shape driving model is obtained by fusing an output result of the general mouth shape driving model with mouth shape style labels; inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence, wherein the style emotion driving model is obtained by fusing an output result of the general mouth shape driving model, an emotion style label and emotion label training; and generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence. According to the embodiment of the application, the general mouth shape driving model, the style mouth shape driving model and the style emotion driving model are subjected to model division and stage training, so that the mouth shape and emotion driving separation can be realized, and the combination of different style mouth shapes and style emotions can be realized when the method is applied, so that the digital human emotion expression and style migration are more natural.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A server, configured to:
acquiring a mouth style and an emotion style set by a user, and determining a broadcasting text and a reply emotion;
Inputting a broadcasting text or broadcasting voice into a universal mouth shape driving model to obtain a style-free mouth shape coefficient sequence, synthesizing the broadcasting voice based on the broadcasting text, and training the universal mouth shape driving model based on multi-person speaking data;
Inputting the mouth shape style and the airless mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence, wherein the style mouth shape driving model is obtained by fusing an output result of the general mouth shape driving model with mouth shape style labels;
inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence, wherein the style emotion driving model is obtained by fusing an output result of the general mouth shape driving model, an emotion style label and emotion label training;
and generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence.
2. The server of claim 1, wherein the reply emotion comprises an emotion type and an emotion intensity, the emotion type comprises a basic emotion and a compound emotion, the compound emotion is a plurality of basic emotion compound emotions, the server performs inputting the emotion style, the reply emotion and the airless pattern coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence, and is further configured to:
And inputting the emotion style, the emotion type, the emotion intensity and the mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence.
3. The server of claim 2, wherein the style emotion-driven model is trained based on emotion intensity data, the server performing the generating of emotion intensity data, further configured to:
obtaining a residual emotion sequence of the basic emotion target intensity according to the basic emotion target intensity coefficient sequence and the non-emotion coefficient sequence of the multiple sentences;
Determining a residual emotion sequence of basic emotion non-target intensity according to the residual emotion sequence of basic emotion target intensity;
determining a basic emotion non-target intensity coefficient sequence according to the residual emotion sequence of the basic emotion non-target intensity and the non-emotion coefficient sequence;
Calculating a residual emotion sequence of the compound emotion according to the residual emotion sequences of at least two basic emotion specific intensities;
and determining a compound emotion coefficient sequence according to the residual emotion sequence of the compound emotion and the non-emotion coefficient sequence.
4. The server of claim 1, wherein the server performs inputting the emotion style, the recovered emotion, and the sequence of non-wind lattice coefficients into a style emotion driven model resulting in a sequence of style emotion coefficients, further configured to:
Determining an emotion vector sequence of an image frame according to the recovered emotion and the number of image frames, wherein at least one group of component values in the emotion vector sequence show increasing or decreasing trend, components in the emotion vector represent basic emotion, the component values in the emotion vector represent the emotion intensity of the components corresponding to the basic emotion, and the number of image frames is determined based on the length of the broadcasting text or the broadcasting voice;
And inputting the emotion style, the emotion vector sequence of the image frame and the mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence.
5. The server of claim 1, wherein the server is configured to:
If the mouth style and the emotion style set by the user are empty, acquiring a random emotion style;
Inputting the random emotion style, the recovered emotion and the non-wind lattice type coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence;
And generating a digital human coefficient sequence based on the airless lattice type coefficient sequence and the style emotion coefficient sequence.
6. The server of claim 1, wherein the server performing determining the broadcast text is further configured to:
Receiving voice data input by a user sent by display equipment;
And determining a broadcasting text according to the voice data.
7. The server of claim 1, wherein the server performs determining a reply emotion, further configured to:
determining a reply emotion based on the voice data or voice text corresponding to the voice data; or alternatively
Receiving face images or physiological signals of a user acquired by display equipment;
determining a return emotion from the face image or the physiological signal.
8. The server of claim 1, wherein the server performs generating a digital person coefficient sequence based on the style mouth style coefficient sequence and the style emotion coefficient sequence, and is further configured to:
acquiring prestored user image data;
Mapping the user image data to a three-dimensional space to obtain an image coefficient sequence;
and generating a digital human coefficient sequence based on the image coefficient sequence, the style mouth shape coefficient sequence and the style emotion coefficient sequence.
9. A display device, characterized by comprising:
A display configured to display a user interface;
A communicator configured to communicate data with the server;
A controller configured to:
Receiving voice data input by a user;
transmitting the voice data to a server through the communicator;
receiving digital human image data issued by the server based on the voice data and broadcasting voice;
and playing the broadcasting voice and displaying the digital human image based on the digital human image data.
10. A digital person generation method, comprising:
acquiring a mouth style and an emotion style set by a user, and determining a broadcasting text and a reply emotion;
Inputting a broadcasting text or broadcasting voice into a universal mouth shape driving model to obtain a style-free mouth shape coefficient sequence, synthesizing the broadcasting voice based on the broadcasting text, and training the universal mouth shape driving model based on multi-person speaking data;
Inputting the mouth shape style and the airless mouth shape coefficient sequence into a style mouth shape driving model to obtain a style mouth shape coefficient sequence, wherein the style mouth shape driving model is obtained by fusing an output result of the general mouth shape driving model with mouth shape style labels;
inputting the emotion style, the recovered emotion and the non-wind style mouth shape coefficient sequence into a style emotion driving model to obtain a style emotion coefficient sequence, wherein the style emotion driving model is obtained by fusing an output result of the general mouth shape driving model, an emotion style label and emotion label training;
and generating a digital human coefficient sequence based on the style mouth shape coefficient sequence and the style emotion coefficient sequence.
CN202311801123.0A 2023-12-26 2023-12-26 Server, display equipment and digital person generation method Pending CN117995160A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311801123.0A CN117995160A (en) 2023-12-26 2023-12-26 Server, display equipment and digital person generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311801123.0A CN117995160A (en) 2023-12-26 2023-12-26 Server, display equipment and digital person generation method

Publications (1)

Publication Number Publication Date
CN117995160A true CN117995160A (en) 2024-05-07

Family

ID=90892616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311801123.0A Pending CN117995160A (en) 2023-12-26 2023-12-26 Server, display equipment and digital person generation method

Country Status (1)

Country Link
CN (1) CN117995160A (en)

Similar Documents

Publication Publication Date Title
US20230316643A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
CN110400251A (en) Method for processing video frequency, device, terminal device and storage medium
JP4199665B2 (en) Rich communication via the Internet
CN111372109B (en) Intelligent television and information interaction method
US11908056B2 (en) Sentiment-based interactive avatar system for sign language
CN110609620A (en) Human-computer interaction method and device based on virtual image and electronic equipment
CN112511882B (en) Display device and voice call-out method
CN111984763B (en) Question answering processing method and intelligent device
CN111538456A (en) Human-computer interaction method, device, terminal and storage medium based on virtual image
WO2022089224A1 (en) Video communication method and apparatus, electronic device, computer readable storage medium, and computer program product
CN107808191A (en) The output intent and system of the multi-modal interaction of visual human
CN110544287B (en) Picture allocation processing method and electronic equipment
CN111818378A (en) Display device and person identification display method
CN112188249A (en) Electronic specification-based playing method and display device
CN114007145A (en) Subtitle display method and display equipment
CN113066491A (en) Display device and voice interaction method
CN112580625A (en) Display device and image content identification method
CN117995160A (en) Server, display equipment and digital person generation method
CN113824982A (en) Live broadcast method and device, computer equipment and storage medium
CN113038217A (en) Display device, server and response language generation method
CN111858856A (en) Multi-round search type chatting method and display equipment
CN112883144A (en) Information interaction method
CN110764618A (en) Bionic interaction system and method and corresponding generation system and method
CN114554266B (en) Display device and display method
WO2022193735A1 (en) Display device and voice interaction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination