CN117809679A

CN117809679A - Server, display equipment and digital human interaction method

Info

Publication number: CN117809679A
Application number: CN202311258675.1A
Authority: CN
Inventors: 李绪送; 付爱国; 杨善松
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-04-02

Abstract

Some embodiments of the present application show a server, a display device, and a digital human interaction method, the method including: after receiving voice data, acquiring user image data and original key point data, and determining a broadcasting text according to the voice data; determining a reply emotion based on the voice data; inputting the broadcast text and the reply emotion into an emotion mapping voice driving model to obtain an emotion voice key point sequence; correspondingly replacing the emotion voice key point sequence into the original key point sequence to generate a face key point sequence; generating digital human image data based on the user image data and the human face key point sequence; generating broadcasting voice based on the broadcasting text; and sending the broadcast voice and the digital human image data to the display device. According to the embodiment of the application, the mapping of key point data from neutrality to other emotion is realized through the emotion mapping voice driving model, so that the generated digital person has a mouth shape corresponding to voice content, and meanwhile, the expression is richer and more natural.

Description

Server, display equipment and digital human interaction method

Technical Field

The application relates to the technical field of digital human interaction, in particular to a server, display equipment and a digital human interaction method.

Background

The digital person refers to a virtual character generated by a computer program and an algorithm, can simulate the characteristics of human language, behaviors, emotion and the like, and has high intellectualization and interactivity. Industry-introduced digital people have been applied to a number of industries such as travel, finance, hosting, gaming, video entertainment, and the like. Different enterprises have a set of virtual digital human flow technology aiming at the service floor requirements, but the enterprises only push out deep customization schemes aiming at clients facing the enterprises due to the problems of resources, effects and the like, and have no mature and reliable customization schemes aiming at individual clients at consumer level.

In a digital personal customization scenario for individual clients, the same server needs to support thousands of different users, and the images are switched every moment. Early prefabricated video frames are adopted and stored in a memory so as to be loaded and played by users, but the number of individual clients is extremely large, and the memory cannot support thousands of users to load video frame data simultaneously. By adopting a temporary disk reading mode, the time-consuming problem seriously affects the user interaction experience. The adoption of a drive based on the key point control single frame image can help to realize digital personal customization facing to individual clients.

The key points of the face are used as the 'information compression state' of the information such as facial expression and action, and are often used as the intermediate state of the two-stage speaking digital person generating algorithm. The history algorithm realizes that the mode signals such as voice/text and the like and the image mode signals are connected in series through the key points by mapping from voice to the key points to the image, and the scheme can realize that a digital person is driven to obtain the lip-shaped state corresponding to the voice/text content. However, the digital human expression based on the key point driving is single and limited by the key point expression capability, and the digital human expression obtained through the key point mapping is usually mainly neutral and lacks of emotion state expression.

Disclosure of Invention

Some embodiments of the present application provide a server, a display device, and a digital person interaction method, which implement mapping of key point data from neutral to other emotion through an emotion mapping voice driving model, so that the generated digital person has a mouth shape corresponding to voice content, and simultaneously, the emotion is richer and more natural.

In a first aspect, some embodiments of the present application provide a server configured to:

after receiving voice data input by a user sent by display equipment, acquiring user image data and original key point data corresponding to the user image data, and determining a broadcasting text according to the voice data;

Determining a reply emotion based on the speech data;

inputting the broadcast text and the reply emotion into an emotion mapping voice driving model to obtain an emotion voice key point sequence, wherein the emotion voice key point sequence is a sequence of key points related to expression and pronunciation;

correspondingly replacing the emotion voice key point sequence into an original key point sequence to generate a face key point sequence, wherein the original key point sequence comprises a plurality of original key point data;

generating digital human image data based on the user image data and the human face key point sequence;

generating broadcasting voice based on the broadcasting text;

and sending the broadcasting voice and the digital human image data to the display device so that the display device plays the broadcasting voice and displays the digital human image based on the digital human image data.

In some embodiments, the server is configured to:

determining a blink key point sequence, wherein the blink key point sequence is a sequence of blink related key points;

acquiring a preset blink position;

determining at least one target area in the original key point sequence based on the preset blink position, wherein the target area is used for replacing the blink key point sequence;

And correspondingly replacing the blink key point sequence into a target area of the original key point sequence to obtain a key point sequence after blinking.

In some embodiments, the server performs a corresponding substitution of the emotion voice keypoint sequence into an original keypoint sequence, generating a face keypoint sequence, further configured to:

and correspondingly replacing the emotion voice key point sequence into the key point sequence after blinking to generate a face key point sequence.

In some embodiments, the server performs a corresponding substitution of the emotion voice keypoint sequence into the post-blink keypoint sequence, generating a face keypoint sequence, further configured to:

correspondingly replacing the emotion voice key point sequence into the key point sequence after blinking to generate a key point sequence after emotion processing;

acquiring a head-movement affine matrix fitting sequence;

and generating a human face key point sequence based on the head affine matrix fitting sequence and the key point sequence after emotion processing.

In some embodiments, the server performs determining a sequence of blink keypoints, and is further configured to:

copying the original key point data into a plurality of data;

And inputting a plurality of original keypoints into a keypoint blink model to obtain a blink keypoint sequence, wherein the keypoint blink model is obtained by training on the basis of a voice synthesis model and on the condition of upper and lower eyelid heights.

In some embodiments, the server performs the obtaining a head-moving affine matrix fitting sequence, further configured to:

extracting key point data in a preset video and normalizing the key point data to obtain a head-movement reference key point sequence;

selecting standard key point data from the head movement reference key point sequence, wherein the standard key point data are key point data of a front face, no expression and no blink;

acquiring a head-movement affine matrix fitting sequence from the standard key point data to the head-movement reference key point sequence by adopting a data fitting algorithm;

storing the head-moving affine matrix fitting sequence to a preset address;

and acquiring a head-movement affine matrix fitting sequence from the preset address.

In some embodiments, the server performs determining a reply emotion based on the speech data, and is further configured to:

determining a reply emotion and emotion intensity based on the voice data;

the server executing the emotion mapping voice driven model for inputting the broadcast text and the reply emotion, is further configured to:

And inputting the broadcast text, the reply emotion and the emotion intensity into an emotion mapping voice driving model.

In some embodiments, the server performs generating digital human image data based on the user image data and the sequence of face keypoints, and is further configured to:

and inputting the user image data and the human face key point sequence into an image generation model to obtain digital human image data.

In a second aspect, some embodiments of the present application provide a display device, including:

a display configured to display a user interface;

a communicator configured to communicate data with the server;

a controller configured to:

after the digital human interaction program is started, receiving voice data input by a user;

transmitting the voice data to a server through the communicator;

receiving digital human image data issued by the server based on the voice data and broadcasting voice;

and playing the broadcasting voice and displaying the digital human image based on the digital human image data.

In a third aspect, some embodiments of the present application provide a digital human interaction method, including:

Determining a reply emotion based on the speech data;

generating broadcasting voice based on the broadcasting text;

Some embodiments of the application provide a server, a display device and a digital human interaction method. After receiving voice data input by a user sent by display equipment, acquiring user image data and original key point data corresponding to the user image data, and determining a broadcasting text according to the voice data; determining a reply emotion based on the speech data; inputting the broadcast text and the reply emotion into an emotion mapping voice driving model to obtain an emotion voice key point sequence, wherein the emotion voice key point sequence is a sequence of key points related to expression and pronunciation; correspondingly replacing the emotion voice key point sequence into an original key point sequence to generate a face key point sequence, wherein the original key point sequence comprises a plurality of original key point data; generating digital human image data based on the user image data and the human face key point sequence; generating broadcasting voice based on the broadcasting text; and sending the broadcasting voice and the digital human image data to the display device so that the display device plays the broadcasting voice and displays the digital human image based on the digital human image data. According to the embodiment of the application, the mapping of key point data from neutrality to other emotion is realized through the emotion mapping voice driving model, so that the generated digital person has a mouth shape corresponding to voice content, and meanwhile, the expression is richer and more natural.

Drawings

FIG. 1 illustrates an operational scenario between a display device and a control apparatus according to some embodiments;

FIG. 2 illustrates a hardware configuration block diagram of a control device according to some embodiments;

FIG. 3 illustrates a hardware configuration block diagram of a display device according to some embodiments;

FIG. 4A illustrates a software configuration diagram in a display device according to some embodiments;

FIG. 4B illustrates a software configuration diagram in a display device according to some embodiments;

FIG. 5 illustrates a flow chart of a digital human interaction provided in accordance with some embodiments;

FIG. 6 illustrates a schematic diagram of a digital human portal interface provided in accordance with some embodiments;

FIG. 7 illustrates a schematic diagram of a digital person selection interface provided in accordance with some embodiments;

FIG. 8 illustrates a flow chart for displaying a digital human interface provided in accordance with some embodiments;

FIG. 9 illustrates a flow chart of one addition of a digital human interface provided in accordance with some embodiments;

FIG. 10 illustrates a schematic diagram of a video recording preparation interface provided in accordance with some embodiments;

FIG. 11 illustrates a schematic diagram of a tone color setting interface provided in accordance with some embodiments;

FIG. 12 illustrates a schematic diagram of an audio recording preparation interface provided in accordance with some embodiments;

FIG. 13 illustrates a schematic diagram of a digital person naming interface provided in accordance with some embodiments;

FIG. 14 illustrates a schematic diagram of another digital person selection interface provided in accordance with some embodiments;

FIG. 15 illustrates a flow chart of digital person customization provided in accordance with some embodiments;

FIG. 16 illustrates a flow chart of one type of voice interaction provided in accordance with some embodiments;

FIG. 17 illustrates an overall architecture diagram provided in accordance with some embodiments;

FIG. 18 illustrates a schematic diagram of a classification model provided in accordance with some embodiments;

FIG. 19 illustrates a schematic diagram of a live data push process provided in accordance with some embodiments;

FIG. 20 illustrates a schematic diagram of a user interface provided in accordance with some embodiments;

FIG. 21 illustrates another digital human interaction timing diagram provided in accordance with some embodiments;

FIG. 22 illustrates a flow chart of a first digital human interaction provided in accordance with some embodiments;

FIG. 23 illustrates a schematic view of a lip provided in accordance with some embodiments;

FIG. 24 illustrates a schematic diagram of one image generation provided in accordance with some embodiments;

FIG. 25 illustrates a flow chart of a second digital human interaction provided in accordance with some embodiments;

FIG. 26 illustrates a schematic diagram of a face key point provided in accordance with some embodiments;

FIG. 27 illustrates a schematic diagram of one type of keypoint blink model input and output provided in accordance with some embodiments;

FIG. 28 illustrates a schematic diagram of an image generation model provided in accordance with some embodiments;

FIG. 29 illustrates a flow chart of a third digital human interaction provided in accordance with some embodiments;

FIG. 30 illustrates a schematic diagram of a speech feature extraction model provided in accordance with some embodiments;

FIG. 31 illustrates a schematic diagram of a distraction network model provided in accordance with some embodiments;

FIG. 32 illustrates a schematic diagram of a prior emotion knowledge extraction model based on a Mel-spectrum, provided in accordance with some embodiments;

fig. 33 illustrates a technical architecture diagram provided in accordance with some embodiments.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The display device provided in the embodiment of the application may have various implementation forms, for example, may be a television, an intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table), and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display apparatus 200 through the terminal 300 or the control device 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, the terminal 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on the terminal.

In some embodiments, the display device may receive instructions not using the terminal or control device described above, but rather receive control of the user by touch or gesture, or the like.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the terminal 300, for example, the voice instruction control of the user may be directly received through a module for acquiring a voice instruction configured inside the display device 200, or the voice instruction control of the user may be received through a voice control device provided outside the display device 200.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

As shown in fig. 3, the display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.

The display 260 further includes a touch screen, and the touch screen is used for receiving an action input control instruction such as sliding or clicking of a finger of a user on the touch screen.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

A user interface, which may be used to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, etc. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, graphics processor (Graphics Processing Unit, GPU), RAM (Random Access Memory, RAM), ROM (Read-Only Memory, ROM), first to nth interfaces for input/output, a communication Bus (Bus), etc.

The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user, which enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

As shown in fig. 4A, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom.

The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; native applications (Native APPs);

the application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).

Native applications (Native APPs) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

Referring to FIG. 4B, in some embodiments, the system is divided into four layers, from top to bottom, an application layer (application layer), an application framework layer (Application Framework layer), a An Zhuoyun row (Android run) and a system library layer (system runtime layer), and a kernel layer, respectively.

In some embodiments, at least one application program is running in the application program layer, and these application programs may be a Window (Window) program of an operating system, a system setting program, a clock program, or the like; or may be an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an application programming interface (application programming interface, API) and programming framework for the application. The application framework layer includes a number of predefined functions. The application framework layer corresponds to a processing center that decides to let the applications in the application layer act. Through the API interface, the application program can access the resources in the system and acquire the services of the system in the execution.

As shown in fig. 4B, the application framework layer in the embodiment of the present application includes a manager (manager), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used to interact with all activities that are running in the system; a Location Manager (Location Manager) is used to provide system services or applications with access to system Location services; a Package Manager (Package Manager) for retrieving various information about an application Package currently installed on the device; a notification manager (Notification Manager) for controlling the display and clearing of notification messages; a Window Manager (Window Manager) is used to manage bracketing icons, windows, toolbars, wallpaper, and desktop components on the user interface.

In some embodiments, the activity manager is used to manage the lifecycle of the individual applications as well as the usual navigation rollback functions, such as controlling the exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of the display screen, judging whether a status bar exists or not, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window to display, dithering display, distorting display, etc.), etc.

In some embodiments, the system runtime layer provides support for the upper layer, the framework layer, and when the framework layer is in use, the android operating system runs the C/C++ libraries contained in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4B, the kernel layer contains at least one of the following drivers: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

The embodiment of the application provides a digital human interaction method, as shown in fig. 5.

Step S501: the terminal 300 establishes an association relationship with the display device 200 through the server 400;

in some embodiments, the server 400 establishes a connection relationship with the display device 200 and the terminal 300, respectively, such that the display device 200 establishes an association relationship with the terminal 300.

Wherein, the step of the server 400 establishing a connection relationship with the display device 200 includes:

the server 400 establishes a long connection with the display device 200;

the purpose of the server 400 and the display device 200 establishing a long connection is that the server 400 can push the customized status of the digital person and the like to the display device 200 in real time.

Long connection means that a plurality of data packets can be continuously transmitted over one connection, and both sides are required to transmit a link detection packet if no data packet is transmitted during connection hold. The long connection can be communicated for a plurality of times only by establishing one connection, so that network overhead is saved; the long connection can keep a communication state only by carrying out one-time handshake and authentication, so that the communication efficiency is improved; the long connection can realize bidirectional data transmission, and the server actively transmits the digital customized data to the display equipment, so that the real-time communication effect is realized.

In some embodiments, the server 400 establishes a long connection with the display device 200 after receiving the display device 200 power-on message.

In some embodiments, server 400 establishes a long connection with display device 200 after receiving a message that display device 200 enables voice digital person services.

In some embodiments, the server 400 establishes a long connection with the display device 200 after receiving the instruction to send the add digital person to the display device 200.

The server 400 receives request data sent by the display device 200, wherein the request data includes a device identification of the display device 200.

After receiving the request data, the server 400 determines whether an identification code corresponding to the device identifier exists in the database, where the identification code is used to characterize the device information of the display device 200, and the identification code may be a plurality of random numbers or letters, may be a bar code, or may be a two-dimensional code.

If the identification code corresponding to the device identification exists in the database, the identification code is sent to the display device 200, so that the display device 200 displays the identification code on the added digital human interface.

If the identification code corresponding to the equipment identifier does not exist in the database, the identification code corresponding to the equipment identifier is created, the equipment identifier and the identification code are correspondingly stored in the database, and the identification code is sent to the display equipment 200, so that the display equipment 200 displays the identification code on the added digital human interface.

To clarify the interaction process of the server 400 establishing a connection with the display device 200, the following embodiments are disclosed:

after receiving an instruction of opening the digital person entry interface input by a user, the display device 200 controls the display 260 to display the digital person entry interface, wherein the digital person entry interface comprises a voice digital person control;

illustratively, as shown in FIG. 6, the digital person entry interface includes a voice digital person control 61, a natural dialog control 62, a wake-free word control 63, and a focus 64.

It should be noted that controls, which are visual objects displayed in the display areas of the user interface in the display device 200 to represent corresponding contents such as icons, thumbnails, video clips, links, etc., can provide the user with various conventional program contents received through data broadcasting, and various application and service contents set by the content manufacturer.

The presentation form of the control is typically diversified. For example, the controls may include text content and/or images for displaying thumbnails related to the text content, or video clips related to the text. As another example, the control may be text and/or an icon of an application.

The focus is used to indicate that any of the controls has been selected. In one aspect, the control may be selected or controlled by controlling movement of the display focus object in the display device 200 according to user input through the control apparatus 100. Such as: the user may select and control controls by directional keying movement of the control focus object between controls on the control device 100. On the other hand, the movement of each control displayed in the display apparatus 200 may be controlled to cause the focus object to select or control the control according to the input of the user through the control device 100. Such as: the user can control the controls to move left and right together through the direction keys on the control device 100, so that the focus object can select and control the controls while the position of the focus object is kept unchanged.

The form of identification of the focal point is typically varied. For example, the position of the focus object may be achieved or identified by zooming in on the item, and also by setting the background color of the item, or may be identified by changing the border line, size, color, transparency, outline, and/or font of the text or image of the focus item.

After receiving an instruction of selecting a voice digital person control from a user input, the display device 200 controls the display 260 to display a digital person selection interface, wherein the digital person selection interface comprises at least one digital person control and an adding control, the digital person control is displayed in a name corresponding to a digital person image and the digital person image, and the adding control is used for adding a new digital person image, tone and name.

Illustratively, in FIG. 6, upon receiving a user input of an instruction to select the voice digital person control 61, the display device 200 displays a digital person selection interface. As shown in fig. 7, the digital person selection interface includes a default character control 71, a butyl control 72, a bottle control 73, an add control 74, and a focus 75. The user may select a desired digital person as the digital person in response to the voice command by moving the position of the focus 75.

In some embodiments, the flow of display device 200 displaying a digital human interface is shown in FIG. 8. Upon receiving an instruction from the user to open a digital person portal interface (home page), the digital person application of the display apparatus 200 requests data from a voice section, which acquires home page configuration information (home page data) from the operator, and transmits the home page data to the digital person application so that the digital person application controls the display 260 to display the digital person home page. The digital person application may directly send a digital person account request, after receiving the virtual digital person account request, the voice special area obtains preset data, such as default digital person account information, from the operation end, and obtains cloud-stored digital person account data from the algorithm service of the server 400, if default supplementary parameters exist, the preset data, the cloud-stored digital person account data, and the supplementary parameters are sent to the digital person application together, so that the digital person application controls the display 260 to display the digital person selection interface after receiving the instruction for displaying the digital person selection interface. After the digital person homepage is displayed, the digital person application can also send a virtual digital person account request after receiving an instruction of displaying the digital person selection interface input by a user, and directly display the digital person selection interface after receiving preset data, cloud-stored digital person account data and supplementary parameters.

The voice special area is oriented to the server 400, and based on the operation supporting platform, the operation configurable management of the background default data item and the configuration item is realized, and the protocol issuing of the data required by the display device 200 is completed. The voice special area serial display device 200 interacts with the algorithm service of the server 400, and the data parameters are reported by the display device 200, so that the instruction analysis is completed, the algorithm background interaction transfer is completed, the background storage data is analyzed and issued, and the data docking process of the full link is finally realized.

After receiving the instruction of selecting the add control from the user input, the display device 200 sends request data carrying the device identifier of the display device 200 to the customized central control service of the server 400.

The customized central control service calls the target application program interface to judge that the identification code corresponding to the equipment identifier exists in the database, and if the identification code corresponding to the equipment identifier exists in the database, the identification code is sent to the display equipment 200. If the identification code corresponding to the device identification does not exist in the database, the identification code is created and transmitted to the display device 200. The target application program is an application program with an identification code identification function.

The display device 200 receives the identification code issued by the server 400 and displays it on the add-on digital human interface.

Illustratively, in FIG. 7, upon receiving an instruction from a user input to select the add control 74, the display device 200 displays an add digital human interface. As shown in fig. 9, the add-on digital human interface includes a two-dimensional code 91.

Wherein, the step of establishing a connection relationship with the terminal 300 by the server 400 includes:

the server 400 receives the identification code uploaded by the terminal 300;

judging whether or not there is a display device 200 corresponding to the identification code;

if there is the display device 200 corresponding to the identification code, an association relationship between the terminal 300 and the display device 200 is established to transmit the data uploaded by the terminal 300 to the display device 200 after being processed by the server 400.

To clarify the interaction procedure of the server 400 to establish a connection with the terminal 300, the following embodiments are disclosed:

after receiving the instruction of opening the target application program from the user input, the terminal 300 starts the target application program and displays the homepage interface corresponding to the target application program. Wherein the home interface includes a swipe control.

And after receiving an instruction of selecting the one-scan control from the user input, the terminal 300 displays a code-scan interface.

The terminal 300 uploads the identification code, for example, the two-dimensional code, to the server 400 after scanning the identification code displayed by the display device 200. Wherein the user can aim the camera of the terminal 300 at the identification code displayed on the digital human interface on the display device 200.

If the identification code is in the form of a number or letter, the home page interface includes an identification code control, and after receiving an instruction from a user to input the identification code control, an identification code input interface is displayed, and the number or letter displayed by the display device 200 is input to the identification code input interface to upload the identification code to the server 400.

The server 400 judges whether or not there is a display device corresponding to the identification code; if there is the display device 200 corresponding to the identification code, an association relationship between the terminal 300 and the display device 200 is established to transmit the data uploaded by the terminal 300 to the display device 200 after being processed by the server 400. If the display device 200 corresponding to the identification code does not exist, a message of failure in identification is transmitted to the terminal 300, so that the terminal 300 displays an error message.

The server 400 transmits a message of successful recognition to the terminal 300 upon determining that there is the display device 200 corresponding to the identification code. The terminal 300 displays a start page, wherein the start page starts to enter the digital person customization process.

In some embodiments, the launch page includes a digital portrait selection interface. The digital persona selection interface includes at least one default persona control and a custom persona control. After receiving an instruction of selecting a custom avatar control from a user input, the terminal 300 displays a video recording preparation interface, which includes a recording control. Illustratively, as shown in FIG. 10, the video recording preparation interface includes a video recording notice 101 and a start recording control 102.

In some embodiments, the start page may also be a video recording preparation interface.

In some embodiments, the step of establishing an association between the terminal 300 and the display device 200 through the server 400 includes:

the server 400 receives the user account and the password uploaded by the terminal 300 and sends a message of successful login after verifying that the user account and the password are correct, so that the terminal 300 can acquire data corresponding to the user account.

The server 400 receives the user account and the password uploaded by the display device 200, and after verifying that the user account and the password are correct, sends a message of successful login, so that the display device 200 can acquire data corresponding to the user account. Wherein, the terminal 300 is the same as the display device 200 logging in the user account. The terminal 300 establishes an association relationship with the display device 200 by logging in the same user account so that data updated by the terminal 300 can be synchronized to the display device 200. For example: digital person related data customized at the terminal 300 may be synchronized to the display device 200. Step S502: the terminal 300 uploads the image data and the audio data to the server 400;

The image data comprises videos or pictures shot by a user, videos or pictures selected by the user in the album and videos or pictures downloaded in a website.

In some embodiments, the terminal 300 will receive the user captured video or picture to upload to the server 400.

Illustratively, in FIG. 10, upon receiving a user input selecting an instruction to begin recording control 102, a video is recorded using terminal 300 media component video. In order to avoid recording a plurality of times due to disqualification of face detection, the recording interface displays suggested positions of the face, and the terminal 300 may perform preliminary detection on the positions of the face. After the recording is finished, the recorded video can be previewed repeatedly. After receiving the user input confirmation upload instruction, the user recorded video is transmitted to the server 400.

In some embodiments, the terminal 300 may send the taken user photograph to the server.

In some embodiments, the terminal 300 may select one of the user photos or user videos from the album and upload the user photos or user videos to the server 400.

The server 400 receives the image data uploaded by the terminal;

detecting whether face points in the image data are qualified or not;

after receiving the image data uploaded by the terminal, the customized central control service invokes the algorithm service to check the face point positions.

If the face point in the image data is detected to be qualified, sending an image detection qualified message to the terminal 300;

if the face point in the image data is detected to be unqualified, an image detection unqualified message is sent to the terminal, so that the terminal 300 prompts the user to upload again.

The face point location detection may be to detect whether key points of the face are all in a predetermined area by using an algorithm.

After receiving the image detection qualified message, the terminal 300 displays an online special effect page.

In the online special effect page, the user can upload the original video or the original photo to the server 400, namely, take the original video or the original photo as a digital head portrait, select a favorite special effect style, drag or click the special effect intensity, and upload the video or the photo after the special effect to the server 400, namely, take the video or the photo after the special effect as the digital head portrait. The right lower corner of the special effect graph can be touched to compare the difference with the original graph at any time in the special effect making process. And adopting picture preloading in special effect manufacturing, monitoring picture resource loading progress, and setting picture hierarchy relation.

After the image data passes the face point verification and is successfully uploaded to the server 400, the terminal 300 displays a tone setting interface. The tone setting interface comprises at least one preset recommended tone control and a custom tone control;

The terminal 300 receives an instruction of selecting a preset recommended tone control input by a user, transmits an identification corresponding to the preset recommended tone to the server 400, and displays a digital person naming interface.

In some embodiments, the terminal 300 displays an audio recording selection interface after receiving a user input of an instruction to select a custom tone control, the audio recording selection interface including an adult control and a child control.

Illustratively, as shown in FIG. 11, the timbre setting interface includes a small throttle control 111, a small liter control 112, and a custom timbre control 113. Receiving a user input selecting the custom tone control 113, an audio recording preparation interface is displayed, as shown in fig. 12. The audio recording selection interface includes recording notes 121, adult controls 122, and child controls 123. After receiving the user input, selecting either the adult control 122 or the child control 123, a respective corresponding flow is entered. Receiving a user input selecting the graceful control 111, a digital person naming interface is displayed, as shown in fig. 13.

After receiving the instruction of selecting the adult control from the user input, the terminal 300 displays an environment sound detection interface.

The terminal 300 collects environmental sounds for a preset period of time and transmits the user-recorded environmental recording sounds to the server 400.

The server 400 receives the environmental record sound uploaded by the terminal 300;

detecting whether the environment recorded sound is qualified or not;

after the customized central control service receives the environmental record sound uploaded by the terminal 300, an algorithm service is invoked to detect whether the environmental record sound is qualified.

Detecting whether the environment recording sound is qualified or not, comprising the following steps:

acquiring a noise value of an environmental recorded sound;

judging whether the noise value exceeds a preset threshold value;

if the noise value exceeds a preset threshold value, determining that the environmental recorded sound is unqualified;

and if the noise value does not exceed the preset threshold value, determining that the environment recording sound is qualified.

If the environmental recording sound is detected to be qualified, sending an environmental sound qualification message and a target text required for recording the audio to the terminal 300;

if the environmental recording sound is detected to be unqualified, an environmental sound unqualified message is sent to the terminal 300, so that the terminal 300 prompts the user to select a quiet space for re-recording.

After receiving the environment sound qualification message and the target text required for recording the audio, the terminal 300 displays the target text, wherein the target text can select the text which shows the tone characteristics of the user.

The terminal 300 receives audio of the user reading the target text and transmits the audio to the server 400. The terminal 300 can send the audio data to the server 400 when receiving the audio data with preset duration, so that the server 400 can send the recognition result back to the terminal 300, thereby achieving the effect of recognizing the text in real time.

The server 400 receives audio of the user reading the target text;

identifying a user text corresponding to the audio;

calculating the qualification rate according to the target text and the user text;

the step of calculating the qualification rate according to the target text and the user text comprises the following steps:

comparing the target text with the user text to obtain the word number of the correct word in the user text;

and determining the qualification rate as the ratio of the number of words of the correct words to the number of words in the target text.

Judging whether the qualification rate is smaller than a preset value;

if the qualification rate is smaller than the preset value, sending a voice uploading failure message to the terminal 300 so that the terminal 300 prompts the user to re-record the audio of the read target text;

in some embodiments, the real-time recognition compares the target text with the user text as the text is being read to determine erroneous, multi-read, and missed text, and labels the erroneous, multi-read, and missed text for transmission to the terminal 300, such that the terminal 300 displays the erroneous, multi-read, and missed text in different colors or fonts.

If the qualification rate is not less than the preset value, a voice uploading success message is sent to the terminal 300, so that the terminal 300 displays the text of the next item mark or the voice recording completion information.

After a preset number of target texts are read and qualified, the audio acquisition process is finished, and the terminal 300 displays a digital person naming interface.

The server 400 receives audio data corresponding to a preset number of target texts.

After receiving the instruction of selecting the child control from the user input, the terminal 300 also displays an environmental sound detection interface, and the environmental sound detection steps are the same as those when selecting the adult control.

If it is detected that the user records the environmental sound, an environmental sound passing message and the collar audio required for recording the audio are transmitted to the terminal 300.

The terminal 300 can automatically play the collar-read audio and can repeatedly listen to the collar-read audio. When receiving the instruction of pressing the record key by the user, recording the audio read by the user and sending the audio to the server 400.

The server 400 receives audio that the user follows;

identifying a user text corresponding to the audio;

calculating the qualification rate according to the target text corresponding to the collar audio and the user text corresponding to the follow-up audio;

judging whether the qualification rate is smaller than a preset value;

if the qualification rate is smaller than the preset value, sending a voice uploading failure message to the terminal 300, so that the terminal 300 prompts a user to re-record the audio corresponding to the reading collar audio; when the text is read, the target text is compared with the user text to determine the error, multi-read and missed text, and the error, multi-read and missed text is marked and sent to the terminal 300, so that the terminal 300 displays the error, multi-read and missed text in different colors or fonts.

If the qualification rate is not less than the preset value, a voice uploading success message is sent to the terminal 300, so that the terminal 300 plays the next read-out audio or voice recording completion information.

After receiving the completion of the voice recording, the terminal 300 displays a digital person naming interface.

In some embodiments, the terminal 300 may select to upload a piece of audio data after receiving an instruction from a user to select a custom tone control. The server 400 detects a noise value after receiving the audio data, and if the noise value exceeds a preset threshold, transmits an upload failure message to the terminal 300, so that the terminal 300 prompts the user to re-upload. If the noise value does not exceed the preset threshold, a success message is uploaded to the terminal 300 to cause the terminal 300 to display a digital person naming interface.

The terminal 300, upon receiving the digital person name input by the user, transmits the digital person name to the server 400.

Illustratively, as shown in FIG. 13, the digital person naming interface includes an input box 131, a wake word control 132, a finish creation control 133, and a trained digital person avatar 134. Wake word control 132 is used to determine whether to set the display device wake word at the same time. If wake word control 132 is selected, the digital person name is set to the wake word of display device 200. Illustratively, the numerical person naming rules set as wake-up words for the display device are: the Chinese characters with the length of 4-5 are avoided from using overlapped words (such as 'small music'), from using spoken words (such as 'I come back'), and from containing sensitive words. If wake word control 132 is not selected, the digital person name is not set to the wake word of the display device. Illustratively, the numerical person naming convention not set as a wake-up word for display device 200 is: the maximum number of characters is 5, chinese, english and numerals can be used, and sensitive words are avoided. Wherein a digital person name created by a display device or a user account cannot be repeated.

After receiving an instruction from the user to select to complete creation of control 133, the name of the digital person is transmitted to server 400. After detecting that the name of the user's transmitting digital person passes the audit, the server 400 transmits a message of successful creation to the terminal 300. The terminal 300 may display a prompt for successful creation. After detecting that the name of the user's transmitting digital person fails the audit, the server 400 transmits a message of the creation failure and the failure reason to the terminal 300. The terminal 300 may display a prompt for the cause of the creation failure and renaming.

Step S503: the server 400 determines digital person image data based on the image data and digital person voice features based on the audio data.

And carrying out image preprocessing on the second-level video or the user photo uploaded by the user to obtain digital human image data. Image preprocessing is a process of sorting out each image and delivering the images to an identification module for identification. In image analysis, the input image is subjected to processing performed before feature extraction, segmentation, and matching. The main purpose of image preprocessing is to eliminate extraneous information in the image, recover useful real information, enhance the detectability of related information and maximally simplify data, thereby improving the reliability of feature extraction, image segmentation, matching and recognition. The embodiment of the application realizes the customized image high-fidelity and high-definition interactive image through the related algorithm.

In some embodiments, the digital human image data includes a 2D digital human image and face keypoint coordinate information that provides data support for digital human voice keypoint drivers.

In some embodiments, the digital human image data includes digital human parameters, such as 3D BS (Blend Shape) parameters. The digital person parameter is an offset that provides a key point of the face on the basis of the basic model, so that the display apparatus 200 can draw a digital person image based on the basic model and the digital person parameter.

And training a human voice cloning model by utilizing the audio data uploaded by the user to obtain tone parameters conforming to the tone of the user. During voice synthesis, the broadcasting text can be input into a human voice cloning model embedded with tone parameters, and broadcasting voice conforming to the tone of the user is obtained.

In order to support digital human voice interaction, the embodiment of the application increases phoneme duration prediction based on a general voice synthesis voice architecture so as to drive key points of downstream digital human faces. The method is used for supporting digital human figure customization, realizing tone customization of few samples on the basis of a multi-speaker voice synthesis model, and realizing human voice cloning by fine tuning a small number of model parameters through 1-10 sentences of user voice samples.

The digital person image can select the real person image or the cartoon image, and can also select to simultaneously create the real person image and the cartoon image.

When receiving the image data (the face point is not detected) uploaded by the terminal 300, the server 400 can inform the user of training the real person figure or the cartoon figure, that is, training the real person figure or the cartoon figure and detecting the face point simultaneously. If the face point position detection fails, the training of the real person image or the cartoon image is stopped. If the face point location detection is successful, the time waiting for digital person training can be shortened.

In some embodiments, the server 400 transmits the trained real person and cartoon figures to the terminal 300, so that the terminal 300 displays the digital human figures and is available for the user to select.

The terminal 300 receives and displays the trained real person image, and can provide the user with the operations of beautifying, adding special effects and the like for the real person image, and also can provide the options of making cartoon images, re-recording videos and the like, so that the user can obtain the digital person image desired by the user.

Step S504: the server 400 transmits the digital person image data to the display device 200 in association with the terminal 300 to cause the display device 200 to display the digital person image based on the digital person image data.

In some embodiments, the digital person image may be displayed directly at the digital person selection interface after the 2D digital person image is received.

In some embodiments, after receiving the digital person parameters, a digital person image is drawn based on the base model and the digital person parameters, and the digital person image is displayed at a digital person selection interface.

In some embodiments, the server 400 may further transmit the digital person name corresponding to the digital person image data to the display device 200 in association with the terminal 300, so that the display device 200 displays the digital person name at the corresponding location of the digital person image.

In some embodiments, the server 400, upon receiving the digital person name uploaded by the terminal 300, transmits the avatar and the digital person name to the display device 200 and displays it on the digital person selection interface. The digital person is identified with "in training" and may also identify training time, an exemplary digital person selection interface is shown in FIG. 14. After the training is completed, the server 400 transmits the final image obtained by the training to the display device 200 to update the display.

In some embodiments, a target voice (e.g., greeting) generated based on the digital person voice feature may also be sent to the display device 200 so that upon receiving a control from the user to move focus to the digital person, a voice corresponding to the digital person's timbre may be played. For example, in FIG. 7, when focus 75 is received to butyl control 72, a "hello, i am butyl" voice with butyl tone is played.

In some embodiments, a target voice generated based on the digital human voice feature, determining a sequence of keypoints based on the target voice; image data is synthesized from the key point sequence and the digital avatar data, and the image data and the target voice are transmitted to the display device 200 and saved locally by the display device 200. The digital person control is displayed in a first frame (first parameter) or a designated frame (designated parameter) in the image data, or the image display is drawn based on the first parameter or the designated parameter in the image data, and when the user moves the focus to the digital person control, the image and the target voice are played.

In some embodiments, receiving user input to manage digital person instructions while the display device 200 displays a digital person selection interface;

in response to a user entering an instruction to manage a digital person, the control display 260 displays a digital person management interface that includes a delete control, a modify control, and a disable control corresponding to at least one digital person.

And if the instruction of selecting the deletion control is received by the user, deleting the related data corresponding to the digital person.

If an instruction of selecting the forbidden control is received by the user input, relevant data corresponding to the digital person is reserved and marked as forbidden.

If an instruction of selecting the modification control is received from the user, the display 260 is controlled to display the modification identification code, and after the modification identification code is scanned by the terminal 300, the user video or photo can be re-uploaded at the terminal 300 to change the image of the digital person, and/or the user audio can be re-uploaded at the terminal 300 to change the voice characteristics of the digital person, and/or the name/wake-up word of the digital person can be changed at the terminal 300.

In the process of customizing the digital person, the user may exit the customization process at any time and halfway, and the target application program of the terminal 300 records and caches the server in real time, so as to record each data of the user. When the user enters halfway, the target application program acquires the data recorded before from the server, so that convenience is brought to the user to continue operation, and re-recording is avoided. If the user is not satisfied with continuing recording, the user can select to re-record at any time.

The embodiments of the present application do not limit the order in which video recordings, audio recordings, and digital persons are named.

In some embodiments, a schematic diagram of digital human interaction is shown in FIG. 15. The display device 200 displays a two-dimensional code. After scanning the two-dimensional code, the terminal 300 receives the recorded video and audio of the user. The terminal 300 transmits the recorded video and audio to the server 400, and the server 400 obtains the customized data of the digital person, including the image and voice characteristics of the digital person, through the human voice cloning technology and the image preprocessing technology. The server 400 transmits the digital person image to the terminal 300 and the display device 200, respectively. The display device 200 presents a digital human figure on a user interface.

In some embodiments, the display apparatus 200 and the terminal 300 do not need to establish an association relationship. The add-on digital human interface of fig. 9 also includes a local upload control 92. Receiving an instruction of selecting the local upload control 92 from the user, starting the camera of the display device 200, shooting the image data of the user through the camera, or displaying a local video and a picture, selecting the image data stored locally by the user, uploading the image data to the server 400, performing face point detection and digital portrait data generation processing by the server 400, and displaying a digital human image by the display device 200 based on the digital portrait data sent by the server 400. Similarly, the sound collector of the display device 200 may collect the environmental sound, and the display device 200 may send the environmental sound to the server 400, so that the server 400 may perform environmental sound detection. The user-readable target text audio may also be transmitted to the server 400 through the sound collector of the display apparatus 200 or the voice collection function of the control device 100, and the digital human voice feature is generated by the server 400.

The embodiments of the present application further refine some of the functions of the server 400. The server 400 performs the following steps, as shown in fig. 16.

Step S1601: the receiving display device 200 transmits voice data input by a user;

after the digital human interactive program is started, the display device 200 receives voice data input by a user.

In some embodiments, the step of initiating a digital human interactive program comprises:

when the display device 200 displays a user interface, receiving an instruction of selecting a control corresponding to the digital person application, which is input by a user, wherein the user interface comprises a control corresponding to the installation application of the display device 200;

in response to a user entered instruction to select a digital person application corresponding control, a digital person entry interface as shown in FIG. 6 is displayed.

In response to a user input selecting the instruction of the natural conversation control 62, a digital human interactive program is started, waiting for the user to input voice data through the control device 100 or controlling the sound collector to start collecting voice data of the user. The natural conversation includes a boring mode, i.e., the user can chat with a digital person.

receiving environmental voice data collected by a sound collector;

when the environment voice data is detected to be larger than or equal to a preset volume or the sound signal time interval of the environment voice data is detected to be larger than or equal to a preset threshold, judging whether the environment voice data comprises a wake-up word corresponding to a digital person or not;

If the environment voice data comprises a wake-up word corresponding to a digital person, a digital person interaction program is started, a sound collector is controlled to start collecting voice data of a user, and a voice receiving frame is displayed on a floating layer of a current user interface;

if the environmental voice data does not include the wake-up word corresponding to the digital person, the related operation of displaying the voice receiving frame is not performed.

In some embodiments, the digital human interactive program and the voice assistant may be installed in the display device 200 at the same time, receive an instruction from the user to set the digital human interactive program as a default interactive program, and set the digital human interactive program as a default interactive program; the received voice data may be sent to a digital human interactive program, which sends the voice data to the server 400. Voice data may also be received by the digital human interactive program and sent to the server 400.

In some embodiments, after the digital human interactive program is initiated, voice data entered by the user pressing a voice key of the control device 100 is received.

Wherein, the voice data collection is started after the user starts to press the voice key of the control device 100, and the voice data collection is ended after the user stops to press the voice key of the control device 100.

In some embodiments, after the digital human interactive program is started, when the voice receiving frame is displayed on the floating layer of the current user interface, the sound collector is controlled to start collecting voice data input by a user. If voice data is not received for a long time, the digital human interactive program may be turned off and the display of the voice receiving frame may be canceled.

In some embodiments, display device 200 receives voice data entered by a user and transmits the voice data and the user selected digital person identification to server 400. Digital person identification is used to characterize the image, voice characteristics, name, etc. of a digital person.

In some embodiments, after the display device 200 receives voice data input by the user, the voice data and the device identification of the display device 200 are transmitted to the server 400. The server 400 obtains the digital person identifier corresponding to the device identifier from the database. It should be noted that, when the display device 200 detects that the user changes the digital person of the display device 200, the changed digital person identifier is sent to the server 400, so that the server 400 changes the digital person identifier corresponding to the device identifier in the database to the modified digital person identifier. According to the embodiment of the application, the user does not need to upload the digital person identifier every time, and the digital person identifier can be directly obtained from the database.

In some embodiments, the user may select the digital person desired to be used through a digital person image displayed through a digital person selection interface as shown in FIG. 7.

In some embodiments, each created digital person has a unique digital person name that can be set as a wake word, and the digital person selected by the user can be determined from the wake word included in the ambient voice data.

In some embodiments, the voice data received by the display device 200 input by the user is streaming audio data in nature. After receiving the voice data, the display device 200 transmits the voice data to the sound processing module, and performs acoustic processing on the voice data through the sound processing module. The acoustic processing includes sound source localization, denoising, sound quality enhancement, and the like. The sound source localization is used for enhancing or preserving the signals of target speakers under the condition of multi-person speaking, suppressing the signals of other speakers, tracking the speakers and carrying out subsequent voice directional pickup. Denoising is used to remove environmental noise in speech data, and the like. The sound quality enhancement is used to increase the intensity of the speaker's voice when it is low. The purpose of the acoustic processing is to obtain a cleaner and clearer sound of the target speaker in the voice data. The acoustically processed voice data is transmitted to the server 400.

In some embodiments, the display device 200, upon receiving voice data input by a user, directly transmits to the server 400, acoustically processes the voice data by the server 400, and transmits the acoustically processed voice data to the semantic service. After performing processing such as speech recognition and semantic understanding on the received speech data, the server 400 transmits the processed speech data to the display device 200.

Step S1602: recognizing a voice text corresponding to the voice data;

the semantic service of the server 400 recognizes a voice text corresponding to the voice data using a voice recognition technology after receiving the voice data.

Step S1603: determining emotion types and emotion reasons based on the voice text;

as shown in fig. 17, the server 400 includes a dialogue emotion recognition module, which is functionally divided into a plurality of sub-modules, including a text-based emotion classification module, an emotion classification module, and an emotion cause extraction module, and a common sense knowledge acquisition module based on a common sense knowledge graph, and for emotion disambiguation modules among different modalities, an emotion inference module of social common sense is fused, and finally, emotion types and emotion causes after comprehensive decision are output as the result of the emotion recognition module, thereby providing basis and support for emotion dialogue management and emotion expression in the downstream.

In some embodiments, emotion types include first-order emotion.

In some embodiments, emotion types include primary emotion and secondary emotion. The first emotion is emotion type, and the second emotion is emotion information of first emotion combined event, psychological state or emotion intensity degree.

The dialog emotion recognition module supports two types of text input, one of which can be input, or both of which can be input simultaneously.

1) Natural language text: the server 400 speech recognition engine (ASR) recognizes the speech text derived from the speech data. For a natural language query request entered by a user, the user's emotion and the cause or event that caused the emotion are identified. For example, for the input query query= "i feel overtime tired on day", the recognition result is first-order emotion label= { sad }, second-order emotion label= { state_tired }, emotion cause causalspan= { overtime tired }.

2) Structured text data: server 400 multi-modal recognition engine results. User emotion and events identified based on various modalities such as gestures, limb movements, facial expressions, voiceprint features and the like of the user, sound and the like are input into the system in the format of structured text. For example, in a fitness scene, the accuracy of the following actions of the user is low, and the input is: emotion_labl= { sad }, subablel= { state_tired }, event= { High error rate in sports movements }.

The emotion classification module is used for carrying out primary classification on a user request (voice text) to obtain primary emotion. And setting the required emotion type according to the actual use scene. By way of example, emotion categories are defined as happy, hard, angry, fuzzy by analyzing a large number of user logs. The first-order classifier is a typical NLP (Natural Language Processing ) classification task, implemented using neural network models, such as RNN (Recurrent Neural Network ), LSTM (Long Short-Term Memory network), etc., or based on pre-training models, such as Bert, electrora, etc., fine-tuning the weights of the linear layers and activation functions. And inputting the voice text into a first-level classifier to obtain first-level emotion.

In some embodiments, the emotion classification module includes an emotion secondary classifier, which may be defined as a separate class hierarchy implemented as a common multi-classification task. And inputting the voice text into an emotion secondary classifier to obtain secondary emotion (emotion).

In some embodiments, the information contained in the dialogue is further mined on the basis of the emotion primary classification, finer-granularity dialogue emotion recognition is performed, and the user request is secondarily classified from the point of view of psychological states. The secondary classification and the primary classification are defined as a set of hierarchical emotion systems, and each primary emotion is further split into secondary emotion combining events, psychological states and emotion intensity degrees in the scene. "Qi" as first emotion can be subdivided into: qi-event-traffic jam, qi-status-anger, etc., the "happiness" can be subdivided into: happy-event-love, happy-state-agitation, etc., the "blurring" can be subdivided into: fuzzy-state-embarrassment, fuzzy-state-confusion, etc.

The emotion category is designed by referring to general classification and combining the application scene of the product and the task requirement. Illustratively, 36 secondary emotions are defined.

The hierarchical emotion system has the advantages that progressive relation exists between the primary emotion and the secondary emotion, and the result of primary emotion classification can be used as priori knowledge of the secondary emotion to be added into the classifier, so that accuracy and robustness of fine-grained emotion classification are improved.

As shown in FIG. 18, the model of the second-level emotion classifier includes a token word classifier, a transducer Encoder (Encoder layer), and a linear classifier. Wherein the token segmenter and the Encoder layer may use a pre-trained language model Bert. And adding the class of the first-level emotion as a special token into a word list of a word segmentation device, training by using data with emotion marks, and finely adjusting the weight of a linear classifier to obtain a second-level emotion classifier.

And splicing the first emotion and the voice text, and inputting the spliced first emotion and voice text into a second emotion classifier to obtain the second emotion.

The first emotion is gas generation, the voice text is "the traffic is blocked today, the traffic is stopped once every monday", the spliced data [ gas generation ] is blocked today, and the traffic is input into the second emotion classifier once every monday, so that the second emotion is gas generation-event-traffic.

The emotion cause extraction module may extract emotion cause text fragments in the dialog. Emotion cause extraction can be handled by a sequence labeling task. Illustratively, sentences to be labeled are input into the model calculation after being segmented and encoded into word vectors. The model adopts a structure formed by connecting 12 layers Transformer Encoder and one layer of CRF (Conditional Random Field ), wherein the output of the upper layer is used as the input of the next layer, and finally the label sequence with the highest probability is obtained as a prediction result. For example: query (phonetic text) = "I feel overtime well tired on day", emotion cause recognition result is { "O", "O", "O", "B-Causer", "I-Causer", "O", "O", "O", "O" }. B-Causer represents a starting word of emotion cause, I-Causer represents an intermediate word of emotion cause, and O represents a non-emotion cause word.

Based on an open-source large-scale common sense knowledge base ATOMIC (the knowledge base covers inference knowledge tuples related to social aspects centering on events, and is connected with each entity by nine relations, wherein the nine relations have events and psychological states for certain events), a large number of event triples related to product use scenes are supplemented to be used as the basic common sense knowledge base of the emotion dialogue system.

In some embodiments, if the emotion cause cannot be extracted after the voice text is input to the emotion cause extraction module, that is, the words of non-emotion cause in the voice text are all extracted, the voice text is input to the common sense knowledge acquisition module. The common sense knowledge acquisition module identifies events in the voice text, queries causal relation (event triples) of the events in the social common sense, and the causal relation is used as a basis for follow-up comprehensive reasoning and dialog strategy establishment.

In some embodiments, a knowledge reasoning model is pre-trained based on a common sense library, and based on the events identified in the dialogue, the reasoning model is used for representing different relations, which are respectively used as psychological knowledge and event knowledge, and the psychological knowledge and the event knowledge are input into an emotion classification model or a common emotion reply model, and are used as external knowledge to enhance the dialogue representation of a downstream model.

By way of example, the phonetic text "i am not able to test today" is itself free of emotion words, is easily ignored by emotion classification models for hidden emotion therein, and cannot extract emotion causes. However, by means of the common sense atlas, the events of the x examination failing can be queried into triplets such as examination failing, xReact, depression, examination failing, xWant, comfort, and the like, and then the tuple information is sent to the comprehensive reasoning sub-module to correct emotion recognition results, or knowledge obtained through the pre-training model is indicated to the emotion classification model, so that emotion and emotion reasons of the sentence of 'I examination failing today' can be correctly recognized.

In some embodiments, the server 400 is processed by the common sense knowledge acquisition module after receiving the structured text data sent by the display device 200.

Upon detecting a trigger specific event, the display device 200 will actively send the event to the server 400 in the form of structured text data. For example: the display device 200 detects that the motion accuracy rate of the user's follow-up in the exercise scene is lower than a preset value, and transmits the exercise and motion accuracy rate to the server 400. The common sense knowledge acquisition module searches causal relation based on the event of action accuracy and can determine emotion type and emotion cause.

In some embodiments, the emotion recognition module further includes a non-text modality based emotion module, such as a speech emotion recognition module, an image emotion recognition module, and a physiological signal emotion recognition module.

The speech emotion recognition module recognizes the emotion state of the speaker by analyzing the tone, audio characteristics, speech content, and the like in the speech data. For example, by analyzing characteristics of pitch, volume, speech speed, and the like in voice data, it is possible to determine whether a speaker is anger, happy, sad, neutral, or the like.

The image emotion recognition module recognizes the emotion state of a person by analyzing facial expression features in a face image or video. For example, by analyzing movements and changes of eyes, eyebrows, mouth, and the like in a facial expression, it is possible to determine whether an emotional state of a person is anger, happy, sad, surprise, or the like.

The physiological signal emotion recognition module recognizes an emotion state of a person by analyzing physiological signals of the person, such as heart rate, skin conductance, brain waves, and the like. For example, by monitoring changes in heart rate, it may be determined whether a person is stressed, relaxed, excited, or the like.

In most cases, emotion obtained by different mode emotion recognition modules is consistent. However, when emotion is inconsistent, the emotion disambiguation module is required to perform disambiguation processing.

A step of emotion disambiguation, comprising:

acquiring a first primary emotion of a first mode emotion recognition module and a second primary emotion of a second mode emotion recognition module;

judging whether the first-level emotion and the second-level emotion are opposite in polarity; wherein, polarity reversal refers to polarity reversal of positive and negative.

If the polarity of the first primary emotion is opposite to that of the second primary emotion, comparing the result priority of the first-mode emotion recognition module with that of the second-mode emotion recognition module;

if the result priority of the first-mode emotion recognition module is greater than that of the second-mode emotion recognition module, determining that the output primary emotion is the first primary emotion;

if the result priority of the first-mode emotion recognition module is smaller than that of the second-mode emotion recognition module, determining that the output first-level emotion is the second-level emotion.

If the first primary emotion and the second primary emotion are not opposite in polarity, carrying out emotion fusion on the first primary emotion and the second primary emotion, and outputting the primary emotion as fused emotion. Takes emotion complementation among modes as a main strategy

For example, the voice text is "you too excellent" to "and emotion is classified as" happy ", the image emotion recognition module recognizes" turn white eye "and emotion is classified as" angry ", and in combination with daily life experience, the emotion of which the expression is a reactive emotion is an idea of being internal and real, and the result priority of the image emotion recognition module is higher than that of the text emotion recognition module. Therefore, the first-order emotion of output is "vital energy". If the emotion recognition result of one mode is "fuzzy", the emotion recognition result of other modes is "happy", the "fuzzy" and the "happy" can be fused, the emotion is between the "fuzzy" and the "happy", and the emotion can be represented by emotion intensity. For example, the fuzzy emotion intensity is 0, the emotion intensity of the happy heart is 1 (strong), and the emotion intensity between "fuzzy" and "happy heart" is 0.33 (general) or 0.66 (normal).

The emotion reasoning module is equivalent to a decision sub-module of the whole emotion cognition module, and has the main effects of summarizing the results of the sub-modules and combining primary emotion, secondary emotion and emotion reasons identified in multi-modal conversations and event tuple information retrieved from social common sense atlas, and finally outputting the comprehensive decision to the primary emotion, secondary emotion and emotion reasons of the downstream module.

The emotion inference module is mainly used for integrating information, and making a decision when emotion classification is inconsistent with emotion in the social common sense. The decision method is to set priority to the sub-modules of the emotion cognition module. When emotion inconsistency occurs, the result with high priority is used as the control. For example, the voice text is "i'm test fail", the emotion is classified as "neutral", but the emotion is obtained (test fail, xReact, depression) by querying the common sense map, that is, from the perspective of social common sense, the emotion at this time is "depression", the emotion priority in the social common sense is higher than the emotion priority of the text, and the emotion is mapped into a first-level emotion label system, and finally output is: { user emotion category: difficult, second-level emotion: difficulty-event-examination, emotional cause: examination failed }.

In some embodiments, the phonetic text comprises current phonetic text.

In some embodiments, the phonetic text includes the current phonetic text and a history phonetic text and replies to the current phonetic text for a pre-set number of entries. For example, the phonetic text includes the current phonetic text and the phonetic text and replies of the first two pieces of the current phonetic text. And the two parts are spliced together in a splicing way.

In some embodiments, the voice text includes a current voice text and a historical voice text and reply for a pre-set duration of the current voice text. For example, the phonetic text includes the current phonetic text and replies within the first 3s of the current phonetic text.

In some embodiments, the voice text includes a current voice text and a previous preset number of historical voice texts and replies within a previous preset time period of the current voice text. For example, the phonetic text includes the current phonetic text and the phonetic text and replies of the last three pieces within the first 3s of the current phonetic text.

Step S1604: determining a reply emotion and a reply strategy according to the voice text and emotion types;

as shown in fig. 17, the server 400 includes a session management module. Dialog management (Dialog Management, DM) is a central control module of great importance in the task-type man-machine dialog technology framework, and generally comprises two sub-modules, namely dialog state tracking (Dialog State Tracing, DST) and dialog policy learning (Dialog Policy Learning, DPL). In real life, people and people interact with each other, a comfortable dialogue is felt, and besides the mutual interest of dialogue contents, emotion is also resonant, so that in an emotion dialogue system, a central control module is also needed to support the following actions: upstream is connected with an emotion recognition module and a semantic understanding engine (refer to a natural language understanding module in a traditional dialogue system), and downstream starts a corresponding reply generation module.

Besides tracking information such as user intention, slot position and the like and dialogue actions of a decision-making system, the emotion factor is considered, meanwhile, in order to make dialogue more personalized, user portrait features and system personnel are used, and finally, the emotion state and a reply strategy which need to be adopted by the emotion dialogue system are decided. Emotion dialogue management is still two sub-modules, DST and DPL.

Emotion dialogue state tracking: and predicting the emotion state to be adopted in the next step of the emotion dialogue system according to emotion cognition results of the emotion cognition module on the user and combining the dialogue context and user preference. For example, 17 kinds of reply emotions such as excitement, self-help, difficulty, worry, curiosity, disappointment and self-responsibility are designed, the emotion state of the system multiplexes the emotion system in emotion cognition of the user, and the emotion system comprises primary emotion and secondary emotion, and in addition, decision basis is tracked as state reason. Besides the classification task of the emotion state, because the last module displays the information of the emotion, the reason and the like of the user with strong interpretability, a series of logic rules can be carefully designed in the online system, and the emotion state tracking can be performed by integrating the semantic understanding result, the business query result and the character characteristics of the user.

(2) And (3) learning a common-case reply strategy: if the reply of the dialogue system is just reply with emotion, the reply can not be far enough to convey the signal of 'I understand you' to the user, because sometimes when the user expresses anger, the system properly expresses certain gas emotion to be helpful for co-emotion, but sometimes the co-emotion is not completely dependent on emotion, for example, a question is a common psychological co-emotion reply strategy, a psychological consultant often communicates with a visitor through a question way, on one hand, the visitor can actively disclose more real ideas, and on the other hand, the signal of 'I hear', 'I very cares about your experience' is expressed. The application can design the co-condition reply strategies with different granularities according to the capability of the system downstream NLG (Natural Language Generation ) module. Illustratively, 14 reply strategies including placebo, approve, suggest, ask question, objection, etc. are defined, and a classification task is designed to calculate the confidence with which each strategy is adopted.

In some embodiments, the step of determining a reply emotion and reply strategy from the phonetic text and emotion type comprises:

And searching a reply emotion and a reply strategy corresponding to the primary emotion or the secondary emotion in a mapping table, wherein the mapping table comprises mapping relations of the primary emotion, the secondary emotion, the reply emotion and the reply strategy.

The mapping relation in the mapping table is many-to-many, namely, the emotion of the user has various system reply emotions and reply strategies as candidates.

The mapping table is shown in table 1, for example.

TABLE 1

If the second emotion is hard-state-tired, the return emotion is worry and the return strategy is placebo. If the first-level emotion is fuzzy, the reply emotion is curious, and the reply strategy is questioning.

In some embodiments, the reply emotion, emotion intensity and reply strategy corresponding to the primary emotion or the secondary emotion are looked up in a mapping table, and the mapping table comprises mapping relations of the primary emotion, the secondary emotion, the reply emotion, emotion intensity and reply strategy.

The mapping table is shown in table 2, for example.

TABLE 1

First-class emotion	Second level emotion	Restoring emotion	Emotional intensity	Reply strategy
					Open heart	Happy-event-love	Excitation method	Strong intensity	(Blessing)
Difficult to get over	Difficult excessive-State-fatigue	Worry about	Normal state	Comfort
					Qi generating	Qi-event-abuse	Self-responsibility	Strong intensity	Advice of
Blurring	Neutral	Curiosity	In general	Question asking

acquiring a user portrait or a system personal setting, wherein the user portrait is a character label of a user determined based on a user history voice dialogue, and the system personal setting is the character label set by the user;

obtaining a mapping table corresponding to a user image or system personnel;

and searching a reply emotion and a reply strategy corresponding to the primary emotion or the secondary emotion in the mapping table.

Different mapping tables are set for people with different characters. For example: by inquiring the user portrait or the system man setting, the character labels of the user are ' urgent, and the map of the urgent, the urgent and the urgent do not contain ' curious ' and other reply emotions with searching meanings, or the ' angry ' moving emotion reaction and the ' asking ' reply strategy are carried out, so that the user is prevented from being restless and aggravated negative emotion. The dialog system is set to be 'gentle' or the character label of the user is 'happy day, love chat', and more emotion and reply strategies are recovered in the mapping table.

Acquiring a user image or a character label corresponding to the system person;

searching a response emotion and a response strategy corresponding to the character label, the primary emotion or the secondary emotion in the mapping table. The mapping table comprises mapping relations of primary emotion, secondary emotion, reply strategy and character labels.

The mapping table is shown in table 3, for example.

TABLE 3 Table 3

First-class emotion	Second level emotion	Restoring emotion	Reply strategy	Character label
					Open heart	Happy-event-love	Excitation method	(Blessing)	Mild and mild nature and rapid spleen qi …
Open heart	Happy-event-love	Excitation method	Question asking	Gentle awareness, letianpi …
					Difficult to get over	Difficult excessive-State-fatigue	Worry about	Comfort	Gentle awareness, letianpi …
Qi generating	Qi-event-abuse	Self-responsibility	Advice of	Gentle awareness, letianpi …
					Blurring	Neutral	Curiosity	Question asking	Gentle awareness, letianpi …

If the second-level emotion is happy-event-love and the character label is the optimistic, the reply emotion is excited, and the reply strategy is questioning. If the second-level emotion is happy-event-love and the character label is acute spleen qi, the return emotion is excited, and the return strategy is blessing.

Acquiring a service demand identifier;

understanding the user's intent, such as watching movies, listening to music, looking for weather, device control, etc. If the user's business needs are met, the business needs identification is set to a preset value, such as 1. If the user's business needs are not satisfied, the business need identification is set to a non-preset value, e.g., 0. That is, the service requirement identifier indicates that the service requirement of the user is completed for the preset value, and the service requirement identifier does not indicate that the service requirement of the user is not completed for the preset value.

Judging whether the service demand identifier is a preset value or not;

if the service demand mark is a preset value, searching a reply emotion and a reply strategy corresponding to the primary emotion or the secondary emotion in the mapping table;

if the service demand identifier is not a preset value, determining that the reply emotion is a preset emotion, and determining that the reply strategy is a preset strategy. The preset emotion is a reply emotion corresponding to incomplete service demand, for example, the preset emotion is difficult, self-responsibility and the like. The preset policy is a reply policy corresponding to incomplete service requirements, for example, the preset policy is a suggestion, etc.

For example, the voice text is "today too happy, you help me to order a bundle of flowers to home", the user emotion is "happy", if the system supports the skill and searches the merchant of flowers take-out, the user can reply with emotion of "excitement" or "happy" according to the general reply emotion mapping, but if the system does not support flowers take-out or does not search the relevant merchant, the reply emotion should be adjusted to "difficult", "self-responsibility", etc. to express the user with a bad meaning.

searching a to-be-selected reply emotion and a to-be-selected reply strategy corresponding to the first emotion or the second emotion in the mapping table;

determining that the reply emotion which is the most similar to the reply emotion corresponding to the last voice text in the reply emotion to be selected is the reply emotion of the voice text;

and determining a reply strategy corresponding to the reply emotion of the current voice text as the reply strategy of the current voice text.

In the continuous conversation process with the user, the conversion speed of the reply emotion needs to be controlled, the reply emotion cannot be selected randomly from the candidate set, so that the user is prevented from feeling happy and erratic, the state is transited by using the last round of similar emotion or fuzzy emotion according to the pre-designed emotion state transition rule.

For example, the last time the corresponding reply emotion of the voice text is happy, and the reply emotion to be selected is difficult and ambiguous. The ambiguity should be selected as the current reply emotion.

acquiring a user portrait or a system portrait;

If the user portrait or the system person is set as the target character, searching a to-be-selected reply emotion and a to-be-selected reply strategy corresponding to the primary emotion or the secondary emotion in the mapping table; target character may refer to character that is not colored, such as mature and stable.

inputting the voice text and the first-level emotion into a first linear layer classifier to obtain a reply emotion;

and inputting the voice text and the second emotion into a second linear layer classifier to obtain a reply strategy.

The structure of the first linear layer classifier and the second linear layer classifier is shown in fig. 18.

Determining a reply emotion through a classification task:

input: input= { [ CLS ] [ em 1] q1[ SEP ] [ em 1] r1[ SEP ] [ em 2] q2[ SEP ] }, both [ CLS ], [ SEP ] are special token in the vocabulary, the output vector corresponding to [ CLS ] is used as semantic representation of the whole context and is input into the linear layer training classification task, [ SEP ] is a separator of two sentences, em i and qi represent primary emotion and text input, re i and ri represent system reply emotion and reply language, and the input is a variable length sequence. When the user requests for the first time, input= { [ CLS ] [ em 1] q1[ SEP ] }, when the user has made a dialogue with the system, the input becomes input= { [ CLS ] [ em 1] q1[ SEP ] [ re em 1] r1[ SEP ] [ em 2] q2[ SEP ] }. Such as: { [ cls ] [ vital energy ] is blocked today, and every monday, the heart [ SEP ] [ vital energy ] is too great to o, and the heart [ SEP ] [ difficult ] I are tired well now [ SEP ] } that I can also vital energy.

And (3) outputting: the next train of wheels will reply to the emotion category, like [ same emotion ].

Determining a reply strategy through classification tasks:

the input is: [ CLS ] the user of the current dialog requests whether [ SEP ] the cause of [ SEP ] is known. For example, input= { [ CLS ] is blocked today, every monday by a heart of time [ SEP ] generating gas-event-blocking [ SEP ] is [ SEP ] }

And (3) outputting: the system replies to the policy.

Step S1605: inputting the voice text, emotion type, emotion cause, reply emotion and reply strategy into a generated pre-training transducer model to obtain a broadcast text;

as shown in fig. 17, the server 400 includes an emotion expression module. The emotion expression module synthesizes voice audio with emotion through generating dialogue reply text with emotion color, and controls facial expression of digital people, so that the dialogue system has emotion expression capability like people. The generation of the co-emotion reply text is a typical NLG task, and specifically, a targeted reply is generated based on the system emotion state and the co-emotion reply strategy decided by the dialogue management module, and the emotion type and emotion reason identified by the emotion cognition module. The biggest difference with NLG module in traditional dialogue system is that the co-emotion reply generation is not only concerned with the relevance of dialogue content, but also with communication strategy and emotion state, and system personnel setting is considered.

The co-emotion reply generation of the text mode is very similar to the natural language generation task in the traditional dialogue system, and the NLG with the best effect is realized by fine tuning on a dialogue generation data set or by task prompt words on large models such as GPT-3.5, GPT-4 and the like based on a Pre-training model of a GPT (generated Pre-Trained Transformer, generated Pre-training transducer model) structure. Because the big model has weaker emotion expression capability and is too sensitive to prompt words, the embodiment of the application adopts a method of fine tuning on a pre-training model GPT-2, but the input of the model also comprises a secondary emotion user_EMOTion_sublevel and emotion cause emo _cause_span obtained from an emotion cognition module besides a conventional dialogue text, the reply emotion_reply_label and the reply policy_reply_policy obtained from the dialog management module are spliced to a dialog text { tok1 tok2 tok3 tok4} through [ SEP ], namely input= { [ CLS ] tok1 tok2 tok3 tok4[ SEP ] user_photograph_subel [ SEP ] emo _result_span [ SEP ] reply_reply_label [ SEP ] and then sent to a GPT-2 structure model for training. Wherein GPT is an autoregressive mechanism, generated word by word during prediction, input corresponds to starting the text for the segment, and then predicting the next word based on the existing sequence until a terminator is predicted or a preset maximum length is reached.

And inputting the voice text, the second-level emotion, the emotion reason, the reply emotion and the reply strategy into the generated pre-training transducer model to obtain the broadcasting text.

In some embodiments, after determining the emotion type and emotion cause, determining whether the emotion type and emotion cause is a target emotion type and target emotion cause; the target emotion type and the target emotion cause refer to emotion types and emotion causes with higher occurrence frequency (higher than a threshold value) determined according to the user log.

If the emotion type and the emotion cause are the target emotion type and the target emotion cause, acquiring a broadcasting text corresponding to the target emotion type and the target emotion cause from a reply template; the report text in the reply template is created manually and updated periodically.

And if the emotion type and the emotion cause are not the target emotion type and the target emotion cause, executing the step of determining a reply emotion and a reply strategy according to the voice text and the emotion type.

The embodiment of the application saves the time consumption of the co-emotion reply generation model reasoning, and simultaneously, the reply of the dialogue system is safer and more controllable.

Based on the capability of the traditional conversation robot for fully understanding the intention of the user, the emotion of the user and the reason for generating the emotion can be accurately positioned, the emotion which accords with logic is inferred and an appropriate reply is formed according to common sense by means of multi-mode emotion characteristics and fusion, and the emotion correlation of the conversation is improved.

Step S1606: synthesizing broadcasting voice according to the broadcasting text;

in some embodiments, the broadcast voice is synthesized according to the preset voice feature and the broadcast text.

In some embodiments, the broadcast voice is synthesized according to the voice features corresponding to the digital person identifiers and the broadcast text.

And inputting the broadcasting text into a voice clone model corresponding to the trained digital person identifier to obtain broadcasting voice with digital person tone. The broadcast voice is a sequence of audio frames.

In some embodiments, the broadcasting voice with emotion color is synthesized according to the voice characteristics, the reply emotion and the broadcasting text corresponding to the digital person identification.

In some embodiments, after synthesizing the broadcast voice, determining a sequence of keypoints from the broadcast voice;

and carrying out data preprocessing such as denoising on the broadcast voice to obtain voice characteristics. The voice features are input into an encoder to obtain high-level semantic features, the high-level semantic features are input into a decoder, and the decoder is combined with a real joint point sequence to generate a predicted joint point sequence and generate digital human limb actions.

And synthesizing digital human image data according to the key point sequence and the digital human image data.

In some embodiments, after synthesizing the broadcast voice, determining a key point sequence according to the broadcast text and the reply emotion;

In some embodiments, after synthesizing the broadcast voice, determining a digital person parameter sequence according to the broadcast text and the reply emotion, wherein the digital person parameter sequence is a parameter sequence of digital person image, lip shape, expression, action and the like. And synthesizing the digital human image data according to the digital human parameter sequence and the digital human basic model.

In some embodiments, the digital human image data is a sequence of digital human image frames. The digital person data is all image frame sequences and audio frame sequences.

In some embodiments, the digital person data includes a base model (pre-processed image or cartoon image), a sequence of digital person data, and a sequence of audio frames.

Step S1607: and sending the broadcasting voice to the display equipment so that the display equipment plays the broadcasting voice.

In some embodiments, the server 400 sends the broadcast voice to the display device 200, and the display device 200 plays the broadcast voice.

In some embodiments, the digital human image data is a sequence of digital human image frames. The push stream central control service relies on the live broadcast channel to push the digital human image frame sequence and the broadcast voice code to the live broadcast room to complete the digital human push stream.

In some embodiments, the live data push process is shown in fig. 19. The terminal 300 transmits a request to establish a live channel to the live channel, and creates a live channel room and transmits the live channel room to the push central control service. The push stream central control service transmits the live broadcast data of the drive image frame sequence and the broadcast voice to the display device 200 in a live broadcast and pull stream mode through a live broadcast channel and plays the live broadcast data by the display device 200.

The push stream central control service is an important part of the drive display and terminal presentation of the digital person, is responsible for the drive and display of the virtual image, and reflects the customization and drive effect of the whole digital person.

The push stream central control service receives the display equipment request as follows: 1) restart, the push stream central control service interrupts the current video playing, reappears room examples, verifies the effectiveness and sensitivity of the customized image, records the example state, creates a live room and distributes broadcast, and completes the live preparation action; 2) query, push stream central control service asynchronously processes request content, and performs actions such as voice synthesis, key point prediction, image synthesis, push stream in a live broadcast room and the like until the image frame group and the audio frame group are pushed, so as to complete live broadcast, destroy rooms and recover examples; 3) stop, push stream central control service breaks current video play, destroys room, and recovers instance.

In order to ensure the real-time performance of digital person driving, the received request content is subjected to digital person synthesis data in real time by adopting a live broadcast technology and is pushed to a live broadcast room, so that the instant playing of a playing end is realized.

In addition, push center control services use an instance pool mechanism. Apply for unique instance use for the same authentication information. The instance pool automatically recovers the used instance for other devices to use. Instances that are abnormal or unrecovered for an excessive time are automatically discovered and destroyed by the instance pool to recreate new instances, so that the number of healthy instances of the instance pool is ensured.

The display device 200 injects the received encoded digital human image frame sequence and the broadcast voice into the decoder to decode, and synchronously plays the decoded image frame and the broadcast voice, that is, the digital human image and voice.

In some embodiments, the digital person image data includes digital person parameters and a base model. The server 400 transmits the digital human image data and the broadcasting voice to the display apparatus 200, and the display apparatus 200 draws and renders the digital human image based on the digital human image data and synchronously displays the drawn digital human image while the broadcasting voice is being played.

In some embodiments, the server 400, after recognizing the voice data, issues, in addition to the digital person data, requested user interface data or media asset data, etc., in relation to the voice data. The display device 200 displays user interface data issued by the server 400 and digital person data at a designated location. Illustratively, when the user inputs "what is today's weather", the user interface of the display device 200 is as shown in FIG. 20.

In some embodiments, the digital person image is displayed at the user interface layer.

In some embodiments, the digital human image display is displayed in an upper floating layer on the user interface layer.

In some embodiments, the user interface layer is located at an upper layer of the video layer. The digital human image is displayed in a preset area of the video layer, a target area is drawn on the user interface layer, the target area is in a transparent state, and the preset area is overlapped with the target area in position, so that the digital human image of the video layer can be displayed to a user.

In some embodiments, a digital human interaction timing diagram is shown in FIG. 21. After receiving the voice data, the display device 200 transmits the voice data to the semantic service, which transmits the semantic result to the display device 200. The display device 200 initiates a request to the push center control service, generates image synthesis data through voice synthesis, key point prediction, image synthesis service, and the like after the push center control service responds, and pushes the image synthesis data and the audio data to the live broadcasting room. The display device 200 may obtain live data from a live room. When the pushing queue is empty, the push stream central control service automatically ends the push stream and exits the live broadcasting room. The display device 200 detects a no-action timeout, ends the live broadcast, and exits the live broadcast room.

The embodiment of the application supports the high-fidelity customization capability of the universal digital person with small sample and low resource consumption for enterprise users and individual users, and provides a novel personification intelligent interaction system based on the image and sound of the repeated digital person. The digital figures include 2D real figures, 2D cartoon figures, 3D real figures, etc. The user enters a terminal customization process through application program code scanning, the exclusive digital person image is customized through collecting second-level video information/self-shot picture information of the user, and the exclusive digital person sound is customized through collecting 1-10 sentences of audio data of the user, so that the customization of the exclusive digital person sound is realized. After customization is completed, selection and switching of images and voices can be performed through the display device 200, and interaction based on voices and texts is provided by using the selected images and timbres. In the interaction process, the display device 200 receives user requests, and a response language (broadcast text) is generated by a perception and cognition algorithm service based on semantic understanding, voice analysis, co-emotion understanding and the like, the response language is output in a video and audio mode through digital human images and sound, audio and video data is generated by an algorithm service such as voice synthesis, face driving, image generation and the like, and is coordinately forwarded to the target display device by a push stream central control service to complete one interaction.

According to the embodiment of the application, the deep learning technology is adopted, the emotion of the user is accurately identified, the fine-grained emotion is identified around the psychological state and the theme event, and the reason behind the emotion of the user is also identified, meanwhile, the system adopts the strategy based on positive listening and proper questioning, and a targeted and emotion logic reasonable system reply is generated, so that personalized and temperature interaction experience is provided for the user.

The embodiments of the present application further refine some of the functions of the server 400. The server 400 performs the following steps, as shown in fig. 22.

Step S2201: the receiving display device 200 transmits voice data input by a user;

step S2202: acquiring user image data and determining a broadcasting text according to voice data;

the server 400 is provided with an avatar database. The image database stores image data uploaded by the user, and the digital person identifier, the user account number or the device identifier of the display device 200 may be stored in correspondence with the image data. The image data corresponding thereto may be acquired according to the digital person identification uploaded by the display device 200, the user account number, or the device identification of the display device 200.

In some embodiments, the image data is a face single frame image, and the user image data (face single frame image) may be directly acquired. The face single frame image can be directly uploaded by a user, the face single frame image can also be a picture which is obtained by processing the picture uploaded by the user through image matting, face enhancement and the like, and the face single frame image can also be a picture which is obtained by processing the video data uploaded by the user through target image frame selection, image matting, face enhancement and the like. The face single frame image may also be a stylized (cartoonized) picture uploaded by the user.

In some embodiments, the avatar data is video data including a face, and the step of acquiring the user image data includes:

acquiring video data;

and selecting a target image frame from the video data, wherein the head position of the human image of the target image frame is in the middle of the image, the eyes of the human image in the target image frame are in an open state, and the facial outline and the five sense organs of the target image frame are complete. The target image frame is user image data.

The semantic service of the server 400 recognizes text contents corresponding to the voice data using a voice recognition technique after receiving the voice data. And carrying out semantic understanding, service distribution, vertical domain analysis, text generation and other processing on the text content to obtain the broadcasting text.

Step S2203: mapping the user image data to a three-dimensional space to obtain image coefficients;

the user image data is a single frame face image. The single frame face image is mapped to a 3D BS (Blend Shape) dimension.

Illustratively, the 3D BS coefficients include four parts of α (avatar coefficient), β (expression coefficient), γ (mouth motion coefficient), and θ (head motion coefficient). Wherein,

α∈R ¹⁰⁰ task identity identification is controlled by 100 groups of BS coefficients, and alpha is regulated and controlledFace shapes such as male and female, fat and thin (refer to the frame model) are made.

β∈R ¹⁰⁰ Expression actions are controlled by using 100 groups of BS coefficients, and facial expressions such as left eye closing, eyebrow squeezing and the like can be controlled by adjusting beta (refer to the setting of a frame).

γ∈R ²⁰ The 20 groups of mouth-related BS are used to control the speaking-related mouth movements, and the gamma-adjustable pronunciation lips are adjusted, as shown in fig. 23, the upper left first diagram is the a-phone mouth shape.

θ∈R ⁶ Head movements such as head lifting, head lowering, head twisting and the like can be adjusted by adjusting theta through 6 groups of parameter head movements.

The embodiment of the application utilizes 226 groups of 3D BS coefficients [ alpha, beta, gamma, theta ] to describe different expressions (beta) of different speakers (alpha), different voice contents (gamma) and different head actions (theta) 3D mesh information.

The mapping from the picture to the three-dimensional space can adopt 3DMM (3D Morphable models, three-dimensional deformable face model), EMOCA (Emotion Driven Monocular Face Capture and Animation, emotion-driven monocular face capturing and animation) and other technologies, and the main functions are to map a single-frame RGB picture to 3D coefficients (image coefficients and expression coefficients), and then reconstruct the 3D coefficients into a mesh structure in the 3-dimensional space through a BS (base template).

Taking an EMOCA algorithm as an example, the input is a single-frame face image, and the output is an image coefficient alpha ₀ And expression coefficient beta ₀ . Wherein beta is ₀ Is the coefficient corresponding to the neutral expression. Wherein if the expression of the user in the single-frame facial image is not neutral, the user needs to be adjusted to the neutral expression, namely the expression coefficient is set to be beta ₀ 。

Step S2204: determining an emotion coefficient sequence;

the digital human expression is controlled by the expression coefficient beta.

In some embodiments, the step of determining the sequence of emotion coefficients comprises:

determining an emotion change sequence corresponding to the broadcast text;

in some embodiments, a database of emotion change sequences is preset, e.g., classifying people's feelings as surprise, sadSeven emotions of injury, anger, fear, happiness, aversion and neutrality, and each emotion changes with a group of emotion change sequences { beta ] ₀ β ₁ ……β _n ……β ₁ β ₀ To satisfy emotion change continuity, emotion sequence is expressed from beta ₀ Beginning to beta ₀ And (5) ending. Before determining the emotion coefficient sequence corresponding to the broadcasting text, determining the emotion type (reply emotion) corresponding to the broadcasting text according to the voice data input by the user. The above steps for determining the reply emotion have been described in detail, and will not be described in detail here. After determining the emotion type corresponding to the broadcast text, directly searching an emotion change sequence corresponding to the emotion type from a database.

In some embodiments, the step of determining the emotion change sequence corresponding to the broadcast text includes:

inputting the voice data into a voice emotion classification model to obtain emotion feature vectors;

And inputting the emotion feature vector into a mapping model to obtain an emotion change sequence.

The voice data contains user emotion information, emotion feature vectors E can be obtained through a voice emotion classification model or emotion embedded labels before voice synthesis, beta control of E is achieved, a mapping model from the emotion feature vectors E to beta needs to be trained, and the model can be in an MLP (Multilayer Perceptron, multi-layer induction machine) layer architecture or a transducer architecture. The emotion label is used for controlling the beta to carry emotion colors. The sequence of emotion changes { β0β … … βn … … β1β0}.

The training method of the emotion feature vector-beta mapping model is as follows:

1) Data preparation, namely selecting a model with performance talents, recording neutral corpus for 6h, and recording emotion data in parallel with each other for 1h in six expressions such as anger, grignard, convulsion and the like.

2) And (3) data processing: and obtaining the expression coefficient beta frame by frame through a 3DMM mode or an EMOCA model, and carrying out time dimension alignment on parallel data of different emotion states by using a DTW (Dynamic Time Warping, dynamic time adjustment) algorithm.

3) Emotion label presetting: emotion tags are set in one-hot mode, neutral and other six emotions are represented by [7] dimension vectors, such as happy represented by [0,1, 0], and neutral represented by [1,0,0,0,0,0,0 ].

4) Mapping model: the transducer structure is adopted, so that the Fatspeech structure of the speech synthesis model can be used as a reference.

5) Training: and inputting a neutral beta+ emotion label, taking emotion beta corresponding to the emotion label as a label, adopting a mode of solving MSE (mean square error) by using the label beta and the output beta by Loss, and controlling by an emotion discriminator.

In the interactive calling process, the length T of the broadcast voice needs to be judged, and frame skipping extraction or linear interpolation is carried out on a certain emotion sequence to meet the requirement of extracting sequence frames beta _select Length=t. The final emotion coefficient sequence is beta _T ＝{β ₀ β ₁ ……β _T }。

Determining the duration of broadcasting voice according to the text length of the broadcasting text;

and determining the emotion coefficient sequence based on the emotion change sequence and the time length.

The method for determining the emotion coefficient sequence based on the emotion change sequence in time length comprises the following steps:

judging whether the duration of broadcasting voice is less than or equal to a target value, wherein the target value is the total duration of emotion change sequence display;

if the duration of the broadcast voice is less than or equal to the target value, removing the emotion coefficients of the target number in the emotion change sequence to obtain an emotion coefficient sequence, wherein the target number is determined by the duration, the target value and the frame rate;

target number= | broadcast voice duration-target value| x frame rate;

The target number of emotion coefficients may be randomly removed from the emotion change sequence, or may be removed from the emotion change sequence at fixed intervals.

Illustratively, the target value is 2s, the broadcast voice is 1.6s, the frame rate is 25fps, and the emotion change sequence includes 50 beta coefficients. Target number= | broadcast voice duration-target value| x frame rate= |2-1.6| x 25=10. 10 beta coefficients can be randomly removed from 50 beta coefficients, 1 beta coefficient can be removed from every 5 beta coefficients, and the remaining 40 beta coefficients are emotion coefficient sequences.

If the duration of the broadcast voice is greater than the target value, inserting the emotion coefficients of the target number into the emotion change sequence to obtain an emotion coefficient sequence.

The random position in the emotion change sequence can be added with a coefficient which is the same as the previous (or the next) emotion coefficient, or the fixed position or the random position in the emotion change sequence can be added with a middle value of the corresponding coefficient of the front and rear positions where the position is located.

Illustratively, the target value is 2s, the broadcast voice is 2.4s, the frame rate is 25fps, and the emotion change sequence includes 50 beta coefficients. Target number= | broadcast voice duration-target value| x frame rate= |2-2.4| x 25=10. 10 beta coefficients can be randomly increased in 50 beta coefficients, 1 beta coefficient can be increased every 5 beta coefficients, and the rest 60 beta coefficients are emotion coefficient sequences.

In some embodiments, if the duration of the broadcast voice is greater than the target value, determining whether the duration of the broadcast voice is less than or equal to a limit value, the limit value being the longest supported by the emotion coefficient sequence;

if the duration of the broadcast voice is less than or equal to the limit value, inserting the emotion coefficients of the target number into the emotion change sequence to obtain an emotion coefficient sequence.

If the duration of the broadcast voice is greater than the limit value, determining an emotion coefficient sequence according to the duration of the broadcast voice, wherein the determining mode is to sequentially select from the first emotion coefficient of the emotion change sequence, and after the last emotion coefficient is selected, sequentially select from the first emotion coefficient again until the duration of the broadcast voice is reached.

Illustratively, the target value is 3s, the broadcast voice is 4s, the frame rate is 25fps, and the emotion change sequence includes 50 beta coefficients. The emotion coefficient sequence is 50+50 beta coefficients, namely an emotion change sequence and an emotion change sequence.

and inputting the emotion feature vector and the duration of the broadcast voice into a mapping model to obtain an emotion coefficient sequence.

Emotion feature vectors E can be obtained through a voice emotion classification model or emotion embedded labels before voice synthesis, beta control of the E is achieved, and a mapping model from the emotion feature vectors E and broadcasting voice duration to beta needs to be trained to obtain beta corresponding to broadcasting duration _T ＝{β ₀ β ₁ ……β _T }. The model may use an MLP layer architecture or a transducer architecture. The emotion label is used for controlling the beta to carry emotion colors. The duration label is added at the training input of 5) of the training mapping model.

Step S2205: generating digital human image data based on the broadcasting text, the image coefficient and the emotion coefficient sequence;

the method for generating the digital human image data based on the broadcasting text, the image coefficient and the emotion coefficient sequence comprises the following steps:

inputting the broadcasting text into a voice driving model to obtain a mouth action coefficient sequence;

the embodiment of the application adopts phoneme driving to generate mouth motion coefficients gamma related to speaking content in 3D BS coefficients. The deep learning network adopts a speech synthesis network architecture FastSpecech 2, inputs text content and outputs gamma _T Coefficient sequence { gamma } ₀ γ ₁ ……γ _T }

1) The phoneme time dimension is aligned: phoneme interval ct e { "pad", "/", "AA0" … "x", "z", "zh" }, 67 english phonemes total, 120 chinese phonemes plus 4 special symbols (space, silence, padding, prosodic boundary) total 191. A section of voice data T is processed into MFCC (Mel Frequency Cepstrum Coefficient ) with time window of 200 and dimension of 39 at the sampling rate of 16000, the dimension of MFCC [ T x 80,39] is obtained, the corresponding text of voice is discretized into a phoneme list Phonemes such as 'hello' - > { eos n i2 h aa3 uu3 eos }. Then, MFCC [ t× 80,39] and corresponding Phonemes are forcedly aligned based on viterbi of TDNN (Time Delay Neural Network, time-lapse neural network) to obtain a time list duration { n1, n2 … nm } of corresponding Phonemes, m is the same as Phonemes in length, n1+n2+ … nm=t80.

2) 3D BS gamma spatial alignment: the method comprises the steps of collecting Mesh data with a frame rate of 25Fps, firstly standardizing a Mesh point-surface structure, then aligning the Mesh structure in a space dimension in a manner of firstly scaling the Mesh to a space with a certain value range according to a certain proportion, then selecting a standard face (the central axis of the nose of the eye is x=0), enabling all key points to be maximally overlapped with 8 groups of anchor points on the nose bridge through rotation and translation, finally obtaining Mesh parameters BS [ T x 25, 226] (226 comprises alpha 100, beta 100, gamma 20 and theta 6) after space alignment, and selecting gamma as a training Label.

3) Aligning the sound and the picture: through the above two steps, phonemes ({ eos n i2 h aa3 uu3 eos }) duration { n1, n2 … nm } (n1+n2+ … nm=t×80) and BS [ t×25,20] are obtained. Before entering the model, 80×t and 25×t need to be aligned in a time dimension, and a bilinear difference method can be used to map 25×20×20 key points to 80×20BS to achieve alignment, so as to obtain Input { Phonemes, duration, BS } required by model Input.

4) BS driving: the phoneme driving model selects a FastSpecech model based on a transducer structure, and the training data extracts Input { Phonemes, duration, BS } through four-dimensional acquired data. The input data are phonmes { ph1, ph 2..phm }, duration { n1, n2 … nm } and BS [ T80, 20], and the output is Delta [ T80, 20] dimension BS offset from reference BS. The Loss is set to Delta + reference and the BS calculates the MSE.

5) And obtaining a voice driving model through training.

In the process of inputting the broadcasting text, a Phoneme sequence Phoneme and a time list Duration corresponding to the broadcasting text are found through a Phoneme library, and are input into a voice driving model to obtain the face key point offset corresponding to the voice fragment, namely a mouth action coefficient sequence gamma _T ＝{γ ₀ γ ₁ ……γ _T }。

Digital human image data is generated based on the image coefficient, the emotion coefficient sequence, and the mouth motion coefficient sequence.

In some embodiments, the digital human image data includes a figure coefficient, a sequence of emotion coefficients, a sequence of mouth motion coefficients, and user image data.

In some embodiments, the step of generating digital human image data based on the image coefficient, the sequence of emotion coefficients, and the sequence of mouth motion coefficients comprises:

inputting user image data into an encoder to obtain coding characteristics;

inputting the image coefficient, the emotion coefficient sequence, the mouth motion coefficient sequence and the coding characteristic into an image generation network to obtain a driving image frame sequence, namely digital human image data.

As shown in fig. 24, the image coefficient/sequence α is acquired _T ＝{α ₀ α ₀ ……α ₀ Sequence of emotion coefficients beta _T ＝{β ₀ β ₁ ……β _T Sequence of mouth motion coefficients gamma _T ＝{γ ₀ γ ₁ ……γ _T After the user image data is passed through an encoder, optionally a VGG (Visual Geometry Group ) encoder, to obtain the encoding feature F0. Will F0, alpha _T ，β _T And gamma _T And outputting the image to an image generation network to obtain a driving image frame sequence. Wherein, the image generation network can be selected from a Face-vid2vid network or a Generator layer of StyleGan series network.

In some embodiments, α _T ＝{α ₀ α ₀ ……α ₀ }、β _T ＝{β ₀ β ₁ ……β _T "and gamma _T ＝{γ ₀ γ ₁ ……γ _T And adding to obtain final BS parameters, and generating a driving image frame sequence based on the final BS parameters.

In some embodiments, the step of generating digital human image data based on the broadcast text, the avatar coefficient, and the emotion coefficient sequence comprises:

determining a head motion coefficient sequence;

in 3D space, head movements are controlled by θ coefficients.

A step of determining a sequence of head motion coefficients, comprising:

determining a head action change sequence corresponding to the broadcasting text;

the head action database is preset, and the head actions comprise actions such as nodding (up and down), shaking (left and right shaking), twisting (left and right shaking) and the like. The head motion database comprises at least one group of head motion change sequences, and the head motion change sequences comprise sequences corresponding to at least one motion. Each set of head action sequences may be defined by { θ } ₀ θ ₁ ……θ _n ……θ ₁ θ ₀ And } represents. The selection of a set of head motion variation sequences in the head motion database may be random or specified.

In the interactive calling process, the length T of the broadcast voice needs to be judged, and frame skipping extraction or linear interpolation is performed on a certain head action change sequence to meet the requirement of extraction sequence frame θselect length=t. The final head motion coefficient sequence is theta _T ＝{θ ₀ θ ₁ ……θ _T }。

and determining a head motion coefficient sequence based on the head motion change sequence and the duration.

A step of determining a head motion coefficient sequence based on the head motion change sequence and the time length, comprising:

judging whether the duration of broadcasting voice is less than or equal to a target value, wherein the target value is the total duration of the head action coefficient sequence display;

if the duration of the broadcast voice is less than or equal to the target value, removing the emotion coefficients of the target number in the head action change sequence to obtain a head action coefficient sequence, wherein the target number is determined by the duration, the target value and the frame rate;

target number= | broadcast voice duration-target value| x frame rate;

the target number of head motion coefficients may be randomly removed in the head motion variation sequence, or may be removed at fixed intervals in the head motion variation sequence.

Illustratively, the target value is 2s, the broadcast voice is 1.6s, the frame rate is 25fps, and the head motion variation sequence includes 50 θ coefficients. Target number= | broadcast voice duration-target value| x frame rate= |2-1.6| x 25=10. 10 or 1 theta coefficients can be randomly removed from 50 theta coefficients, or 1 theta coefficient can be removed from every 5 theta coefficients, and the remaining 40 theta coefficients are the head motion coefficient sequence.

If the duration of the broadcast voice is greater than the target value, inserting a target number of head motion coefficients into the head motion change sequence to obtain a head motion coefficient sequence.

The random position in the head motion change sequence can be added with a coefficient which is the same as the previous (or the next) head motion coefficient, or the fixed position or the random position in the head motion change sequence can be added with a middle value of the corresponding coefficient of the front and rear positions of the position.

Illustratively, the target value is 2s, the broadcast voice is 2.4s, the frame rate is 25fps, and the head motion variation sequence includes 50 θ coefficients. Target number= | broadcast voice duration-target value| x frame rate= |2-2.4| x 25=10. 10 theta coefficients can be randomly increased in 50 theta coefficients, 1 theta coefficient can be increased every 5 theta coefficients, and the remaining 60 theta coefficients are head motion coefficient sequences.

In some embodiments, if the duration of the broadcast voice is greater than the target value, determining whether the duration of the broadcast voice is less than or equal to a limit value, the limit value being the longest supported by the sequence of head motion coefficients;

if the duration of the broadcast voice is less than or equal to the limit value, inserting a target number of head motion coefficients into the head motion change sequence to obtain a head motion coefficient sequence.

If the duration of the broadcast voice is greater than the limit value, determining a head action coefficient sequence according to the duration of the broadcast voice, wherein the determination mode is that the head action coefficient sequence is sequentially selected from the first head action coefficient of the head action change sequence, and after the last head action coefficient is selected, the head action coefficient sequence is restarted to continue to be sequentially selected from the first head action coefficient until the duration of the broadcast voice is reached.

Illustratively, the target value is 3s, the broadcast voice is 4s, the frame rate is 25fps, and the head motion variation sequence includes 50 θ coefficients. The head motion coefficient sequence is 50+50 theta coefficients, namely a head motion change sequence+a head motion change sequence. Two different head motion change sequences can also be randomly selected.

Digital human image data is generated based on the broadcast text, the image coefficient, the emotion coefficient sequence and the head action coefficient sequence.

Generating digital human image data based on the broadcast text, the image coefficient, the emotion coefficient sequence and the head action coefficient sequence, wherein the method comprises the following steps of:

digital human image data is generated based on the image coefficient, the emotion coefficient sequence, the mouth motion coefficient sequence, and the head motion coefficient sequence.

In some embodiments, the digital human image data includes a figure coefficient, an emotion coefficient sequence, a mouth motion coefficient sequence, a head motion coefficient sequence, and user image data.

In some embodiments, the step of generating digital human image data based on the image coefficient, the emotion coefficient sequence, the mouth motion coefficient sequence, and the head motion coefficient sequence comprises:

inputting user image data into an encoder to obtain coding characteristics;

inputting the image coefficient, the emotion coefficient sequence, the mouth motion coefficient sequence, the head motion coefficient sequence and the coding features into an image generation network to obtain a driving image frame sequence, namely digital human image data.

As shown in fig. 24, the image coefficient/sequence α is acquired _T ＝{α ₀ α ₀ ……α ₀ Sequence of emotion coefficients beta _T ＝{β ₀ β ₁ ……β _T Sequence of mouth motion coefficients gamma _T ＝{γ ₀ γ ₁ ……γ _T Sequence of head motion coefficients θ _T ＝{θ ₀ θ ₁ ……θ _T After } the user image data is passed through the encoder to obtain coding feature F0, alpha _T ，β _T ，γ _T ，θ _T And outputting the image to an image generation network to obtain a driving image frame sequence.

In some embodiments, α _T ＝{α ₀ α ₀ ……α ₀ }、β _T ＝{β ₀ β ₁ ……β _T }、γ _T ＝{γ ₀ γ ₁ ……γ _T And θ _T ＝{θ ₀ θ ₁ ……θ _T And adding to obtain final BS parameters, and generating a driving image frame sequence based on the final BS parameters.

Step S2206: generating broadcasting voice based on the broadcasting text;

in some embodiments, the broadcast voice is synthesized from the preset voice feature and the broadcast text.

In some embodiments, synthesizing broadcasting voice according to voice characteristics corresponding to the digital person identifiers and broadcasting texts;

step S2207: the broadcasting voice and the digital person image data are transmitted to the display apparatus 200 so that the display apparatus 200 plays the broadcasting voice and displays the digital person image based on the digital person image data.

In some embodiments, the digital human image data includes a sequence of drive image frames, play drive image frames, and broadcast voice.

In some embodiments, the digital human image data includes a figure coefficient, an emotion coefficient sequence, a mouth motion coefficient sequence, a head motion coefficient sequence, and user image data. The display device 200 inputs the user image data to the encoder to obtain the encoding characteristics; inputting the image coefficient, the emotion coefficient sequence, the mouth motion coefficient sequence, the head motion coefficient sequence and the coding features into an image generation network to obtain a driving image frame sequence, and playing the driving image frame and broadcasting voice.

The embodiment of the application provides a digital human head movement and emotion control method taking a three-dimensional topological structure as an intermediate state, which comprises the following steps: firstly, a single frame picture of a user is acquired, the picture is mapped to a 3D BS parameter space by utilizing a 3D reconstruction algorithm, controllable adjustment of head movements and expressions can be realized in the BS space, then a deep learning architecture is generated by utilizing images, and a mapping model from the BS space to a final image is obtained through training. Finally, high personification emotion and head motion editing of a single frame image is realized, and the expression capability of a digital person is effectively improved.

The embodiments of the present application further refine some of the functions of the server 400. The server 400 performs the following steps, as shown in fig. 25.

Step S2501: the receiving display device 200 transmits voice data input by a user;

step S2502: acquiring user image data and original key point data corresponding to the user image data, and determining a broadcasting text according to the voice data;

the original key point data can be stored in the image database together with the user image data, and the original key point data is 3D key point data.

The face key point detection algorithm MediaPipe Face Mesh is adopted, the face surface and depth are obtained through a machine learning algorithm, and the face mesh structure is drawn through 468 key points.

In the embodiment of the application, 153 key points which are strongly related to facial actions such as expression and pronunciation are selected from 468 key points, and specifically include: facial contour (36), mouth (40), eyes (30), eyebrows (22), nose (25), as shown in fig. 26.

Step S2503: determining a reply emotion based on the voice data;

the method for determining the reply emotion has been described in detail above and is not described in detail here.

Step S2504: inputting the broadcast text and the reply emotion into an emotion mapping voice driving model to obtain an emotion voice key point sequence, wherein the emotion voice key point sequence is a sequence of key points related to expression and pronunciation;

Taking 7 emotions of neutrality, happiness, sadness, anger, surprise, light thin and fear as examples, the data processing and training process of the emotion mapping voice driving model is as follows:

1) Parallel corpus is recorded, a speaker speaks the same sentence under different emotion states, 4 hours of neutral corpus (1 hour of parallel corpus) is recorded, 1 hour of emotion states are 6, and total 10 hours of training data are recorded.

2) Parallel corpus alignment, the pronunciation length of the same speaker has larger difference under different emotion states, and the parallel corpus is aligned at a voice level through a dynamic time warping algorithm (DTW). The aligned data corresponds to different emotion states of the visuals in the same phoneme state.

3) The parallel corpus pairs processed by 2) comprise: neutral state key point Fls [ N,153 x 3], emotion state key point Flse [ N,153 x 3], emotion state label Labels [ N,7], wherein N is the aligned sequence length, and the emotion state label adopts a one-hot mode, so that the 7-dimensional eigenvector is used for representing six emotion+neutral emotion.

4) The model uses Fatspech (speech synthesis model) Encoder, input Fls [ N,153 x 3]]And Labels [ N,7]，Labels[N,7]Activity L through MLP layer _hidden [N,153*3]。

⑤L _hidden [N,153*3]And Fls [ N,153 x 3]Spliced together through Fastspeech Encoder layers to obtain an output Out [ N153 x 3] ]。

⑥Out[N，153*3]And Fls _e [N,153*3]The MSE LOSS is calculated, and the model parameters are adjusted by back propagation until the training process reaches convergence.

And inputting the broadcast text and the reply emotion into an emotion mapping voice driving model to obtain an emotion voice key point sequence.

In some embodiments, the reply emotion and emotion intensity are determined based on the voice data;

the method for determining the emotion intensity has been described in detail above and is not described in detail here.

And inputting the broadcast text, the reply emotion and the emotion intensity into an emotion mapping voice driving model to obtain an emotion voice key point sequence.

Recording data is used in the emotion mapping voice driving model training process, and the model is required to express the same emotion in different intensity states in 3 in the data recording process, for example, the happy emotion can be divided into: normally, three states, namely, normal and strong states, are multiplied by the latent space of the emotion encoding model in a parameter form in the training process, and the three states are as follows: 0.33 normal: 066 is strong: 1. the trained model can input the broadcast text, the reply emotion and the emotion intensity into the emotion mapping voice driving model to obtain an emotion voice key point sequence.

Step S2505: correspondingly replacing the emotion voice key point sequence into an original key point sequence to generate a face key point sequence, wherein the original key point sequence comprises a plurality of original key point data;

The original key point sequence is a sequence formed by a plurality of original key point data, and the number of key point data groups of the original key point sequence is the same as that of the emotion voice key point sequence. The number of emotion voice key points is smaller than or equal to the number of original key points.

The corresponding replacement of the emotion voice key point sequence into the original key point sequence means that emotion voice key point data at the same position in the sequence is replaced with the original key point data, and when the emotion voice key point data are replaced, the emotion voice key point data representing the same key point in the original key point data are replaced with the emotion voice key point data.

Illustratively, the emotion voice key point sequence and the original key point sequence each have 10 sets of key point data. The first group of emotion voice key point sequences comprises 20 key point data (such as eye coordinates, mouth coordinates and the like), the first group of original key point sequences comprises 153 key point data (such as eye coordinates, nose coordinates and the like), only the first group of 20 key point data in the emotion voice key point sequences is replaced into the first group of data in the original key point sequences, the replacement rule is the same data of the eye replacement parts, and the rest 133 key point data are unchanged.

A certain blink sequence can be obtained through geometric structure design, but because the blink sequence designed by people firstly changes linearly at the positions of key points, a blink fragment which is more always in a real state is difficult to obtain, and blink is unnatural.

In order to solve the above technical problems, in the embodiment of the present application, before replacing the emotion voice key point sequence into the original key point sequence and generating the face key point sequence, the server 400 performs:

a step of determining a sequence of blink keypoints, comprising:

copying the original key point data into a plurality of data;

The data processing and training process of the key point blink model is as follows:

a key point blink model is established by a deep learning mode, the model is Input into a non-blink key point sequence of the Input line of fig. 27, and the model is output into a blink state sequence included in the Label line of fig. 27. The training data is characterized in that: the data collector is to keep the head absolutely still, and the video frame contains only one natural blink state.

And collecting 100 groups of blink videos of different people, wherein the ratio of men to women is 1:1, extracting a key point sequence through the videos, and then performing normalization operation on the key point sequence to obtain Label. If the sequence of keypoints contains N frames, then the first frame of the sequence is input by duplicating N copies of the keypoints.

In order to improve the robustness of the model, a scheme of increasing the data to a certain extent is adopted, wherein the existing video data is subjected to StyleGan stylized countermeasure generation network to obtain cartoon style fragments with different eye sizes, and then the cartoon style fragments are added into training data.

The key point blink model is to realize that the input is a non-blink fragment, the output is a blink fragment, and can perform proper prediction on relevant areas such as eyebrows, cheeks and the like, the model can select a time sequence model RNN (Recurrent Neural Network, cyclic neural network), LSTM (Long Short-Term Memory network) and the like, a voice synthesis model Fastspeech Encoder is adopted, and the heights of upper eyelid and lower eyelid are added as required to serve as generating conditions, and the specific training process is as follows:

1) Construction of input keypoint sequence Fls _i [N,153,3]And Label [ N,153,3 ]]Sequence of keypoints Fls _l Calculated Fls _l Corresponding upper and lower eyelid height feature sequences H [ N, 1]]

2) Setting batch size as b, obtaining model input data Inp [ b, M,153 x 3], H [ b, M,1] and group Truth Gt [ b, M,153 x 3], M is the maximum value of b N, padding zero Padding is carried out at other positions smaller than N, and H [ b, M,1] is an upper eyelid height sequence and a lower eyelid height sequence which are obtained correspondingly by Inp.

3)H[b,M,1]By two MLP dimensions to H _hidden [b,M,153*3]。H _hidden [b,M,153*3]Is a latent space feature, H [ b, M,1 ] is calculated by MLP model]Dimension is expanded to H _hidden [b,M,153*3]。

4)H _hidden And Inp [ b, M,153 x 3]concat is input together into Fastspeech encoder layer to obtain output Out [ b, M,153 x 3]

5) Out [ b, M,153 x 3] and Gt [ b, M,153 x 3] calculate MSE LOSS, back-propagate the adjustment model parameters until the training process reaches convergence.

Acquiring a preset blink position;

in some embodiments, the blink positions may be randomly set, but it is desirable to ensure that the interval between blink positions is greater than the length of the blink keypoint sequence, i.e. that the two blink keypoint sequences cannot overlap. It is also ensured that the sequence of blink keypoints is complete.

In some embodiments, the blink positions may be fixed, such as one blink position per preset length of interval.

Determining at least one target area in the original key point sequence based on a preset blink position, wherein the target area is used for replacing the blink key point sequence;

in some embodiments, the target length after the preset blink position is determined as the target area, and the target length is the length of the blink key point sequence.

In some embodiments, the target length before the preset blink position is determined as the target area, and the target length is the length of the blink key point sequence.

In some embodiments, the target length before and after the preset blink position is determined as the target area, and the target length is the length/2 of the blink key point sequence.

And correspondingly replacing the blink key point sequence into a target area of the original key point sequence to obtain the key point sequence after blinking.

The corresponding replacement of the blink key point sequence to the original key point sequence means that emotion voice key point data at the same position in the sequence is replaced with the original key point data.

After the keyword sequence after blinking is obtained, the broadcasting text and the reply emotion are input into an emotion mapping voice driving model to obtain an emotion voice keyword sequence, and the emotion voice keyword sequence is correspondingly replaced into the keyword sequence after blinking to generate a face keyword sequence.

The key points of the human face are mostly 2D data, and natural and smooth head movements are difficult to obtain based on the 2D data, and the head movements are stiff.

To solve the above problem, in the embodiment of the present application, after inputting the broadcast text and the reply emotion into the emotion mapping voice driving model to obtain the emotion voice key point sequence, the server 400 executes:

correspondingly replacing the emotion voice key point sequence into the original key point sequence or the key point sequence after blinking to generate an emotion processed key point sequence;

Acquiring a head-movement affine matrix fitting sequence;

the step of obtaining the head-moving affine matrix fitting sequence comprises the following steps:

selecting standard key point data in a head movement reference key point sequence, wherein the standard key point data are key point data of a front face, no expression and no blink;

storing the head-moving affine matrix fitting sequence to a preset address;

and obtaining a head-movement affine matrix fitting sequence from a preset address.

And generating a human face key point sequence based on the head-movement affine matrix fitting sequence and the key point sequence after emotion processing.

The method for embedding the head motion sequence comprises the following steps:

coordinate system definition: the coordinate system where the key points of the human face are located is designed, the key points of the nose tip are taken as an original point, a straight line passing through the original point and perpendicular to the plane of the human face is taken as an X axis, a Y axis passes through the original point and is parallel to the straight line where the two eye angles are located, and a Z axis passes through the original point and is orthogonal with a X, Y axis, so that natural and smooth head movement in the dimension of the key points is realized.

Imitation shooting matrix: the simulated camera matrix defines a change matrix from standard state keypoints Fls to post-change keypoints Fls', the change comprising four aspects: rotates along the X axis, rotates along the Y axis, rotates along the Z axis and translates. The combinations of these four variations can be generally represented by the simulated camera matrix D [3,4 ].

Matrix fitting and application: the matrix fitting and practical application can be divided into the following steps:

1) Selecting a suitable head-moving video segment, which can be a presenter report or a presenter interaction segment.

2) Extracting key point coordinates of the video fragment, and performing normalization operation to obtain a head-movement reference key point sequence Fls _ref [N,153*3]。

3) At Fls _ref [N,153*3]Selecting a frame of standard key point Fls _stand [1,153*3]The method is generally a redundant information key point such as a face and expression-free blink.

4) The data fitting algorithm in Scikit-learn was used to obtain the data from Fls _stand [1,153*3]To Fls _ref [N,153*3]Imitation shooting matrix sequence D _ref [N,3,4]。

5) Will D _ref [N,3,4]And (5) carrying out fork multiplication on the key point sequence after emotion processing, and endowing the digital person with the head action which is completely the same as that of the selected video.

In some embodiments, the sequence embedding order is blink-emotion-head movement, and the embedding is performed preferentially because the sequence of blink key points is embedded with less number of key points and less variation amplitude. The head movement involves a large number of key points and a large variation range, and finally, the head movement is adjusted, so that the obtained image is ensured to be more accurate.

Step S2506: generating digital human image data based on the user image data and the human face key point sequence;

in some embodiments, the digital human image data is a sequence of digital human image frames. A step of generating digital human image data based on user image data and a sequence of human face keypoints, comprising:

The image generation model is shown in fig. 28, the model comprises an encoding structure, a decoding structure, a generator and a discriminator, in the training process, a key point sequence is obtained through a face key point detection algorithm, training data comprises Celeb2 and more than hundred thousand screening images of FFHQ, loss comprises a discriminator Loss, a generator Loss and the like, a gradient function selects Adam, a learning rate is set to be 10-3-10-5, and a cos function is adopted to adjust the learning rate according to epoch. The initial available effect can be achieved about 20 hours after the Nvidia-A100 display card takes.

And inputting the user image data and the human face key point sequence into an image generation model to obtain a digital human image sequence frame.

In some embodiments, the digital human image data includes user image data and a sequence of human face keypoints.

Step S2507: generating broadcasting voice based on the broadcasting text;

step S2508: the broadcasting voice and the digital person image data are transmitted to the display apparatus 200 so that the display apparatus 200 plays the broadcasting voice and displays the digital person image based on the digital person image data.

In some embodiments, the digital human image data comprises a sequence of digital human image frames. The display device 200 plays digital human image frames and announces voice.

In some embodiments, the digital human image data includes user image data and a sequence of human face keypoints. The display device 200 inputs the user image data and the face key point sequence into the image generation model to obtain a digital human image frame sequence, plays the digital human image frame and broadcasts the voice.

According to the embodiment of the application, the depth analysis of the expression capability of the key points of the human face is carried out, the information quantity contained in the key point sequences is increased step by step in stages, and finally the control of the key points on the detailed information such as the facial expression, the head movement, the blink and the like is realized. Based on a transducer model and a parallel corpus training mode, the conversion from the key points of the neutral expression to different expression styles such as happiness, anger, grime and the like is realized, and the intensity is controllable. Based on a three-dimensional key point and an imitation shooting matrix nonlinear fitting algorithm, natural and smooth head movements such as head turning, head shaking and head twisting are realized. Based on the time sequence generation scheme, a section of eye opening key point training is input, a blink key point sequence segment with a corresponding length is output, and natural and coherent blink actions are realized.

The embodiments of the present application further refine some of the functions of the server 400. The server 400 performs the following steps, as shown in fig. 29.

Step S2901: the receiving display device 200 transmits voice data input by a user;

Step S2902: determining a broadcast voice based on the voice data;

step S2903: extracting voice characteristics of broadcast voice;

the server 400 includes a speech feature extraction module. A voice feature extraction module based on an Encoder-Decoder structure extracts voice features of the broadcast voice. The Encoder-Decoder speech feature extraction module is Wenet, wav2vec2.0, etc. Taking wav2vec2.0 as an example, wav2vec2.0 is used as a base network to extract speech feature data. The wav2vec2.0 model is an end-to-end self-supervision pre-training architecture provided for language recognition tasks, and the best results can be obtained on each voice recognition task only by adding a linear layer fine-tune. The pre-trained encoder of the wav2vec2.0 model is selected to extract the voice characteristics, and the voice characteristics output by the encoder have better semanteme because the wav2vec2.0 model has high accuracy on ASR.

The wav2vec2.0 model is shown in fig. 30, and the broadcast voice sequentially passes through CNN Blocks (convolutional neural network block), layerNorm (normalization layer), dropout_feature (discard feature), post_conv (Post convolution), layerNorm (normalization layer), and Transformer to obtain Final feature vectors (final feature vector), namely voice feature. Among them, CNN Blocks include GELU (Gaussian Error Linear Unit ), groupnum (group normalization) and Conv1d (one-dimensional convolution). Post_conv includes GELU, groupNorm and Conv1d.

Step S2904: determining mouth shape parameters based on the speech features;

in some embodiments, mapping of speech features to mouth-shape parameters is achieved based on a deep learning speech driven model. The speech driven model includes an encoder and a decoder. The voice features are input into an encoder to obtain high-level semantic features, the high-level semantic features are input into a decoder, and the decoder generates mouth shape parameters corresponding to the voice features.

The mouth shape parameters include a FLAME model mouth shape parameter.

The FLAME model is a general 3D face model, the input of which is a parameter vector containing shape parameters, posture parameters and expression parameters, and the output of which is a three-dimensional face grid. The expression parameters in the FLAME model can represent the expression changes of the face, such as smiling, frowning and mouth opening, so that the advantage of the FLMAE model in emotion expression can be used for completing the voice-driven 3D facial animation.

The FLAME references the representation of the body model SMPL (skin Multi-Person Linear), based on LBS (Linear blend skining, linear skin) and incorporates blendhapes (hybrid deformation) as a representation, comprising 5023 vertices, 4 joints. Specifically, the FLAME splits the human head into four parts, namely a left eyeball, a right eyeball, a chin and a neck, which can rotate around a self-defined joint to form a new three-dimensional representation. The parameters of the FLAME model are of three types: shape, post, expression. Can be expressed as beta.epsilon.R respectively ^|β| ，θ∈R ^|θ| ， The FLAME model can be understood as a mapping of the input of the three parameters and the output of the three-dimensional modelWherein N representsVertex number (5023). The FLAME model explicitly models the rotation of the neck and the eyeball through LBS, and can represent rich expressions, so that the FLAME model has great advantages in completing the task of reconstructing the face.

The flag parameter learning decodes the input speech feature or emotion expression feature into a corresponding flag parameter by a Decoder.

In some embodiments, the Decoder for flag mouth-shape parameter learning consists of two parts: a biased causal multi-headed self-attention with periodic position coding (PPE) for generalizing longer input sequences; the multi-modal multi-headed note with bias is used to align the audio motion modalities. The encoder converts the original audio X into a corresponding visual frame number sequence A in the training process _Τ The decoder is denoted by A _Τ And past facial parametersFor the condition, the face motion sequence of the current frame is predicted +.>The concrete expression is as follows:

wherein f ₁ Namely the decoder, A _Τ The audio characteristics at time T are represented,representing the facial parameters learned at time T-1.

Step S2905: determining emotion parameters and acquiring user image data;

In some embodiments, the step of determining an emotion parameter includes

Determining a reply emotion according to the voice data;

the above steps for determining the reply emotion are described in detail, and will not be described in detail here.

And determining emotion parameters according to the reply emotion.

And inputting the broadcast text or broadcast voice and the reply emotion into an emotion parameter learning module to obtain emotion parameters. The emotion parameters are FLAME model emotion parameters.

In some embodiments, the step of determining an emotion parameter includes

Determining priori emotion knowledge;

in the training stage, in order to enable the model to encode and decode emotion changes in input voice, frame-level emotion information can be firstly obtained by using an emotion recognition model as priori emotion knowledge, so that model training is facilitated.

In some embodiments, the step of determining a priori emotional knowledge comprises:

acquiring user face data acquired by display equipment;

simultaneously with the user inputting the voice data, the display apparatus 200 may activate a camera through which the user face data is collected, and the display apparatus 200 transmits the user face data to the server 400.

And inputting the face data of the user into a distraction network model to obtain priori emotion knowledge, wherein the distraction network model comprises a feature clustering network, a multi-head attention network and an attention fusion network.

The prior emotion knowledge extraction method based on the two-dimensional image comprises the following steps: the 2D image data is easier to decode reliable emotion information and the model is easier to learn than the audio signal data. The prior emotional knowledge in the image is extracted using the existing DAN (distraction network) network model and the pre-trained model. As shown in fig. 31, the DAN network model includes a feature clustering network (Feature Clustering Network, FCN), a Multi-head attention network (Multi-head cross Attention Network, MAN), and an attention fusion network (Attention Fusion Network). First extracted and clustered through a feature clustering network, wherein Affinity Loss (Affinity Loss) is applied to increase the difference between levels and reduce the variance within the levels. Next, a multi-head attention network is built, focusing on multiple face areas simultaneously through a series of spatial attention (Spatial Attention, SA) and channel attention (Channel Attention, CA) units. Finally, the attention fusion network adjusts attention graph feature vectors and output class confidence by enhancing differences between the attention. Since the final output of the network model is emotion category confidence and cannot represent prior knowledge of emotion, the network output before the softmax layer in the attention fusion network needs to be taken as emotion prior knowledge.

extracting a mel spectrum of the broadcast voice;

the Mel spectrum is processed by a convolutional neural network and a feature aggregation unit to obtain semantic features;

the semantic features are passed through a bi-directional gating recursion unit and a global-local attention module to obtain characteristic features;

acquiring common characteristics in voice characteristics;

fusing the common features and the characteristic features through an interaction attention module to obtain fused features;

and inputting the common characteristics, the characteristic characteristics and the fusion characteristics into a classifier to obtain priori emotion knowledge.

The prior emotion knowledge extraction method based on the voice Mel spectrum comprises the following steps: as shown in fig. 32, a logarithmic mel spectrogram obtained by preprocessing an audio signal of a broadcast voice is used as an input feature, semantic features related to emotion are extracted through a convolutional neural network and a feature aggregation unit, and then characteristic features Y related to emotion are obtained through Bi-GRU (Bi-directional gating recursion unit) and a global-local attention module. The Wav2Vec2.0 pre-training model learns a common representation X from broadcast voice through self-supervision learning and migrates to a voice emotion recognition task through a fine tuning mode. And then fusing the common features and the characteristic features obtained from the pre-training model and the spectrum-based model through different interaction attention modules in a joint network, so that emotion information in voice is better utilized. Finally, the obtained common characteristics, characteristic characteristics and fusion characteristics are all used for generating voice emotion recognition prediction through a classifier (classifier). Since the content of the predictions is also emotion category confidence, the input before the classifier is taken as a priori knowledge of emotion.

Inputting priori emotion knowledge and voice characteristics into an emotion predictor to obtain emotion expression characteristics;

after the priori knowledge of emotion is obtained, the priori knowledge is used as a group description (fact base) to extract emotion characterization in voice characteristics through Bi-LSTM (two-way long and short term memory) network prediction. The loss function in the network training process is designed to maximize the mutual information between the predictive emotion characterization and the priori emotion knowledge.

An emotion parameter is determined based on the emotion expression characteristics.

And inputting the emotion expression characteristics into an emotion parameter learning module to obtain emotion parameters.

Wherein, FLAME emotion parameter study: emotion feature vector M output at emotion predictor _T For a 7-dimensional (exemplary emotion type includes 7 of neutral, angry, aversion, fear, happiness, injury and surprise) emotion vector, it is necessary to project it into the embedding space through a learnable two-dimensional matrix W before inputting it into the CNN network f ₂ Decoding to generate corresponding emotion Flame parameterThe concrete representation is as follows:

the embodiment of the application provides two different voice priori emotion knowledge extraction methods, wherein the voice priori emotion knowledge is extracted by combining with the Mel spectrum data of voice or the 2D video data of a speaker and a pre-training emotion recognition model to serve as a group trunk of an emotion predictor, so that the whole training process of the voice emotion predictor is guided. The priori emotion knowledge enables the output of the emotion predictor to have more accurate emotion semantic characteristics, so that the accuracy of FLAME emotion parameters output by the emotion enhancement network is improved.

In some embodiments, the step of determining an emotion parameter comprises:

determining priori emotion knowledge;

inputting the priori emotion knowledge and the voice characteristics into an emotion predictor to obtain emotion expression characteristics;

acquiring emotion intensity;

in some embodiments, the emotion intensity may be set by the user and the set emotion intensity may be saved to a preset address, from which the emotion intensity is obtained.

In some embodiments, emotion intensity is determined from speech data. The above steps for determining the emotion intensity are described in detail, and will not be described in detail here.

Determining an emotional intensity expression signature based on the emotional expression signature and the emotional intensity;

wherein, the step of determining the emotion intensity expression feature based on the emotion expression feature and the emotion intensity comprises:

calculating distances between emotion expression features and a plurality of emotion clustering centers, wherein the emotion clustering centers are obtained by clustering priori emotion knowledge vectors of different emotions and samples;

determining the minimum distance in the distances and an emotion clustering center corresponding to the minimum distance;

and determining the emotion intensity expression characteristics according to the emotion expression characteristics, the emotion intensity, the minimum distance and the emotion clustering center corresponding to the minimum distance.

The server includes an increased emotion level control module. The user may input a certain degree of emotional control epsilon of the personalization. After receiving a certain emotion individuation degree epsilon and broadcasting voice test, extracting voice characteristics and outputting emotion expression characteristics A by an emotion predictor _test And then, obtaining the emotion intensity expression characteristics controlled by the individuation degree through the following steps:

clustering all emotion and priori emotion knowledge vectors of all samples in the training stage to obtain a clustering center { C (C) of each emotion _i I=0,..6 }, where i represents the serial number of the emotion category, 0 represents neutral emotion, and others are in turn angry, aversion, fear, happiness, heart injury and surprise, C _i For the corresponding emotion clustering center, calculate simultaneouslyEach cluster center C _i Distance D of 6 from neutral emotion cluster center, i=1 _i ，i＝1，...，6。

Emotion expression feature A output to emotion predictor _test Calculate it and each cluster center C _i Distance d of i=1,..6 _i And determining d with the smallest distance _i Corresponding C _i As the closest emotional center to the speech signal.

The emotion intensity expression characteristics of the individuation degree control are calculated according to the following formula:

A_(test，final)＝A_test+(ε-(1-d_i/D_i))C_i。

an emotion parameter is determined based on the emotion intensity expression characteristics.

And inputting the emotion intensity expression characteristics into an emotion parameter learning module to obtain emotion parameters.

In order to enable the generated facial emotion to be more personalized, an emotion degree control module is added in a prediction stage, according to different emotion degree expression requirements of a user, according to emotion characteristics predicted by an emotion predictor and clustering centers of emotion labels, voice emotion expression of personalized emotion degrees is obtained and then input into an emotion parameter learning module, and finally generated facial emotion animation is more natural, vivid and personalized.

Step S2906: generating digital human image data based on the user image data, the emotion parameters and the mouth shape parameters;

in some embodiments, the digital human image data comprises a sequence of digital human image frames. A step of generating digital human image data based on user image data, emotion parameters, and mouth shape parameters, comprising:

fusing the emotion parameters and the mouth shape parameters to obtain fusion parameters;

after the decoder obtains the mouth shape parameter and the emotion parameter, the mouth shape parameter and the emotion parameter have obvious independence. The two parameters can be fused in a summation fusion mode as follows:

digital human image data, i.e., a sequence of digital human image frames, is generated based on the user image data and the fusion parameters.

FLAME parameter at time T of receiptAfter that, the +.A. can be obtained by the trained FLAME model>Conversion to the corresponding 3D face vertex +.>

To improve the accuracy of driving the mouth shape and facial expression, the loss function throughout the training process is expressed as follows

Therein liP _mask And face _mask Vertex masks representing the mouth region and the face region, respectively (mask 1 if vertices belong to the face region, otherwise 0), yt represents the true value vectors of all face vertices at time t (5023 x 3 dimensions),the predicted value vector (5023 x 3 dimensions) of all face vertices at time t is set according to the actual driving effect, and α, β represent the loss weights (between 0 and 1) of the face and mouth region vertices.

In some embodiments, the digital human image data includes user image data and fusion parameters.

According to the embodiment of the application, the emotion characteristics of a speaker are predicted from audio data, and then the emotion characteristics are decoded into FLAME expression parameters through an emotion parameter learning module. Then, the FLAME expression parameters and the learned voice mouth shape FLAME parameters are combined and input to the micro-rendering module to generate the 3D facial animation.

Step S2907: the broadcasting voice and the digital person image data are transmitted to the display apparatus 200 so that the display apparatus 200 plays the broadcasting voice and displays the digital person image based on the digital person image data.

In some embodiments, the digital human image data comprises a sequence of digital human image frames. The display device 200 plays digital human image frames and broadcasts voice.

In some embodiments, the digital human image data includes user image data and fusion parameters. The display device 200 generates a sequence of digital human image frames based on the user image data and the fusion parameters, plays the digital human image frames, and broadcasts voice.

The technical architecture diagram is shown in fig. 33. And inputting voice to extract Mel spectrum and then inputting a pre-training emotion recognition model, or inputting image data to the pre-training emotion recognition model to obtain priori emotion knowledge. After extracting the voice characteristics of the input voice, respectively inputting the voice characteristics into a mouth shape parameter learning model in the emotion predictor and the parameter learning model. The emotion predictor determines emotion expression characteristics based on voice characteristics and priori emotion knowledge, and can directly input the emotion expression characteristics into an emotion parameter learning model in the parameter learning model on one hand and input the emotion expression characteristics into the emotion degree control module on the other hand. And the emotion degree control module outputs emotion intensity expression characteristics to the emotion parameter learning model after acquiring emotion control degrees and emotion expression characteristics. The emotion parameter learning model determines emotion parameters based on emotion expression features or emotion intensity expression features. The mouth shape parameter learning model determines mouth shape parameters based on speech features. And the parameter learning model fuses the emotion parameters and the mouth shape parameters to obtain fusion parameters. The parameter driving module generates a 3D emotion animation sequence based on the fusion parameters.

In some embodiments, after receiving voice data input by a user sent by a display device, extracting voice features of the voice data, determining mouth shape parameters based on the voice features, determining emotion parameters and acquiring user image data; digital person image data is generated based on the user image data, the emotion parameters and the mouth shape parameters, so that the effect that a digital person learns to speak by the user can be achieved.

According to the embodiment of the application, voice mel spectrum data or speaker video and a pre-trained emotion recognition model are used for obtaining priori emotion knowledge, then an emotion predictor of audio is trained under the guidance of the pre-trained emotion knowledge to obtain emotion expression characteristics of voice, the emotion expression characteristics are input into a neural network to be decoded into FLAME emotion parameters, and meanwhile the emotion expression characteristics are combined with the learned mouth shape FLAME parameters and can be micro-rendered to generate 3D facial animation. According to the embodiment of the application, the emotion degree control module is further added, emotion feature expression of personalized degree can be generated from the predicted voice emotion features and the priori knowledge clustering center of each emotion label, and the generated digital human image is more vivid and personalized.

Some embodiments of the present application provide a digital human interaction method, the method being applicable to a server 400, the server 400 being configured to: after receiving voice data input by a user sent by the display device 200, acquiring user image data and original key point data corresponding to the user image data, and determining a broadcasting text according to the voice data; determining a reply emotion based on the voice data; inputting the broadcast text and the reply emotion into an emotion mapping voice driving model to obtain an emotion voice key point sequence; correspondingly replacing the emotion voice key point sequence into the original key point sequence to generate a face key point sequence; generating digital human image data based on the user image data and the human face key point sequence; generating broadcasting voice based on the broadcasting text; the broadcasting voice and the digital person image data are transmitted to the display apparatus 200 so that the display apparatus 200 plays the broadcasting voice and displays the digital person image based on the digital person image data. According to the embodiment of the application, the mapping of key point data from neutrality to other emotion is realized through the emotion mapping voice driving model, so that the generated digital person has a mouth shape corresponding to voice content, and meanwhile, the expression is richer and more natural.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A server, configured to:

Determining a reply emotion based on the speech data;

generating broadcasting voice based on the broadcasting text;

2. The server of claim 1, wherein the server is configured to:

acquiring a preset blink position;

3. The server of claim 2, wherein the server performs a corresponding substitution of the emotion voice keypoint sequence into an original keypoint sequence, generating a face keypoint sequence, further configured to:

4. The server of claim 3, wherein the server performs a corresponding substitution of the emotion voice keypoint sequence into the post-blink keypoint sequence to generate a face keypoint sequence, further configured to:

acquiring a head-movement affine matrix fitting sequence;

5. The server of claim 2, wherein the server performs determining a sequence of blink keypoints, and is further configured to:

Copying the original key point data into a plurality of data;

6. The server of claim 4, wherein the server performing the acquisition head-driven affine matrix fitting sequence is further configured to:

storing the head-moving affine matrix fitting sequence to a preset address;

7. The server of claim 1, wherein the server performing determining a reply emotion based on the voice data is further configured to:

Determining a reply emotion and emotion intensity based on the voice data;

8. The server of claim 1, wherein the server performs generating digital human image data based on the user image data and the sequence of human face keypoints is further configured to:

9. A display device, characterized by comprising:

a display configured to display a user interface;

a communicator configured to communicate data with the server;

a controller configured to:

receiving voice data input by a user;

transmitting the voice data to a server through the communicator;

10. A digital human interaction method, comprising:

determining a reply emotion based on the speech data;

generating broadcasting voice based on the broadcasting text;