WO2021232875A1

WO2021232875A1 - Method and apparatus for driving digital person, and electronic device

Info

Publication number: WO2021232875A1
Application number: PCT/CN2021/078242
Authority: WO
Inventors: 樊博
Original assignee: 北京搜狗科技发展有限公司
Priority date: 2020-05-18
Filing date: 2021-02-26
Publication date: 2021-11-25
Also published as: CN113689530B; CN113689530A

Abstract

A method and apparatus for driving a digital person, and an electronic device. The method comprises: acquiring a target action corresponding to target text (S20); obtaining, when driving a digital person to output a voice on the basis of the target text, a reference action needing to be performed by the digital person before performing the target action (S22); modifying a target action parameter of the target action according to a reference action parameter of the reference action (S24); and in the process of driving the digital person to output the voice on the basis of the target text, driving the digital person to perform the target action according to the modified target action parameter (S26). In the technical solution, a corresponding action is obtained on the basis of text, and an action parameter of a target action corresponding to the text is modified according to a reference action of a digital person, so that the process of the digital person switching from the reference action to the target action is natural and delicate, the technical problem in the prior art that the action of the digital person changes suddenly is solved, and the delicacy of change of the action of the digital person is improved.

Description

Method, device and electronic equipment for driving digital human

Cross-references to related applications

This application claims the priority of a Chinese patent application filed on May 18, 2020 with an application number of 202010420678.0 and titled "A method, device and electronic device for driving digital humans", the entire content of which is incorporated herein by reference.

Technical field

The present disclosure relates to the field of software technology, and in particular to a method, device and electronic equipment for driving digital humans.

Background technique

In the present disclosure, Digital Human (Digital Human) is abbreviated as Digital Human, which is a comprehensive rendering technology that uses computers to simulate real humans, and is also called virtual humans, super-realistic humans, and photo-level humans. Because people are too familiar with real people, leading to the famous Uncanny Valley phenomenon, the difficulty of realizing digital humans does not increase linearly, but increases exponentially. It is possible that the 3D static model is very real, but it is a blink of an eye. It immediately becomes unreal. How to make the movements of digital humans more delicate and realistic has become a technical problem that needs to be solved urgently in the development of digital humans.

Summary of the invention

The purpose of the present disclosure is, at least in part, to provide a method, a device, and an electronic device for driving digital humans, which are used to solve the technical problem of sudden changes in digital human motions in the prior art, and to improve the fineness of digital human motion changes.

In a first aspect of the present disclosure, there is provided a method for driving a digital human, the method includes: obtaining a target action corresponding to a target text; when obtaining a voice output based on the target text to drive the digital human, the digital human is The reference action to be executed before the target action is executed; the target action parameter of the target action is modified according to the reference action parameter of the reference action; in the process of driving the digital human to output speech based on the target text, according to the modified The target action parameter drives the digital human to perform the target action.

In some embodiments, before acquiring the target action corresponding to the target text, the method further includes: acquiring the target action corresponding to the text to be processed; converting the text to be processed into the target text through a speech synthesis markup language, and Insert the label of the target action into the target text.

In some embodiments, the acquiring a target action corresponding to the text to be processed includes: acquiring a preset keyword in the text to be processed; acquiring a predetermined action corresponding to the preset keyword as the target action.

In some embodiments, the obtaining the target action corresponding to the text to be processed includes:

Semantic recognition is performed on the text to be processed to obtain the action intention contained in the text to be processed; and a predetermined action corresponding to the action intention is obtained as the target action.

In some embodiments, the adjusting the target action parameter of the target action according to the reference action parameter of the reference action includes: obtaining at least one of the target action and the information of each target action from a preset action library Action parameters, the action parameters include a start action parameter and a termination action parameter; according to the action parameters of each target action, obtain the initial action parameter with the smallest difference in the termination action parameter among the reference action parameters An action parameter is used as the target action parameter; the target action parameter is modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.

In some embodiments, the motion parameter is a bone position parameter or a muscle movement parameter.

In some embodiments, the target action is a facial expression or a physical action.

In a second aspect of the present disclosure, a device for driving a digital human is provided. The device includes: an acquiring unit for acquiring a target action corresponding to a target text; when the digital human is driven to output a voice based on the target text, The reference action to be performed by the digital person before the target action is executed; an adjustment unit for modifying the target action parameter of the target action based on the reference action parameter of the reference action; a driving unit for In the process of text-driven digital human outputting speech, the digital human is driven to perform the target action according to the modified target action parameter.

In some embodiments, the device further includes: a recognition unit, configured to obtain the target action corresponding to the text to be processed before acquiring the target action corresponding to the target text; The processed text is converted into the target text, and the tag of the target action is inserted into the target text.

In some embodiments, the recognition unit is configured to: obtain a preset keyword in the to-be-processed text; obtain a predetermined action corresponding to the preset keyword as the target action.

In some embodiments, the recognition unit is further configured to: perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text; and obtain a predetermined action corresponding to the action intention as the target action .

In some embodiments, the adjustment unit is configured to: obtain at least one of the target actions and the action parameters of each of the target actions from a preset action library, the action parameters including a start action parameter and a termination action parameter According to the action parameter of each target action, obtain the action parameter corresponding to the initial action parameter with the smallest difference of the termination action parameter among the reference action parameters as the target action parameter; modify according to the reference action parameter The target action parameter reduces the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter.

An embodiment of the present disclosure provides a method for driving a digital human to obtain a target action corresponding to a target text; to obtain a reference action performed by the digital human before performing the target action when the digital human is driven based on the target text to output speech ; Modify the target action parameters of the target action according to the reference action parameters of the reference action, so that the target action and the reference action are as close as possible; in the process of driving the digital human based on the target text, drive the digital human to execute the modified action after the reference action The parameterized target action enables the digital person to seamlessly switch to the target action based on the current action state. The action change process is natural and delicate, which solves the technical problem of sudden changes in the digital person’s movements in the prior art, and improves the digital person The fineness of movement changes.

Description of the drawings

FIG. 1 shows a schematic flowchart of a method for generating digital human-driven text according to one or more embodiments of the present disclosure;

Fig. 2 shows a schematic flowchart of a method for driving a digital human according to one or more embodiments of the present disclosure;

Fig. 3 shows a block diagram of an apparatus for driving a digital human according to one or more embodiments of the present disclosure;

Fig. 4 shows a schematic structural diagram of an electronic device according to one or more embodiments of the present disclosure.

Detailed ways

The present disclosure provides a method for driving a digital human. The insertion action is adjusted based on the reference action of the digital human, so that the action change process between the reference action and the insertion action is natural and delicate, thereby solving the digital human action in the prior art. Technical problems with sudden changes.

The main implementation principles, specific implementation manners and corresponding beneficial effects of the technical solutions of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Example

Please refer to FIG. 1. An embodiment of the present disclosure provides a method for generating digital human-driven text, which includes:

S10. Obtain a target action corresponding to the text to be processed;

S12: Convert the to-be-processed text into the target text by using a speech synthesis markup language, and put the tag of the target action in the target text.

In an embodiment of the present disclosure, the text content of the text to be processed needs to be voice-converted and output. In the process of outputting speech, it may be necessary to output corresponding actions corresponding to the content of the text. For example, suppose the text to be processed is "Please wave your hand like me and say hello to a friend from afar". When the text is converted into speech output , It is necessary to output the action "wave hand" when outputting the voice "wave hand". S10 acquires a target action corresponding to the text to be processed. The target action may be one or more than one. This embodiment does not limit the specific number of target actions.

Specifically, S10 may obtain the target action corresponding to the text to be processed through any one or more of the following methods:

Method 1: Obtain the preset keywords in the text to be processed. The preset keywords can be body motion keywords, facial expression keywords, for example: "wave hand", "shaking head", "smile", "sad", etc. A predetermined action corresponding to a preset keyword is acquired as a target action, and the target action can be a facial expression or a driving action. Create an action library in advance to store the correspondence between each keyword and each action, as well as the action parameters of each action, such as bone position parameters and muscle motion parameters. Among them, the actions in the action library can be obtained through data collection devices such as cameras, three-dimensional scanners, etc., which collect real-life actions, or they can be extracted from existing videos.

Method 2: Perform semantic recognition on the text to be processed to obtain the action intention contained in the text to be processed; obtain the predetermined action corresponding to the action intention as the target action. Through semantic recognition, the intention of the text to be processed can be obtained more accurately and comprehensively, rather than limited to the action text. For example, for the text "Today's sun is bright and beautiful, the air is fresh and refreshing", although the whole text does not mention it Any action, but according to the meaning of the entire text, "Yangguanmingmei" may correspond to an action intention to raise the head, and "fresh air" may correspond to an action intention to breathe. According to these action intentions, the corresponding predetermined actions are obtained. Similarly, an action library can be established in advance to store the correspondence between each action intention and each action, as well as the action parameters of each action, so that the predetermined action corresponding to the action intention can be quickly obtained from the action library.

The third method is to manually mark the text to be processed and insert the action identifier. Different action identifiers correspond to different target actions. When the target action is obtained, the action identifier in the text to be processed is searched, and the corresponding target action can be obtained according to the action identifier obtained by the search.

After obtaining the target action, continue to perform S12 for text conversion and action insertion, so that the target text obtained by the conversion can be recognized by the speech synthesis service, thereby providing corresponding services. Speech Synthesis Markup Language (SSML) is an XML-based markup language. Compared with the synthesis of plain text, the use of SSML can enrich the synthesized content and bring more changes to the final synthesis effect. In this embodiment, the SSML markup language is used to convert the target text, and the text to be converted is placed in the <speak></speak> tag, and each speech synthesis task includes a <speak></speak> tag. In the process of obtaining the target text, this embodiment also inserts the tag of the target action into the target text through the SSML markup language, so that the target text can not only control what the speech synthesis reads, but also control the output of corresponding actions when reading the speech.

It should be noted that the label of the target action can be the action name. When the digital human is subsequently driven, the corresponding action parameter can be obtained according to the action name, or the target action parameter can be directly inserted into the target text as a label. When the digital human is driven, it can be directly Get the target action parameter.

Please refer to FIG. 2, an embodiment of the present disclosure provides a method for driving a digital human, and the method includes:

S20. Obtain a target action corresponding to the target text;

S22. Obtain a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;

S24. Modify the target action parameter of the target action according to the reference action parameter of the reference action;

S26. In the process of driving the digital human to output speech based on the target text, drive the digital human to perform the target action according to the modified target action parameter.

In the process of text-driven digital human outputting speech, the digital human may usually be in a common state, namely the reference state. For example, for a digital human broadcasting news, the reference state may be standing upright or sitting on the table in front of the desk without expression. It may be due to the habits of news broadcasters that they have habitually performed some actions. For this reason, when inserting actions during the broadcast process, there may be technical problems such as large differences between the two actions before and after, and sudden changes in the actions. This embodiment obtains in advance the target action in the target text and the reference action that the digital human is in before executing the target action, and modifies the target action based on the reference action, so that the target action is as close as possible to the reference action, thereby solving the problem. A technical problem with abrupt movement changes caused by large movement differences.

In an embodiment of the present disclosure, S20 may directly search and obtain the action label of the target action from the target text, and obtain the corresponding target action according to the action label. Among them, the target text may contain one or more target action labels. When S20 is executed, one action can be obtained according to the label at a time, or multiple target actions corresponding to the target text can be obtained at one time to form a target action sequence, for each target The operation executes steps S22 to S26.

S22 obtains the reference action of the digital human before performing the target action. Specifically, the location feature of the target action in the target text can be obtained first, such as between the keywords x1 and x2, and the duration feature of the target text can be obtained, which is generated according to the phoneme feature corresponding to the target text; The duration feature and the location feature of the target action are used to obtain the first time point when the target action is executed, that is, at which point in the total duration of the voice broadcast the target action is executed; then according to the first time point, the first time point of the digital person is obtained. The reference action at the adjacent time point before the time point. For example: assuming that the execution time of the target action is 00:50:45, then obtain the reference action performed by the digital human at 00:50:44. The reference action may be a basic action corresponding to the reference state of the digital human, or it may be a habitual action adopted in the voice input process, or may be other target actions in the target text.

After obtaining the reference action, continue to execute S24 to modify the action parameter, and modify the target action parameter according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced. An action usually includes a basic action and a characteristic action, which correspond to the basic action parameters and the characteristic action parameters respectively. The basic action can be changed according to the scene, and the characteristic action generally does not change with the scene. For example, in general, the characteristic action of "goodbye" is for the forearm to drive the palm. Swing and basic movements include big arms, head, feet and other movements. When modifying the target action parameter, you can modify the basic action parameter in the target action parameter according to the basic action parameter in the reference action parameter. The difference between the action parameters refers to the total difference obtained by subtracting and accumulating the corresponding parameters in the action parameters. Suppose: basic action parameter V=[x11～x1n, y11～y1m, z11～z1k], basic action parameter W=[x21～x2n, y21～y2m, z21～z2k], the difference between the two basic action parameters =∑(x1n-x2n)+∑(y1m-y2m)+∑(z1k-z2k).

In an embodiment of the present disclosure, the action parameter referred to in this embodiment may be a bone position parameter or a muscle movement parameter of a digital human, where the muscle movement parameter includes a muscle contraction parameter and a muscle relaxation parameter. Which parameter to obtain is determined according to the driving model of the digital human. If the driving model of the digital human is a muscle binding model, then the muscle motion parameters are used; if the driving model of the digital human is a skeletal animation, then the bone position parameters are used. The following takes the bone position parameter as an example to explain in detail the modification of the target action parameter of the target action:

The first step is to obtain the action parameters of the target action. A type of action in the action library may correspond to many different forms. For example, the action "goodbye" may include a wave of "goodbye" on the chest, a wave of "goodbye" on the side of the body, and a wave of "goodbye" above the head. One form corresponds to a set of action parameters (collectively referred to as action parameters), and each set of action parameters is divided into initial action parameters, intermediate action parameters, and end time parameters according to different timings. Each set of action parameters corresponds to a complete action. In order to make the movement of the digital human change natural and delicate, this embodiment obtains at least one target action, that is, at least one form of target action, and the action parameters of each target action from the preset action library; according to the start of each target action The initial action parameter, the action parameter corresponding to the initial action parameter with the smallest difference between the initial action parameter of the reference action is obtained as the target action parameter, that is, a target action with the smallest difference from the reference action is obtained from multiple morphological actions. For example: for the reference action is "hands crossed in front of the chest", then when selecting the target action of "goodbye", it is more appropriate to choose "goodbye" in front of the chest. The difference between the position parameters of the hand and arm bones of these two actions The smallest, the movement changes natural and real.

The second step is to modify the target action parameters. After the target action parameter of the target action is determined, the target action parameter is further modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced, so that the modified The difference between the target action and the reference action is as small as possible, and the basic actions overlap as much as possible. As an implementation in some embodiments, when modifying the target action parameter, the basic action parameter in the target action parameter can be modified to the basic action parameter in the reference action parameter, and the difference between the modified target action parameter and the reference action parameter Minimum, the reference action coincides with the basic action of the target action after modifying the parameters. For example, for the reference action "hands crossed in front of the chest" and the target action "goodbye" in front of the chest, the action parameters corresponding to the big arm action in the target action can be modified to the action parameters corresponding to the big arm action in the reference action, or reduced The difference between the action parameter corresponding to the big arm action in the small target action and the action parameter corresponding to the big arm action in the reference action.

After S24, S26 is further executed to drive the digital human according to the modified target action parameters. Specifically, when the digital human is driven based on the target text, the duration feature can be obtained according to the target text; the target speech sequence corresponding to the target text can be obtained according to the duration feature; the target can be obtained according to the duration feature and the modification parameters of all target actions contained in the target text The target action sequence of the text; the target voice sequence and the target action sequence are input into the driving model of the digital human, and the digital human is driven to output corresponding voices and actions. In this embodiment, after executing the target action, the digital human can be further driven to perform the reference action, that is, from the target action back to the reference action. In specific implementation, the reference action parameter of the reference action can be added after the target action parameter when generating the action sequence.

In the above technical solution, by recognizing the semantics and/or keywords of the text, the target action carried in the text expression is obtained, and the label of the target action is inserted into the text, so that when the digital person is driven by the text, the insertion The action tag drives the digital person to perform the corresponding action, which realizes the action drive of the text to the digital person. Further, for the target action corresponding to the text, the reference action before the execution of the target action is obtained, and the action parameters of the target action are modified according to the action parameters of the reference action. The difference between the target action and the reference action is small, so that the digital human is performing the reference When the action is converted to the target action, the conversion process is natural and coordinated, which solves the abrupt technical problem of digital human action conversion in the prior art, and increases the delicacy of digital human action conversion.

In one aspect of the present disclosure, a device for driving a digital human is also provided. Please refer to FIG. 3. The device includes:

The obtaining unit 31 is configured to obtain a target action corresponding to the target text; obtain a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;

The adjustment unit 32 is configured to modify the target action parameter of the target action according to the reference action parameter of the reference action;

The driving unit 33 is configured to drive the digital person to perform the target action according to the modified target action parameter in the process of driving the digital person to output speech based on the target text.

In some embodiments, the target action is a facial expression or a physical action. The action parameters are bone position parameters or muscle motion parameters.

In some embodiments, the device further includes: an identification unit 34 and an insertion unit 35. Wherein, the recognition unit 34 is configured to obtain the target action corresponding to the text to be processed before obtaining the target action corresponding to the target text; the inserting unit 35 is configured to convert the text to be processed into the target text through a speech synthesis markup language , And insert the label of the target action into the target text.

In an embodiment of the present disclosure, the recognition unit 34 may use any of the following methods to recognize and acquire the target action:

Manner 1: Obtain a preset keyword in the to-be-processed text; obtain a predetermined action corresponding to the preset keyword as the target action.

Manner 2: Perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text; acquire a predetermined action corresponding to the action intention as the target action.

In some embodiments, when the adjustment unit 32 modifies the action parameters, it can obtain at least one of the target actions and the action parameters of each of the target actions from a preset action library, and the action parameters include the initial action. Parameters and termination action parameters; according to the action parameters of each target action, the action parameter corresponding to the initial action parameter with the smallest difference between the termination action parameters in the reference action parameters is obtained as the target action parameter; The reference action parameter modifies the target action parameter so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.

Regarding the device in an embodiment of the present disclosure, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and detailed description will not be given here.

FIG. 4 shows a block diagram of an electronic device 800 for implementing a method for driving a digital person according to one or more embodiments of the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.

4, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/presentation (I/O) interface 812, and a sensor component 814 , And communication component 816.

The processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method. In addition, the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations in the device 800. Examples of these data include instructions for any application or method to operate on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc. The memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

The power supply component 806 provides power for various components of the electronic device 800. The power supply component 806 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 800.

The multimedia component 808 includes a screen that provides a presentation interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), and when the electronic device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for displaying audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module. The above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.

The sensor component 814 includes one or more sensors for providing the electronic device 800 with various aspects of state evaluation. For example, the sensor component 814 can detect the on/off status of the device 800 and the relative positioning of components. For example, the component is the display and the keypad of the electronic device 800. The sensor component 814 can also detect the electronic device 800 or a component of the electronic device 800. The position of the electronic device 800 changes, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800. The sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact. The sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field-available A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.

In another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium including instructions, such as the memory 804 including instructions, which can be executed by the processor 820 of the electronic device 800 to complete the foregoing method. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer-readable storage medium. When the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal is able to execute a method for driving a digital person. The method includes: obtaining a target text correspondence Obtain the reference action to be performed by the digital human before performing the target action when the digital human is driven to output speech based on the target text; modify the target of the target action according to the reference action parameters of the reference action Action parameters; in the process of driving the digital person to output speech based on the target text, drive the digital person to perform the target action according to the modified target action parameter.

Those skilled in the art will easily think of other embodiments of the present invention after considering the specification and practicing the invention disclosed herein. This application is intended to cover any variations, uses, or adaptive changes of the present invention. These variations, uses, or adaptive changes follow the general principles of the present invention and include common knowledge or conventional technical means in the technical field not disclosed in this disclosure. . The description and the embodiments are to be regarded as exemplary only, and the true scope and spirit of the present invention are pointed out by the following claims.

It should be understood that the present invention is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from its scope. The scope of the present invention is only limited by the appended claims

The above descriptions are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A method of driving digital humans, including:

Obtain the target action corresponding to the target text;

Obtaining a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;

Modify the target action parameter of the target action according to the reference action parameter of the reference action;

In the process of driving the digital human to output speech based on the target text, the digital human is driven to perform the target action according to the modified target action parameter.
The method according to claim 1, wherein before obtaining the target action corresponding to the target text, the method further comprises:

Obtain the target action corresponding to the text to be processed;

The text to be processed is converted into the target text through a speech synthesis markup language, and the tag of the target action is inserted into the target text.
The method according to claim 2, wherein said obtaining the target action corresponding to the text to be processed comprises:

Acquiring preset keywords in the to-be-processed text;

Obtain a predetermined action corresponding to the preset keyword as the target action.
The method according to claim 2, wherein said obtaining the target action corresponding to the text to be processed comprises:

Perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text;

Acquire a predetermined action corresponding to the action intention as the target action.
The method of claim 1, wherein the adjusting the target action parameter of the target action according to the reference action parameter of the reference action comprises:

Acquiring at least one of the target actions and the action parameters of each of the target actions from a preset action library, where the action parameters include a start action parameter and an end action parameter;

Acquiring, according to the action parameter of each target action, the action parameter corresponding to the initial action parameter with the smallest difference of the termination action parameter in the reference action parameters as the target action parameter;

The target action parameter is modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
The method according to any one of claims 1 to 5, wherein the motion parameter is a bone position parameter or a muscle movement parameter.
The method according to any one of claims 1 to 5, wherein the target action is a facial expression or a body action.
A device for driving digital humans, including:

An acquiring unit for acquiring a target action corresponding to a target text; acquiring a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;

An adjustment unit, configured to modify the target action parameter of the target action according to the reference action parameter of the reference action;

The driving unit is configured to drive the digital human to perform the target action according to the modified target action parameter in the process of driving the digital human to output speech based on the target text.
The device of claim 8, further comprising:

The recognition unit is used to obtain the target action corresponding to the text to be processed before obtaining the target action corresponding to the target text;

The inserting unit is configured to convert the to-be-processed text into the target text through a speech synthesis markup language, and insert the tag of the target action into the target text.
The device according to claim 9, wherein the identification unit is used for:

Acquiring preset keywords in the to-be-processed text;

Obtain a predetermined action corresponding to the preset keyword as the target action.
The device according to claim 9, wherein the identification unit is further used for:

Perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text;

Acquire a predetermined action corresponding to the action intention as the target action.
The device according to claim 8, wherein the adjustment unit is configured to:

Acquiring at least one of the target actions and the action parameters of each of the target actions from a preset action library, where the action parameters include a start action parameter and an end action parameter;

Obtaining, according to the action parameter of each target action, the action parameter corresponding to the initial action parameter with the smallest difference of the termination action parameter in the reference action parameters as the target action parameter;

The target action parameter is modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
The device according to any one of claims 8-12, wherein the motion parameter is a bone position parameter or a muscle movement parameter.
The device according to any one of claims 8-12, wherein the target action is a facial expression or a body action.
An electronic device including a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors The included instructions are used to perform the operation corresponding to the method according to any one of claims 1-7.
A computer-readable storage medium with a computer program stored thereon, and when the program is executed by a processor, the steps corresponding to the method according to any one of claims 1-7 are realized.