WO2021232875A1 - Method and apparatus for driving digital person, and electronic device - Google Patents

Method and apparatus for driving digital person, and electronic device Download PDF

Info

Publication number
WO2021232875A1
WO2021232875A1 PCT/CN2021/078242 CN2021078242W WO2021232875A1 WO 2021232875 A1 WO2021232875 A1 WO 2021232875A1 CN 2021078242 W CN2021078242 W CN 2021078242W WO 2021232875 A1 WO2021232875 A1 WO 2021232875A1
Authority
WO
WIPO (PCT)
Prior art keywords
action
target
parameter
text
target action
Prior art date
Application number
PCT/CN2021/078242
Other languages
French (fr)
Chinese (zh)
Inventor
樊博
Original Assignee
北京搜狗科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京搜狗科技发展有限公司 filed Critical 北京搜狗科技发展有限公司
Publication of WO2021232875A1 publication Critical patent/WO2021232875A1/en
Priority to US17/989,323 priority Critical patent/US20230082830A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings

Definitions

  • the present disclosure relates to the field of software technology, and in particular to a method, device and electronic equipment for driving digital humans.
  • Digital Human is abbreviated as Digital Human, which is a comprehensive rendering technology that uses computers to simulate real humans, and is also called virtual humans, super-realistic humans, and photo-level humans. Because people are too familiar with real people, leading to the famous Uncanny Valley phenomenon, the difficulty of realizing digital humans does not increase linearly, but increases exponentially. It is possible that the 3D static model is very real, but it is a blink of an eye. It immediately becomes unreal. How to make the movements of digital humans more delicate and realistic has become a technical problem that needs to be solved urgently in the development of digital humans.
  • the purpose of the present disclosure is, at least in part, to provide a method, a device, and an electronic device for driving digital humans, which are used to solve the technical problem of sudden changes in digital human motions in the prior art, and to improve the fineness of digital human motion changes.
  • a method for driving a digital human includes: obtaining a target action corresponding to a target text; when obtaining a voice output based on the target text to drive the digital human, the digital human is The reference action to be executed before the target action is executed; the target action parameter of the target action is modified according to the reference action parameter of the reference action; in the process of driving the digital human to output speech based on the target text, according to the modified The target action parameter drives the digital human to perform the target action.
  • the method before acquiring the target action corresponding to the target text, the method further includes: acquiring the target action corresponding to the text to be processed; converting the text to be processed into the target text through a speech synthesis markup language, and Insert the label of the target action into the target text.
  • the acquiring a target action corresponding to the text to be processed includes: acquiring a preset keyword in the text to be processed; acquiring a predetermined action corresponding to the preset keyword as the target action.
  • the obtaining the target action corresponding to the text to be processed includes:
  • Semantic recognition is performed on the text to be processed to obtain the action intention contained in the text to be processed; and a predetermined action corresponding to the action intention is obtained as the target action.
  • the adjusting the target action parameter of the target action according to the reference action parameter of the reference action includes: obtaining at least one of the target action and the information of each target action from a preset action library Action parameters, the action parameters include a start action parameter and a termination action parameter; according to the action parameters of each target action, obtain the initial action parameter with the smallest difference in the termination action parameter among the reference action parameters An action parameter is used as the target action parameter; the target action parameter is modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
  • the motion parameter is a bone position parameter or a muscle movement parameter.
  • the target action is a facial expression or a physical action.
  • a device for driving a digital human includes: an acquiring unit for acquiring a target action corresponding to a target text; when the digital human is driven to output a voice based on the target text, The reference action to be performed by the digital person before the target action is executed; an adjustment unit for modifying the target action parameter of the target action based on the reference action parameter of the reference action; a driving unit for In the process of text-driven digital human outputting speech, the digital human is driven to perform the target action according to the modified target action parameter.
  • the device further includes: a recognition unit, configured to obtain the target action corresponding to the text to be processed before acquiring the target action corresponding to the target text; The processed text is converted into the target text, and the tag of the target action is inserted into the target text.
  • a recognition unit configured to obtain the target action corresponding to the text to be processed before acquiring the target action corresponding to the target text; The processed text is converted into the target text, and the tag of the target action is inserted into the target text.
  • the recognition unit is configured to: obtain a preset keyword in the to-be-processed text; obtain a predetermined action corresponding to the preset keyword as the target action.
  • the recognition unit is further configured to: perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text; and obtain a predetermined action corresponding to the action intention as the target action .
  • the adjustment unit is configured to: obtain at least one of the target actions and the action parameters of each of the target actions from a preset action library, the action parameters including a start action parameter and a termination action parameter According to the action parameter of each target action, obtain the action parameter corresponding to the initial action parameter with the smallest difference of the termination action parameter among the reference action parameters as the target action parameter; modify according to the reference action parameter The target action parameter reduces the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter.
  • the motion parameter is a bone position parameter or a muscle movement parameter.
  • the target action is a facial expression or a physical action.
  • An embodiment of the present disclosure provides a method for driving a digital human to obtain a target action corresponding to a target text; to obtain a reference action performed by the digital human before performing the target action when the digital human is driven based on the target text to output speech ; Modify the target action parameters of the target action according to the reference action parameters of the reference action, so that the target action and the reference action are as close as possible; in the process of driving the digital human based on the target text, drive the digital human to execute the modified action after the reference action
  • the parameterized target action enables the digital person to seamlessly switch to the target action based on the current action state.
  • the action change process is natural and delicate, which solves the technical problem of sudden changes in the digital person’s movements in the prior art, and improves the digital person The fineness of movement changes.
  • FIG. 1 shows a schematic flowchart of a method for generating digital human-driven text according to one or more embodiments of the present disclosure
  • Fig. 2 shows a schematic flowchart of a method for driving a digital human according to one or more embodiments of the present disclosure
  • Fig. 3 shows a block diagram of an apparatus for driving a digital human according to one or more embodiments of the present disclosure
  • Fig. 4 shows a schematic structural diagram of an electronic device according to one or more embodiments of the present disclosure.
  • the present disclosure provides a method for driving a digital human.
  • the insertion action is adjusted based on the reference action of the digital human, so that the action change process between the reference action and the insertion action is natural and delicate, thereby solving the digital human action in the prior art.
  • An embodiment of the present disclosure provides a method for generating digital human-driven text, which includes:
  • S12 Convert the to-be-processed text into the target text by using a speech synthesis markup language, and put the tag of the target action in the target text.
  • the text content of the text to be processed needs to be voice-converted and output.
  • the text is converted into speech output , It is necessary to output the action "wave hand” when outputting the voice "wave hand”.
  • S10 acquires a target action corresponding to the text to be processed.
  • the target action may be one or more than one. This embodiment does not limit the specific number of target actions.
  • S10 may obtain the target action corresponding to the text to be processed through any one or more of the following methods:
  • Method 1 Obtain the preset keywords in the text to be processed.
  • the preset keywords can be body motion keywords, facial expression keywords, for example: “wave hand”, “shaking head”, “smile”, “sad”, etc.
  • a predetermined action corresponding to a preset keyword is acquired as a target action, and the target action can be a facial expression or a driving action.
  • the actions in the action library can be obtained through data collection devices such as cameras, three-dimensional scanners, etc., which collect real-life actions, or they can be extracted from existing videos.
  • Method 2 Perform semantic recognition on the text to be processed to obtain the action intention contained in the text to be processed; obtain the predetermined action corresponding to the action intention as the target action.
  • semantic recognition the intention of the text to be processed can be obtained more accurately and comprehensively, rather than limited to the action text. For example, for the text "Today's sun is bright and beautiful, the air is fresh and refreshing", although the whole text does not mention it Any action, but according to the meaning of the entire text, "Yangguanmingmei” may correspond to an action intention to raise the head, and "fresh air” may correspond to an action intention to breathe.
  • the corresponding predetermined actions are obtained.
  • an action library can be established in advance to store the correspondence between each action intention and each action, as well as the action parameters of each action, so that the predetermined action corresponding to the action intention can be quickly obtained from the action library.
  • the third method is to manually mark the text to be processed and insert the action identifier.
  • Different action identifiers correspond to different target actions.
  • the action identifier in the text to be processed is searched, and the corresponding target action can be obtained according to the action identifier obtained by the search.
  • Speech Synthesis Markup Language is an XML-based markup language. Compared with the synthesis of plain text, the use of SSML can enrich the synthesized content and bring more changes to the final synthesis effect.
  • the SSML markup language is used to convert the target text, and the text to be converted is placed in the ⁇ speak> ⁇ /speak> tag, and each speech synthesis task includes a ⁇ speak> ⁇ /speak> tag.
  • this embodiment also inserts the tag of the target action into the target text through the SSML markup language, so that the target text can not only control what the speech synthesis reads, but also control the output of corresponding actions when reading the speech.
  • the label of the target action can be the action name.
  • the corresponding action parameter can be obtained according to the action name, or the target action parameter can be directly inserted into the target text as a label.
  • the digital human is driven, it can be directly Get the target action parameter.
  • an embodiment of the present disclosure provides a method for driving a digital human, and the method includes:
  • the digital human In the process of text-driven digital human outputting speech, the digital human may usually be in a common state, namely the reference state.
  • the reference state For example, for a digital human broadcasting news, the reference state may be standing upright or sitting on the table in front of the desk without expression. It may be due to the habits of news broadcasters that they have habitually performed some actions. For this reason, when inserting actions during the broadcast process, there may be technical problems such as large differences between the two actions before and after, and sudden changes in the actions.
  • This embodiment obtains in advance the target action in the target text and the reference action that the digital human is in before executing the target action, and modifies the target action based on the reference action, so that the target action is as close as possible to the reference action, thereby solving the problem. A technical problem with abrupt movement changes caused by large movement differences.
  • S20 may directly search and obtain the action label of the target action from the target text, and obtain the corresponding target action according to the action label.
  • the target text may contain one or more target action labels.
  • the S22 obtains the reference action of the digital human before performing the target action.
  • the location feature of the target action in the target text can be obtained first, such as between the keywords x1 and x2, and the duration feature of the target text can be obtained, which is generated according to the phoneme feature corresponding to the target text;
  • the duration feature and the location feature of the target action are used to obtain the first time point when the target action is executed, that is, at which point in the total duration of the voice broadcast the target action is executed; then according to the first time point, the first time point of the digital person is obtained.
  • the reference action at the adjacent time point before the time point.
  • the reference action may be a basic action corresponding to the reference state of the digital human, or it may be a habitual action adopted in the voice input process, or may be other target actions in the target text.
  • An action usually includes a basic action and a characteristic action, which correspond to the basic action parameters and the characteristic action parameters respectively.
  • the basic action can be changed according to the scene, and the characteristic action generally does not change with the scene.
  • the characteristic action of "goodbye" is for the forearm to drive the palm. Swing and basic movements include big arms, head, feet and other movements.
  • the difference between the action parameters refers to the total difference obtained by subtracting and accumulating the corresponding parameters in the action parameters.
  • basic action parameter V [x11 ⁇ x1n, y11 ⁇ y1m, z11 ⁇ z1k]
  • basic action parameter W [x21 ⁇ x2n, y21 ⁇ y2m, z21 ⁇ z2k]
  • the difference between the two basic action parameters ⁇ (x1n-x2n)+ ⁇ (y1m-y2m)+ ⁇ (z1k-z2k).
  • the action parameter referred to in this embodiment may be a bone position parameter or a muscle movement parameter of a digital human, where the muscle movement parameter includes a muscle contraction parameter and a muscle relaxation parameter.
  • the muscle movement parameter includes a muscle contraction parameter and a muscle relaxation parameter.
  • Which parameter to obtain is determined according to the driving model of the digital human. If the driving model of the digital human is a muscle binding model, then the muscle motion parameters are used; if the driving model of the digital human is a skeletal animation, then the bone position parameters are used.
  • the following takes the bone position parameter as an example to explain in detail the modification of the target action parameter of the target action:
  • the first step is to obtain the action parameters of the target action.
  • a type of action in the action library may correspond to many different forms.
  • the action “goodbye” may include a wave of "goodbye” on the chest, a wave of "goodbye” on the side of the body, and a wave of "goodbye” above the head.
  • One form corresponds to a set of action parameters (collectively referred to as action parameters), and each set of action parameters is divided into initial action parameters, intermediate action parameters, and end time parameters according to different timings.
  • Each set of action parameters corresponds to a complete action.
  • this embodiment obtains at least one target action, that is, at least one form of target action, and the action parameters of each target action from the preset action library; according to the start of each target action
  • the initial action parameter, the action parameter corresponding to the initial action parameter with the smallest difference between the initial action parameter of the reference action is obtained as the target action parameter, that is, a target action with the smallest difference from the reference action is obtained from multiple morphological actions. For example: for the reference action is "hands crossed in front of the chest", then when selecting the target action of "goodbye”, it is more appropriate to choose "goodbye” in front of the chest.
  • the difference between the position parameters of the hand and arm bones of these two actions The smallest, the movement changes natural and real.
  • the second step is to modify the target action parameters.
  • the target action parameter of the target action is determined, the target action parameter is further modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced, so that the modified The difference between the target action and the reference action is as small as possible, and the basic actions overlap as much as possible.
  • the basic action parameter in the target action parameter can be modified to the basic action parameter in the reference action parameter, and the difference between the modified target action parameter and the reference action parameter Minimum, the reference action coincides with the basic action of the target action after modifying the parameters.
  • the action parameters corresponding to the big arm action in the target action can be modified to the action parameters corresponding to the big arm action in the reference action, or reduced The difference between the action parameter corresponding to the big arm action in the small target action and the action parameter corresponding to the big arm action in the reference action.
  • S26 is further executed to drive the digital human according to the modified target action parameters.
  • the duration feature can be obtained according to the target text; the target speech sequence corresponding to the target text can be obtained according to the duration feature; the target can be obtained according to the duration feature and the modification parameters of all target actions contained in the target text
  • the target action sequence of the text; the target voice sequence and the target action sequence are input into the driving model of the digital human, and the digital human is driven to output corresponding voices and actions.
  • the digital human after executing the target action, the digital human can be further driven to perform the reference action, that is, from the target action back to the reference action.
  • the reference action parameter of the reference action can be added after the target action parameter when generating the action sequence.
  • the target action carried in the text expression is obtained, and the label of the target action is inserted into the text, so that when the digital person is driven by the text, the insertion
  • the action tag drives the digital person to perform the corresponding action, which realizes the action drive of the text to the digital person.
  • the reference action before the execution of the target action is obtained, and the action parameters of the target action are modified according to the action parameters of the reference action.
  • the difference between the target action and the reference action is small, so that the digital human is performing the reference
  • the conversion process is natural and coordinated, which solves the abrupt technical problem of digital human action conversion in the prior art, and increases the delicacy of digital human action conversion.
  • a device for driving a digital human is also provided. Please refer to FIG. 3.
  • the device includes:
  • the obtaining unit 31 is configured to obtain a target action corresponding to the target text; obtain a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;
  • the adjustment unit 32 is configured to modify the target action parameter of the target action according to the reference action parameter of the reference action;
  • the driving unit 33 is configured to drive the digital person to perform the target action according to the modified target action parameter in the process of driving the digital person to output speech based on the target text.
  • the target action is a facial expression or a physical action.
  • the action parameters are bone position parameters or muscle motion parameters.
  • the device further includes: an identification unit 34 and an insertion unit 35.
  • the recognition unit 34 is configured to obtain the target action corresponding to the text to be processed before obtaining the target action corresponding to the target text;
  • the inserting unit 35 is configured to convert the text to be processed into the target text through a speech synthesis markup language , And insert the label of the target action into the target text.
  • the recognition unit 34 may use any of the following methods to recognize and acquire the target action:
  • Manner 1 Obtain a preset keyword in the to-be-processed text; obtain a predetermined action corresponding to the preset keyword as the target action.
  • Manner 2 Perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text; acquire a predetermined action corresponding to the action intention as the target action.
  • the adjustment unit 32 when it modifies the action parameters, it can obtain at least one of the target actions and the action parameters of each of the target actions from a preset action library, and the action parameters include the initial action. Parameters and termination action parameters; according to the action parameters of each target action, the action parameter corresponding to the initial action parameter with the smallest difference between the termination action parameters in the reference action parameters is obtained as the target action parameter; The reference action parameter modifies the target action parameter so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
  • FIG. 4 shows a block diagram of an electronic device 800 for implementing a method for driving a digital person according to one or more embodiments of the present disclosure.
  • the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.
  • the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/presentation (I/O) interface 812, and a sensor component 814 , And communication component 816.
  • the processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing element 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method.
  • the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components.
  • the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
  • the memory 804 is configured to store various types of data to support operations in the device 800. Examples of these data include instructions for any application or method to operate on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable and Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic Disk Magnetic Disk or Optical Disk.
  • the power supply component 806 provides power for various components of the electronic device 800.
  • the power supply component 806 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 800.
  • the multimedia component 808 includes a screen that provides a presentation interface between the electronic device 800 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 810 is configured to present and/or input audio signals.
  • the audio component 810 includes a microphone (MIC), and when the electronic device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
  • the audio component 810 further includes a speaker for displaying audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module.
  • the above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
  • the sensor component 814 includes one or more sensors for providing the electronic device 800 with various aspects of state evaluation.
  • the sensor component 814 can detect the on/off status of the device 800 and the relative positioning of components.
  • the component is the display and the keypad of the electronic device 800.
  • the sensor component 814 can also detect the electronic device 800 or a component of the electronic device 800.
  • the position of the electronic device 800 changes, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800.
  • the sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact.
  • the sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
  • the electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field-available A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • ASIC application-specific integrated circuits
  • DSP digital signal processors
  • DSPD digital signal processing devices
  • PLD programmable logic devices
  • FPGA field-available A programmable gate array
  • controller microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • non-transitory computer-readable storage medium including instructions, such as the memory 804 including instructions, which can be executed by the processor 820 of the electronic device 800 to complete the foregoing method.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • a non-transitory computer-readable storage medium When the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal is able to execute a method for driving a digital person.
  • the method includes: obtaining a target text correspondence Obtain the reference action to be performed by the digital human before performing the target action when the digital human is driven to output speech based on the target text; modify the target of the target action according to the reference action parameters of the reference action Action parameters; in the process of driving the digital person to output speech based on the target text, drive the digital person to perform the target action according to the modified target action parameter.

Abstract

A method and apparatus for driving a digital person, and an electronic device. The method comprises: acquiring a target action corresponding to target text (S20); obtaining, when driving a digital person to output a voice on the basis of the target text, a reference action needing to be performed by the digital person before performing the target action (S22); modifying a target action parameter of the target action according to a reference action parameter of the reference action (S24); and in the process of driving the digital person to output the voice on the basis of the target text, driving the digital person to perform the target action according to the modified target action parameter (S26). In the technical solution, a corresponding action is obtained on the basis of text, and an action parameter of a target action corresponding to the text is modified according to a reference action of a digital person, so that the process of the digital person switching from the reference action to the target action is natural and delicate, the technical problem in the prior art that the action of the digital person changes suddenly is solved, and the delicacy of change of the action of the digital person is improved.

Description

一种驱动数字人的方法、装置及电子设备Method, device and electronic equipment for driving digital human
相关申请的交叉引用Cross-references to related applications
本申请要求于2020年5月18日提交、申请号为202010420678.0且名称为“一种驱动数字人的方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用合并于此。This application claims the priority of a Chinese patent application filed on May 18, 2020 with an application number of 202010420678.0 and titled "A method, device and electronic device for driving digital humans", the entire content of which is incorporated herein by reference.
技术领域Technical field
本公开内容涉及软件技术领域,特别涉及一种驱动数字人的方法、装置及电子设备。The present disclosure relates to the field of software technology, and in particular to a method, device and electronic equipment for driving digital humans.
背景技术Background technique
本公开内容数字人类(Digital Human)简称数字人,是利用计算机模拟真实人类的一种综合性的渲染技术,也被称为虚拟人类、超写实人类、照片级人类。由于人对真人太熟悉了,导致了著名的恐怖谷现象,所以实现数字人的写实程度的难度不是线性增长的,而是指数型增长的,有可能3D静态模型很真,但是一说话一眨眼立马就变得不真实。如何使数字人的动作变动更为细腻、真实,成了当前数字人发展亟待解决的技术问题。In the present disclosure, Digital Human (Digital Human) is abbreviated as Digital Human, which is a comprehensive rendering technology that uses computers to simulate real humans, and is also called virtual humans, super-realistic humans, and photo-level humans. Because people are too familiar with real people, leading to the famous Uncanny Valley phenomenon, the difficulty of realizing digital humans does not increase linearly, but increases exponentially. It is possible that the 3D static model is very real, but it is a blink of an eye. It immediately becomes unreal. How to make the movements of digital humans more delicate and realistic has become a technical problem that needs to be solved urgently in the development of digital humans.
发明内容Summary of the invention
本公开内容的目的至少部分在于,提供一种驱动数字人的方法、装置及电子设备,用于解决现有技术中数字人动作变动突兀的技术问题,提高数字人动作变动的细腻度。The purpose of the present disclosure is, at least in part, to provide a method, a device, and an electronic device for driving digital humans, which are used to solve the technical problem of sudden changes in digital human motions in the prior art, and to improve the fineness of digital human motion changes.
在本公开内容的第一方面,提供了一种驱动数字人的方法,所述方法包括:获取目标文本对应的目标动作;获得基于所述目标文本驱动数字人输出语音时,所述数字人在执行所述目标动作之前所要执行的参考动作;根据所述参考动作的参考动作参数修改所述目标动作的目标动作参数;在基于所述目标文本驱动数字人输出语音的过程中,根据修改 后的目标动作参数驱动所述数字人执行所述目标动作。In a first aspect of the present disclosure, there is provided a method for driving a digital human, the method includes: obtaining a target action corresponding to a target text; when obtaining a voice output based on the target text to drive the digital human, the digital human is The reference action to be executed before the target action is executed; the target action parameter of the target action is modified according to the reference action parameter of the reference action; in the process of driving the digital human to output speech based on the target text, according to the modified The target action parameter drives the digital human to perform the target action.
在一些实施例中,在获取目标文本对应的目标动作之前,所述方法还包括:获取待处理文本对应的目标动作;通过语音合成标记语言将所述待处理文本转换为所述目标文本,并将所述目标动作的标签插入所述目标文本中。In some embodiments, before acquiring the target action corresponding to the target text, the method further includes: acquiring the target action corresponding to the text to be processed; converting the text to be processed into the target text through a speech synthesis markup language, and Insert the label of the target action into the target text.
在一些实施例中,所述获取待处理文本对应的目标动作,包括:获取所述待处理文本中的预设关键词;获取所述预设关键词对应的预定动作作为所述目标动作。In some embodiments, the acquiring a target action corresponding to the text to be processed includes: acquiring a preset keyword in the text to be processed; acquiring a predetermined action corresponding to the preset keyword as the target action.
在一些实施例中,所述获取待处理文本对应的目标动作,包括:In some embodiments, the obtaining the target action corresponding to the text to be processed includes:
对所述待处理文本进行语义识别,获得所述待处理文本中包含的动作意图;获取所述动作意图对应的预定动作作为所述目标动作。Semantic recognition is performed on the text to be processed to obtain the action intention contained in the text to be processed; and a predetermined action corresponding to the action intention is obtained as the target action.
在一些实施例中,所述根据所述参考动作的参考动作参数调整所述目标动作的目标动作参数,包括:从预设动作库中获取至少一个所述目标动作以及每个所述目标动作的动作参数,所述动作参数包含起始动作参数和终止动作参数;根据所述每个目标动作的动作参数,获取与所述参考动作参数中的终止动作参数差值最小的起始动作参数对应的动作参数作为所述目标动作参数;根据所述参考动作参数修改所述目标动作参数,使得修改后的目标动作参数与所述参考动作参数对应的基本动作参数之间的差值减小。In some embodiments, the adjusting the target action parameter of the target action according to the reference action parameter of the reference action includes: obtaining at least one of the target action and the information of each target action from a preset action library Action parameters, the action parameters include a start action parameter and a termination action parameter; according to the action parameters of each target action, obtain the initial action parameter with the smallest difference in the termination action parameter among the reference action parameters An action parameter is used as the target action parameter; the target action parameter is modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
在一些实施例中,所述动作参数为骨骼位置参数或肌肉运动参数。In some embodiments, the motion parameter is a bone position parameter or a muscle movement parameter.
在一些实施例中,所述目标动作为面部表情或躯体动作。In some embodiments, the target action is a facial expression or a physical action.
本公开内容的第二方面,提供了一种驱动数字人的装置,所述装置包括:获取单元,用于获取目标文本对应的目标动作;获得基于所述目标文本驱动数字人输出语音时,所述数字人在执行所述目标动作之前所要执行的参考动作;调整单元,用于根据所述参考动作的参考动作参数修改所述目标动作的目标动作参数;驱动单元,用于在基于所述目标文本驱动数字人输出语音的过程中,根据修改后的目标动作参数驱动所 述数字人执行所述目标动作。In a second aspect of the present disclosure, a device for driving a digital human is provided. The device includes: an acquiring unit for acquiring a target action corresponding to a target text; when the digital human is driven to output a voice based on the target text, The reference action to be performed by the digital person before the target action is executed; an adjustment unit for modifying the target action parameter of the target action based on the reference action parameter of the reference action; a driving unit for In the process of text-driven digital human outputting speech, the digital human is driven to perform the target action according to the modified target action parameter.
在一些实施例中,所述装置还包括:识别单元,用于在获取目标文本对应的目标动作之前,获取待处理文本对应的目标动作;插入单元,用于通过语音合成标记语言将所述待处理文本转换为所述目标文本,并将所述目标动作的标签插入所述目标文本中。In some embodiments, the device further includes: a recognition unit, configured to obtain the target action corresponding to the text to be processed before acquiring the target action corresponding to the target text; The processed text is converted into the target text, and the tag of the target action is inserted into the target text.
在一些实施例中,所述识别单元用于:获取所述待处理文本中的预设关键词;获取所述预设关键词对应的预定动作作为所述目标动作。In some embodiments, the recognition unit is configured to: obtain a preset keyword in the to-be-processed text; obtain a predetermined action corresponding to the preset keyword as the target action.
在一些实施例中,所述识别单元还用于:对所述待处理文本进行语义识别,获得所述待处理文本中包含的动作意图;获取所述动作意图对应的预定动作作为所述目标动作。In some embodiments, the recognition unit is further configured to: perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text; and obtain a predetermined action corresponding to the action intention as the target action .
在一些实施例中,所述调整单元用于:从预设动作库中获取至少一个所述目标动作以及每个所述目标动作的动作参数,所述动作参数包含起始动作参数和终止动作参数;根据所述每个目标动作的动作参数,获取与所述参考动作参数中的终止动作参数差值最小的起始动作参数对应的动作参数作为所述目标动作参数;根据所述参考动作参数修改所述目标动作参数,使得修改后的目标动作参数与所述参考动作参数对应的基本动作参数之间的差值减小。In some embodiments, the adjustment unit is configured to: obtain at least one of the target actions and the action parameters of each of the target actions from a preset action library, the action parameters including a start action parameter and a termination action parameter According to the action parameter of each target action, obtain the action parameter corresponding to the initial action parameter with the smallest difference of the termination action parameter among the reference action parameters as the target action parameter; modify according to the reference action parameter The target action parameter reduces the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter.
在一些实施例中,所述动作参数为骨骼位置参数或肌肉运动参数。In some embodiments, the motion parameter is a bone position parameter or a muscle movement parameter.
在一些实施例中,所述目标动作为面部表情或躯体动作。In some embodiments, the target action is a facial expression or a physical action.
本公开内容的一种实施例中提供了一种驱动数字人的方法,获取目标文本对应的目标动作;获得基于目标文本驱动数字人输出语音时,数字人在执行目标动作之前所执行的参考动作;根据该参考动作的参考动作参数修改目标动作的目标动作参数,使得目标动作与参考动作尽可能的接近;在基于目标文本驱动数字人的过程中,驱动数字人执行在参考动作之后执行修改动作参数后的目标动作,使得数字人以当前所处的动作状态为参考无缝切换到目标动作,动作变动过程自然、细腻,解决了现有技术中数字人动作变动突兀的技术问题,提高数字人动作变动的 细腻度。An embodiment of the present disclosure provides a method for driving a digital human to obtain a target action corresponding to a target text; to obtain a reference action performed by the digital human before performing the target action when the digital human is driven based on the target text to output speech ; Modify the target action parameters of the target action according to the reference action parameters of the reference action, so that the target action and the reference action are as close as possible; in the process of driving the digital human based on the target text, drive the digital human to execute the modified action after the reference action The parameterized target action enables the digital person to seamlessly switch to the target action based on the current action state. The action change process is natural and delicate, which solves the technical problem of sudden changes in the digital person’s movements in the prior art, and improves the digital person The fineness of movement changes.
附图说明Description of the drawings
图1示出了依据本公开内容的一个或多个实施例的一种数字人驱动文本的生成方法的流程示意图;FIG. 1 shows a schematic flowchart of a method for generating digital human-driven text according to one or more embodiments of the present disclosure;
图2示出了依据本公开内容的一个或多个实施例的一种驱动数字人的方法的流程示意图;Fig. 2 shows a schematic flowchart of a method for driving a digital human according to one or more embodiments of the present disclosure;
图3示出了依据本公开内容的一个或多个实施例的一种驱动数字人的装置的方框图;Fig. 3 shows a block diagram of an apparatus for driving a digital human according to one or more embodiments of the present disclosure;
图4示出了依据本公开内容的一个或多个实施例的一种电子设备的结构示意图。Fig. 4 shows a schematic structural diagram of an electronic device according to one or more embodiments of the present disclosure.
具体实施方式Detailed ways
本公开内容提供一种驱动数字人的方法,通过基于数字人的参考动作对插入动作进行调整,使得参考动作与插入动作之间的动作变动过程自然、细腻,从而解决现有技术中数字人动作变动突兀的技术问题。The present disclosure provides a method for driving a digital human. The insertion action is adjusted based on the reference action of the digital human, so that the action change process between the reference action and the insertion action is natural and delicate, thereby solving the digital human action in the prior art. Technical problems with sudden changes.
下面结合附图对本申请实施例技术方案的主要实现原理、具体实施方式及其对应能够达到的有益效果进行详细的阐述。The main implementation principles, specific implementation manners and corresponding beneficial effects of the technical solutions of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
实施例Example
请参考图1,本公开内容的一种实施例中提供了一种数字人驱动文本的生成方法,该方法包括:Please refer to FIG. 1. An embodiment of the present disclosure provides a method for generating digital human-driven text, which includes:
S10、获取待处理文本对应的目标动作;S10. Obtain a target action corresponding to the text to be processed;
S12、通过语音合成标记语言将所述待处理文本转换为所述目标文本,并将所述目标动作的标签所述目标文本中。S12: Convert the to-be-processed text into the target text by using a speech synthesis markup language, and put the tag of the target action in the target text.
本公开内容的一种实施例中,待处理文本的文本内容需要进行语音转换输出的。在输出语音的过程中可能还需对应输出与文本内容对应 的动作,例如,假设待处理文本为“请像我这样挥挥手,给远方的朋友打个招呼”,该文本在转换为语音输出时,需要在输出语音“挥挥手”的时候,输出动作“挥挥手”。S10获取待处理文本对应的目标动作,该目标动作可能是一个、也可能是多个,本实施例并不限制目标动作的具体数量。In an embodiment of the present disclosure, the text content of the text to be processed needs to be voice-converted and output. In the process of outputting speech, it may be necessary to output corresponding actions corresponding to the content of the text. For example, suppose the text to be processed is "Please wave your hand like me and say hello to a friend from afar". When the text is converted into speech output , It is necessary to output the action "wave hand" when outputting the voice "wave hand". S10 acquires a target action corresponding to the text to be processed. The target action may be one or more than one. This embodiment does not limit the specific number of target actions.
具体的,S10可以通过如下任意一种或多种方式来获取待处理文本对应的目标动作:Specifically, S10 may obtain the target action corresponding to the text to be processed through any one or more of the following methods:
方式一、获取待处理文本中的预设关键词。该预设关键词可以是躯体动作关键词、面部表情关键词,例如:“挥手”、“摇头”、“微笑”、“难过”等。获取预设关键词对应的预定动作作为目标动作,目标动作可以是面部表情也可以是驱动动作。预先创建动作库,存储各个关键词与各动作之间的对应关系,以及每个动作的动作参数如骨骼位置参数、肌肉运动参数。其中,动作库中的动作可以通过数据采集设备如摄像头、三维扫描仪等采集真人动作获得,也可以从已有的视频中提取。Method 1: Obtain the preset keywords in the text to be processed. The preset keywords can be body motion keywords, facial expression keywords, for example: "wave hand", "shaking head", "smile", "sad", etc. A predetermined action corresponding to a preset keyword is acquired as a target action, and the target action can be a facial expression or a driving action. Create an action library in advance to store the correspondence between each keyword and each action, as well as the action parameters of each action, such as bone position parameters and muscle motion parameters. Among them, the actions in the action library can be obtained through data collection devices such as cameras, three-dimensional scanners, etc., which collect real-life actions, or they can be extracted from existing videos.
方式二、对待处理文本进行语义识别,获得待处理文本中包含的动作意图;获取动作意图对应的预定动作作为目标动作。通过语义识别更为准确、全面的获得待处理文本的意图,而不仅限于动作文本,例如:对于文本“今日阳关明媚空气清新,让人神清气爽”,虽然整个文本中并没有提到任何动作,但根据整个文本的意思表达,“阳关明媚”可能对应有一抬头的动作意图、“空气清新”可能对应有一呼吸的动作意图,根据这些动作意图获取对应的预定动作。同样的,可以预先建立动作库,存储各个动作意图与各个动作之间的对应关系,以及各个动作的动作参数,这样可以从动作库中快速的获得动作意图对应的预定动作。Method 2: Perform semantic recognition on the text to be processed to obtain the action intention contained in the text to be processed; obtain the predetermined action corresponding to the action intention as the target action. Through semantic recognition, the intention of the text to be processed can be obtained more accurately and comprehensively, rather than limited to the action text. For example, for the text "Today's sun is bright and beautiful, the air is fresh and refreshing", although the whole text does not mention it Any action, but according to the meaning of the entire text, "Yangguanmingmei" may correspond to an action intention to raise the head, and "fresh air" may correspond to an action intention to breathe. According to these action intentions, the corresponding predetermined actions are obtained. Similarly, an action library can be established in advance to store the correspondence between each action intention and each action, as well as the action parameters of each action, so that the predetermined action corresponding to the action intention can be quickly obtained from the action library.
方式三、由人工对待处理文本进行标注,插入动作标识,不同的动作标识对应不同的目标动作。获取目标动作时通过对待处理文本中的动作标识进行查找,根据查找获得的动作标识获取对应的目标动作即可。The third method is to manually mark the text to be processed and insert the action identifier. Different action identifiers correspond to different target actions. When the target action is obtained, the action identifier in the text to be processed is searched, and the corresponding target action can be obtained according to the action identifier obtained by the search.
在获取到目标动作之后,继续执行S12进行文本转换和动作插入, 使得转换获得的目标文本能够被语音合成服务识别,从而提供相应的服务。语音合成标记语言(Speech Synthesis Markup Language,SSML)是一种基于XML的标记语言,与纯文本的合成相比,使用SSML可以充实合成的内容,为最终合成效果带来更多变化。本实施例通过SSML标记语言转换为目标文本,将需要转换的文本放入在<speak></speak>标签之内,每个语音合成任务包含一个<speak></speak>标签。在转换获得目标文本的过程中,本实施例还通过SSML标记语言将目标动作的标签插入目标文本中,使得目标文本不仅能够控制语音合成读什么,还可以控制读语音时输出相应的动作。After obtaining the target action, continue to perform S12 for text conversion and action insertion, so that the target text obtained by the conversion can be recognized by the speech synthesis service, thereby providing corresponding services. Speech Synthesis Markup Language (SSML) is an XML-based markup language. Compared with the synthesis of plain text, the use of SSML can enrich the synthesized content and bring more changes to the final synthesis effect. In this embodiment, the SSML markup language is used to convert the target text, and the text to be converted is placed in the <speak></speak> tag, and each speech synthesis task includes a <speak></speak> tag. In the process of obtaining the target text, this embodiment also inserts the tag of the target action into the target text through the SSML markup language, so that the target text can not only control what the speech synthesis reads, but also control the output of corresponding actions when reading the speech.
需要说明的是,目标动作的标签可以是动作名称,后续驱动数字人时根据动作名称去获取相应的动作参数,也可以直接将其目标动作参数作为标签插入目标文本中,驱动数字人时可以直接获取该目标动作参数。It should be noted that the label of the target action can be the action name. When the digital human is subsequently driven, the corresponding action parameter can be obtained according to the action name, or the target action parameter can be directly inserted into the target text as a label. When the digital human is driven, it can be directly Get the target action parameter.
请参考图2,本公开内容的一种实施例中提供了一种驱动数字人的方法,该方法包括:Please refer to FIG. 2, an embodiment of the present disclosure provides a method for driving a digital human, and the method includes:
S20、获取目标文本对应的目标动作;S20. Obtain a target action corresponding to the target text;
S22、获得基于目标文本驱动数字人输出语音时,数字人在执行目标动作之前所要执行的参考动作;S22. Obtain a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;
S24、根据参考动作的参考动作参数修改目标动作的目标动作参数;S24. Modify the target action parameter of the target action according to the reference action parameter of the reference action;
S26、在基于目标文本驱动数字人输出语音的过程中,根据修改后的目标动作参数驱动数字人执行目标动作。S26. In the process of driving the digital human to output speech based on the target text, drive the digital human to perform the target action according to the modified target action parameter.
文本驱动数字人输出语音的过程中,数字人通常情况下可能处于一个常用状态即基准状态,如对于一个新闻播报的数字人,其基准状态可能正面站立或正面坐于桌前无表情播报,也可能是根据新闻播报人的习惯,习惯性的做了一些动作,为此,在播报过程中插入动作时,可能会出现前后两个动作差异较大,动作变化突兀的技术问题。本实施例通过预先获得目标文本中的目标动作以及数字人执行该目标动作之前所处 的参考动作,基于参考动作来对目标动作进行修改,使得目标动作与参考动作尽可能的接近,从而解决因动作差异较大导致的动作变化突兀的技术问题。In the process of text-driven digital human outputting speech, the digital human may usually be in a common state, namely the reference state. For example, for a digital human broadcasting news, the reference state may be standing upright or sitting on the table in front of the desk without expression. It may be due to the habits of news broadcasters that they have habitually performed some actions. For this reason, when inserting actions during the broadcast process, there may be technical problems such as large differences between the two actions before and after, and sudden changes in the actions. This embodiment obtains in advance the target action in the target text and the reference action that the digital human is in before executing the target action, and modifies the target action based on the reference action, so that the target action is as close as possible to the reference action, thereby solving the problem. A technical problem with abrupt movement changes caused by large movement differences.
本公开内容的一种实施例中,S20可以直接从目标文本中查找获得目标动作的动作标签,根据该动作标签获得相应的目标动作。其中,目标文本中可能包含一个或多个目标动作的标签,S20执行时可以根据标签一次获取一个动作,也可以一次获得目标文本中对应的多个目标动作,形成目标动作序列,针对每一个目标动作执行步骤S22~S26。In an embodiment of the present disclosure, S20 may directly search and obtain the action label of the target action from the target text, and obtain the corresponding target action according to the action label. Among them, the target text may contain one or more target action labels. When S20 is executed, one action can be obtained according to the label at a time, or multiple target actions corresponding to the target text can be obtained at one time to form a target action sequence, for each target The operation executes steps S22 to S26.
S22获得数字人在执行目标动作前的参考动作。具体的,可以先获得目标动作在目标文本中的位置特征,如在关键词x1与x2之间,以及获得目标文本的时长特征,该时长特征根据目标文本对应的音素特征生成;根据目标文本的时长特征和目标动作的位置特征,获得目标动作执行时的第一时间点,即目标动作在这个语音播报的总时长的哪个时间点执行;进而根据该第一时间点取获取数字人在第一时间点之前的相邻时间点的参考动作。例如:假设目标动作的执行时间点为00:50:45,那么获取数字人在00:50:44所执行的参考动作。该参考动作可能是数字人常处的基准状态对应的基本动作,也可以能是语音输入过程中采用的习惯性动作,还可能是目标文本中的其它目标动作。S22 obtains the reference action of the digital human before performing the target action. Specifically, the location feature of the target action in the target text can be obtained first, such as between the keywords x1 and x2, and the duration feature of the target text can be obtained, which is generated according to the phoneme feature corresponding to the target text; The duration feature and the location feature of the target action are used to obtain the first time point when the target action is executed, that is, at which point in the total duration of the voice broadcast the target action is executed; then according to the first time point, the first time point of the digital person is obtained. The reference action at the adjacent time point before the time point. For example: assuming that the execution time of the target action is 00:50:45, then obtain the reference action performed by the digital human at 00:50:44. The reference action may be a basic action corresponding to the reference state of the digital human, or it may be a habitual action adopted in the voice input process, or may be other target actions in the target text.
在获得参考动作之后,继续执行S24修改动作参数,根据参考动作参数修改目标动作参数,使得修改后的目标动作参数与参考动作参数对应的基本动作参数之间的差值减小。一个动作通常包含基本动作和特征动作,分别对应基本动作参数和特征动作参数,基本动作可以适场景改变,特征动作一般不随场景改变,例如:一般的,“再见”的特征动作为小臂带动手掌挥动、基本动作则包括大臂、头、脚等动作。修改目标动作参数时,可以根据参考动作参数中的基本动作参数修改目标动作参数中的基本动作参数。动作参数之间的差值是指,将动作参数中对应参数相减然后累加得到的总差值。假设:基本动作参数V=[x11~x1n、 y11~y1m、z11~z1k],基本动作参数W=[x21~x2n、y21~y2m、z21~z2k],两个基本动作参数之间的差值=∑(x1n-x2n)+∑(y1m-y2m)+∑(z1k-z2k)。After obtaining the reference action, continue to execute S24 to modify the action parameter, and modify the target action parameter according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced. An action usually includes a basic action and a characteristic action, which correspond to the basic action parameters and the characteristic action parameters respectively. The basic action can be changed according to the scene, and the characteristic action generally does not change with the scene. For example, in general, the characteristic action of "goodbye" is for the forearm to drive the palm. Swing and basic movements include big arms, head, feet and other movements. When modifying the target action parameter, you can modify the basic action parameter in the target action parameter according to the basic action parameter in the reference action parameter. The difference between the action parameters refers to the total difference obtained by subtracting and accumulating the corresponding parameters in the action parameters. Suppose: basic action parameter V=[x11~x1n, y11~y1m, z11~z1k], basic action parameter W=[x21~x2n, y21~y2m, z21~z2k], the difference between the two basic action parameters =∑(x1n-x2n)+∑(y1m-y2m)+∑(z1k-z2k).
本公开内容的一种实施例中,本实施例所指的动作参数可以是数字人的骨骼位置参数或肌肉运动参数,其中,肌肉运动参数包括肌肉收缩参数和肌肉舒张参数。具体获取哪一种参数依据数字人的驱动模型确定,若数字人的驱动模型是肌肉绑定模型,那么采用肌肉运动参数;若数字人的驱动模型是骨骼动画,那么采用骨骼位置参数。下面以骨骼位置参数为例对目标动作的目标动作参数的修改进行详细说明:In an embodiment of the present disclosure, the action parameter referred to in this embodiment may be a bone position parameter or a muscle movement parameter of a digital human, where the muscle movement parameter includes a muscle contraction parameter and a muscle relaxation parameter. Which parameter to obtain is determined according to the driving model of the digital human. If the driving model of the digital human is a muscle binding model, then the muscle motion parameters are used; if the driving model of the digital human is a skeletal animation, then the bone position parameters are used. The following takes the bone position parameter as an example to explain in detail the modification of the target action parameter of the target action:
第一步,获取目标动作的动作参数。动作库中一类动作可能会对应多种不同形态,例如:动作“再见”可能包含在胸前的挥手“再见”、在身体一侧的挥手“再见”、举过头顶的挥手“再见”,一种形态对应一组动作参数(统称为动作参数),每组动作参数按时序不同划分为起始动作参数、中间动作参数、终止时间参数,每组动作参数对应一个完整的动作。为了使数字人的动作变化自然、细腻,本实施例从预设动作库中获取至少一个目标动作即至少一种形态的目标动作,以及每个目标动作的动作参数;根据每个目标动作的起始动作参数,获取与参考动作的起始动作参数差值最小的起始动作参数对应的动作参数作为目标动作参数,即从多个形态的动作中获取一个与参考动作差异最小的目标动作。例如:对于参考动作是“双手交叉位于胸前”,那么选择“再见”这个目标动作时,选择在胸前的挥手“再见”更为合适,这两个动作的手手臂骨骼位置参数的差值最小,动作变换自然而真实。The first step is to obtain the action parameters of the target action. A type of action in the action library may correspond to many different forms. For example, the action "goodbye" may include a wave of "goodbye" on the chest, a wave of "goodbye" on the side of the body, and a wave of "goodbye" above the head. One form corresponds to a set of action parameters (collectively referred to as action parameters), and each set of action parameters is divided into initial action parameters, intermediate action parameters, and end time parameters according to different timings. Each set of action parameters corresponds to a complete action. In order to make the movement of the digital human change natural and delicate, this embodiment obtains at least one target action, that is, at least one form of target action, and the action parameters of each target action from the preset action library; according to the start of each target action The initial action parameter, the action parameter corresponding to the initial action parameter with the smallest difference between the initial action parameter of the reference action is obtained as the target action parameter, that is, a target action with the smallest difference from the reference action is obtained from multiple morphological actions. For example: for the reference action is "hands crossed in front of the chest", then when selecting the target action of "goodbye", it is more appropriate to choose "goodbye" in front of the chest. The difference between the position parameters of the hand and arm bones of these two actions The smallest, the movement changes natural and real.
第二步,修改目标动作参数。在确定了目标动作的目标动作参数之后,进一步根据参考动作参数修改目标动作参数,使得修改后的目标动作参数与参考动作参数对应的基本动作参数之间的差值减小,从而使修改后的目标动作与参动作之间的差异尽可能小,基本动作尽可能重合。作为一些实施例中实施方式,修改目标动作参数时,可以将目标动作参 数中的基本动作参数修改为参考动作参数中的基本动作参数,修改后的目标动作参数与参考动作参数之间的差值最小,参考动作与修改参数后的目标动作的基本动作重合。例如,对于参考动作“双手交叉位于胸前”、目标动作胸前的挥手“再见”,可以将目标动作中大臂动作对应的动作参数修改为参考动作中大臂动作对应的动作参数,或者减小目标动作中大臂动作对应的动作参数与参考动作中大臂动作对应的动作参数之间的差值。The second step is to modify the target action parameters. After the target action parameter of the target action is determined, the target action parameter is further modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced, so that the modified The difference between the target action and the reference action is as small as possible, and the basic actions overlap as much as possible. As an implementation in some embodiments, when modifying the target action parameter, the basic action parameter in the target action parameter can be modified to the basic action parameter in the reference action parameter, and the difference between the modified target action parameter and the reference action parameter Minimum, the reference action coincides with the basic action of the target action after modifying the parameters. For example, for the reference action "hands crossed in front of the chest" and the target action "goodbye" in front of the chest, the action parameters corresponding to the big arm action in the target action can be modified to the action parameters corresponding to the big arm action in the reference action, or reduced The difference between the action parameter corresponding to the big arm action in the small target action and the action parameter corresponding to the big arm action in the reference action.
在S24之后,进一步执行S26根据修改后的目标动作参数驱动数字人。具体的,基于目标文本驱动数字人时,可以根据目标文本获得时长特征;根据时长特征,获得目标文本对应的目标语音序列;根据时长特征和目标文本中包含的所有目标动作的修改参数,获得目标文本的目标动作序列;将所述目标语音序列和所述目标动作序列输入数字人的驱动模型,驱动数字人输出相应的语音和动作。本实施例还可以在执行完目标动作之后,进一步驱动数字人执行参考动作,即从目标动作回到参考动作。具体实现时,在生成动作序列时在目标动作参数后加入参考动作的参考动作参数即可。After S24, S26 is further executed to drive the digital human according to the modified target action parameters. Specifically, when the digital human is driven based on the target text, the duration feature can be obtained according to the target text; the target speech sequence corresponding to the target text can be obtained according to the duration feature; the target can be obtained according to the duration feature and the modification parameters of all target actions contained in the target text The target action sequence of the text; the target voice sequence and the target action sequence are input into the driving model of the digital human, and the digital human is driven to output corresponding voices and actions. In this embodiment, after executing the target action, the digital human can be further driven to perform the reference action, that is, from the target action back to the reference action. In specific implementation, the reference action parameter of the reference action can be added after the target action parameter when generating the action sequence.
在上述技术方案中,通过对文本的语义和/或关键词的识别,获取文本表达里携带的目标动作,并将该目标动作的标签插入文本中,促使在通过文本驱动数字人时,通过插入的动作标签驱动数字人执行相应的动作,实现了文本对数字人的动作驱动。进一步的,针对文本对应的目标动作,获取目标动作执行前的参考动作,依据参考动作的动作参数修改目标动作的动作参数,较小目标动作与参考动作之间的差异,使得数字人在执行参考动作到目标动作的转换时,转换过程自然、协调,解决了现有技术中数字人动作转换突兀的技术问题,增加了数字人动作转换的细腻性。In the above technical solution, by recognizing the semantics and/or keywords of the text, the target action carried in the text expression is obtained, and the label of the target action is inserted into the text, so that when the digital person is driven by the text, the insertion The action tag drives the digital person to perform the corresponding action, which realizes the action drive of the text to the digital person. Further, for the target action corresponding to the text, the reference action before the execution of the target action is obtained, and the action parameters of the target action are modified according to the action parameters of the reference action. The difference between the target action and the reference action is small, so that the digital human is performing the reference When the action is converted to the target action, the conversion process is natural and coordinated, which solves the abrupt technical problem of digital human action conversion in the prior art, and increases the delicacy of digital human action conversion.
在本公开内容的一个方面,还提供了一种驱动数字人的装置,请参考图3,该装置包括:In one aspect of the present disclosure, a device for driving a digital human is also provided. Please refer to FIG. 3. The device includes:
获取单元31,用于获取目标文本对应的目标动作;获得基于所述目标文本驱动数字人输出语音时,所述数字人在执行所述目标动作之前所要执行的参考动作;The obtaining unit 31 is configured to obtain a target action corresponding to the target text; obtain a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;
调整单元32,用于根据所述参考动作的参考动作参数修改所述目标动作的目标动作参数;The adjustment unit 32 is configured to modify the target action parameter of the target action according to the reference action parameter of the reference action;
驱动单元33,用于在基于所述目标文本驱动数字人输出语音的过程中,根据修改后的目标动作参数驱动所述数字人执行所述目标动作。The driving unit 33 is configured to drive the digital person to perform the target action according to the modified target action parameter in the process of driving the digital person to output speech based on the target text.
在一些实施例中,所述目标动作为面部表情或躯体动作。所述动作参数为骨骼位置参数或肌肉运动参数。In some embodiments, the target action is a facial expression or a physical action. The action parameters are bone position parameters or muscle motion parameters.
在一些实施例中,所述装置还包括:识别单元34和插入单元35。其中,识别单元34,用于在获取目标文本对应的目标动作之前,获取待处理文本对应的目标动作;插入单元35,用于通过语音合成标记语言将所述待处理文本转换为所述目标文本,并将所述目标动作的标签插入所述目标文本中。In some embodiments, the device further includes: an identification unit 34 and an insertion unit 35. Wherein, the recognition unit 34 is configured to obtain the target action corresponding to the text to be processed before obtaining the target action corresponding to the target text; the inserting unit 35 is configured to convert the text to be processed into the target text through a speech synthesis markup language , And insert the label of the target action into the target text.
本公开内容的一种实施例中,识别单元34可以采用以下任一方式识别获取目标动作:In an embodiment of the present disclosure, the recognition unit 34 may use any of the following methods to recognize and acquire the target action:
方式一、获取所述待处理文本中的预设关键词;获取所述预设关键词对应的预定动作作为所述目标动作。Manner 1: Obtain a preset keyword in the to-be-processed text; obtain a predetermined action corresponding to the preset keyword as the target action.
方式二、对所述待处理文本进行语义识别,获得所述待处理文本中包含的动作意图;获取所述动作意图对应的预定动作作为所述目标动作。Manner 2: Perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text; acquire a predetermined action corresponding to the action intention as the target action.
在一些实施例中,所述调整单元32在修改动作参数时,可以从预设动作库中获取至少一个所述目标动作以及每个所述目标动作的动作参数,所述动作参数包含起始动作参数和终止动作参数;根据所述每个目标动作的动作参数,获取与所述参考动作参数中的终止动作参数差值最小的起始动作参数对应的动作参数作为所述目标动作参数;根据所述参考动作参数修改所述目标动作参数,使得修改后的目标动作参数与所述 参考动作参数对应的基本动作参数之间的差值减小。In some embodiments, when the adjustment unit 32 modifies the action parameters, it can obtain at least one of the target actions and the action parameters of each of the target actions from a preset action library, and the action parameters include the initial action. Parameters and termination action parameters; according to the action parameters of each target action, the action parameter corresponding to the initial action parameter with the smallest difference between the termination action parameters in the reference action parameters is obtained as the target action parameter; The reference action parameter modifies the target action parameter so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
关于本公开内容的一种实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the device in an embodiment of the present disclosure, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and detailed description will not be given here.
图4示出了依据本公开内容的一个或多个实施例的用于实现驱动数字人的方法的电子设备800的框图。例如,电子设备800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。FIG. 4 shows a block diagram of an electronic device 800 for implementing a method for driving a digital person according to one or more embodiments of the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.
参照图4,电子设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/展现(I/O)的接口812,传感器组件814,以及通信组件816。4, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/presentation (I/O) interface 812, and a sensor component 814 , And communication component 816.
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理部件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。The processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method. In addition, the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
存储器804被配置为存储各种类型的数据以支持在设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 804 is configured to store various types of data to support operations in the device 800. Examples of these data include instructions for any application or method to operate on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc. The memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。The power supply component 806 provides power for various components of the electronic device 800. The power supply component 806 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 800.
多媒体组件808包括在所述电子设备800和用户之间的提供一个展现接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 808 includes a screen that provides a presentation interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
音频组件810被配置为展现和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当电子设备800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于展现音频信号。The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), and when the electronic device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for displaying audio signals.
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module. The above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
传感器组件814包括一个或多个传感器,用于为电子设备800提供各个方面的状态评估。例如,传感器组件814可以检测到设备800的打开/关闭状态,组件的相对定位,例如所述组件为电子设备800的显示器和小键盘,传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变,用户与电子设备800接触的存在或不存在,电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图 像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。The sensor component 814 includes one or more sensors for providing the electronic device 800 with various aspects of state evaluation. For example, the sensor component 814 can detect the on/off status of the device 800 and the relative positioning of components. For example, the component is the display and the keypad of the electronic device 800. The sensor component 814 can also detect the electronic device 800 or a component of the electronic device 800. The position of the electronic device 800 changes, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800. The sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact. The sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信部件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信部件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
在示例性实施例中,电子设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field-available A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
在本公开内容的又一方面,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器804,上述指令可由电子设备800的处理器820执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium including instructions, such as the memory 804 including instructions, which can be executed by the processor 820 of the electronic device 800 to complete the foregoing method. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
一种非临时性计算机可读存储介质,当所述存储介质中的指令由移动终端的处理器执行时,使得移动终端能够执行一种驱动数字人的方法,所述方法包括:获取目标文本对应的目标动作;获得基于所述目标文本驱动数字人输出语音时,所述数字人在执行所述目标动作之前所要执行的参考动作;根据所述参考动作的参考动作参数修改所述目标动作的目标动作参数;在基于所述目标文本驱动数字人输出语音的过程中,根据修改后的目标动作参数驱动所述数字人执行所述目标动作。A non-transitory computer-readable storage medium. When the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal is able to execute a method for driving a digital person. The method includes: obtaining a target text correspondence Obtain the reference action to be performed by the digital human before performing the target action when the digital human is driven to output speech based on the target text; modify the target of the target action according to the reference action parameters of the reference action Action parameters; in the process of driving the digital person to output speech based on the target text, drive the digital person to perform the target action according to the modified target action parameter.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由下面的权利要求指出。Those skilled in the art will easily think of other embodiments of the present invention after considering the specification and practicing the invention disclosed herein. This application is intended to cover any variations, uses, or adaptive changes of the present invention. These variations, uses, or adaptive changes follow the general principles of the present invention and include common knowledge or conventional technical means in the technical field not disclosed in this disclosure. . The description and the embodiments are to be regarded as exemplary only, and the true scope and spirit of the present invention are pointed out by the following claims.
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制It should be understood that the present invention is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from its scope. The scope of the present invention is only limited by the appended claims
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (16)

  1. 一种驱动数字人的方法,包括:A method of driving digital humans, including:
    获取目标文本对应的目标动作;Obtain the target action corresponding to the target text;
    获得基于所述目标文本驱动数字人输出语音时,所述数字人在执行所述目标动作之前所要执行的参考动作;Obtaining a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;
    根据所述参考动作的参考动作参数修改所述目标动作的目标动作参数;Modify the target action parameter of the target action according to the reference action parameter of the reference action;
    在基于所述目标文本驱动数字人输出语音的过程中,根据修改后的目标动作参数驱动所述数字人执行所述目标动作。In the process of driving the digital human to output speech based on the target text, the digital human is driven to perform the target action according to the modified target action parameter.
  2. 如权利要求1所述的方法,其中,在获取目标文本对应的目标动作之前,所述方法还包括:The method according to claim 1, wherein before obtaining the target action corresponding to the target text, the method further comprises:
    获取待处理文本对应的目标动作;Obtain the target action corresponding to the text to be processed;
    通过语音合成标记语言将所述待处理文本转换为所述目标文本,并将所述目标动作的标签插入所述目标文本中。The text to be processed is converted into the target text through a speech synthesis markup language, and the tag of the target action is inserted into the target text.
  3. 如权利要求2所述的方法,其中,所述获取待处理文本对应的目标动作,包括:The method according to claim 2, wherein said obtaining the target action corresponding to the text to be processed comprises:
    获取所述待处理文本中的预设关键词;Acquiring preset keywords in the to-be-processed text;
    获取所述预设关键词对应的预定动作作为所述目标动作。Obtain a predetermined action corresponding to the preset keyword as the target action.
  4. 如权利要求2所述的方法,其中,所述获取待处理文本对应的目标动作,包括:The method according to claim 2, wherein said obtaining the target action corresponding to the text to be processed comprises:
    对所述待处理文本进行语义识别,获得所述待处理文本中包含的动作意图;Perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text;
    获取所述动作意图对应的预定动作作为所述目标动作。Acquire a predetermined action corresponding to the action intention as the target action.
  5. 如权利要求1所述的方法,其中,所述根据所述参考动作的参考动作参数调整所述目标动作的目标动作参数,包括:The method of claim 1, wherein the adjusting the target action parameter of the target action according to the reference action parameter of the reference action comprises:
    从预设动作库中获取至少一个所述目标动作以及每个所述目标动作的动作参数,所述动作参数包含起始动作参数和终止动作参数;Acquiring at least one of the target actions and the action parameters of each of the target actions from a preset action library, where the action parameters include a start action parameter and an end action parameter;
    根据所述每个目标动作的动作参数,获取与所述参考动作参数中 的终止动作参数差值最小的起始动作参数对应的动作参数作为所述目标动作参数;Acquiring, according to the action parameter of each target action, the action parameter corresponding to the initial action parameter with the smallest difference of the termination action parameter in the reference action parameters as the target action parameter;
    根据所述参考动作参数修改所述目标动作参数,使得修改后的目标动作参数与所述参考动作参数对应的基本动作参数之间的差值减小。The target action parameter is modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
  6. 如权利要求1~5任一所述的方法,其中,所述动作参数为骨骼位置参数或肌肉运动参数。The method according to any one of claims 1 to 5, wherein the motion parameter is a bone position parameter or a muscle movement parameter.
  7. 如权利要求1~5任一所述的方法,其中,所述目标动作为面部表情或躯体动作。The method according to any one of claims 1 to 5, wherein the target action is a facial expression or a body action.
  8. 一种驱动数字人的装置,包括:A device for driving digital humans, including:
    获取单元,用于获取目标文本对应的目标动作;获得基于所述目标文本驱动数字人输出语音时,所述数字人在执行所述目标动作之前所要执行的参考动作;An acquiring unit for acquiring a target action corresponding to a target text; acquiring a reference action to be performed by the digital person before performing the target action when the digital person is driven to output speech based on the target text;
    调整单元,用于根据所述参考动作的参考动作参数修改所述目标动作的目标动作参数;An adjustment unit, configured to modify the target action parameter of the target action according to the reference action parameter of the reference action;
    驱动单元,用于在基于所述目标文本驱动数字人输出语音的过程中,根据修改后的目标动作参数驱动所述数字人执行所述目标动作。The driving unit is configured to drive the digital human to perform the target action according to the modified target action parameter in the process of driving the digital human to output speech based on the target text.
  9. 如权利要求8所述的装置,还包括:The device of claim 8, further comprising:
    识别单元,用于在获取目标文本对应的目标动作之前,获取待处理文本对应的目标动作;The recognition unit is used to obtain the target action corresponding to the text to be processed before obtaining the target action corresponding to the target text;
    插入单元,用于通过语音合成标记语言将所述待处理文本转换为所述目标文本,并将所述目标动作的标签插入所述目标文本中。The inserting unit is configured to convert the to-be-processed text into the target text through a speech synthesis markup language, and insert the tag of the target action into the target text.
  10. 如权利要求9所述的装置,其中,所述识别单元用于:The device according to claim 9, wherein the identification unit is used for:
    获取所述待处理文本中的预设关键词;Acquiring preset keywords in the to-be-processed text;
    获取所述预设关键词对应的预定动作作为所述目标动作。Obtain a predetermined action corresponding to the preset keyword as the target action.
  11. 如权利要求9所述的装置,其中,所述识别单元还用于:The device according to claim 9, wherein the identification unit is further used for:
    对所述待处理文本进行语义识别,获得所述待处理文本中包含的动作意图;Perform semantic recognition on the to-be-processed text to obtain the action intention contained in the to-be-processed text;
    获取所述动作意图对应的预定动作作为所述目标动作。Acquire a predetermined action corresponding to the action intention as the target action.
  12. 如权利要求8所述的装置,其中,所述调整单元用于:The device according to claim 8, wherein the adjustment unit is configured to:
    从预设动作库中获取至少一个所述目标动作以及每个所述目标动作的动作参数,所述动作参数包含起始动作参数和终止动作参数;Acquiring at least one of the target actions and the action parameters of each of the target actions from a preset action library, where the action parameters include a start action parameter and an end action parameter;
    根据所述每个目标动作的动作参数,获取与所述参考动作参数中的终止动作参数差值最小的起始动作参数对应的动作参数作为所述目标动作参数;Obtaining, according to the action parameter of each target action, the action parameter corresponding to the initial action parameter with the smallest difference of the termination action parameter in the reference action parameters as the target action parameter;
    根据所述参考动作参数修改所述目标动作参数,使得修改后的目标动作参数与所述参考动作参数对应的基本动作参数之间的差值减小。The target action parameter is modified according to the reference action parameter, so that the difference between the modified target action parameter and the basic action parameter corresponding to the reference action parameter is reduced.
  13. 如权利要求8~12任一所述的装置,其中,所述动作参数为骨骼位置参数或肌肉运动参数。The device according to any one of claims 8-12, wherein the motion parameter is a bone position parameter or a muscle movement parameter.
  14. 如权利要求8~12任一所述的装置,其中,所述目标动作为面部表情或躯体动作。The device according to any one of claims 8-12, wherein the target action is a facial expression or a body action.
  15. 一种电子设备,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上的程序存储于存储器中,且经配置以由一个或者一个以上的处理器执行所述一个或者一个以上的程序所包含的用于进行如权利要求1~7任一所述方法对应的操作指令。An electronic device including a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors The included instructions are used to perform the operation corresponding to the method according to any one of claims 1-7.
  16. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1~7任一所述方法对应的步骤。A computer-readable storage medium with a computer program stored thereon, and when the program is executed by a processor, the steps corresponding to the method according to any one of claims 1-7 are realized.
PCT/CN2021/078242 2020-05-18 2021-02-26 Method and apparatus for driving digital person, and electronic device WO2021232875A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/989,323 US20230082830A1 (en) 2020-05-18 2022-11-17 Method and apparatus for driving digital human, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010420678.0 2020-05-18
CN202010420678.0A CN113689530B (en) 2020-05-18 2020-05-18 Method and device for driving digital person and electronic equipment

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078248 Continuation WO2021232878A1 (en) 2020-05-18 2021-02-26 Virtual anchor face swapping method and apparatus, electronic device, and storage medium

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/989,323 Continuation US20230082830A1 (en) 2020-05-18 2022-11-17 Method and apparatus for driving digital human, and electronic device

Publications (1)

Publication Number Publication Date
WO2021232875A1 true WO2021232875A1 (en) 2021-11-25

Family

ID=78575522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078242 WO2021232875A1 (en) 2020-05-18 2021-02-26 Method and apparatus for driving digital person, and electronic device

Country Status (2)

Country Link
CN (1) CN113689530B (en)
WO (1) WO2021232875A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116708920A (en) * 2022-06-30 2023-09-05 北京生数科技有限公司 Video processing method, device and storage medium applied to virtual image synthesis
CN117808942A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Semantic strong-correlation 3D digital human action generation method and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115661308B (en) * 2022-11-03 2024-03-19 北京百度网讯科技有限公司 Method, apparatus, electronic device, and storage medium for driving digital person

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102497513A (en) * 2011-11-25 2012-06-13 中山大学 Video virtual hand language system facing digital television
US20150187112A1 (en) * 2013-12-27 2015-07-02 Toonimo, Inc. System and Method for Automatic Generation of Animation
CN108665492A (en) * 2018-03-27 2018-10-16 北京光年无限科技有限公司 A kind of Dancing Teaching data processing method and system based on visual human
CN110688008A (en) * 2019-09-27 2020-01-14 贵州小爱机器人科技有限公司 Virtual image interaction method and device
CN110880198A (en) * 2018-09-06 2020-03-13 百度在线网络技术(北京)有限公司 Animation generation method and device
CN111145322A (en) * 2019-12-26 2020-05-12 上海浦东发展银行股份有限公司 Method, apparatus and computer-readable storage medium for driving avatar

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1256931A1 (en) * 2001-05-11 2002-11-13 Sony France S.A. Method and apparatus for voice synthesis and robot apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102497513A (en) * 2011-11-25 2012-06-13 中山大学 Video virtual hand language system facing digital television
US20150187112A1 (en) * 2013-12-27 2015-07-02 Toonimo, Inc. System and Method for Automatic Generation of Animation
CN108665492A (en) * 2018-03-27 2018-10-16 北京光年无限科技有限公司 A kind of Dancing Teaching data processing method and system based on visual human
CN110880198A (en) * 2018-09-06 2020-03-13 百度在线网络技术(北京)有限公司 Animation generation method and device
CN110688008A (en) * 2019-09-27 2020-01-14 贵州小爱机器人科技有限公司 Virtual image interaction method and device
CN111145322A (en) * 2019-12-26 2020-05-12 上海浦东发展银行股份有限公司 Method, apparatus and computer-readable storage medium for driving avatar

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116708920A (en) * 2022-06-30 2023-09-05 北京生数科技有限公司 Video processing method, device and storage medium applied to virtual image synthesis
CN116708920B (en) * 2022-06-30 2024-04-19 北京生数科技有限公司 Video processing method, device and storage medium applied to virtual image synthesis
CN117808942A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Semantic strong-correlation 3D digital human action generation method and system

Also Published As

Publication number Publication date
CN113689530B (en) 2023-10-20
CN113689530A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
US11503377B2 (en) Method and electronic device for processing data
CN109637518B (en) Virtual anchor implementation method and device
WO2021232875A1 (en) Method and apparatus for driving digital person, and electronic device
CN108363706B (en) Method and device for man-machine dialogue interaction
CN111726536A (en) Video generation method and device, storage medium and computer equipment
TWI255141B (en) Method and system for real-time interactive video
US20080165195A1 (en) Method, apparatus, and software for animated self-portraits
CN112199016B (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN111954063B (en) Content display control method and device for video live broadcast room
CN110322760B (en) Voice data generation method, device, terminal and storage medium
JP7209851B2 (en) Image deformation control method, device and hardware device
US20210029304A1 (en) Methods for generating video, electronic device and storage medium
WO2019153925A1 (en) Searching method and related device
CN109819167B (en) Image processing method and device and mobile terminal
US10893203B2 (en) Photographing method and apparatus, and terminal device
EP3340077B1 (en) Method and apparatus for inputting expression information
US20230368461A1 (en) Method and apparatus for processing action of virtual object, and storage medium
CN110794964A (en) Interaction method and device for virtual robot, electronic equipment and storage medium
US11076091B1 (en) Image capturing assistant
US20240022772A1 (en) Video processing method and apparatus, medium, and program product
CN113709548B (en) Image-based multimedia data synthesis method, device, equipment and storage medium
CN111292743B (en) Voice interaction method and device and electronic equipment
CN114339393A (en) Display processing method, server, device, system and medium for live broadcast picture
CN115225756A (en) Method for determining target object, shooting method and device
CN114495988B (en) Emotion processing method of input information and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21809522

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21809522

Country of ref document: EP

Kind code of ref document: A1