WO2021196643A1 - 交互对象的驱动方法、装置、设备以及存储介质 - Google Patents
交互对象的驱动方法、装置、设备以及存储介质 Download PDFInfo
- Publication number
- WO2021196643A1 WO2021196643A1 PCT/CN2020/129770 CN2020129770W WO2021196643A1 WO 2021196643 A1 WO2021196643 A1 WO 2021196643A1 CN 2020129770 W CN2020129770 W CN 2020129770W WO 2021196643 A1 WO2021196643 A1 WO 2021196643A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phoneme
- posture
- feature information
- parameter value
- interactive object
- Prior art date
Links
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 183
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000005070 sampling Methods 0.000 claims description 78
- 108091026890 Coding region Proteins 0.000 claims description 43
- 238000013528 artificial neural network Methods 0.000 claims description 42
- 230000001815 facial effect Effects 0.000 claims description 38
- 210000001097 facial muscle Anatomy 0.000 claims description 23
- 230000033001 locomotion Effects 0.000 claims description 20
- 230000006403 short-term memory Effects 0.000 claims description 17
- 230000003993 interaction Effects 0.000 claims description 16
- 230000007787 long-term memory Effects 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 11
- 230000009471 action Effects 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 3
- 230000000704 physical effect Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 12
- 230000014509 gene expression Effects 0.000 description 11
- 210000003205 muscle Anatomy 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000004044 response Effects 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 5
- 238000009877 rendering Methods 0.000 description 4
- 230000008602 contraction Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0484—Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
- G06F3/04847—Interaction techniques to control parameter settings, e.g. interaction with sliders or dials
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
- G06V40/176—Dynamic expression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
Definitions
- a method for driving an interactive object the interactive object being displayed in a display device
- the method includes: acquiring a phoneme sequence corresponding to sound driving data of the interactive object; The posture parameter value of the interactive object matched by the sequence; the posture of the interactive object displayed by the display device is controlled according to the posture parameter value.
- the method further includes: controlling the display device to output voice and/or text according to the phoneme sequence.
- performing feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence includes: generating the phoneme for each of the multiple phonemes included in the phoneme sequence Respective coding sequence; according to the coding value of the coding sequence corresponding to the phoneme and the duration corresponding to the multiple phonemes in the phoneme sequence, the characteristic information of the coding sequence of the phoneme is obtained; according to the multiple phonemes The characteristic information of the corresponding coding sequence is obtained to obtain the characteristic information of the phoneme sequence.
- the neural network includes a long and short-term memory network and a fully connected network
- the sampling feature information corresponding to the first sampling time is input to a pre-trained neural network to obtain and
- the posture parameter value of the interactive object corresponding to the sampling feature information includes: inputting the sampling feature information corresponding to the first sampling time into the long and short-term memory network, and according to the sampling feature information before the first sampling time , Output associated feature information; input the associated feature information into the fully connected network, and determine the posture parameter value corresponding to the associated feature information according to the classification result of the fully connected network; wherein, in the classification result
- Each category corresponds to a set of attitude parameter values.
- the method further includes: performing feature encoding on the phoneme sequence sample, obtaining feature information corresponding to the second sampling time, and labeling the feature information with a corresponding posture parameter value , Obtain a characteristic information sample; train the initial neural network according to the characteristic information sample, and train the neural network after the change of the network loss satisfies the convergence condition, wherein the network loss includes the prediction obtained by the initial neural network The difference between the posture parameter value and the marked posture parameter value.
- a driving device for an interactive object the interactive object is displayed in a display device, the device includes: a phoneme sequence acquisition unit for acquiring the phoneme corresponding to the sound driving data of the interactive object Sequence; a parameter acquisition unit for acquiring the posture parameter value of the interactive object matching the phoneme sequence; a driving unit for controlling the posture of the interactive object displayed by the display device according to the posture parameter value.
- an electronic device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed.
- the method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
- a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.
- the phoneme sequence corresponding to the sound driving data of the interactive object displayed by the display device is obtained, and the phoneme sequence matching the phoneme sequence is obtained.
- the posture parameter value of the interactive object, and according to the posture parameter value of the interactive object matching the phoneme sequence, the posture of the interactive object displayed by the display device is controlled, so that the interactive object interacts with each other.
- the matching posture of the target object communicating or responding to the target object, so that the target object feels that it is communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved.
- FIG. 3 is a schematic diagram of a process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure
- Fig. 6 is a schematic structural diagram of an electronic device provided by at least one embodiment of the present disclosure.
- the interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.
- terminal devices may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc.
- VR virtual reality
- AR augmented reality
- Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure.
- the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect.
- the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters.
- the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen.
- the display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor.
- the processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.
- FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 2, the method includes steps 201 to 203.
- Step 201 Obtain a phoneme sequence corresponding to the sound-driven data of the interactive object.
- the sound-driven data may be the driving data generated by the server or terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or it may be the terminal device from the internal memory.
- the called sound-driven data The present disclosure does not limit the acquisition method of the sound-driven data.
- the posture parameter value of the interactive object matching the phoneme sequence can be obtained according to the acoustic characteristics of the phoneme sequence; it is also possible to perform feature encoding on the phoneme sequence to determine the corresponding feature code The posture parameter value, thereby determining the posture parameter value corresponding to the phoneme sequence.
- Step 203 Control the posture of the interactive object displayed by the display device according to the posture parameter value.
- the posture parameter value matches the phoneme sequence corresponding to the sound-driven data of the interactive object, and the posture of the interactive object is controlled according to the posture parameter value, so that the posture of the interactive object can be aligned with the target
- the communication or response made by the subject matches. For example, when the interactive object is using voice to communicate or respond with the target object, the gesture made is synchronized with the output voice, thereby giving the target object a feeling that the interactive object is speaking.
- the posture parameter value of the interactive object matching the phoneme sequence is obtained, and the posture parameter value of the interactive object matching the phoneme sequence is obtained according to the phoneme sequence matching the phoneme sequence.
- the posture parameter value of the interactive object controls the posture of the interactive object displayed by the display device so that the interactive object makes a matching posture for communicating with the target object or responding to the target object, In this way, the target object feels that it is communicating with the interactive object, and the interactive experience of the target object is improved.
- the method is applied to a terminal, and the terminal processes sound-driven data of an interactive object, generates a posture parameter value of the interactive object, and performs rendering using a three-dimensional rendering engine according to the posture parameter value, An animation of the interactive object is obtained, and the terminal can display the animation to communicate or respond to the target object.
- the display device may be controlled to output speech and/or display text according to the phoneme sequence. And while controlling the display device to output voice and/or display text according to the phoneme sequence, the gesture of the interactive object displayed by the display device can be controlled according to the gesture parameter value.
- the time window is moved on the phoneme sequence, and the phonemes in the time window during each movement are output, wherein the set duration is used as the time of each movement
- the step size of the window For example, you can set the length of the time window to 1 second and the set time to 0.1 second.
- the phoneme at the set position of the time window or the attitude parameter value corresponding to the feature information of the phone is obtained, and the attitude parameter value is used to control the attitude of the interactive object;
- the set position is The position of the set duration from the start position of the time window. For example, when the length of the time window is set to 1s, the set position may be 0.5s away from the start position of the time window.
- a Gaussian filter may be used to perform a Gaussian convolution operation on the consecutive values of phonemes j, i1, and ie4 in the coded sequences 321, 322, and 323, respectively, to obtain the characteristic information of the coded sequence. That is, the Gaussian convolution operation is performed on the continuous value of the phoneme in time through the Gaussian filter, so that the code value in each code sequence changes from the second value to the first value or from the first value to the second value. smooth.
- the facial posture parameters may include facial muscle control coefficients.
- the facial motion of the interactive object may be associated with the body posture, that is, the facial posture parameter value corresponding to the facial motion may be associated with the body posture.
- the body posture may include body motion, Gesture movement, walking posture, etc.
- the interactive object obtain the driving data of the body posture associated with the facial posture parameter value; while outputting the sound according to the phoneme sequence, according to the driving data of the body posture associated with the facial posture parameter value ,
- To drive the interactive object to make physical actions That is, while driving the interactive object to make a facial action according to the sound driving data of the interactive object, it also obtains the driving data of the associated body posture according to the facial posture parameter value corresponding to the facial action, so as to output the sound
- the interactive object can be driven to make corresponding facial and body movements synchronously, so that the speaking state of the interactive object is more vivid and natural, and the interactive experience of the target object is improved.
- a phoneme sequence sample is obtained, and the phoneme sequence sample includes the posture parameter value of the interactive object marked at the second sampling time of the set time interval.
- the dotted line represents the second sampling time, and the posture parameter value of the interactive object is marked at each second sampling time.
- the device further includes an output unit for controlling the display device to output speech and/or display text according to the phoneme sequence.
- the parameter obtaining unit is specifically configured to: perform feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence; and obtain the posture parameters of the interaction object corresponding to the feature information of the phoneme sequence value.
- the parameter acquisition unit when generating coding sequences corresponding to multiple phonemes for the multiple phonemes included in the phoneme sequence, is specifically configured to: detect whether there is a first phoneme corresponding to each time point, The first phoneme is any one of the plurality of phonemes; by setting the code value at the time point when the first phoneme is present to the first value, the code at the time point without the first phoneme is set The value is set to a second value to obtain the coding sequence corresponding to the first phoneme.
- the posture parameters include facial posture parameters
- the facial posture parameters include facial muscle control coefficients, which are used to control the motion state of at least one facial muscle;
- the driving unit is specifically configured to: match with the phoneme sequence The facial muscle control coefficient of, drives the interactive object to make facial actions that match each phoneme in the phoneme sequence.
- the processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output.
- the processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
- FPGA Field Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
- PDA personal digital assistant
- GPS global positioning system
- USB universal serial bus
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Processing Or Creating Images (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims (20)
- 一种交互对象的驱动方法,所述交互对象展示在显示设备中,所述方法包括:获取所述交互对象的声音驱动数据对应的音素序列;获取与所述音素序列匹配的所述交互对象的姿态参数值;根据所述姿态参数值控制所述显示设备展示的所述交互对象的姿态。
- 根据权利要求1所述的方法,还包括:根据所述音素序列控制所述显示设备输出语音和/或展示文本。
- 根据权利要求1或2所述的方法,其中,获取与所述音素序列匹配的所述交互对象的姿态参数值,包括:对所述音素序列进行特征编码,获得所述音素序列的特征信息;获取所述音素序列的特征信息对应的所述交互对象的姿态参数值。
- 根据权利要求3所述的方法,其中,对所述音素序列进行特征编码,获得所述音素序列的特征信息,包括:针对所述音素序列包含的多种音素中的每种音素,生成所述音素对应的编码序列;根据所述音素对应的编码序列的编码值以及所述音素对应的持续时间,获得所述音素对应的编码序列的特征信息;根据所述多种音素分别对应的编码序列的特征信息,获得所述音素序列的特征信息。
- 根据权利要求4所述的方法,其中,针对所述音素序列包含的多种音素中的每种音素,生成所述音素对应的编码序列,包括:检测各时间点上是否对应有所述音素;通过将有所述音素的时间点上的编码值设置为第一数值,将没有所述音素的时间点上的编码值设置为第二数值,得到所述音素对应的所述编码序列。
- 根据权利要求4或5所述的方法,其中,根据所述多个音素分别对应的编码序列的编码值以及所述多种音素分别对应的持续时间,获得所述多种音素分别对应的编码序列的特征信息,包括:对于所述多种音素中的每种音素,对于所述音素对应的编码序列,利用高斯滤波器对所述音素在时间上的连续值进行高斯卷积操作,获得所述音素对应的编码序列的特征信息。
- 根据权利要求1至6任一项所述的方法,其中,姿态参数包括面部姿态参数,所述面部姿态参数包括面部肌肉控制系数,所述面部肌肉控制系数用于控制至少一个面部肌肉的运动状态;根据所述姿态参数值控制所述显示设备展示的所述交互对象的姿态,包括:根据与所述音素序列匹配的面部肌肉控制系数值,驱动所述交互对象做出与所述音素序列中的各个音素匹配的面部动作。
- 根据权利要求7所述的方法,还包括:获取与所述面部姿态参数值关联的身体姿态的驱动数据;根据与所述面部姿态参数值关联的所述身体姿态的所述驱动数据,驱动所述交互对象做出肢体动作。
- 根据权利要求3所述的方法,其中,获取所述音素序列的特征信息对应的所述交互对象的姿态参数值,包括:以设定时间间隔对所述音素序列的特征信息进行采样,获得第一采样时间对应的采样特征信息;将所述第一采样时间对应的采样特征信息输入至预先训练的神经网络,获得与所述采样特征信息对应的所述交互对象的所述姿态参数值。
- 根据权利要求9所述的方法,其中,所述预先训练的神经网络包括长短期记忆网络和全连接网络,将所述第一采样时间对应的采样特征信息输入至预先训练的神经网络,获得与所述采样特征信息对应的所述交互对象的所述姿态参数值,包括:将所述第一采样时间对应的所述采样特征信息输入至所述长短期记忆网络,根据在所述第一采样时间之前的采样特征信息,输出关联特征信息;将所述关联特征信息输入至所述全连接网络,根据所述全连接网络的分类结果,确定与所述关联特征信息对应的姿态参数值;其中,所述分类结果中,每种类别对应于一 组所述姿态参数值。
- 根据权利要求9或10所述的方法,其中,所述神经网络通过音素序列样本训练得到;所述方法还包括:获取一角色发出语音的视频段;根据所述视频段获取多个包含所述角色的第一图像帧,以及与多个所述第一图像帧分别对应的多个音频帧;将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的姿态参数值;根据所述第二图像帧对应的所述姿态参数值,对与所述第一图像帧对应的所述音频帧进行标注;根据标注有所述姿态参数值的所述音频帧,获得所述音素序列样本。
- 根据权利要求11所述的方法,还包括:对所述音素序列样本进行特征编码,获得第二采样时间对应的特征信息,并对于所述特征信息标注对应的姿态参数值,获得特征信息样本;根据所述特征信息样本对初始神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述神经网络,其中,所述网络损失包括所述初始神经网络预测得到的所述姿态参数值与标注的所述姿态参数值之间的差异。
- 根据权利要求12所述的方法,其中,所述网络损失包括所述初始神经网络预测得到的所述姿态参数值与标注的所述姿态参数值的差的二范数;所述网络损失还包括,所述初始神经网络预测得到的所述姿态参数值的一范数。
- 一种交互对象的驱动装置,所述交互对象展示在显示设备中,所述装置包括:音素序列获取单元,用于获取所述交互对象的声音驱动数据对应的音素序列;参数获取单元,用于获取与所述音素序列匹配的所述交互对象的姿态参数值;驱动单元,用于根据所述姿态参数值控制所述显示设备展示的所述交互对象的姿态。
- 根据权利要求14所述的装置,其中,所述参数获取单元用于:针对所述音素序列包含的多种音素中的每个音素,生成所述音素对应的编码序列;根据所述音素对应的编码序列的编码值以及所述音素对应的持续时间,获得所述音素对应的编码序列的特征信息;根据所述多种音素分别对应的编码序列的特征信息,获得所述音素序列的特征信息;其中,针对所述音素序列包含的多种音素中的每种音素,生成所述音素对应的编码序列,包括:检测各时间点上是否对应有所述音素;通过将有所述音素的时间点上的编码值设置为第一数值,将没有所述音素的时间点上的编码值设置为第二数值,得到所述音素对应的所述编码序列。
- 根据权利要求14或15所述的装置,其中,姿态参数包括面部姿态参数,所述面部姿态参数包括面部肌肉控制系数,所述面部肌肉控制系数用于控制至少一个面部肌肉的运动状;所述驱动单元用于:根据与所述音素序列匹配的面部肌肉控制系数值,驱动所述交互对象做出与所述音素序列中的各个音素匹配的面部动作;所述装置还包括动作驱动单元,用于获取与所述面部姿态参数值关联的身体姿态的驱动数据;根据与所述面部姿态参数值关联的所述身体姿态的所述驱动数据,驱动所述交互对象做出肢体动作。
- 根据权利要求15所述的装置,其中,在获取所述音素序列的特征信息对应的所述交互对象的姿态参数值时,所述参数获取单元用于:以设定时间间隔对所述音素序列的特征信息进行采样,获得第一采样时间对应的采样特征信息;将所述第一采样时间对应的采样特征信息输入至预先训练的神经网络,获得与所述采样特征信息对应的所述交互对象的所述姿态参数值,其中,所述神经网络包括长短期记忆网络和全连接网络;在将所述第一采样时间对应的采样特征信息输入至预先训练的神经网络,获得与所述采样特征信息对应的所述交互对象的所述姿态参数值时,所述参数获取单元用于:将所述第一采样时间对应的所述采样特征信息输入至所述长短期记忆网络,根据在所述第一采样时间之前的采样特征信息,输出关联特征信息;将所述关联特征信息输入至所述全连接网络,根据所述全连接网络的分类结果,确定与所述关联特征信息对应的姿态参数值;其中,所述分类结果中,每种类别对应于一组所述姿态参数值。
- 根据权利要求17所述的装置,其中,所述神经网络通过音素序列样本训练得到;所述装置还包括样本获取单元,所述样本获取单元用于:获取一角色发出语音的视频段;根据所述视频段获取多个包含所述角色的第一图像帧,以及与多个所述第一图像帧对应的多个音频帧;将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的姿态参数值;根据所述第二图像帧对应的所述姿态参数值,对与所述第一图像帧对应的所述音频帧进行标注;根据标注有所述姿态参数值的所述音频帧,获得所述音素序列样本;所述装置还包括训练单元,所述训练单元用于:对所述音素序列样本进行特征编码,获得所述第二采样时间对应的特征信息,并对于所述特征信息标注对应的姿态参数值,获得特征信息样本;根据所述特征信息样本对初始神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述神经网络,其中,所述网络损失包括所述初始神经网络预测得到的所述姿态参数值与标注的所述姿态参数值之间的差异;其中,所述网络损失包括所述初始神经网络预测得到的所述姿态参数值与标注的所述姿态参数值的差的二范数;所述网络损失还包括,所述初始神经网络预测得到的所述姿态参数值的一范数。
- 一种电子设备,包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至13任一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至13任一所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020217027717A KR20210124312A (ko) | 2020-03-31 | 2020-11-18 | 인터랙티브 대상의 구동 방법, 장치, 디바이스 및 기록 매체 |
JP2021549867A JP2022531057A (ja) | 2020-03-31 | 2020-11-18 | インタラクティブ対象の駆動方法、装置、デバイス、及び記録媒体 |
SG11202109464YA SG11202109464YA (en) | 2020-03-31 | 2020-11-18 | Methods and apparatuses for driving interaction objects, devices and storage media |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010245761.9 | 2020-03-31 | ||
CN202010245761.9A CN111459450A (zh) | 2020-03-31 | 2020-03-31 | 交互对象的驱动方法、装置、设备以及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021196643A1 true WO2021196643A1 (zh) | 2021-10-07 |
Family
ID=71682375
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129770 WO2021196643A1 (zh) | 2020-03-31 | 2020-11-18 | 交互对象的驱动方法、装置、设备以及存储介质 |
Country Status (6)
Country | Link |
---|---|
JP (1) | JP2022531057A (zh) |
KR (1) | KR20210124312A (zh) |
CN (1) | CN111459450A (zh) |
SG (1) | SG11202109464YA (zh) |
TW (1) | TWI766499B (zh) |
WO (1) | WO2021196643A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114972589A (zh) * | 2022-05-31 | 2022-08-30 | 北京百度网讯科技有限公司 | 虚拟数字形象的驱动方法及其装置 |
WO2023116208A1 (zh) * | 2021-12-24 | 2023-06-29 | 上海商汤智能科技有限公司 | 数字对象生成方法、装置、设备及存储介质 |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460785B (zh) * | 2020-03-31 | 2023-02-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
CN111459450A (zh) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
CN113314104B (zh) * | 2021-05-31 | 2023-06-20 | 北京市商汤科技开发有限公司 | 交互对象驱动和音素处理方法、装置、设备以及存储介质 |
CN114283227B (zh) * | 2021-11-26 | 2023-04-07 | 北京百度网讯科技有限公司 | 虚拟人物的驱动方法、装置、电子设备及可读存储介质 |
CN114741561A (zh) * | 2022-02-28 | 2022-07-12 | 商汤国际私人有限公司 | 动作生成方法、装置、电子设备及存储介质 |
TWI799223B (zh) * | 2022-04-01 | 2023-04-11 | 國立臺中科技大學 | 肌力評估教學虛擬實境系統 |
CN115662388A (zh) * | 2022-10-27 | 2023-01-31 | 维沃移动通信有限公司 | 虚拟形象面部驱动方法、装置、电子设备及介质 |
CN116524896A (zh) * | 2023-04-24 | 2023-08-01 | 北京邮电大学 | 一种基于发音生理建模的发音反演方法及系统 |
CN116665695B (zh) * | 2023-07-28 | 2023-10-20 | 腾讯科技(深圳)有限公司 | 虚拟对象口型驱动方法、相关装置和介质 |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170352351A1 (en) * | 2014-10-29 | 2017-12-07 | Kyocera Corporation | Communication robot |
CN109599113A (zh) * | 2019-01-22 | 2019-04-09 | 北京百度网讯科技有限公司 | 用于处理信息的方法和装置 |
CN110009716A (zh) * | 2019-03-28 | 2019-07-12 | 网易(杭州)网络有限公司 | 面部表情的生成方法、装置、电子设备及存储介质 |
CN110413841A (zh) * | 2019-06-13 | 2019-11-05 | 深圳追一科技有限公司 | 多态交互方法、装置、系统、电子设备及存储介质 |
CN110531860A (zh) * | 2019-09-02 | 2019-12-03 | 腾讯科技(深圳)有限公司 | 一种基于人工智能的动画形象驱动方法和装置 |
CN111145777A (zh) * | 2019-12-31 | 2020-05-12 | 苏州思必驰信息科技有限公司 | 一种虚拟形象展示方法、装置、电子设备及存储介质 |
CN111459450A (zh) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
CN111459452A (zh) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
CN111460785A (zh) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
CN111541908A (zh) * | 2020-02-27 | 2020-08-14 | 北京市商汤科技开发有限公司 | 交互方法、装置、设备以及存储介质 |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002077755A (ja) * | 2000-08-29 | 2002-03-15 | Sharp Corp | エージェントインタフェース装置 |
JP2003058908A (ja) * | 2001-08-10 | 2003-02-28 | Minolta Co Ltd | 顔画像制御方法および装置、コンピュータプログラム、および記録媒体 |
JP2015038725A (ja) * | 2013-07-18 | 2015-02-26 | 国立大学法人北陸先端科学技術大学院大学 | 発話アニメーション生成装置、方法、及びプログラム |
JP5913394B2 (ja) * | 2014-02-06 | 2016-04-27 | Psソリューションズ株式会社 | 音声同期処理装置、音声同期処理プログラム、音声同期処理方法及び音声同期システム |
JP2015166890A (ja) * | 2014-03-03 | 2015-09-24 | ソニー株式会社 | 情報処理装置、情報処理システム、情報処理方法及びプログラム |
CN106056989B (zh) * | 2016-06-23 | 2018-10-16 | 广东小天才科技有限公司 | 一种语言学习方法及装置、终端设备 |
CN107704169B (zh) * | 2017-09-26 | 2020-11-17 | 北京光年无限科技有限公司 | 虚拟人的状态管理方法和系统 |
CN107861626A (zh) * | 2017-12-06 | 2018-03-30 | 北京光年无限科技有限公司 | 一种虚拟形象被唤醒的方法及系统 |
CN108942919B (zh) * | 2018-05-28 | 2021-03-30 | 北京光年无限科技有限公司 | 一种基于虚拟人的交互方法及系统 |
CN110176284A (zh) * | 2019-05-21 | 2019-08-27 | 杭州师范大学 | 一种基于虚拟现实的言语失用症康复训练方法 |
CN110609620B (zh) * | 2019-09-05 | 2020-11-17 | 深圳追一科技有限公司 | 基于虚拟形象的人机交互方法、装置及电子设备 |
CN110647636B (zh) * | 2019-09-05 | 2021-03-19 | 深圳追一科技有限公司 | 交互方法、装置、终端设备及存储介质 |
CN110866609B (zh) * | 2019-11-08 | 2024-01-30 | 腾讯科技(深圳)有限公司 | 解释信息获取方法、装置、服务器和存储介质 |
-
2020
- 2020-03-31 CN CN202010245761.9A patent/CN111459450A/zh active Pending
- 2020-11-18 KR KR1020217027717A patent/KR20210124312A/ko not_active Application Discontinuation
- 2020-11-18 JP JP2021549867A patent/JP2022531057A/ja not_active Ceased
- 2020-11-18 SG SG11202109464YA patent/SG11202109464YA/en unknown
- 2020-11-18 WO PCT/CN2020/129770 patent/WO2021196643A1/zh active Application Filing
- 2020-12-24 TW TW109145886A patent/TWI766499B/zh active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170352351A1 (en) * | 2014-10-29 | 2017-12-07 | Kyocera Corporation | Communication robot |
CN109599113A (zh) * | 2019-01-22 | 2019-04-09 | 北京百度网讯科技有限公司 | 用于处理信息的方法和装置 |
CN110009716A (zh) * | 2019-03-28 | 2019-07-12 | 网易(杭州)网络有限公司 | 面部表情的生成方法、装置、电子设备及存储介质 |
CN110413841A (zh) * | 2019-06-13 | 2019-11-05 | 深圳追一科技有限公司 | 多态交互方法、装置、系统、电子设备及存储介质 |
CN110531860A (zh) * | 2019-09-02 | 2019-12-03 | 腾讯科技(深圳)有限公司 | 一种基于人工智能的动画形象驱动方法和装置 |
CN111145777A (zh) * | 2019-12-31 | 2020-05-12 | 苏州思必驰信息科技有限公司 | 一种虚拟形象展示方法、装置、电子设备及存储介质 |
CN111541908A (zh) * | 2020-02-27 | 2020-08-14 | 北京市商汤科技开发有限公司 | 交互方法、装置、设备以及存储介质 |
CN111459450A (zh) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
CN111459452A (zh) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
CN111460785A (zh) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | 交互对象的驱动方法、装置、设备以及存储介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023116208A1 (zh) * | 2021-12-24 | 2023-06-29 | 上海商汤智能科技有限公司 | 数字对象生成方法、装置、设备及存储介质 |
CN114972589A (zh) * | 2022-05-31 | 2022-08-30 | 北京百度网讯科技有限公司 | 虚拟数字形象的驱动方法及其装置 |
Also Published As
Publication number | Publication date |
---|---|
TW202138993A (zh) | 2021-10-16 |
JP2022531057A (ja) | 2022-07-06 |
KR20210124312A (ko) | 2021-10-14 |
SG11202109464YA (en) | 2021-11-29 |
TWI766499B (zh) | 2022-06-01 |
CN111459450A (zh) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021196643A1 (zh) | 交互对象的驱动方法、装置、设备以及存储介质 | |
WO2021169431A1 (zh) | 交互方法、装置、电子设备以及存储介质 | |
WO2021196644A1 (zh) | 交互对象的驱动方法、装置、设备以及存储介质 | |
WO2021196646A1 (zh) | 交互对象的驱动方法、装置、设备以及存储介质 | |
WO2021196645A1 (zh) | 交互对象的驱动方法、装置、设备以及存储介质 | |
US11514634B2 (en) | Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses | |
WO2022106654A2 (en) | Methods and systems for video translation | |
WO2023284435A1 (zh) | 生成动画的方法及装置 | |
US20230082830A1 (en) | Method and apparatus for driving digital human, and electronic device | |
WO2022252890A1 (zh) | 交互对象驱动和音素处理方法、装置、设备以及存储介质 | |
CN113689879A (zh) | 实时驱动虚拟人的方法、装置、电子设备及介质 | |
RU2721180C1 (ru) | Способ генерации анимационной модели головы по речевому сигналу и электронное вычислительное устройство, реализующее его | |
CN113689880A (zh) | 实时驱动虚拟人的方法、装置、电子设备及介质 | |
WO2021196647A1 (zh) | 交互对象的驱动方法、装置、设备以及存储介质 | |
Heisler et al. | Making an android robot head talk | |
KR102514580B1 (ko) | 영상 전환 방법, 장치 및 컴퓨터 프로그램 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2021549867 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20217027717 Country of ref document: KR Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20928302 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20928302 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 521430714 Country of ref document: SA |