WO2021196644A1 - Method, apparatus and device for driving interactive object, and storage medium - Google Patents
Method, apparatus and device for driving interactive object, and storage medium Download PDFInfo
- Publication number
- WO2021196644A1 WO2021196644A1 PCT/CN2020/129793 CN2020129793W WO2021196644A1 WO 2021196644 A1 WO2021196644 A1 WO 2021196644A1 CN 2020129793 W CN2020129793 W CN 2020129793W WO 2021196644 A1 WO2021196644 A1 WO 2021196644A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phoneme
- sequence
- interactive object
- feature code
- control vector
- Prior art date
Links
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 162
- 238000000034 method Methods 0.000 title claims abstract description 59
- 239000013598 vector Substances 0.000 claims description 83
- 108091026890 Coding region Proteins 0.000 claims description 44
- 238000013528 artificial neural network Methods 0.000 claims description 30
- 230000000306 recurrent effect Effects 0.000 claims description 26
- 230000003993 interaction Effects 0.000 claims description 14
- 230000015654 memory Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 230000008859 change Effects 0.000 claims description 7
- 230000001131 transforming effect Effects 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 2
- 230000036544 posture Effects 0.000 description 40
- 230000009471 action Effects 0.000 description 23
- 230000014509 gene expression Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 6
- 230000001815 facial effect Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000033001 locomotion Effects 0.000 description 5
- 238000009877 rendering Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000001360 synchronised effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000004397 blinking Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.
- the embodiments of the present disclosure provide a driving solution for interactive objects.
- a method for driving an interactive object comprising: obtaining a phoneme sequence corresponding to text data; obtaining a control parameter value of at least one partial region of an interactive object matching the phoneme sequence; The acquired control parameter value controls the posture of the interactive object.
- the method further includes: controlling the display device displaying the interactive object to display text according to the text data, and/or controlling the display device according to the phoneme sequence corresponding to the text data Output speech.
- the control parameter of the local area of the interactive object includes the posture control vector of the local area, and the control parameter value of at least one local area of the interactive object matching the phoneme sequence is obtained
- the method includes: performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; obtaining a feature code corresponding to at least one phoneme according to the first coding sequence; obtaining the interaction corresponding to the feature code The attitude control vector of at least one local area of the object.
- performing feature encoding on the phoneme sequence to obtain the first encoding sequence corresponding to the phoneme sequence includes: generating for each of the multiple phonemes contained in the phoneme sequence The sub-coding sequence corresponding to the phoneme; and obtaining the first coding sequence corresponding to the phoneme sequence according to the sub-coding sequences corresponding to the multiple phonemes respectively.
- generating a sub-coding sequence corresponding to the phoneme includes: detecting whether the phoneme corresponds to each time point; By setting the coding value at the time point when the phoneme is present to the first value, and setting the coding value at the time point when the phoneme is not present to the second value, the sub-coding sequence corresponding to the phoneme is obtained.
- the method further includes: for the sub-coding sequence corresponding to each phoneme of the multiple phonemes, using a Gaussian filter to perform continuous values of the phonemes in time Gaussian convolution operation.
- controlling the posture of the interactive object according to the obtained control parameter value includes: obtaining a sequence of posture control vectors corresponding to the second coding sequence; and according to the posture control vector The sequence of controls the gesture of the interactive object.
- the method further includes: in the case that the time interval between the phonemes in the phoneme sequence is greater than a set threshold, controlling the parameter value according to the setting of the local area To control the posture of the interactive object.
- acquiring the attitude control vector of at least one local area of the interactive object corresponding to the feature code includes: inputting the feature code into a pre-trained recurrent neural network to obtain the The attitude control vector of at least one local area of the interactive object corresponding to the feature code.
- the recurrent neural network is obtained through feature coding sample training; the method further includes: obtaining a video segment of a character's voice, and obtaining a plurality of video segments containing the character according to the video segment Extract the corresponding voice segment from the video segment, obtain a sample phoneme sequence according to the voice segment, and perform feature encoding on the sample phoneme sequence; obtain at least the corresponding voice segment corresponding to the first image frame A feature encoding of a phoneme; converting the first image frame into a second image frame containing the interactive object, and obtaining the attitude control vector value of at least one local area corresponding to the second image frame; controlling according to the attitude The vector value is used to annotate the feature code corresponding to the first image frame to obtain the feature code sample.
- the method further includes: training the initial recurrent neural network according to the characteristic coding samples, and training to obtain the recurrent neural network after the change of the network loss satisfies the convergence condition, wherein
- the network loss includes the difference between the attitude control vector value of the at least one local area predicted by the recurrent neural network and the marked attitude control vector value.
- a driving device for an interactive object including: a first acquiring unit for acquiring a phoneme sequence corresponding to text data; a second acquiring unit for acquiring a phoneme sequence matching the phoneme sequence The control parameter value of at least one partial area of the interactive object; the driving unit is used to control the posture of the interactive object according to the acquired control parameter value.
- an electronic device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed.
- the method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
- a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.
- the driving method, device, device, and computer readable storage medium of an interactive object obtain a phoneme sequence corresponding to text data, and obtain at least one partial region of an interactive object matching the phoneme sequence
- the control parameter value of to control the posture of the interactive object so that the interactive object can make a posture that matches the phoneme corresponding to the text data.
- the posture includes facial posture and body posture, so that the target object generates that the interactive object is speaking
- the sense of text content enhances the interactive experience between the target object and the interactive object.
- FIG. 1 is a schematic diagram of a display device in a method for driving interactive objects proposed by at least one embodiment of the present disclosure
- FIG. 2 is a flowchart of a method for driving interactive objects proposed by at least one embodiment of the present disclosure
- FIG. 3 is a schematic diagram of a process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure
- FIG. 4 is a schematic structural diagram of a driving device for interactive objects proposed in at least one embodiment of the present disclosure
- FIG. 5 is a schematic structural diagram of an electronic device proposed in at least one embodiment of the present disclosure.
- At least one embodiment of the present disclosure provides a method for driving interactive objects.
- the driving method may be executed by electronic devices such as a terminal device or a server.
- the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game.
- the server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.
- the interaction object may be any virtual image capable of interacting with the target object.
- the interactive object may be a virtual character, or may also be a virtual animal, virtual item, cartoon image, or other virtual images capable of implementing interactive functions.
- the display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure.
- the target object may be a user, a robot, or other smart devices.
- the interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner.
- the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction.
- the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.
- the interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.
- terminal devices may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc.
- VR virtual reality
- AR augmented reality
- Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure.
- the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect.
- the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters.
- the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen.
- the display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor.
- the processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.
- the interactive object in response to the sound-driven data used to drive the interactive object to output voice, the interactive object may emit a specified voice to the target object.
- the terminal device can generate sound-driven data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to respond by issuing a specified voice, thereby providing anthropomorphic services for the target object.
- the sound-driven data can also be generated in other ways, for example, generated by the server and sent to the terminal device.
- At least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.
- FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 2, the method includes steps 201 to 203.
- Step 201 Obtain a phoneme sequence corresponding to the text data.
- the text data may be driving data used to drive the interactive object.
- the drive data can be drive data generated by the server or terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or drive data called by the terminal device from the internal memory.
- the present disclosure does not limit the method of obtaining the text data.
- the phoneme corresponding to the morpheme can be obtained according to the morphemes contained in the text, so as to obtain the phoneme sequence corresponding to the text.
- the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and a pronunciation action of a real person can form a phoneme.
- the Chinese text in response to the text being a Chinese text, can be converted into pinyin, a phoneme sequence can be generated using pinyin, and a timestamp for each phoneme can be generated.
- Step 202 Obtain a control parameter value of at least one partial region of an interactive object that matches the phoneme sequence.
- the local area is obtained by dividing the whole (including face and/or body) of the interactive object.
- the control of one or more local areas of the face may correspond to a series of facial expressions or actions of the interactive object.
- the control of the eye area may correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective;
- the control of the mouth area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees.
- the control of one or more local areas of the body may correspond to a series of physical actions of the interactive object.
- the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.
- the control parameter of the local area of the interactive object includes the posture control vector of the local area.
- the attitude control vector of each local area is used to drive the local area of the interactive object to perform actions.
- Different posture control vector values correspond to different motions or motion ranges. For example, for the posture control vector of the mouth area, one set of posture control vector values can make the mouth of the interactive object slightly open, and another set of posture control vector values can make the mouth of the interactive object open wider.
- the corresponding local areas can make different actions or actions with different amplitudes.
- the local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the posture control vector values of all the local areas can be obtained; when the expression of the interactive object needs to be controlled At this time, the posture control vector value of the local area corresponding to the face can be obtained.
- the control parameter value corresponding to the feature code can be determined, thereby determining the control parameter value corresponding to the phoneme sequence.
- Different encoding methods can reflect different characteristics of the phoneme sequence. The present disclosure does not limit the specific encoding method.
- the corresponding relationship between the feature code of the phoneme sequence corresponding to the text data and the control parameter value of the interactive object can be established in advance, so that the corresponding control parameter value can be obtained through the text data.
- the specific method for obtaining the control parameter value matching the feature code of the phoneme sequence of the text data will be described in detail later.
- Step 203 Control the posture of the interactive object according to the acquired control parameter value.
- the control parameter value such as the posture control vector value
- the display device that displays the interactive object is controlled to display text according to the text data, and/or the display device is controlled to output speech according to the phoneme sequence corresponding to the text data
- the gesture made by the interactive object is different from that of the interactive object.
- the output voice and/or the displayed text are synchronized, thereby giving the target object a feeling that the interactive object is speaking.
- the posture of the interactive object can be controlled, so that the interactive object A gesture that matches the phoneme corresponding to the text data is made, and the gesture includes facial gestures and body gestures, so that the target object feels that the interactive object is speaking the text content, and the interactive experience of the target object is improved.
- the method is applied to a server, including a local server or a cloud server.
- the server processes text data, generates control parameter values of the interactive objects, and uses three-dimensional rendering according to the control parameter values.
- the engine performs rendering to obtain the animation of the interactive object.
- the server may send the animation to the terminal for display to communicate or respond to the target object, and may also send the animation to the cloud, so that the terminal can obtain the animation from the cloud to communicate or respond to the target object .
- the control parameter value may also be sent to the terminal, so that the terminal completes the process of rendering, generating animation, and performing display.
- the method is applied to a terminal, and the terminal processes text data, generates control parameter values of the interactive object, and renders the interactive object using a three-dimensional rendering engine according to the control parameter value to obtain the interactive
- the animation of the object the terminal can display the animation to communicate or respond to the target object.
- the display device displaying the interactive object may be controlled to display text according to the text data, and/or the display device may be controlled to output speech according to the phoneme sequence corresponding to the text data.
- the voice and/or text output according to the text data is different from controlling the gesture of the interactive object according to the control parameter value.
- the gesture made by the interactive object is synchronized with the output voice and/or displayed text, giving the target object the feeling that the interactive object is speaking.
- control parameter of at least one local area of the interactive object includes a posture control vector
- the posture control vector can be obtained in the following manner.
- the coding sequence corresponding to the phoneme sequence of the text data is called the first coding sequence, that is, the first coding sequence is obtained by performing feature coding on the phoneme sequence.
- a sub-coding sequence corresponding to each phoneme is generated.
- the encoding value at the time point where the first phoneme is present is set to The first value
- the coding value at the time point without the first phoneme is set to the second value
- the coding sequence corresponding to the first phoneme can be obtained after assigning the coding value at each time point.
- the code value at the time point when the first phoneme is present may be set to 1, and the code value at the time point when the first phoneme is not present may be set to 0.
- the encoding value at the time point where the phoneme is present is set to the first value, and there is no phoneme.
- the coding value at the time point of the phoneme is set to the second value, and the coding sequence corresponding to the phoneme can be obtained after assigning the coding value at each time point.
- the first coding sequence corresponding to the phoneme sequence is obtained according to the respective sub-coding sequences corresponding to the multiple phonemes.
- a Gaussian filter may be used to perform a Gaussian convolution operation on the continuous values of the first phoneme in time, so as to filter and smooth the matrix corresponding to the feature encoding. The transition of the mouth area when each phoneme is converted.
- FIG. 3 shows a schematic diagram of a driving method of interactive objects proposed by at least one embodiment of the present disclosure.
- the phoneme sequence 310 contains phonemes j, i1, j, and ie4 (for brevity, only some phonemes are shown), and corresponding sub-coding sequences 321, 322, and 321 are respectively obtained for each phoneme j, i1, and ie4. 323.
- the corresponding code value at the time point where the phoneme is present is set to a first value (for example, 1), and the corresponding code value at the time point without the phoneme is set to the second value ( For example, 0).
- the value of the sub-coding sequence 321 is the first value 1
- the value of the sub-coding sequence 321 is the first value.
- the two value is 0. All the sub-coding sequences constitute the first coding sequence 320.
- a feature code corresponding to at least one phoneme is obtained.
- the duration of i1 in the sub-coding sequence 322 and the duration of ie4 in the sub-coding sequence 323 can obtain the characteristic information of the sub-coding sequences 321, 322, and 323.
- a Gaussian filter may be used to perform Gaussian convolution operations on the consecutive values of phonemes j, i1, and ie4 in the sub-encoding sequences 321, 322, and 323, respectively, to smooth the feature encoding to obtain the smoothed ⁇ coding sequence 330. That is, the Gaussian convolution operation is performed on the continuous value of the phoneme in time through the Gaussian filter, so that the code value in each code sequence changes from the second value to the first value or from the first value to the second value. smooth.
- the values of the coding sequence also show intermediate values, such as 0.2, 0.3, etc., and the posture control vector obtained according to the values of these intermediate states makes the interaction characters excessively move and change their expressions more smoothly , Naturally, improve the interactive experience of the target object.
- the feature code corresponding to at least one phoneme may be obtained by performing a sliding window on the first code sequence.
- the first coding sequence may be a coding sequence after a Gaussian convolution operation.
- a sliding window is performed on the coding sequence with a time window of a set length and a set step size, and the feature code in the time window is used as the feature code of the corresponding at least one phoneme.
- the second code sequence can be obtained. Since the duration of each phoneme is different, and the duration of each phoneme is different in proportion to the length of the time window, the number of phonemes corresponding to the feature code in the time window may be 1, 2 or even more depending on the position of the time window. As shown in FIG.
- M is a positive integer, and its value is determined according to the length of the first coding sequence, the length of the time window, and the sliding step of the time window.
- attitude control vector of at least one partial region of the interactive object corresponding to the feature code is acquired.
- attitude control vector 1 According to feature code 1, feature code 2, feature code 3,..., feature code M, the corresponding attitude control vector 1, attitude control vector 2, attitude control vector 3,..., attitude control vector M can be obtained respectively, thereby obtaining attitude control 350 of the sequence of vectors.
- the sequence 350 of the attitude control vector and the second coding sequence 340 are aligned in time. Since each feature code in the second coding sequence is obtained according to at least one phoneme in the phoneme sequence, the sequence of the attitude control vector Each control vector in 350 is also obtained based on at least one phoneme in the phoneme sequence.
- the interactive object While playing the phoneme sequence corresponding to the text data, the interactive object is driven to make an action according to the sequence of the posture control vector, that is, the interactive object can be driven to emit the sound corresponding to the text content while making synchronization with the sound
- the action gives the target object the feeling that the interactive object is speaking, which improves the interactive experience of the target object.
- the attitude control vector value before the set time can be set to the default value, that is, when the phoneme sequence is just started to be played, the interactive object A default action is made, and after the set time, the interactive object is driven to make an action using the sequence of the posture control vector obtained according to the first coding sequence.
- the feature code 1 starts to be output at time t0, which corresponds to the default attitude control vector before time t0.
- the length of the time window is related to the amount of information contained in the feature code. In the case where the amount of information contained in the time window is relatively large, the cyclic neural network processing will output a relatively uniform result. If the length of the time window is too large, the expression of the interactive object may not correspond to part of the text; if the length of the time window is too small, the expression of the interactive object may appear rigid when speaking. Therefore, the duration of the time window needs to be determined according to the minimum duration of the phoneme corresponding to the text data, so that the actions taken by driving the interactive object have a stronger correlation with the sound.
- the sliding step length of the time window is related to the time interval (frequency) of obtaining the attitude control vector, that is, it is related to the frequency of driving the interactive object to make an action.
- the length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
- the interactive object when the time interval between phonemes in the phoneme sequence is greater than a set threshold, the interactive object is driven to take actions according to the set posture control vector of the local area. That is, when the interactive character pauses for a long time, the interactive object is driven to make a set action. For example, when the output voice pauses for a long time, the interactive object can be made to make a smiling expression or slightly swing the body to avoid the interactive object standing upright without expression when the pause is long, thereby making the interactive object speak more Natural and smooth, it improves the interaction between the target object and the interactive object.
- the feature code may be input to a pre-trained recurrent neural network, and the recurrent neural network outputs at least one of the interactive objects corresponding to the feature code according to the first coding sequence.
- the attitude control vector of the local area Since the recurrent neural network is a time recurrent neural network, it can learn the historical information of the input feature code, and output the attitude control vector of the at least one local area according to the feature code sequence.
- the characteristic coding sequence includes a first coding sequence and a second coding sequence.
- the recurrent neural network may be, for example, a long short-term memory network (Long Short-Term Memory, LSTM).
- a pre-trained recurrent neural network is used to obtain the posture control vector of at least one local area of the interactive object corresponding to the feature code, and the historical feature information of the feature code and the current feature information are merged, thereby The historical attitude control vector has an impact on the current attitude control vector change, making the expression changes and body movements of the interactive characters more smooth and natural.
- the recurrent neural network can be trained in the following manner.
- a feature coded sample is obtained, the feature coded sample is annotated with a true value, and the true value is a posture control vector value of at least one local area of the interactive object.
- the initial recurrent neural network is trained according to the feature coding samples, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the recurrent neural network The difference between the attitude control vector value of the at least one local area and the real value obtained by network prediction.
- feature code samples can be obtained by the following method.
- the manner of encoding the sample phoneme sequence is the same as the encoding manner of the phoneme sequence corresponding to the text data described above.
- the feature code of at least one phoneme corresponding to the first image frame is obtained.
- the at least one phoneme may be a phoneme within a set range of the appearance time of the first image frame.
- the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained.
- the attitude control vector value may include the attitude control vector value of all the local areas, and may also include the attitude control vector value of some of the local areas.
- the image frame of the real person can be converted into a second image frame containing the image represented by the interactive object, and the local area of the real person
- the posture control vector corresponds to the posture control vector of each local area of the interactive object, so that the posture control vector of each local area of the interactive object in the second image frame can be obtained.
- the feature code of at least one phoneme corresponding to the first image frame obtained above is annotated according to the attitude control vector value to obtain feature code samples.
- the video segment of a character is split into a plurality of corresponding first image frames and voice segments, and the first image frame containing the real person is converted into the second image containing the interactive object.
- Frames are used to obtain the attitude control vector corresponding to the feature code of the phoneme, so that the feature code has a better correspondence with the attitude control vector, so as to obtain high-quality feature code samples, so that the actions of the interactive objects are closer to the real actions of the corresponding characters.
- FIG. 4 shows a schematic structural diagram of a driving device for interactive objects according to at least one embodiment of the present disclosure.
- the device may include: a first obtaining unit 401, configured to obtain a phoneme sequence corresponding to text data; and second The acquiring unit 402 is configured to acquire a control parameter value of at least one partial region of an interactive object matching the phoneme sequence; the driving unit 403 is configured to control the posture of the interactive object according to the acquired control parameter value.
- the device further includes an output unit for controlling the display device displaying the interactive object to display text according to the text data, and/or controlling the display device according to the phoneme sequence corresponding to the text data Output speech.
- the second obtaining unit is specifically configured to: perform feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; and obtain at least one phoneme corresponding to the phoneme sequence according to the first coding sequence.
- the second acquiring unit when performing feature encoding on the phoneme sequence to obtain the first encoding sequence corresponding to the phoneme sequence, is specifically configured to: for multiple phonemes included in the phoneme sequence, A sub-coding sequence corresponding to each phoneme is generated; and the first coding sequence corresponding to the phoneme sequence is obtained according to the sub-coding sequences respectively corresponding to the multiple phonemes.
- the second acquiring unit when generating the sub-coding sequence corresponding to each phoneme for the multiple phonemes included in the phoneme sequence, is specifically configured to: detect whether there is a first phoneme corresponding to each time point. , The first phoneme is any one of the multiple phonemes; by setting the code value at the time point when the first phoneme is present to the first value, the time point when the first phoneme is not present is set to the first value. The encoding value of is set to a second value to obtain the sub-encoding sequence corresponding to the first phoneme.
- the device further includes a filtering unit for performing a Gaussian filter on the continuous value of the phoneme in time for the sub-coding sequence corresponding to each phoneme of the multiple phonemes.
- Gaussian convolution operation for the sub-coding sequence corresponding to the first phoneme, a Gaussian filter is used to perform a Gaussian convolution operation on the continuous values of the first phoneme in time, and the first phoneme is one of the multiple phonemes.
- a Gaussian filter is used to perform a Gaussian convolution operation on the continuous values of the first phoneme in time, and the first phoneme is one of the multiple phonemes.
- the second acquiring unit when acquiring feature codes corresponding to at least one phoneme according to the first coding sequence, is specifically configured to: A sliding window is performed on the coding sequence, the feature code in the time window is used as the feature code of the corresponding at least one phoneme, and the second code sequence is obtained according to the multiple feature codes obtained by completing the sliding window.
- the driving unit is specifically configured to: obtain a sequence of a posture control vector corresponding to the second coding sequence; and control the posture of the interactive object according to the sequence of the posture control vector.
- the device further includes a pause drive unit, which is used to control the set control parameter value of the local area when the time interval between phonemes in the phoneme sequence is greater than a set threshold.
- a pause drive unit which is used to control the set control parameter value of the local area when the time interval between phonemes in the phoneme sequence is greater than a set threshold.
- the second acquiring unit when acquiring the attitude control vector of at least one partial region of the interactive object corresponding to the feature code, is specifically configured to: input the feature code into a pre-trained loop A neural network obtains a posture control vector of at least one local area of the interactive object corresponding to the feature code.
- the neural network is obtained through phoneme sequence sample training; the device further includes a sample acquisition unit for: acquiring a video segment of a character's voice, and acquiring a plurality of video segments containing the voice according to the video segment The first image frame of the character; extract the corresponding voice segment from the video segment, obtain a sample phoneme sequence according to the voice segment, and perform feature encoding on the sample phoneme sequence; obtain the corresponding voice segment of the first image frame Feature encoding of at least one phoneme; transforming the first image frame into a second image frame containing the interactive object, and obtaining the attitude control vector value of at least one local area corresponding to the second image frame; according to the attitude The control vector value is used to annotate the feature code corresponding to the first image frame to obtain a feature code sample.
- a sample acquisition unit for: acquiring a video segment of a character's voice, and acquiring a plurality of video segments containing the voice according to the video segment The first image frame of the character; extract the corresponding voice segment from the
- the device further includes a training unit for training the initial recurrent neural network according to the characteristic coding samples, and training to obtain the recurrent neural network after the change in network loss satisfies the convergence condition, wherein
- the network loss includes the difference between the attitude control vector value of the at least one local area and the labeled attitude control vector value predicted by the recurrent neural network.
- At least one embodiment of this specification also provides an electronic device.
- the device includes a memory and a processor.
- the memory is used to store computer instructions that can run on the processor.
- the method for driving interactive objects described in any embodiment of the present disclosure is realized by computer instructions.
- At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.
- one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- the embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or among them A combination of one or more.
- the embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules.
- the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for use by the data
- the processing device executes.
- the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- the processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output.
- the processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
- FPGA Field Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit.
- the central processing unit will receive instructions and data from a read-only memory and/or a random access memory.
- the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both.
- the computer does not have to have such equipment.
- the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
- PDA personal digital assistant
- GPS global positioning system
- USB universal serial bus
- Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks.
- semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
- magnetic disks such as internal hard disks or Removable disks
- magneto-optical disks CD ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.
Abstract
Description
Claims (20)
- 一种交互对象的驱动方法,包括:A driving method of interactive objects includes:获取文本数据对应的音素序列;Obtaining the phoneme sequence corresponding to the text data;获取与所述音素序列匹配的交互对象的至少一个局部区域的控制参数值;Acquiring a control parameter value of at least one partial region of an interactive object matching the phoneme sequence;根据获取的所述控制参数值控制所述交互对象的姿态。Controlling the posture of the interactive object according to the acquired control parameter value.
- 根据权利要求1所述的方法,还包括:根据所述文本数据控制展示所述交互对象的显示设备展示文本,和/或根据所述文本数据对应的音素序列控制所述显示设备输出语音。The method according to claim 1, further comprising: controlling the display device displaying the interactive object to display text according to the text data, and/or controlling the display device to output voice according to the phoneme sequence corresponding to the text data.
- 根据权利要求1或2所述的方法,其中,所述交互对象的局部区域的控制参数包括所述局部区域的姿态控制向量;The method according to claim 1 or 2, wherein the control parameter of the local area of the interactive object includes the attitude control vector of the local area;获取与所述音素序列匹配的交互对象的至少一个局部区域的控制参数值,包括:Obtaining the control parameter value of at least one partial region of the interaction object matching the phoneme sequence includes:对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列;Performing feature encoding on the phoneme sequence to obtain a first encoding sequence corresponding to the phoneme sequence;根据所述第一编码序列,获取至少一个音素对应的特征编码;Obtaining a feature code corresponding to at least one phoneme according to the first coding sequence;获取所述特征编码对应的所述交互对象的至少一个局部区域的姿态控制向量。Obtain a posture control vector of at least one local area of the interactive object corresponding to the feature code.
- 根据权利要求3所述的方法,其中,对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列,包括:The method according to claim 3, wherein, performing feature encoding on the phoneme sequence to obtain the first encoding sequence corresponding to the phoneme sequence comprises:针对所述音素序列包含的多种音素中的每种音素,生成所述音素对应的子编码序列;For each phoneme of the multiple phonemes included in the phoneme sequence, generating a sub-coding sequence corresponding to the phoneme;根据所述多种音素分别对应的子编码序列,获得所述音素序列对应的第一编码序列。The first coding sequence corresponding to the phoneme sequence is obtained according to the respective sub-coding sequences corresponding to the multiple phonemes.
- 根据权利要求4所述的方法,其中,针对所述音素序列包含的多种音素中的每种音素,生成所述音素对应的子编码序列,包括:The method according to claim 4, wherein, for each phoneme of a plurality of phonemes included in the phoneme sequence, generating a sub-coding sequence corresponding to the phoneme comprises:检测各时间点上是否对应有所述音素;Detecting whether the phoneme corresponds to each time point;通过将有所述音素的时间点上的编码值设置为第一数值,将没有所述音素的时间点上的编码值设置为第二数值,得到所述音素对应的所述子编码序列。By setting the code value at the time point with the phoneme to the first value, and the code value at the time point without the phoneme to the second value, the sub-coding sequence corresponding to the phoneme is obtained.
- 根据权利要求5所述的方法,还包括:The method according to claim 5, further comprising:对于所述多种音素中的每种音素对应的所述子编码序列,利用高斯滤波器对所述音 素在时间上的连续值进行高斯卷积操作。For the sub-coding sequence corresponding to each phoneme of the plurality of phonemes, a Gaussian filter is used to perform a Gaussian convolution operation on the continuous values of the phonemes in time.
- 根据权利要求3至6任一项所述的方法,其中,根据所述第一编码序列,获取至少一个音素对应的特征编码,包括:The method according to any one of claims 3 to 6, wherein, according to the first coding sequence, obtaining a feature code corresponding to at least one phoneme comprises:以设定长度的时间窗口和设定步长,对所述第一编码序列进行滑窗,将所述时间窗口内的特征编码作为所对应的所述至少一个音素的特征编码,并根据完成所述滑窗得到的多个所述特征编码,获得第二编码序列;Using a time window of a set length and a set step size, the first coding sequence is window-slid, the feature code in the time window is used as the feature code of the corresponding at least one phoneme, and according to the completion of the Obtaining a second coding sequence by using a plurality of the feature codes obtained by the sliding window;根据获取的所述控制参数值控制所述交互对象的姿态,包括:Controlling the posture of the interactive object according to the acquired control parameter value includes:获取与所述第二编码序列对应的姿态控制向量的序列;Acquiring a sequence of attitude control vectors corresponding to the second coding sequence;根据所述姿态控制向量的序列控制所述交互对象的姿态。The posture of the interactive object is controlled according to the sequence of the posture control vector.
- 根据权利要求1至7任一项所述的方法,还包括:The method according to any one of claims 1 to 7, further comprising:在所述音素序列中的所述音素之间的时间间隔大于设定阈值的情况下,根据所述局部区域的设定控制参数值,控制所述交互对象的姿态。In the case that the time interval between the phonemes in the phoneme sequence is greater than a set threshold, the posture of the interactive object is controlled according to the set control parameter value of the local area.
- 根据权利要求3所述的方法,其中,获取所述特征编码对应的所述交互对象的至少一个局部区域的姿态控制向量,包括:The method according to claim 3, wherein acquiring a posture control vector of at least one local area of the interactive object corresponding to the feature code comprises:将所述特征编码输入至预先训练的循环神经网络,获得与所述特征编码对应的所述交互对象的至少一个局部区域的所述姿态控制向量。The feature code is input to a pre-trained recurrent neural network, and the attitude control vector of at least one local area of the interactive object corresponding to the feature code is obtained.
- 根据权利要求9所述的方法,其中,所述循环神经网络通过特征编码样本训练得到;The method according to claim 9, wherein the recurrent neural network is obtained by training of feature coding samples;所述方法还包括:The method also includes:获取一角色发出语音的视频段,并根据所述视频段获取多个包含所述角色的第一图像帧;Acquiring a video segment of a character uttering a voice, and acquiring a plurality of first image frames containing the character according to the video segment;从所述视频段中提取相应的语音段,根据所述语音段获取样本音素序列,并对所述样本音素序列进行特征编码;Extracting a corresponding speech segment from the video segment, obtaining a sample phoneme sequence according to the speech segment, and performing feature encoding on the sample phoneme sequence;获取与所述第一图像帧对应的至少一个音素的特征编码;Acquiring a feature code of at least one phoneme corresponding to the first image frame;将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的至少一个局部区域的姿态控制向量值;Transforming the first image frame into a second image frame containing the interactive object, and obtaining a posture control vector value of at least one local area corresponding to the second image frame;根据所述姿态控制向量值,对与所述第一图像帧对应的所述特征编码进行标注,获 得所述特征编码样本。According to the attitude control vector value, the feature code corresponding to the first image frame is annotated to obtain the feature code sample.
- 根据权利要求10所述的方法,还包括:The method according to claim 10, further comprising:根据所述特征编码样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述循环神经网络预测得到的所述至少一个局部区域的所述姿态控制向量值与标注的所述姿态控制向量值之间的差异。The initial recurrent neural network is trained according to the feature code samples, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the at least one predicted by the recurrent neural network The difference between the attitude control vector value of the local area and the marked attitude control vector value.
- 一种交互对象的驱动装置,包括:A driving device for interactive objects, including:第一获取单元,用于获取文本数据对应的音素序列;The first acquiring unit is used to acquire the phoneme sequence corresponding to the text data;第二获取单元,用于获取与所述音素序列匹配的交互对象的至少一个局部区域的控制参数值;The second acquiring unit is configured to acquire the control parameter value of at least one partial region of the interaction object that matches the phoneme sequence;驱动单元,用于根据获取的所述控制参数值控制所述交互对象的姿态。The driving unit is configured to control the posture of the interactive object according to the acquired control parameter value.
- 根据权利要求12所述的装置,还包括输出单元,用于根据所述文本数据控制展示所述交互对象的显示设备展示文本,和/或根据所述文本数据对应的音素序列控制所述显示设备输出语音。The apparatus according to claim 12, further comprising an output unit for controlling the display device displaying the interactive object to display text according to the text data, and/or controlling the display device according to the phoneme sequence corresponding to the text data Output speech.
- 根据权利要求12或13所述的装置,其中,所述第二获取单元用于:The device according to claim 12 or 13, wherein the second obtaining unit is configured to:对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列;Performing feature encoding on the phoneme sequence to obtain a first encoding sequence corresponding to the phoneme sequence;根据所述第一编码序列,获取至少一个音素对应的特征编码;Obtaining a feature code corresponding to at least one phoneme according to the first coding sequence;获取所述特征编码对应的所述交互对象的至少一个局部区域的姿态控制向量;Acquiring a posture control vector of at least one local area of the interactive object corresponding to the feature code;其中,对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列,包括:Wherein, performing feature encoding on the phoneme sequence to obtain the first encoding sequence corresponding to the phoneme sequence includes:针对所述音素序列包含的多种音素中的每种音素,生成所述音素对应的子编码序列;For each phoneme of the multiple phonemes included in the phoneme sequence, generating a sub-coding sequence corresponding to the phoneme;根据所述多种音素分别对应的子编码序列,获得所述音素序列对应的第一编码序列。The first coding sequence corresponding to the phoneme sequence is obtained according to the respective sub-coding sequences corresponding to the multiple phonemes.
- 根据权利要求14所述的装置,其特征在于,在根据所述第一编码序列,获取至少一个音素对应的特征编码时,所述第二获取单元用于:The apparatus according to claim 14, wherein when acquiring the feature code corresponding to at least one phoneme according to the first coding sequence, the second acquiring unit is configured to:以设定长度的时间窗口和设定步长,对所述编码序列进行滑窗,将所述时间窗口内 的特征编码作为所对应的所述至少一个音素的特征编码,并根据完成所述滑窗得到的多个特征编码,获得第二编码序列;A sliding window is performed on the coding sequence with a time window of a set length and a set step size, the feature code in the time window is used as the feature code of the corresponding at least one phoneme, and the sliding is performed according to the completion of the sliding window. Multiple feature codes obtained by the window to obtain a second code sequence;所述驱动单元用于:The driving unit is used for:获取与所述第二编码序列对应的姿态控制向量的序列;Acquiring a sequence of attitude control vectors corresponding to the second coding sequence;根据所述姿态控制向量的序列控制所述交互对象的姿态。The posture of the interactive object is controlled according to the sequence of the posture control vector.
- 根据权利要求12至15任一项所述的装置,还包括:The device according to any one of claims 12 to 15, further comprising:停顿驱动单元,在所述音素序列中的所述音素之间的时间间隔大于设定阈值的情况下,根据所述局部区域的设定控制参数值,控制所述交互对象的姿态。The pause driving unit, when the time interval between the phonemes in the phoneme sequence is greater than a set threshold, controls the posture of the interactive object according to the set control parameter value of the local area.
- 根据权利要求14所述的装置,其中,在获取所述特征编码对应的所述交互对象的至少一个局部区域的姿态控制向量时,所述第二获取单元用于:将所述特征编码输入至预先训练的循环神经网络,获得与所述特征编码对应的所述交互对象的至少一个局部区域的所述姿态控制向量。The device according to claim 14, wherein, when acquiring the attitude control vector of at least one partial region of the interactive object corresponding to the feature code, the second acquiring unit is configured to: input the feature code to A pre-trained recurrent neural network obtains the attitude control vector of at least one local area of the interactive object corresponding to the feature code.
- 根据权利要求17所述的装置,还包括样本获取单元,用于:The device according to claim 17, further comprising a sample acquisition unit, configured to:获取一角色发出语音的视频段,并根据所述视频段获取多个包含所述角色的第一图像帧;Acquiring a video segment of a character uttering a voice, and acquiring a plurality of first image frames containing the character according to the video segment;从所述视频段中提取相应的语音段,根据所述语音段获取样本音素序列,并对所述样本音素序列进行特征编码;Extracting a corresponding speech segment from the video segment, obtaining a sample phoneme sequence according to the speech segment, and performing feature encoding on the sample phoneme sequence;获取与所述第一图像帧对应的至少一个音素的特征编码;Acquiring a feature code of at least one phoneme corresponding to the first image frame;将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的至少一个局部区域的姿态控制向量值;Transforming the first image frame into a second image frame containing the interactive object, and obtaining a posture control vector value of at least one local area corresponding to the second image frame;根据所述姿态控制向量值,对与所述第一图像帧对应的所述特征编码进行标注,获得所述特征编码样本;Mark the feature code corresponding to the first image frame according to the attitude control vector value to obtain the feature code sample;所述装置还包括训练单元,用于根据所述特征编码样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述循环神经网络预测得到的所述至少一个局部区域的所述姿态控制向量值与标注的所述姿态控制向量值之间的差异。The device further includes a training unit for training the initial recurrent neural network according to the characteristic coding samples, and training to obtain the recurrent neural network after the change of the network loss satisfies the convergence condition, wherein the network loss includes the The difference between the attitude control vector value of the at least one local area and the marked attitude control vector value obtained by the cyclic neural network.
- 一种电子设备,包括存储器、处理器,所述存储器用于存储可在处理器上运行 的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至11任一项所述的方法。An electronic device, comprising a memory and a processor, the memory is used to store computer instructions that can run on the processor, and the processor is used to implement any one of claims 1 to 11 when the computer instructions are executed Methods.
- 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至11中任一所述的方法。A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method according to any one of claims 1 to 11 is realized.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020217027692A KR20210124307A (en) | 2020-03-31 | 2020-11-18 | Interactive object driving method, apparatus, device and recording medium |
SG11202111909QA SG11202111909QA (en) | 2020-03-31 | 2020-11-18 | Methods and apparatuses for driving an interactive object, devices and storage media |
JP2021549562A JP2022530935A (en) | 2020-03-31 | 2020-11-18 | Interactive target drive methods, devices, devices, and recording media |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010245802.4A CN111460785B (en) | 2020-03-31 | 2020-03-31 | Method, device and equipment for driving interactive object and storage medium |
CN202010245802.4 | 2020-03-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021196644A1 true WO2021196644A1 (en) | 2021-10-07 |
Family
ID=71683475
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129793 WO2021196644A1 (en) | 2020-03-31 | 2020-11-18 | Method, apparatus and device for driving interactive object, and storage medium |
Country Status (6)
Country | Link |
---|---|
JP (1) | JP2022530935A (en) |
KR (1) | KR20210124307A (en) |
CN (1) | CN111460785B (en) |
SG (1) | SG11202111909QA (en) |
TW (1) | TW202138992A (en) |
WO (1) | WO2021196644A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115409920A (en) * | 2022-08-30 | 2022-11-29 | 重庆爱车天下科技有限公司 | Virtual object lip driving system |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111459450A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111460785B (en) * | 2020-03-31 | 2023-02-28 | 北京市商汤科技开发有限公司 | Method, device and equipment for driving interactive object and storage medium |
KR102601159B1 (en) * | 2022-09-30 | 2023-11-13 | 주식회사 아리아스튜디오 | Virtual human interaction generating device and method therof |
CN115662388A (en) * | 2022-10-27 | 2023-01-31 | 维沃移动通信有限公司 | Avatar face driving method, apparatus, electronic device and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109377540A (en) * | 2018-09-30 | 2019-02-22 | 网易(杭州)网络有限公司 | Synthetic method, device, storage medium, processor and the terminal of FA Facial Animation |
CN110136698A (en) * | 2019-04-11 | 2019-08-16 | 北京百度网讯科技有限公司 | For determining the method, apparatus, equipment and storage medium of nozzle type |
CN110876024A (en) * | 2018-08-31 | 2020-03-10 | 百度在线网络技术(北京)有限公司 | Method and device for determining lip action of avatar |
CN111145322A (en) * | 2019-12-26 | 2020-05-12 | 上海浦东发展银行股份有限公司 | Method, apparatus and computer-readable storage medium for driving avatar |
CN111459452A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111459450A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111460785A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111459454A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003058908A (en) * | 2001-08-10 | 2003-02-28 | Minolta Co Ltd | Method and device for controlling face image, computer program and recording medium |
CN102609969B (en) * | 2012-02-17 | 2013-08-07 | 上海交通大学 | Method for processing face and speech synchronous animation based on Chinese text drive |
JP2015038725A (en) * | 2013-07-18 | 2015-02-26 | 国立大学法人北陸先端科学技術大学院大学 | Utterance animation generation device, method, and program |
JP5913394B2 (en) * | 2014-02-06 | 2016-04-27 | Psソリューションズ株式会社 | Audio synchronization processing apparatus, audio synchronization processing program, audio synchronization processing method, and audio synchronization system |
JP2015166890A (en) * | 2014-03-03 | 2015-09-24 | ソニー株式会社 | Information processing apparatus, information processing system, information processing method, and program |
CN106056989B (en) * | 2016-06-23 | 2018-10-16 | 广东小天才科技有限公司 | A kind of interactive learning methods and device, terminal device |
CN107704169B (en) * | 2017-09-26 | 2020-11-17 | 北京光年无限科技有限公司 | Virtual human state management method and system |
CN107891626A (en) * | 2017-11-07 | 2018-04-10 | 嘉善中奥复合材料有限公司 | Urea-formaldehyde moulding powder compression molding system |
CN110176284A (en) * | 2019-05-21 | 2019-08-27 | 杭州师范大学 | A kind of speech apraxia recovery training method based on virtual reality |
-
2020
- 2020-03-31 CN CN202010245802.4A patent/CN111460785B/en active Active
- 2020-11-18 WO PCT/CN2020/129793 patent/WO2021196644A1/en active Application Filing
- 2020-11-18 JP JP2021549562A patent/JP2022530935A/en active Pending
- 2020-11-18 KR KR1020217027692A patent/KR20210124307A/en not_active Application Discontinuation
- 2020-11-18 SG SG11202111909QA patent/SG11202111909QA/en unknown
- 2020-12-16 TW TW109144447A patent/TW202138992A/en unknown
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110876024A (en) * | 2018-08-31 | 2020-03-10 | 百度在线网络技术(北京)有限公司 | Method and device for determining lip action of avatar |
CN109377540A (en) * | 2018-09-30 | 2019-02-22 | 网易(杭州)网络有限公司 | Synthetic method, device, storage medium, processor and the terminal of FA Facial Animation |
CN110136698A (en) * | 2019-04-11 | 2019-08-16 | 北京百度网讯科技有限公司 | For determining the method, apparatus, equipment and storage medium of nozzle type |
CN111145322A (en) * | 2019-12-26 | 2020-05-12 | 上海浦东发展银行股份有限公司 | Method, apparatus and computer-readable storage medium for driving avatar |
CN111459452A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111459450A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111460785A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
CN111459454A (en) * | 2020-03-31 | 2020-07-28 | 北京市商汤科技开发有限公司 | Interactive object driving method, device, equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115409920A (en) * | 2022-08-30 | 2022-11-29 | 重庆爱车天下科技有限公司 | Virtual object lip driving system |
Also Published As
Publication number | Publication date |
---|---|
CN111460785A (en) | 2020-07-28 |
SG11202111909QA (en) | 2021-11-29 |
CN111460785B (en) | 2023-02-28 |
TW202138992A (en) | 2021-10-16 |
KR20210124307A (en) | 2021-10-14 |
JP2022530935A (en) | 2022-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021196644A1 (en) | Method, apparatus and device for driving interactive object, and storage medium | |
WO2021169431A1 (en) | Interaction method and apparatus, and electronic device and storage medium | |
WO2021196643A1 (en) | Method and apparatus for driving interactive object, device, and storage medium | |
WO2021196646A1 (en) | Interactive object driving method and apparatus, device, and storage medium | |
JP7227395B2 (en) | Interactive object driving method, apparatus, device, and storage medium | |
US11514634B2 (en) | Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses | |
CN112528936B (en) | Video sequence arrangement method, device, electronic equipment and storage medium | |
WO2022252890A1 (en) | Interaction object driving and phoneme processing methods and apparatus, device and storage medium | |
WO2021232876A1 (en) | Method and apparatus for driving virtual human in real time, and electronic device and medium | |
CN113689880A (en) | Method, device, electronic equipment and medium for driving virtual human in real time | |
WO2021196647A1 (en) | Method and apparatus for driving interactive object, device, and storage medium | |
CN110166844B (en) | Data processing method and device for data processing | |
Heisler et al. | Making an android robot head talk | |
KR102514580B1 (en) | Video transition method, apparatus and computer program | |
CN116958328A (en) | Method, device, equipment and storage medium for synthesizing mouth shape |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2021549562 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20217027692 Country of ref document: KR Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20929350 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20929350 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 521430720 Country of ref document: SA |