WO2021196646A1 - Procédé et appareil de commande d'objet interactif, dispositif et support de stockage - Google Patents

Procédé et appareil de commande d'objet interactif, dispositif et support de stockage Download PDF

Info

Publication number
WO2021196646A1
WO2021196646A1 PCT/CN2020/129814 CN2020129814W WO2021196646A1 WO 2021196646 A1 WO2021196646 A1 WO 2021196646A1 CN 2020129814 W CN2020129814 W CN 2020129814W WO 2021196646 A1 WO2021196646 A1 WO 2021196646A1
Authority
WO
WIPO (PCT)
Prior art keywords
acoustic feature
sequence
interactive object
speech
control vector
Prior art date
Application number
PCT/CN2020/129814
Other languages
English (en)
Chinese (zh)
Inventor
吴文岩
吴潜溢
钱晨
王宇欣
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to KR1020217015867A priority Critical patent/KR20210124182A/ko
Priority to JP2021529000A priority patent/JP2022530726A/ja
Publication of WO2021196646A1 publication Critical patent/WO2021196646A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04847Interaction techniques to control parameter settings, e.g. interaction with sliders or dials
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/14Digital output to display device ; Cooperation and interconnection of the display device with other functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.
  • the embodiments of the present disclosure provide a driving solution for interactive objects.
  • a method for driving an interactive object comprising: obtaining a sequence of voice frames contained in a voice segment; obtaining a control parameter of at least one local area of the interactive object corresponding to the sequence of voice frames ; Control the posture of the interactive object according to the acquired control parameter.
  • the method further includes: controlling the display device displaying the interactive object to output voice and/or display text according to the voice segment.
  • the control parameter of the local area of the interactive object includes the posture control vector of the local area; acquiring the control parameter of at least one local area of the interactive object corresponding to the speech frame sequence,
  • the method includes: acquiring a first acoustic feature sequence corresponding to the voice frame sequence; acquiring an acoustic feature corresponding to at least one voice frame according to the first acoustic feature sequence; acquiring at least one of the interactive objects corresponding to the acoustic feature The attitude control vector of a local area.
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;
  • the first acoustic feature sequence acquiring the acoustic feature corresponding to at least one speech frame, includes: sliding a window on the first acoustic feature sequence with a time window of a set length and a set step size, and changing the time window
  • the acoustic feature vector within is used as the corresponding acoustic feature of the at least one speech frame, and a second acoustic feature sequence is obtained according to the plurality of acoustic features obtained by completing the sliding window.
  • controlling the posture of the interactive object according to the control parameter includes: obtaining a sequence of posture control vectors corresponding to the second acoustic feature sequence; and according to the sequence of posture control vectors Control the posture of the interactive object.
  • acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature includes: inputting the acoustic feature into a pre-trained cyclic neural network to obtain The attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  • the recurrent neural network is obtained through acoustic feature sample training; the method further includes: obtaining an acoustic feature sample, specifically including: obtaining a video segment of a character uttering a voice, from the video segment Extract the corresponding voice segment from the video segment; sample the video segment to obtain multiple first image frames containing the character; and sample the voice segment to obtain multiple voice frames; Acoustic characteristics of the speech frame corresponding to the image frame; converting the first image frame into a second image frame containing the interactive object, and obtaining the attitude control vector value of at least one local area corresponding to the second image frame ; According to the attitude control vector value, annotate the acoustic feature corresponding to the first image frame to obtain the acoustic feature sample.
  • the method further includes: training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change in the network loss satisfies the convergence condition, wherein
  • the network loss includes the difference between the attitude control vector value of the at least one local area predicted by the recurrent neural network and the marked attitude control vector value.
  • a driving device for an interactive object comprising: a first acquisition unit for acquiring a sequence of speech frames contained in a speech segment; a second acquisition unit for acquiring the The control parameter of at least one partial area of the interactive object corresponding to the frame sequence; the driving unit is configured to control the posture of the interactive object according to the acquired control parameter.
  • an electronic device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed.
  • the method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
  • a computer-readable storage medium having a computer program stored thereon, and the computer program program, when executed by a processor, implements the method for driving an interactive object according to any one of the embodiments provided in the present disclosure .
  • the driving method, device, device, and computer-readable storage medium of an interactive object obtain the voice frame sequence contained in the voice segment, and determine the location of at least one partial area of the interactive object according to the voice frame sequence.
  • the parameter value is controlled to control the posture of the interactive object, so that the interactive object makes a posture that matches the voice segment, so that the target object feels that it is communicating with the interactive object, and the interaction between the target object and the interactive object is improved.
  • Interactive experience
  • FIG. 1 is a schematic diagram of a display device in a method for driving interactive objects proposed by at least one embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for driving interactive objects proposed by at least one embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of a process of feature encoding a speech frame sequence proposed by at least one embodiment of the present disclosure
  • FIG. 4 is a schematic structural diagram of a driving device for interactive objects proposed in at least one embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of an electronic device proposed in at least one embodiment of the present disclosure.
  • At least one embodiment of the present disclosure provides a method for driving interactive objects.
  • the driving method may be executed by electronic devices such as a terminal device or a server.
  • the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game.
  • the server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.
  • the interaction object may be any virtual image capable of interacting with the target object.
  • the interactive object may be a virtual character, or may also be a virtual animal, virtual item, cartoon image, or other virtual images capable of implementing interactive functions.
  • the display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure.
  • the target object may be a user, a robot, or other smart devices.
  • the interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner.
  • the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction.
  • the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.
  • the interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.
  • terminal devices may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc.
  • VR virtual reality
  • AR augmented reality
  • Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure.
  • the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect.
  • the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters.
  • the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen.
  • the display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor.
  • the processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.
  • the interactive object in response to the sound-driven data used to drive the interactive object to output voice, the interactive object may emit a specified voice to the target object.
  • the terminal device can generate sound-driven data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to communicate or respond by emitting a specified voice, thereby providing anthropomorphic services for the target object.
  • the sound-driven data can also be generated in other ways, for example, generated by the server and sent to the terminal device.
  • At least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.
  • FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 2, the method includes steps 201 to 203.
  • step 201 a sequence of speech frames contained in the speech segment is obtained.
  • the voice segment may be a voice segment corresponding to the sound-driven data of the interactive object, and the sound-driven data may include audio data (voice data), text, and so on.
  • the sound-driven data may be driving data generated by a server or a terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or may be sound-driven data called by the terminal device from the internal memory.
  • the present disclosure does not limit the acquisition method of the sound-driven data.
  • the speech frame sequence contained in the speech segment may be obtained by performing frequency division processing on the speech segment.
  • the frequency division processing is performed on the voice segment, that is, the voice segment is divided into multiple voice frames, and each voice frame is arranged in time order to form a voice frame sequence.
  • the number of sampling points (duration) and frame shift (the degree of overlap between frames) contained in the speech frame obtained by performing frequency division processing can be determined according to the driving requirements of the interactive object, which is not limited in the present disclosure.
  • FIG. 3 shows a schematic diagram of a driving method of interactive objects proposed by at least one embodiment of the present disclosure. Perform segmentation/frequency division processing on the voice segment signal, and the resulting voice frame sequence is shown in Figure 3.
  • step 202 the control parameter value of at least one partial region of the interactive object corresponding to the speech frame sequence is acquired.
  • the local area is obtained by dividing the whole (including face and/or body) of the interactive object.
  • the control of one or more local areas of the face may correspond to a series of facial expressions or actions of the interactive object.
  • the control of the eye area may correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective;
  • the control of the mouth area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees.
  • the control of one or more local areas of the body may correspond to a series of physical actions of the interactive object.
  • the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.
  • the control parameter of the local area of the interactive object includes the posture control vector of the local area.
  • the attitude control vector of each local area is used to drive the local area of the interactive object to perform actions.
  • Different posture control vector values correspond to different motions or motion ranges. For example, for the posture control vector of the mouth area, one set of posture control vector values can make the mouth of the interactive object slightly open, and another set of posture control vector values can make the mouth of the interactive object open wider.
  • the corresponding local areas can make different actions or actions with different amplitudes.
  • the local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the posture control vector of all the local areas can be obtained; when the expression of the interactive object needs to be controlled , Then the posture control vector of the local area corresponding to the face can be obtained.
  • control parameter value of at least one local area of the interactive object can be determined according to the acoustic characteristics of the speech frame sequence, and the control parameter value can also be determined according to other characteristics of the speech frame sequence.
  • the corresponding relationship between a certain characteristic of the speech frame sequence and the control parameter value of the interactive object can be established in advance, and the corresponding control parameter value can be obtained when the speech frame sequence is obtained.
  • the specific method for obtaining the control parameter value of the interaction object matching the speech frame sequence will be described in detail later.
  • step 203 the posture of the interactive object is controlled according to the acquired control parameter value.
  • control parameter value for example, the attitude control vector value
  • the control parameter value is matched with the speech frame sequence contained in the speech segment. For example, when the display device showing the interactive object is outputting the voice segment, or is displaying the text corresponding to the voice segment, the gesture made by the interactive object is the same as the output voice and/or the displayed text It is synchronized, so as to give the target object a feeling that the interactive object is speaking.
  • the posture of the interactive object is controlled so that the interactive object A gesture matching the speech segment is made, so that the target object feels that it is communicating with the interactive object, and the interactive experience of the target object is improved.
  • the method is applied to a server, including a local server or a cloud server.
  • the server processes the speech segment, generates the control parameter value of the interactive object, and uses three-dimensional rendering according to the control parameter value.
  • the engine performs rendering to obtain the animation of the interactive object.
  • the server may send the animation to the terminal for display to communicate or respond to the target object, and may also send the animation to the cloud, so that the terminal can obtain the animation from the cloud to communicate or respond to the target object .
  • the control parameter value may also be sent to the terminal, so that the terminal completes the process of rendering, generating animation, and performing display.
  • the method is applied to a terminal, and the terminal processes the speech segment, generates the control parameter value of the interactive object, and uses the three-dimensional rendering engine to render according to the control parameter value to obtain the interactive
  • the animation of the object the terminal can display the animation to communicate or respond to the target object.
  • the display device displaying the interactive object may be controlled to output voice and/or display text according to the voice segment. And while outputting voice and/or displaying text, the gesture of the interactive object displayed by the display device can be controlled according to the control parameter value.
  • the voice and/or text outputted according to the voice segment is different from the control parameter value based on the control parameter value.
  • the gesture made by the interactive object is synchronized with the output voice and/or the displayed text, and the target object is given the feeling that the interactive object is speaking.
  • the attitude control vector may be obtained in the following manner.
  • the acoustic feature sequence corresponding to the speech frame sequence is referred to as the first acoustic feature sequence.
  • the acoustic features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Cepstral Cofficient (MFCC) and so on.
  • MFCC Mel Frequency Cepstral Cofficient
  • the first acoustic feature sequence is obtained by processing the entire speech frame sequence.
  • the speech frame sequence can be windowed, fast Fourier transform, etc. Filtering, logarithmic processing, and discrete cosine processing to obtain the MFCC coefficients corresponding to each speech frame.
  • the first acoustic feature sequence is obtained by processing the entire speech frame sequence, and reflects the overall acoustic feature of the speech segment.
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence.
  • the first acoustic feature sequence includes the MFCC coefficients of each speech frame.
  • the first acoustic feature sequence obtained according to the speech frame sequence is shown in FIG. 3.
  • the acoustic feature corresponding to at least one speech frame is acquired.
  • the same number of feature vectors corresponding to the at least one voice frame may be used as the voice The acoustic characteristics of the frame.
  • the same number of feature vectors may form a feature matrix, and the feature matrix is the acoustic feature corresponding to the at least one speech frame.
  • the N feature vectors in the first acoustic feature sequence form the acoustic features of the corresponding N speech frames; where N is a positive integer.
  • the first acoustic feature sequence may include multiple acoustic features, and the speech frames corresponding to each of the acoustic features may partially overlap.
  • attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature is acquired.
  • the attitude control vector of the at least one local area can be obtained.
  • the local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the posture control vector of all the local areas can be obtained; when the expression of the interactive object needs to be controlled , Then the posture control vector of the local area corresponding to the face can be obtained.
  • the interactive object While playing the voice segment, the interactive object is driven to make an action according to the attitude control vector corresponding to each acoustic feature obtained through the first acoustic feature sequence, so that the terminal device can output sound while the interactive object can Perform actions that match the output sound, including facial actions, expressions, and body actions, so that the target object feels that the interactive object is speaking.
  • the attitude control vector is related to the acoustic characteristics of the output sound, driving according to the attitude control vector can make the expression and body movements of the interactive object have emotional factors, thereby making the speaking process of the interactive object more natural and vivid , Thereby improving the interactive experience between the target object and the interactive object.
  • the acoustic feature corresponding to the at least one speech frame may be acquired by performing a sliding window on the first acoustic feature sequence.
  • the acoustic feature vector in the time window is used as the acoustic feature of the corresponding same number of speech frames, thereby obtaining Acoustic characteristics corresponding to these speech frames.
  • the second acoustic feature sequence can be obtained according to the obtained multiple acoustic features.
  • the speech frame sequence includes 100 speech frames per second, the length of the time window is 1 s, and the step length is 0.04 s. Since each feature vector in the first acoustic feature sequence corresponds to a speech frame, correspondingly, the first acoustic feature sequence also includes 100 feature vectors per second. During the window sliding process on the first acoustic feature sequence, 100 feature vectors in the time window are obtained each time as the acoustic features of the corresponding 100 speech frames.
  • the acoustic features corresponding to the 1st to 100th speech frames 1 and the acoustic features 2 corresponding to the 4th to 104th speech frames are respectively obtained.
  • acoustic feature 1 After traversing the first acoustic feature, acoustic feature 1, acoustic feature 2,..., acoustic feature M is obtained, thereby obtaining the second acoustic feature sequence, where M is a positive integer, and its value is based on the sequence of the speech frame
  • M is a positive integer, and its value is based on the sequence of the speech frame
  • the number of frames (the number of feature vectors in the first acoustic feature sequence), the length of the time window, and the step size are determined.
  • the acoustic feature 1 the acoustic feature 2,..., the acoustic feature M, the corresponding attitude control vector 1, the attitude control vector 2,..., the attitude control vector M can be obtained respectively, so as to obtain the sequence of the attitude control vector.
  • the sequence of the attitude control vector and the second acoustic feature sequence are aligned in time.
  • Acoustic feature 1, acoustic feature 2, ..., acoustic feature M in the second acoustic feature sequence Are respectively obtained according to the N feature vectors in the first acoustic feature sequence. Therefore, while the voice frame is played, the interactive object can be driven to make an action according to the sequence of the posture control vector.
  • the attitude control vector before the set time can be set to the default value, that is, when the speech frame sequence is just started to be played, the interactive object A default action is made, and after the set time, the interactive object is driven to make an action using the sequence of the posture control vector obtained according to the first acoustic feature sequence.
  • Acoustic feature 1 is output at t0, and the acoustic feature is output at intervals of 0.04s corresponding to the step size.
  • Acoustic feature 2 is output at t1
  • acoustic feature 3 is output at t2 until at t (M-1)
  • Acoustic feature M is output at every moment.
  • ti ⁇ t(i+1) corresponds to the feature vector (i+1), where i is an integer smaller than (M-1)
  • the attitude control vector is the default Attitude control vector.
  • the interactive object is driven to make an action according to the sequence of the gesture control vector while playing the voice segment, so that the action of the interactive object is synchronized with the output sound, and the target object With the feeling that the interactive object is speaking, the interactive experience between the target object and the interactive object is improved.
  • the length of the time window is related to the amount of information contained in the acoustic feature. The greater the length of the time window, the more information it contains, and the stronger the correlation between the actions and sounds that drive the interactive object.
  • the sliding step length of the time window is related to the time interval (frequency) of obtaining the attitude control vector, that is, it is related to the frequency of driving the interactive object to make an action.
  • the length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
  • the acoustic feature includes Mel frequency cepstral coefficients MFCC in L dimensions, where L is a positive integer.
  • MFCC represents the distribution of the energy of the speech signal in different frequency ranges.
  • the MFCC of L dimensions can be obtained by converting multiple speech frame data in the speech frame sequence to the frequency domain and using a Mel filter including L subbands. .
  • the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature can be obtained. Since the cyclic neural network is a time recurrent neural network, it can learn the historical information of the input acoustic features, and output the attitude control vector of the at least one local area according to the acoustic feature sequence.
  • the acoustic feature sequence includes a first acoustic feature sequence and a second acoustic feature sequence.
  • a pre-trained cyclic neural network is used to obtain the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature, and the historical feature information of the acoustic feature and the current feature information are merged, thereby The historical attitude control vector has an impact on the current attitude control vector change, making the expression changes and body movements of the interactive characters more smooth and natural.
  • the recurrent neural network can be trained in the following manner.
  • an acoustic feature sample is obtained, the acoustic feature sample is annotated with a true value, and the true value is a posture control vector value of at least one local area of the interactive object.
  • the initial recurrent neural network is trained according to the acoustic feature samples, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the recurrent neural network The difference between the attitude control vector value of the at least one local area and the real value obtained by network prediction.
  • the acoustic feature samples can be obtained by the following method.
  • the video segment is sampled according to the first sampling period to obtain multiple first image frames containing the character;
  • the voice segment is sampled according to the second sampling period to obtain multiple voice frames.
  • the second sampling period is less than the first sampling period, that is, the frequency of sampling the voice segment is higher than the frequency of sampling the video segment, so that one first image frame can correspond to the acoustics of at least one voice frame. feature.
  • the acoustic feature corresponding to the at least one speech frame corresponding to the first image frame is acquired.
  • the number of speech frames corresponding to a first image frame in the training process is the same as the number of speech frames corresponding to the acoustic features obtained in the aforementioned driving process, and the number of acoustic features obtained in the training process
  • the method is the same as in the aforementioned driving process.
  • the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained.
  • the attitude control vector value may include the attitude control vector value of all the local areas, and may also include the attitude control vector value of some of the local areas.
  • the image frame of the real person can be converted into a second image frame containing the image represented by the interactive object, and the local area of the real person
  • the posture control vector corresponds to the posture control vector of each local area of the interactive object, so that the posture control vector of each local area of the interactive object in the second image frame can be obtained.
  • an acoustic feature corresponding to the first image frame is annotated to obtain an acoustic feature sample.
  • the video segment of a character is split into corresponding multiple first image frames and multiple voice frames, and the first image frame containing the real person is converted into the first image frame containing the interactive object.
  • Two image frames are used to obtain the attitude control vector corresponding to the acoustic feature of at least one speech frame, so that the acoustic feature has a better correspondence with the attitude control vector, so as to obtain high-quality acoustic feature samples, so that the action of the interactive object is closer to the corresponding character Real action.
  • FIG. 4 shows a schematic structural diagram of a device for driving an interactive object according to at least one embodiment of the present disclosure.
  • the device may include: a first obtaining unit 401, configured to obtain a sequence of speech frames contained in a speech segment;
  • the second acquiring unit 402 is configured to acquire a control parameter of at least one partial region of the interaction object corresponding to the speech frame sequence;
  • the driving unit 403 is configured to control the posture of the interaction object according to the acquired control parameter.
  • the device further includes an output unit for controlling the display device displaying the interactive object to output voice and/or display text according to the voice segment.
  • control parameter of the local area of the interactive object includes a posture control vector of the local area
  • the second acquiring unit is specifically configured to: acquire the first acoustic feature sequence corresponding to the speech frame sequence Acquire the acoustic feature corresponding to at least one speech frame according to the first acoustic feature sequence; Acquire the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence, and after acquiring at least one voice frame corresponding to the first acoustic feature sequence
  • the second acquiring unit is specifically configured to: acquire a sequence of posture control vectors corresponding to the second acoustic feature sequence; and control the posture of the interactive object according to the sequence of posture control vectors.
  • the driving unit is specifically configured to: obtain a sequence of a posture control vector corresponding to the second acoustic feature sequence; and control the posture of the interactive object according to the sequence of the posture control vector.
  • the second acquiring unit when acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature, is specifically configured to: input the acoustic feature into a pre-trained loop A neural network obtains a posture control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  • the recurrent neural network is obtained through acoustic feature sample training;
  • the device further includes a sample acquisition unit, configured to: extract the voice segment of the character from the acquired video segment; Sampling the segment to obtain a plurality of first image frames containing the character; and sampling the speech segment to obtain a plurality of speech frames; obtaining the acoustic characteristics of the speech frame corresponding to the first image frame;
  • the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained;
  • the acoustic feature corresponding to the first image frame is annotated to obtain an acoustic feature sample.
  • the device further includes a training unit for training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change in network loss satisfies the convergence condition, wherein
  • the network loss includes the difference between the attitude control vector value of the at least one local area and the labeled attitude control vector value predicted by the initial recurrent neural network.
  • At least one embodiment of this specification also provides an electronic device.
  • the device includes a memory and a processor.
  • the memory is used to store computer instructions that can run on the processor.
  • the method for driving interactive objects described in any embodiment of the present disclosure is realized by computer instructions.
  • At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.
  • one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or among them A combination of one or more.
  • the embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules.
  • the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission.
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output.
  • the processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit.
  • the central processing unit will receive instructions and data from a read-only memory and/or a random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • the computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both.
  • the computer does not have to have such equipment.
  • the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks or Removable disks
  • magneto-optical disks CD ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Acoustics & Sound (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Procédé et appareil de commande d'objet interactif, dispositif, et support de stockage. Le procédé comprend les étapes consistant à : obtenir une séquence de trames vocales contenues dans un segment vocal ; obtenir un paramètre de commande d'au moins une zone partielle d'un objet interactif correspondant à la séquence de trames vocales ; et commander la posture de ladite zone partielle de l'objet interactif en fonction du paramètre de commande obtenu.
PCT/CN2020/129814 2020-03-31 2020-11-18 Procédé et appareil de commande d'objet interactif, dispositif et support de stockage WO2021196646A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020217015867A KR20210124182A (ko) 2020-03-31 2020-11-18 인터렉티브 대상 구동 방법, 장치, 디바이스 및 기록 매체
JP2021529000A JP2022530726A (ja) 2020-03-31 2020-11-18 インタラクティブ対象駆動方法、装置、デバイス、及び記録媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010247276.5A CN111459454B (zh) 2020-03-31 2020-03-31 交互对象的驱动方法、装置、设备以及存储介质
CN202010247276.5 2020-03-31

Publications (1)

Publication Number Publication Date
WO2021196646A1 true WO2021196646A1 (fr) 2021-10-07

Family

ID=71678881

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129814 WO2021196646A1 (fr) 2020-03-31 2020-11-18 Procédé et appareil de commande d'objet interactif, dispositif et support de stockage

Country Status (5)

Country Link
JP (1) JP2022530726A (fr)
KR (1) KR20210124182A (fr)
CN (2) CN111459454B (fr)
TW (1) TW202139052A (fr)
WO (1) WO2021196646A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023116208A1 (fr) * 2021-12-24 2023-06-29 上海商汤智能科技有限公司 Procédé et appareil de génération d'objet numérique, et dispositif et support de stockage

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459454B (zh) * 2020-03-31 2021-08-20 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN111460785B (zh) * 2020-03-31 2023-02-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN112527115B (zh) * 2020-12-15 2023-08-04 北京百度网讯科技有限公司 用户形象生成方法、相关装置及计算机程序产品
CN113050859B (zh) * 2021-04-19 2023-10-24 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN113314104B (zh) * 2021-05-31 2023-06-20 北京市商汤科技开发有限公司 交互对象驱动和音素处理方法、装置、设备以及存储介质
CN114283227B (zh) * 2021-11-26 2023-04-07 北京百度网讯科技有限公司 虚拟人物的驱动方法、装置、电子设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120284029A1 (en) * 2011-05-02 2012-11-08 Microsoft Corporation Photo-realistic synthesis of image sequences with lip movements synchronized with speech
CN110136698A (zh) * 2019-04-11 2019-08-16 北京百度网讯科技有限公司 用于确定嘴型的方法、装置、设备和存储介质
CN110288682A (zh) * 2019-06-28 2019-09-27 北京百度网讯科技有限公司 用于控制三维虚拟人像口型变化的方法和装置
CN111459454A (zh) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3674875B2 (ja) * 1994-10-24 2005-07-27 株式会社イメージリンク アニメーションシステム
JP3212578B2 (ja) * 1999-06-30 2001-09-25 インタロボット株式会社 身体的音声反応玩具
JP2001034785A (ja) * 1999-07-16 2001-02-09 Atr Media Integration & Communications Res Lab 仮想変身装置
JP2003248837A (ja) * 2001-11-12 2003-09-05 Mega Chips Corp 画像作成装置、画像作成システム、音声生成装置、音声生成システム、画像作成用サーバ、プログラム、および記録媒体
JP4543263B2 (ja) * 2006-08-28 2010-09-15 株式会社国際電気通信基礎技術研究所 アニメーションデータ作成装置及びアニメーションデータ作成プログラム
CN102609969B (zh) * 2012-02-17 2013-08-07 上海交通大学 基于汉语文本驱动的人脸语音同步动画的处理方法
JP2015166890A (ja) * 2014-03-03 2015-09-24 ソニー株式会社 情報処理装置、情報処理システム、情報処理方法及びプログラム
US9818409B2 (en) * 2015-06-19 2017-11-14 Google Inc. Context-dependent modeling of phonemes
CN106056989B (zh) * 2016-06-23 2018-10-16 广东小天才科技有限公司 一种语言学习方法及装置、终端设备
CN109789550B (zh) * 2016-07-27 2023-05-30 华纳兄弟娱乐公司 基于小说或表演中的先前角色描绘的社交机器人的控制
JP6945375B2 (ja) * 2017-07-27 2021-10-06 株式会社バンダイナムコエンターテインメント 画像生成装置及びプログラム
CN107704169B (zh) * 2017-09-26 2020-11-17 北京光年无限科技有限公司 虚拟人的状态管理方法和系统
JP2019078857A (ja) * 2017-10-24 2019-05-23 国立研究開発法人情報通信研究機構 音響モデルの学習方法及びコンピュータプログラム
CN107861626A (zh) * 2017-12-06 2018-03-30 北京光年无限科技有限公司 一种虚拟形象被唤醒的方法及系统
JP7140984B2 (ja) * 2018-02-16 2022-09-22 日本電信電話株式会社 非言語情報生成装置、非言語情報生成モデル学習装置、方法、及びプログラム
WO2019160105A1 (fr) * 2018-02-16 2019-08-22 日本電信電話株式会社 Dispositif de génération d'informations non verbales, dispositif d'apprentissage de modèle de génération d'informations non verbales, procédé et programme
CN108942919B (zh) * 2018-05-28 2021-03-30 北京光年无限科技有限公司 一种基于虚拟人的交互方法及系统
CN110310662A (zh) * 2019-05-21 2019-10-08 平安科技(深圳)有限公司 音节自动标注方法、装置、计算机设备及存储介质
CN110176284A (zh) * 2019-05-21 2019-08-27 杭州师范大学 一种基于虚拟现实的言语失用症康复训练方法
CN110400251A (zh) * 2019-06-13 2019-11-01 深圳追一科技有限公司 视频处理方法、装置、终端设备及存储介质
CN110503942A (zh) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 一种基于人工智能的语音驱动动画方法和装置
CN110794964A (zh) * 2019-10-22 2020-02-14 深圳追一科技有限公司 虚拟机器人的交互方法、装置、电子设备及存储介质
CN110929762B (zh) * 2019-10-30 2023-05-12 中科南京人工智能创新研究院 一种基于深度学习的肢体语言检测与行为分析方法及系统
CN110815258B (zh) * 2019-10-30 2023-03-31 华南理工大学 基于电磁力反馈和增强现实的机器人遥操作系统和方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120284029A1 (en) * 2011-05-02 2012-11-08 Microsoft Corporation Photo-realistic synthesis of image sequences with lip movements synchronized with speech
CN110136698A (zh) * 2019-04-11 2019-08-16 北京百度网讯科技有限公司 用于确定嘴型的方法、装置、设备和存储介质
CN110288682A (zh) * 2019-06-28 2019-09-27 北京百度网讯科技有限公司 用于控制三维虚拟人像口型变化的方法和装置
CN111459454A (zh) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023116208A1 (fr) * 2021-12-24 2023-06-29 上海商汤智能科技有限公司 Procédé et appareil de génération d'objet numérique, et dispositif et support de stockage

Also Published As

Publication number Publication date
CN111459454B (zh) 2021-08-20
JP2022530726A (ja) 2022-07-01
CN113672194A (zh) 2021-11-19
KR20210124182A (ko) 2021-10-14
TW202139052A (zh) 2021-10-16
CN111459454A (zh) 2020-07-28

Similar Documents

Publication Publication Date Title
WO2021196646A1 (fr) Procédé et appareil de commande d'objet interactif, dispositif et support de stockage
WO2021169431A1 (fr) Procédé et appareil d'interaction, et dispositif électronique et support de stockage
TWI766499B (zh) 互動物件的驅動方法、裝置、設備以及儲存媒體
TWI760015B (zh) 互動物件的驅動方法、裝置、設備以及儲存媒體
WO2021196644A1 (fr) Procédé, appareil et dispositif permettant d'entraîner un objet interactif, et support d'enregistrement
JP7193015B2 (ja) コミュニケーション支援プログラム、コミュニケーション支援方法、コミュニケーション支援システム、端末装置及び非言語表現プログラム
CN110162598B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN113299312A (zh) 一种图像生成方法、装置、设备以及存储介质
CN113689879A (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
RU2721180C1 (ru) Способ генерации анимационной модели головы по речевому сигналу и электронное вычислительное устройство, реализующее его
TW202248994A (zh) 互動對象驅動和音素處理方法、設備以及儲存媒體
CN113050859A (zh) 交互对象的驱动方法、装置、设备以及存储介质
WO2021196647A1 (fr) Procédé et appareil permettant de commander un objet interactif, dispositif, et support de stockage
Pham et al. Learning continuous facial actions from speech for real-time animation
CN110166844B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN116958328A (zh) 口型合成方法、装置、设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021529000

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20928143

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20928143

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 521430187

Country of ref document: SA