WO2021196646A1 - Interactive object driving method and apparatus, device, and storage medium - Google Patents

Interactive object driving method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2021196646A1
WO2021196646A1 PCT/CN2020/129814 CN2020129814W WO2021196646A1 WO 2021196646 A1 WO2021196646 A1 WO 2021196646A1 CN 2020129814 W CN2020129814 W CN 2020129814W WO 2021196646 A1 WO2021196646 A1 WO 2021196646A1
Authority
WO
WIPO (PCT)
Prior art keywords
acoustic feature
sequence
interactive object
speech
control vector
Prior art date
Application number
PCT/CN2020/129814
Other languages
French (fr)
Chinese (zh)
Inventor
吴文岩
吴潜溢
钱晨
王宇欣
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to JP2021529000A priority Critical patent/JP2022530726A/en
Priority to KR1020217015867A priority patent/KR20210124182A/en
Publication of WO2021196646A1 publication Critical patent/WO2021196646A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04847Interaction techniques to control parameter settings, e.g. interaction with sliders or dials
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/14Digital output to display device ; Cooperation and interconnection of the display device with other functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.
  • the embodiments of the present disclosure provide a driving solution for interactive objects.
  • a method for driving an interactive object comprising: obtaining a sequence of voice frames contained in a voice segment; obtaining a control parameter of at least one local area of the interactive object corresponding to the sequence of voice frames ; Control the posture of the interactive object according to the acquired control parameter.
  • the method further includes: controlling the display device displaying the interactive object to output voice and/or display text according to the voice segment.
  • the control parameter of the local area of the interactive object includes the posture control vector of the local area; acquiring the control parameter of at least one local area of the interactive object corresponding to the speech frame sequence,
  • the method includes: acquiring a first acoustic feature sequence corresponding to the voice frame sequence; acquiring an acoustic feature corresponding to at least one voice frame according to the first acoustic feature sequence; acquiring at least one of the interactive objects corresponding to the acoustic feature The attitude control vector of a local area.
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;
  • the first acoustic feature sequence acquiring the acoustic feature corresponding to at least one speech frame, includes: sliding a window on the first acoustic feature sequence with a time window of a set length and a set step size, and changing the time window
  • the acoustic feature vector within is used as the corresponding acoustic feature of the at least one speech frame, and a second acoustic feature sequence is obtained according to the plurality of acoustic features obtained by completing the sliding window.
  • controlling the posture of the interactive object according to the control parameter includes: obtaining a sequence of posture control vectors corresponding to the second acoustic feature sequence; and according to the sequence of posture control vectors Control the posture of the interactive object.
  • acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature includes: inputting the acoustic feature into a pre-trained cyclic neural network to obtain The attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  • the recurrent neural network is obtained through acoustic feature sample training; the method further includes: obtaining an acoustic feature sample, specifically including: obtaining a video segment of a character uttering a voice, from the video segment Extract the corresponding voice segment from the video segment; sample the video segment to obtain multiple first image frames containing the character; and sample the voice segment to obtain multiple voice frames; Acoustic characteristics of the speech frame corresponding to the image frame; converting the first image frame into a second image frame containing the interactive object, and obtaining the attitude control vector value of at least one local area corresponding to the second image frame ; According to the attitude control vector value, annotate the acoustic feature corresponding to the first image frame to obtain the acoustic feature sample.
  • the method further includes: training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change in the network loss satisfies the convergence condition, wherein
  • the network loss includes the difference between the attitude control vector value of the at least one local area predicted by the recurrent neural network and the marked attitude control vector value.
  • a driving device for an interactive object comprising: a first acquisition unit for acquiring a sequence of speech frames contained in a speech segment; a second acquisition unit for acquiring the The control parameter of at least one partial area of the interactive object corresponding to the frame sequence; the driving unit is configured to control the posture of the interactive object according to the acquired control parameter.
  • an electronic device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed.
  • the method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
  • a computer-readable storage medium having a computer program stored thereon, and the computer program program, when executed by a processor, implements the method for driving an interactive object according to any one of the embodiments provided in the present disclosure .
  • the driving method, device, device, and computer-readable storage medium of an interactive object obtain the voice frame sequence contained in the voice segment, and determine the location of at least one partial area of the interactive object according to the voice frame sequence.
  • the parameter value is controlled to control the posture of the interactive object, so that the interactive object makes a posture that matches the voice segment, so that the target object feels that it is communicating with the interactive object, and the interaction between the target object and the interactive object is improved.
  • Interactive experience
  • FIG. 1 is a schematic diagram of a display device in a method for driving interactive objects proposed by at least one embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for driving interactive objects proposed by at least one embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of a process of feature encoding a speech frame sequence proposed by at least one embodiment of the present disclosure
  • FIG. 4 is a schematic structural diagram of a driving device for interactive objects proposed in at least one embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of an electronic device proposed in at least one embodiment of the present disclosure.
  • At least one embodiment of the present disclosure provides a method for driving interactive objects.
  • the driving method may be executed by electronic devices such as a terminal device or a server.
  • the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game.
  • the server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.
  • the interaction object may be any virtual image capable of interacting with the target object.
  • the interactive object may be a virtual character, or may also be a virtual animal, virtual item, cartoon image, or other virtual images capable of implementing interactive functions.
  • the display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure.
  • the target object may be a user, a robot, or other smart devices.
  • the interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner.
  • the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction.
  • the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.
  • the interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.
  • terminal devices may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc.
  • VR virtual reality
  • AR augmented reality
  • Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure.
  • the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect.
  • the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters.
  • the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen.
  • the display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor.
  • the processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.
  • the interactive object in response to the sound-driven data used to drive the interactive object to output voice, the interactive object may emit a specified voice to the target object.
  • the terminal device can generate sound-driven data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to communicate or respond by emitting a specified voice, thereby providing anthropomorphic services for the target object.
  • the sound-driven data can also be generated in other ways, for example, generated by the server and sent to the terminal device.
  • At least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.
  • FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 2, the method includes steps 201 to 203.
  • step 201 a sequence of speech frames contained in the speech segment is obtained.
  • the voice segment may be a voice segment corresponding to the sound-driven data of the interactive object, and the sound-driven data may include audio data (voice data), text, and so on.
  • the sound-driven data may be driving data generated by a server or a terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or may be sound-driven data called by the terminal device from the internal memory.
  • the present disclosure does not limit the acquisition method of the sound-driven data.
  • the speech frame sequence contained in the speech segment may be obtained by performing frequency division processing on the speech segment.
  • the frequency division processing is performed on the voice segment, that is, the voice segment is divided into multiple voice frames, and each voice frame is arranged in time order to form a voice frame sequence.
  • the number of sampling points (duration) and frame shift (the degree of overlap between frames) contained in the speech frame obtained by performing frequency division processing can be determined according to the driving requirements of the interactive object, which is not limited in the present disclosure.
  • FIG. 3 shows a schematic diagram of a driving method of interactive objects proposed by at least one embodiment of the present disclosure. Perform segmentation/frequency division processing on the voice segment signal, and the resulting voice frame sequence is shown in Figure 3.
  • step 202 the control parameter value of at least one partial region of the interactive object corresponding to the speech frame sequence is acquired.
  • the local area is obtained by dividing the whole (including face and/or body) of the interactive object.
  • the control of one or more local areas of the face may correspond to a series of facial expressions or actions of the interactive object.
  • the control of the eye area may correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective;
  • the control of the mouth area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees.
  • the control of one or more local areas of the body may correspond to a series of physical actions of the interactive object.
  • the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.
  • the control parameter of the local area of the interactive object includes the posture control vector of the local area.
  • the attitude control vector of each local area is used to drive the local area of the interactive object to perform actions.
  • Different posture control vector values correspond to different motions or motion ranges. For example, for the posture control vector of the mouth area, one set of posture control vector values can make the mouth of the interactive object slightly open, and another set of posture control vector values can make the mouth of the interactive object open wider.
  • the corresponding local areas can make different actions or actions with different amplitudes.
  • the local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the posture control vector of all the local areas can be obtained; when the expression of the interactive object needs to be controlled , Then the posture control vector of the local area corresponding to the face can be obtained.
  • control parameter value of at least one local area of the interactive object can be determined according to the acoustic characteristics of the speech frame sequence, and the control parameter value can also be determined according to other characteristics of the speech frame sequence.
  • the corresponding relationship between a certain characteristic of the speech frame sequence and the control parameter value of the interactive object can be established in advance, and the corresponding control parameter value can be obtained when the speech frame sequence is obtained.
  • the specific method for obtaining the control parameter value of the interaction object matching the speech frame sequence will be described in detail later.
  • step 203 the posture of the interactive object is controlled according to the acquired control parameter value.
  • control parameter value for example, the attitude control vector value
  • the control parameter value is matched with the speech frame sequence contained in the speech segment. For example, when the display device showing the interactive object is outputting the voice segment, or is displaying the text corresponding to the voice segment, the gesture made by the interactive object is the same as the output voice and/or the displayed text It is synchronized, so as to give the target object a feeling that the interactive object is speaking.
  • the posture of the interactive object is controlled so that the interactive object A gesture matching the speech segment is made, so that the target object feels that it is communicating with the interactive object, and the interactive experience of the target object is improved.
  • the method is applied to a server, including a local server or a cloud server.
  • the server processes the speech segment, generates the control parameter value of the interactive object, and uses three-dimensional rendering according to the control parameter value.
  • the engine performs rendering to obtain the animation of the interactive object.
  • the server may send the animation to the terminal for display to communicate or respond to the target object, and may also send the animation to the cloud, so that the terminal can obtain the animation from the cloud to communicate or respond to the target object .
  • the control parameter value may also be sent to the terminal, so that the terminal completes the process of rendering, generating animation, and performing display.
  • the method is applied to a terminal, and the terminal processes the speech segment, generates the control parameter value of the interactive object, and uses the three-dimensional rendering engine to render according to the control parameter value to obtain the interactive
  • the animation of the object the terminal can display the animation to communicate or respond to the target object.
  • the display device displaying the interactive object may be controlled to output voice and/or display text according to the voice segment. And while outputting voice and/or displaying text, the gesture of the interactive object displayed by the display device can be controlled according to the control parameter value.
  • the voice and/or text outputted according to the voice segment is different from the control parameter value based on the control parameter value.
  • the gesture made by the interactive object is synchronized with the output voice and/or the displayed text, and the target object is given the feeling that the interactive object is speaking.
  • the attitude control vector may be obtained in the following manner.
  • the acoustic feature sequence corresponding to the speech frame sequence is referred to as the first acoustic feature sequence.
  • the acoustic features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Cepstral Cofficient (MFCC) and so on.
  • MFCC Mel Frequency Cepstral Cofficient
  • the first acoustic feature sequence is obtained by processing the entire speech frame sequence.
  • the speech frame sequence can be windowed, fast Fourier transform, etc. Filtering, logarithmic processing, and discrete cosine processing to obtain the MFCC coefficients corresponding to each speech frame.
  • the first acoustic feature sequence is obtained by processing the entire speech frame sequence, and reflects the overall acoustic feature of the speech segment.
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence.
  • the first acoustic feature sequence includes the MFCC coefficients of each speech frame.
  • the first acoustic feature sequence obtained according to the speech frame sequence is shown in FIG. 3.
  • the acoustic feature corresponding to at least one speech frame is acquired.
  • the same number of feature vectors corresponding to the at least one voice frame may be used as the voice The acoustic characteristics of the frame.
  • the same number of feature vectors may form a feature matrix, and the feature matrix is the acoustic feature corresponding to the at least one speech frame.
  • the N feature vectors in the first acoustic feature sequence form the acoustic features of the corresponding N speech frames; where N is a positive integer.
  • the first acoustic feature sequence may include multiple acoustic features, and the speech frames corresponding to each of the acoustic features may partially overlap.
  • attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature is acquired.
  • the attitude control vector of the at least one local area can be obtained.
  • the local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the posture control vector of all the local areas can be obtained; when the expression of the interactive object needs to be controlled , Then the posture control vector of the local area corresponding to the face can be obtained.
  • the interactive object While playing the voice segment, the interactive object is driven to make an action according to the attitude control vector corresponding to each acoustic feature obtained through the first acoustic feature sequence, so that the terminal device can output sound while the interactive object can Perform actions that match the output sound, including facial actions, expressions, and body actions, so that the target object feels that the interactive object is speaking.
  • the attitude control vector is related to the acoustic characteristics of the output sound, driving according to the attitude control vector can make the expression and body movements of the interactive object have emotional factors, thereby making the speaking process of the interactive object more natural and vivid , Thereby improving the interactive experience between the target object and the interactive object.
  • the acoustic feature corresponding to the at least one speech frame may be acquired by performing a sliding window on the first acoustic feature sequence.
  • the acoustic feature vector in the time window is used as the acoustic feature of the corresponding same number of speech frames, thereby obtaining Acoustic characteristics corresponding to these speech frames.
  • the second acoustic feature sequence can be obtained according to the obtained multiple acoustic features.
  • the speech frame sequence includes 100 speech frames per second, the length of the time window is 1 s, and the step length is 0.04 s. Since each feature vector in the first acoustic feature sequence corresponds to a speech frame, correspondingly, the first acoustic feature sequence also includes 100 feature vectors per second. During the window sliding process on the first acoustic feature sequence, 100 feature vectors in the time window are obtained each time as the acoustic features of the corresponding 100 speech frames.
  • the acoustic features corresponding to the 1st to 100th speech frames 1 and the acoustic features 2 corresponding to the 4th to 104th speech frames are respectively obtained.
  • acoustic feature 1 After traversing the first acoustic feature, acoustic feature 1, acoustic feature 2,..., acoustic feature M is obtained, thereby obtaining the second acoustic feature sequence, where M is a positive integer, and its value is based on the sequence of the speech frame
  • M is a positive integer, and its value is based on the sequence of the speech frame
  • the number of frames (the number of feature vectors in the first acoustic feature sequence), the length of the time window, and the step size are determined.
  • the acoustic feature 1 the acoustic feature 2,..., the acoustic feature M, the corresponding attitude control vector 1, the attitude control vector 2,..., the attitude control vector M can be obtained respectively, so as to obtain the sequence of the attitude control vector.
  • the sequence of the attitude control vector and the second acoustic feature sequence are aligned in time.
  • Acoustic feature 1, acoustic feature 2, ..., acoustic feature M in the second acoustic feature sequence Are respectively obtained according to the N feature vectors in the first acoustic feature sequence. Therefore, while the voice frame is played, the interactive object can be driven to make an action according to the sequence of the posture control vector.
  • the attitude control vector before the set time can be set to the default value, that is, when the speech frame sequence is just started to be played, the interactive object A default action is made, and after the set time, the interactive object is driven to make an action using the sequence of the posture control vector obtained according to the first acoustic feature sequence.
  • Acoustic feature 1 is output at t0, and the acoustic feature is output at intervals of 0.04s corresponding to the step size.
  • Acoustic feature 2 is output at t1
  • acoustic feature 3 is output at t2 until at t (M-1)
  • Acoustic feature M is output at every moment.
  • ti ⁇ t(i+1) corresponds to the feature vector (i+1), where i is an integer smaller than (M-1)
  • the attitude control vector is the default Attitude control vector.
  • the interactive object is driven to make an action according to the sequence of the gesture control vector while playing the voice segment, so that the action of the interactive object is synchronized with the output sound, and the target object With the feeling that the interactive object is speaking, the interactive experience between the target object and the interactive object is improved.
  • the length of the time window is related to the amount of information contained in the acoustic feature. The greater the length of the time window, the more information it contains, and the stronger the correlation between the actions and sounds that drive the interactive object.
  • the sliding step length of the time window is related to the time interval (frequency) of obtaining the attitude control vector, that is, it is related to the frequency of driving the interactive object to make an action.
  • the length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
  • the acoustic feature includes Mel frequency cepstral coefficients MFCC in L dimensions, where L is a positive integer.
  • MFCC represents the distribution of the energy of the speech signal in different frequency ranges.
  • the MFCC of L dimensions can be obtained by converting multiple speech frame data in the speech frame sequence to the frequency domain and using a Mel filter including L subbands. .
  • the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature can be obtained. Since the cyclic neural network is a time recurrent neural network, it can learn the historical information of the input acoustic features, and output the attitude control vector of the at least one local area according to the acoustic feature sequence.
  • the acoustic feature sequence includes a first acoustic feature sequence and a second acoustic feature sequence.
  • a pre-trained cyclic neural network is used to obtain the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature, and the historical feature information of the acoustic feature and the current feature information are merged, thereby The historical attitude control vector has an impact on the current attitude control vector change, making the expression changes and body movements of the interactive characters more smooth and natural.
  • the recurrent neural network can be trained in the following manner.
  • an acoustic feature sample is obtained, the acoustic feature sample is annotated with a true value, and the true value is a posture control vector value of at least one local area of the interactive object.
  • the initial recurrent neural network is trained according to the acoustic feature samples, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the recurrent neural network The difference between the attitude control vector value of the at least one local area and the real value obtained by network prediction.
  • the acoustic feature samples can be obtained by the following method.
  • the video segment is sampled according to the first sampling period to obtain multiple first image frames containing the character;
  • the voice segment is sampled according to the second sampling period to obtain multiple voice frames.
  • the second sampling period is less than the first sampling period, that is, the frequency of sampling the voice segment is higher than the frequency of sampling the video segment, so that one first image frame can correspond to the acoustics of at least one voice frame. feature.
  • the acoustic feature corresponding to the at least one speech frame corresponding to the first image frame is acquired.
  • the number of speech frames corresponding to a first image frame in the training process is the same as the number of speech frames corresponding to the acoustic features obtained in the aforementioned driving process, and the number of acoustic features obtained in the training process
  • the method is the same as in the aforementioned driving process.
  • the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained.
  • the attitude control vector value may include the attitude control vector value of all the local areas, and may also include the attitude control vector value of some of the local areas.
  • the image frame of the real person can be converted into a second image frame containing the image represented by the interactive object, and the local area of the real person
  • the posture control vector corresponds to the posture control vector of each local area of the interactive object, so that the posture control vector of each local area of the interactive object in the second image frame can be obtained.
  • an acoustic feature corresponding to the first image frame is annotated to obtain an acoustic feature sample.
  • the video segment of a character is split into corresponding multiple first image frames and multiple voice frames, and the first image frame containing the real person is converted into the first image frame containing the interactive object.
  • Two image frames are used to obtain the attitude control vector corresponding to the acoustic feature of at least one speech frame, so that the acoustic feature has a better correspondence with the attitude control vector, so as to obtain high-quality acoustic feature samples, so that the action of the interactive object is closer to the corresponding character Real action.
  • FIG. 4 shows a schematic structural diagram of a device for driving an interactive object according to at least one embodiment of the present disclosure.
  • the device may include: a first obtaining unit 401, configured to obtain a sequence of speech frames contained in a speech segment;
  • the second acquiring unit 402 is configured to acquire a control parameter of at least one partial region of the interaction object corresponding to the speech frame sequence;
  • the driving unit 403 is configured to control the posture of the interaction object according to the acquired control parameter.
  • the device further includes an output unit for controlling the display device displaying the interactive object to output voice and/or display text according to the voice segment.
  • control parameter of the local area of the interactive object includes a posture control vector of the local area
  • the second acquiring unit is specifically configured to: acquire the first acoustic feature sequence corresponding to the speech frame sequence Acquire the acoustic feature corresponding to at least one speech frame according to the first acoustic feature sequence; Acquire the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence, and after acquiring at least one voice frame corresponding to the first acoustic feature sequence
  • the second acquiring unit is specifically configured to: acquire a sequence of posture control vectors corresponding to the second acoustic feature sequence; and control the posture of the interactive object according to the sequence of posture control vectors.
  • the driving unit is specifically configured to: obtain a sequence of a posture control vector corresponding to the second acoustic feature sequence; and control the posture of the interactive object according to the sequence of the posture control vector.
  • the second acquiring unit when acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature, is specifically configured to: input the acoustic feature into a pre-trained loop A neural network obtains a posture control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  • the recurrent neural network is obtained through acoustic feature sample training;
  • the device further includes a sample acquisition unit, configured to: extract the voice segment of the character from the acquired video segment; Sampling the segment to obtain a plurality of first image frames containing the character; and sampling the speech segment to obtain a plurality of speech frames; obtaining the acoustic characteristics of the speech frame corresponding to the first image frame;
  • the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained;
  • the acoustic feature corresponding to the first image frame is annotated to obtain an acoustic feature sample.
  • the device further includes a training unit for training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change in network loss satisfies the convergence condition, wherein
  • the network loss includes the difference between the attitude control vector value of the at least one local area and the labeled attitude control vector value predicted by the initial recurrent neural network.
  • At least one embodiment of this specification also provides an electronic device.
  • the device includes a memory and a processor.
  • the memory is used to store computer instructions that can run on the processor.
  • the method for driving interactive objects described in any embodiment of the present disclosure is realized by computer instructions.
  • At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.
  • one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or among them A combination of one or more.
  • the embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules.
  • the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission.
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output.
  • the processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit.
  • the central processing unit will receive instructions and data from a read-only memory and/or a random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • the computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both.
  • the computer does not have to have such equipment.
  • the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks or Removable disks
  • magneto-optical disks CD ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.

Abstract

An interactive object driving method and apparatus, a device, and a storage medium. The method comprises: obtaining a speech frame sequence contained in a speech segment; obtaining a control parameter of at least one partial area of an interactive object corresponding to the speech frame sequence; and controlling the posture of the at least one partial area of the interactive object according to the control parameter obtained.

Description

交互对象的驱动方法、装置、设备以及存储介质Driving method, device, equipment and storage medium of interactive object
相关交叉引用Related cross references
本申请基于申请号为2020102472765、申请日为2020年3月31日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is filed based on a Chinese patent application with an application number of 2020102472765 and an application date of March 31, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this application.
技术领域Technical field
本公开涉及计算机技术领域,具体涉及一种交互对象的驱动方法、装置、设备以及存储介质。The present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.
背景技术Background technique
人机交互的方式大多基于按键、触摸、语音进行输入,通过在显示屏上呈现图像、文本或虚拟人物进行回应。目前虚拟人物多是在语音助理的基础上改进得到的。The way of human-computer interaction is mostly based on keystrokes, touches, and voice input, and responds by presenting images, texts or virtual characters on the display screen. At present, virtual characters are mostly improved on the basis of voice assistants.
发明内容Summary of the invention
本公开实施例提供一种交互对象的驱动方案。The embodiments of the present disclosure provide a driving solution for interactive objects.
根据本公开的一方面,提供一种交互对象的驱动方法,所述方法包括:获取语音段所包含的语音帧序列;获取与所述语音帧序列对应的交互对象的至少一个局部区域的控制参数;根据获取的所述控制参数控制所述交互对象的姿态。According to an aspect of the present disclosure, there is provided a method for driving an interactive object, the method comprising: obtaining a sequence of voice frames contained in a voice segment; obtaining a control parameter of at least one local area of the interactive object corresponding to the sequence of voice frames ; Control the posture of the interactive object according to the acquired control parameter.
结合本公开提供的任一实施方式,所述方法还包括:根据所述语音段控制展示所述交互对象的显示设备输出语音和/或展示文本。With reference to any one of the embodiments provided in the present disclosure, the method further includes: controlling the display device displaying the interactive object to output voice and/or display text according to the voice segment.
结合本公开提供的任一实施方式,所述交互对象的局部区域的控制参数包括所述局部区域的姿态控制向量;获取与所述语音帧序列对应的交互对象的至少一个局部区域的控制参数,包括:获取所述语音帧序列对应的第一声学特征序列;根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征;获取所述声学特征对应的所述交互对象 的至少一个局部区域的姿态控制向量。With reference to any one of the embodiments provided in the present disclosure, the control parameter of the local area of the interactive object includes the posture control vector of the local area; acquiring the control parameter of at least one local area of the interactive object corresponding to the speech frame sequence, The method includes: acquiring a first acoustic feature sequence corresponding to the voice frame sequence; acquiring an acoustic feature corresponding to at least one voice frame according to the first acoustic feature sequence; acquiring at least one of the interactive objects corresponding to the acoustic feature The attitude control vector of a local area.
结合本公开提供的任一实施方式,结合本公开提供的任一实施方式,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对应的声学特征向量;根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征,包括:以设定长度的时间窗口和设定步长,对所述第一声学特征序列进行滑窗,将所述时间窗口内的声学特征向量作为对应的所述至少一个语音帧的声学特征,并根据完成所述滑窗得到的多个所述声学特征,获得第二声学特征序列。In combination with any of the embodiments provided in the present disclosure, in combination with any of the embodiments provided in the present disclosure, the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; The first acoustic feature sequence, acquiring the acoustic feature corresponding to at least one speech frame, includes: sliding a window on the first acoustic feature sequence with a time window of a set length and a set step size, and changing the time window The acoustic feature vector within is used as the corresponding acoustic feature of the at least one speech frame, and a second acoustic feature sequence is obtained according to the plurality of acoustic features obtained by completing the sliding window.
结合本公开提供的任一实施方式,根据所述控制参数控制所述交互对象的姿态,包括:获取与所述第二声学特征序列对应的姿态控制向量的序列;根据所述姿态控制向量的序列控制所述交互对象的姿态。In conjunction with any one of the embodiments provided in the present disclosure, controlling the posture of the interactive object according to the control parameter includes: obtaining a sequence of posture control vectors corresponding to the second acoustic feature sequence; and according to the sequence of posture control vectors Control the posture of the interactive object.
结合本公开提供的任一实施方式,获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量,包括:将所述声学特征输入至预先训练的循环神经网络,获得与所述声学特征对应的所述交互对象的至少一个局部区域的所述姿态控制向量。In conjunction with any one of the embodiments provided in the present disclosure, acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature includes: inputting the acoustic feature into a pre-trained cyclic neural network to obtain The attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature.
结合本公开提供的任一实施方式,所述循环神经网络通过声学特征样本训练得到;所述方法还包括:获取声学特征样本,具体包括:获取一角色发出语音的视频段,从所述视频段中提取相应的的语音段;对所述视频段进行采样获取多个包含所述角色的第一图像帧;以及,对所述语音段进行采样,获得多个语音帧;获取与所述第一图像帧对应的所述语音帧的声学特征;将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的至少一个局部区域的姿态控制向量值;根据所述姿态控制向量值,对与所述第一图像帧对应的所述声学特征进行标注,获得所述声学特征样本。With reference to any one of the embodiments provided in the present disclosure, the recurrent neural network is obtained through acoustic feature sample training; the method further includes: obtaining an acoustic feature sample, specifically including: obtaining a video segment of a character uttering a voice, from the video segment Extract the corresponding voice segment from the video segment; sample the video segment to obtain multiple first image frames containing the character; and sample the voice segment to obtain multiple voice frames; Acoustic characteristics of the speech frame corresponding to the image frame; converting the first image frame into a second image frame containing the interactive object, and obtaining the attitude control vector value of at least one local area corresponding to the second image frame ; According to the attitude control vector value, annotate the acoustic feature corresponding to the first image frame to obtain the acoustic feature sample.
结合本公开提供的任一实施方式,所述方法还包括:根据所述声学特征样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述循环神经网络预测得到的所述至少一个局部区域的所述姿态控制向量值与标注的所述姿态控制向量值之间的差异。With reference to any of the embodiments provided in the present disclosure, the method further includes: training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change in the network loss satisfies the convergence condition, wherein The network loss includes the difference between the attitude control vector value of the at least one local area predicted by the recurrent neural network and the marked attitude control vector value.
根据本公开的一方面,提供一种交互对象的驱动装置,所述装置包括:第一获取单元,用于获取语音段所包含的语音帧序列;第二获取单元,用于获取与所述语音帧序列对应的交互对象的至少一个局部区域的控制参数;驱动单元,用于根据获取的所述控制参数控制所述交互对象的姿态。According to an aspect of the present disclosure, there is provided a driving device for an interactive object, the device comprising: a first acquisition unit for acquiring a sequence of speech frames contained in a speech segment; a second acquisition unit for acquiring the The control parameter of at least one partial area of the interactive object corresponding to the frame sequence; the driving unit is configured to control the posture of the interactive object according to the acquired control parameter.
根据本公开的一方面,提供一种电子设备,所述设备包括存储器、处理器,所述存 储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed. The method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
根据本公开的一方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序程序被处理器执行时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and the computer program program, when executed by a processor, implements the method for driving an interactive object according to any one of the embodiments provided in the present disclosure .
本公开一个或多个实施例的交互对象的驱动方法、装置、设备及计算机可读存储介质,通过获取语音段所包含的语音帧序列,并根据语音帧序列确定交互对象的至少一个局部区域的控制参数值,来控制所述交互对象的姿态,使得所述交互对象做出与所述语音段匹配的姿态,从而使目标对象产生与交互对象正在交流的感觉,提升了目标对象与交互对象的交互体验。The driving method, device, device, and computer-readable storage medium of an interactive object according to one or more embodiments of the present disclosure obtain the voice frame sequence contained in the voice segment, and determine the location of at least one partial area of the interactive object according to the voice frame sequence. The parameter value is controlled to control the posture of the interactive object, so that the interactive object makes a posture that matches the voice segment, so that the target object feels that it is communicating with the interactive object, and the interaction between the target object and the interactive object is improved. Interactive experience.
附图说明Description of the drawings
为了更清楚地说明本说明书一个或多个实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain one or more embodiments of this specification or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, in the following description The drawings are only some of the embodiments described in one or more embodiments of this specification. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1是本公开至少一个实施例提出的交互对象的驱动方法中显示设备的示意图;FIG. 1 is a schematic diagram of a display device in a method for driving interactive objects proposed by at least one embodiment of the present disclosure;
图2是本公开至少一个实施例提出的交互对象的驱动方法的流程图;2 is a flowchart of a method for driving interactive objects proposed by at least one embodiment of the present disclosure;
图3是本公开至少一个实施例提出的对语音帧序列进行特征编码的过程示意图;Fig. 3 is a schematic diagram of a process of feature encoding a speech frame sequence proposed by at least one embodiment of the present disclosure;
图4是本公开至少一个实施例提出的交互对象的驱动装置的结构示意图;4 is a schematic structural diagram of a driving device for interactive objects proposed in at least one embodiment of the present disclosure;
图5是本公开至少一个实施例提出的电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device proposed in at least one embodiment of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如 所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. On the contrary, they are merely examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations. In addition, the term "at least one" in this document means any one of a plurality of or any combination of at least two of the plurality, for example, including at least one of A, B, and C, may mean including A, Any one or more elements selected in the set formed by B and C.
本公开至少一个实施例提供了一种交互对象的驱动方法,所述驱动方法可以由终端设备或服务器等电子设备执行,所述终端设备可以是固定终端或移动终端,例如手机、平板电脑、游戏机、台式机、广告机、一体机、车载终端等等,所述服务器包括本地服务器或云端服务器等,所述方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。At least one embodiment of the present disclosure provides a method for driving interactive objects. The driving method may be executed by electronic devices such as a terminal device or a server. The terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game. The server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.
在本公开实施例中,交互对象可以是任意一种能够与目标对象进行交互的虚拟形象。在一实施例中,交互对象可以是虚拟人物,还可以是虚拟动物、虚拟物品、卡通形象等等其他能够实现交互功能的虚拟形象。交互对象的展现形式即可以是2D形式也可以是3D形式,本公开对此并不限定。所述目标对象可以是用户,也可以是机器人,还可以是其他智能设备。所述交互对象和所述目标对象之间的交互方式可以是主动交互方式,也可以是被动交互方式。一示例中,目标对象可以通过做出手势或者肢体动作来发出需求,通过主动交互的方式来触发交互对象与其交互。另一示例中,交互对象可以通过主动打招呼、提示目标对象做出动作等方式,使得目标对象采用被动方式与交互对象进行交互。In the embodiments of the present disclosure, the interaction object may be any virtual image capable of interacting with the target object. In an embodiment, the interactive object may be a virtual character, or may also be a virtual animal, virtual item, cartoon image, or other virtual images capable of implementing interactive functions. The display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure. The target object may be a user, a robot, or other smart devices. The interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner. In an example, the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction. In another example, the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.
所述交互对象可以通过终端设备进行展示,所述终端设备可以是电视机、带有显示功能的一体机、投影仪、虚拟现实(Virtual Reality,VR)设备、增强现实(Augmented Reality,AR)设备等,本公开并不限定终端设备的具体形式。The interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.
图1示出本公开至少一个实施例提出的显示设备。如图1所示,该显示设备具有透明显示屏,在透明显示屏上可以显示立体画面,以呈现出具有立体效果的虚拟场景以及交互对象。例如图1中透明显示屏显示的交互对象包括虚拟卡通人物。在一些实施例中,本公开中所述的终端设备也可以为上述具有透明显示屏的显示设备,显示设备中配置有存储器和处理器,存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开提供的交互对象的驱动方法,以驱动透明显示屏中显示的交互对象对目标对象进行交流或回应。Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure. As shown in Figure 1, the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect. For example, the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen. The display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor. The processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.
在一些实施例中,响应于用于驱动交互对象输出语音的声音驱动数据,交互对象可以对目标对象发出指定语音。终端设备可以根据终端设备周边目标对象的动作、表情、身份、偏好等,生成声音驱动数据,以驱动交互对象通过发出指定语音进行交流或回应,从而为目标对象提供拟人化的服务。需要说明的是,声音驱动数据也可以通过其他方式生成,比如,由服务器生成并发送给终端设备。In some embodiments, in response to the sound-driven data used to drive the interactive object to output voice, the interactive object may emit a specified voice to the target object. The terminal device can generate sound-driven data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to communicate or respond by emitting a specified voice, thereby providing anthropomorphic services for the target object. It should be noted that the sound-driven data can also be generated in other ways, for example, generated by the server and sent to the terminal device.
在交互对象与目标对象的交互过程中,根据该声音驱动数据驱动交互对象发出指定语音时,可能无法驱动所述交互对象做出与该指定语音同步的面部动作,使得交互对象在发出语音时呆板、不自然,影响了目标对象与交互对象的交互体验。基于此,本公开至少一个实施例提出一种交互对象的驱动方法,以提升目标对象与交互对象进行交互的体验。During the interaction between the interactive object and the target object, when the interactive object is driven to make a specified voice according to the sound driving data, the interactive object may not be able to drive the interactive object to make facial movements synchronized with the specified voice, making the interactive object dull when uttering the voice , Unnatural, affecting the interactive experience between the target object and the interactive object. Based on this, at least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.
图2示出根据本公开至少一个实施例的交互对象的驱动方法的流程图,如图2所示,所述方法包括步骤201~步骤203。FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 2, the method includes steps 201 to 203.
在步骤201中,获取语音段所包含的语音帧序列。In step 201, a sequence of speech frames contained in the speech segment is obtained.
所述语音段可以是所述交互对象的声音驱动数据所对应的语音段,所述声音驱动数据可以包括音频数据(语音数据)、文本等等。所述声音驱动数据可以是服务器或终端设备根据与交互对象进行交互的目标对象的动作、表情、身份、偏好等生成的驱动数据,也可以是终端设备从内部存储器调用的声音驱动数据。本公开对于该声音驱动数据的获取方式不进行限制。The voice segment may be a voice segment corresponding to the sound-driven data of the interactive object, and the sound-driven data may include audio data (voice data), text, and so on. The sound-driven data may be driving data generated by a server or a terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or may be sound-driven data called by the terminal device from the internal memory. The present disclosure does not limit the acquisition method of the sound-driven data.
在本公开实施例中,可以通过对所述语音段进行分频处理,得到所述语音段所包含的语音帧序列。对所述语音段进行分频处理,也即将所述语音段分割为多个语音帧,将各个语音帧按照时间顺序排列即形成了语音帧序列。进行分频处理所得到的语音帧所包含的采样点数(时长)、帧移(帧与帧之间的重叠程度)可以根据对于交互对象的驱动需求来确定,本公开对此不进行限制。In the embodiment of the present disclosure, the speech frame sequence contained in the speech segment may be obtained by performing frequency division processing on the speech segment. The frequency division processing is performed on the voice segment, that is, the voice segment is divided into multiple voice frames, and each voice frame is arranged in time order to form a voice frame sequence. The number of sampling points (duration) and frame shift (the degree of overlap between frames) contained in the speech frame obtained by performing frequency division processing can be determined according to the driving requirements of the interactive object, which is not limited in the present disclosure.
图3示出了本公开至少一个实施例提出的交互对象的驱动方法的示意图。对于语音段信号进行分段/分频处理,所得到的语音帧序列如图3所示。FIG. 3 shows a schematic diagram of a driving method of interactive objects proposed by at least one embodiment of the present disclosure. Perform segmentation/frequency division processing on the voice segment signal, and the resulting voice frame sequence is shown in Figure 3.
在步骤202中,获取与所述语音帧序列对应的、交互对象的至少一个局部区域的控制参数值。In step 202, the control parameter value of at least one partial region of the interactive object corresponding to the speech frame sequence is acquired.
所述局部区域是对交互对象的整体(包括面部和/或身体)进行划分而得到的。面部的一个或多个局部区域的控制可以对应于交互对象的一系列面部表情或动作,例如眼部 区域的控制可以对应于交互对象睁眼、闭眼、眨眼、视角变换等面部动作;又例如嘴部区域的控制可以对应于交互对象闭嘴、不同程度的张嘴等面部动作。而身体的一个或多个局部区域的控制可以对应于交互对象的一系列肢体动作,例如腿部区域的控制可以对应于交互对象走路、跳跃、踢腿等动作。The local area is obtained by dividing the whole (including face and/or body) of the interactive object. The control of one or more local areas of the face may correspond to a series of facial expressions or actions of the interactive object. For example, the control of the eye area may correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective; The control of the mouth area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees. The control of one or more local areas of the body may correspond to a series of physical actions of the interactive object. For example, the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.
所述交互对象的局部区域的控制参数,包括所述局部区域的姿态控制向量。每个局部区域的姿态控制向量用于驱动所述交互对象的所述局部区域进行动作。不同的姿态控制向量值对应于不同的动作或者动作幅度。例如,对于嘴部区域的姿态控制向量,其一组姿态控制向量值可以使所述交互对象的嘴部微张,而另一组姿态控制向量值可以使所述交互对象的嘴部大张。通过以不同的姿态控制向量值来驱动所述交互对象,可以使相应的局部区域做出不同动作或者不同幅度的动作。The control parameter of the local area of the interactive object includes the posture control vector of the local area. The attitude control vector of each local area is used to drive the local area of the interactive object to perform actions. Different posture control vector values correspond to different motions or motion ranges. For example, for the posture control vector of the mouth area, one set of posture control vector values can make the mouth of the interactive object slightly open, and another set of posture control vector values can make the mouth of the interactive object open wider. By controlling the vector values with different postures to drive the interactive objects, the corresponding local areas can make different actions or actions with different amplitudes.
局部区域可以根据需要控制的交互对象的动作进行选择,例如在需要控制所述交互对象面部以及肢体同时进行动作时,可以获取全部局部区域的姿态控制向量;在需要控制所述交互对象的表情时,则可以获取所述面部所对应的局部区域的姿态控制向量。The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the posture control vector of all the local areas can be obtained; when the expression of the interactive object needs to be controlled , Then the posture control vector of the local area corresponding to the face can be obtained.
在本公开实施例中,可以根据所述语音帧序列的声学特征,确定交互对象的至少一个局部区域的控制参数值,也可以根据所述语音帧序列的其他特征来确定控制参数值。In the embodiments of the present disclosure, the control parameter value of at least one local area of the interactive object can be determined according to the acoustic characteristics of the speech frame sequence, and the control parameter value can also be determined according to other characteristics of the speech frame sequence.
在本公开实施例中,可以预先建立语音帧序列的某种特征与交互对象的控制参数值的对应关系,在获得了所述语音帧序列的情况下,即可获得对应的控制参数值。获取与所述语音帧序列匹配的所述交互对象的控制参数值的具体方法容后详述。In the embodiments of the present disclosure, the corresponding relationship between a certain characteristic of the speech frame sequence and the control parameter value of the interactive object can be established in advance, and the corresponding control parameter value can be obtained when the speech frame sequence is obtained. The specific method for obtaining the control parameter value of the interaction object matching the speech frame sequence will be described in detail later.
在步骤203中,根据获取的所述控制参数值控制所述交互对象的姿态。In step 203, the posture of the interactive object is controlled according to the acquired control parameter value.
其中,所述控制参数值,例如姿态控制向量值,是与所述语音段所包含的语音帧序列相匹配的。例如,在展示所述交互对象的显示设备正在输出所述语音段,或者正在展示与所述语音段相应的文本时,交互对象所做出的姿态与所输出的语音和/或所展示的文本是同步的,从而给目标对象一种所述交互对象正在说话的感觉。Wherein, the control parameter value, for example, the attitude control vector value, is matched with the speech frame sequence contained in the speech segment. For example, when the display device showing the interactive object is outputting the voice segment, or is displaying the text corresponding to the voice segment, the gesture made by the interactive object is the same as the output voice and/or the displayed text It is synchronized, so as to give the target object a feeling that the interactive object is speaking.
在本公开实施例中,通过获取语音段所包含的语音帧序列,并根据语音帧序列确定交互对象的至少一个局部区域的控制参数值,来控制所述交互对象的姿态,使得所述交互对象做出与所述语音段匹配的姿态,从而使目标对象产生与交互对象正在交流的感觉,提升了目标对象的交互体验。In the embodiment of the present disclosure, by acquiring the speech frame sequence contained in the speech segment, and determining the control parameter value of at least one partial area of the interactive object according to the speech frame sequence, the posture of the interactive object is controlled so that the interactive object A gesture matching the speech segment is made, so that the target object feels that it is communicating with the interactive object, and the interactive experience of the target object is improved.
在一些实施例中,所述方法应用于服务器,包括本地服务器或云端服务器等,所述服务器对于语音段进行处理,生成所述交互对象的控制参数值,并根据所述控制参数值 利用三维渲染引擎进行渲染,得到所述交互对象的动画。所述服务器可以将所述动画发送至终端进行展示来对目标对象进行交流或回应,还可以将所述动画发送至云端,以使终端能够从云端获取所述动画来对目标对象进行交流或回应。在服务器生成所述交互对象的控制参数值后,还可以将所述控制参数值发送至终端,以使终端完成渲染、生成动画、进行展示的过程。In some embodiments, the method is applied to a server, including a local server or a cloud server. The server processes the speech segment, generates the control parameter value of the interactive object, and uses three-dimensional rendering according to the control parameter value. The engine performs rendering to obtain the animation of the interactive object. The server may send the animation to the terminal for display to communicate or respond to the target object, and may also send the animation to the cloud, so that the terminal can obtain the animation from the cloud to communicate or respond to the target object . After the server generates the control parameter value of the interactive object, the control parameter value may also be sent to the terminal, so that the terminal completes the process of rendering, generating animation, and performing display.
在一些实施例中,所述方法应用于终端,所述终端对于语音段进行处理,生成所述交互对象的控制参数值,并根据所述控制参数值利用三维渲染引擎进行渲染,得到所述交互对象的动画,所述终端可以展示所述动画以对目标对象进行交流或回应。In some embodiments, the method is applied to a terminal, and the terminal processes the speech segment, generates the control parameter value of the interactive object, and uses the three-dimensional rendering engine to render according to the control parameter value to obtain the interactive The animation of the object, the terminal can display the animation to communicate or respond to the target object.
在一些实施例中,可以根据所述语音段控制展示所述交互对象的显示设备输出语音和/或展示文本。并且可以在语音输出和/或展示文本的同时,根据所述控制参数值控制所述显示设备展示的所述交互对象的姿态。In some embodiments, the display device displaying the interactive object may be controlled to output voice and/or display text according to the voice segment. And while outputting voice and/or displaying text, the gesture of the interactive object displayed by the display device can be controlled according to the control parameter value.
在本公开实施例中,由于所述控制参数值与所述语音段的语音帧序列相匹配,因此根据所述语音段输出的语音和/或文本,与根据所述控制参数值控制交互对象的姿态是同步进行的情况下,交互对象所做出的姿态与所输出的语音和/或所展示的文本是同步的,给目标对象以所述交互对象正在说话的感觉。In the embodiment of the present disclosure, since the control parameter value matches the voice frame sequence of the voice segment, the voice and/or text outputted according to the voice segment is different from the control parameter value based on the control parameter value. When the gesture is synchronized, the gesture made by the interactive object is synchronized with the output voice and/or the displayed text, and the target object is given the feeling that the interactive object is speaking.
在一些实施例中,在所述交互对象的至少一个局部区域的控制参数包括姿态控制向量的情况下,可以通过以下方式获得姿态控制向量。In some embodiments, in the case where the control parameter of at least one partial area of the interactive object includes an attitude control vector, the attitude control vector may be obtained in the following manner.
首先,获取所述语音帧序列对应的声学特征序列。此处,为了与后续提到的声学特征序列进行区分,将所述语音帧序列对应的声学特征序列称为第一声学特征序列。First, obtain the acoustic feature sequence corresponding to the speech frame sequence. Here, in order to distinguish from the acoustic feature sequence mentioned later, the acoustic feature sequence corresponding to the speech frame sequence is referred to as the first acoustic feature sequence.
在本公开实施例中,声学特征可以是与语音情感相关的特征,例如基频特征、共峰特征、梅尔频率倒谱系数(Mel Frequency Cepstral Cofficient,MFCC)等等。In the embodiments of the present disclosure, the acoustic features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Cepstral Cofficient (MFCC) and so on.
所述第一声学特征序列是对整体的语音帧序列进行处理所得到的,以MFCC特征为例,可以通过对所述语音帧序列中的各个语音帧进行加窗、快速傅里叶变换、滤波、对数处理、离散余弦处理,得到各个语音帧对应的MFCC系数。The first acoustic feature sequence is obtained by processing the entire speech frame sequence. Taking the MFCC feature as an example, the speech frame sequence can be windowed, fast Fourier transform, etc. Filtering, logarithmic processing, and discrete cosine processing to obtain the MFCC coefficients corresponding to each speech frame.
所述第一声学特征序列是针对整体的语音帧序列进行处理所得到的,体现了语音段的整体声学特征。The first acoustic feature sequence is obtained by processing the entire speech frame sequence, and reflects the overall acoustic feature of the speech segment.
在本公开实施例中,所述第一声学特征序列包含与所述语音帧序列中的每个语音帧对应的声学特征向量。以MFCC为例,所述第一声学特征序列包含了每个语音帧的MFCC系数。根据所述语音帧序列所获得的第一声学特征序列如图3所示。In the embodiment of the present disclosure, the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence. Taking MFCC as an example, the first acoustic feature sequence includes the MFCC coefficients of each speech frame. The first acoustic feature sequence obtained according to the speech frame sequence is shown in FIG. 3.
接下来,根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征。Next, according to the first acoustic feature sequence, the acoustic feature corresponding to at least one speech frame is acquired.
在所述第一声学特征序列包括了所述语音帧序列中的每个语音帧对应的声学特征向量的情况下,可以将所述至少一个语音帧对应的相同数目的特征向量作为所述语音帧的声学特征。其中,上述相同数目的特征向量可以形成一个特征矩阵,该特征矩阵即为所述至少一个语音帧对应的声学特征。In the case that the first acoustic feature sequence includes the acoustic feature vector corresponding to each voice frame in the voice frame sequence, the same number of feature vectors corresponding to the at least one voice frame may be used as the voice The acoustic characteristics of the frame. Wherein, the same number of feature vectors may form a feature matrix, and the feature matrix is the acoustic feature corresponding to the at least one speech frame.
以图3为例,所述第一声学特征序列中的N个特征向量形成了所对应的N个语音帧的声学特征;其中,N为正整数。所述第一声学特征序列可以包括多个声学特征,各个所述声学特征所对应的语音帧之间可以是部分重叠的。Taking FIG. 3 as an example, the N feature vectors in the first acoustic feature sequence form the acoustic features of the corresponding N speech frames; where N is a positive integer. The first acoustic feature sequence may include multiple acoustic features, and the speech frames corresponding to each of the acoustic features may partially overlap.
最后,获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量。Finally, the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature is acquired.
根据所获得的至少一个语音帧对应的声学特征,可以获取至少一个局部区域的姿态控制向量。局部区域可以根据需要控制的交互对象的动作进行选择,例如在需要控制所述交互对象面部以及肢体同时进行动作时,可以获取全部局部区域的姿态控制向量;在需要控制所述交互对象的表情时,则可以获取所述面部所对应的局部区域的姿态控制向量。According to the obtained acoustic feature corresponding to the at least one speech frame, the attitude control vector of the at least one local area can be obtained. The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the posture control vector of all the local areas can be obtained; when the expression of the interactive object needs to be controlled , Then the posture control vector of the local area corresponding to the face can be obtained.
在播放语音段的同时,根据通过所述第一声学特征序列所获得的各个声学特征对应的姿态控制向量驱动所述交互对象做出动作,可以实现终端设备在输出声音的同时,交互对象能够做出与所输出的声音相配合的动作,该动作包括面部动作、表情以及肢体动作等,从而使目标对象产生该交互对象正在说话的感觉。并且由于所述姿态控制向量是与输出声音的声学特征相关的,根据所述姿态控制向量进行驱动能够使得交互对象的表情和肢体动作具有了情感因素,从而使得交互对象的说话过程更加自然、生动,从而提高了目标对象与所述交互对象的交互体验。While playing the voice segment, the interactive object is driven to make an action according to the attitude control vector corresponding to each acoustic feature obtained through the first acoustic feature sequence, so that the terminal device can output sound while the interactive object can Perform actions that match the output sound, including facial actions, expressions, and body actions, so that the target object feels that the interactive object is speaking. And because the attitude control vector is related to the acoustic characteristics of the output sound, driving according to the attitude control vector can make the expression and body movements of the interactive object have emotional factors, thereby making the speaking process of the interactive object more natural and vivid , Thereby improving the interactive experience between the target object and the interactive object.
在一些实施例中,可以通过在所述第一声学特征序列上进行滑窗的方式获取所述至少一个语音帧对应的声学特征。In some embodiments, the acoustic feature corresponding to the at least one speech frame may be acquired by performing a sliding window on the first acoustic feature sequence.
通过以设定长度的时间窗口和设定步长,对所述第一声学特征序列进行滑窗,将所述时间窗口内的声学特征向量作为对应的相同数目语音帧的声学特征,从而获得这些语音帧共同对应的声学特征。在完成滑窗后,根据得到的多个声学特征,则可以获得第二声学特征序列。By sliding the window of the first acoustic feature sequence with a time window of a set length and a set step size, the acoustic feature vector in the time window is used as the acoustic feature of the corresponding same number of speech frames, thereby obtaining Acoustic characteristics corresponding to these speech frames. After the sliding window is completed, the second acoustic feature sequence can be obtained according to the obtained multiple acoustic features.
以图3所示的交互对象的驱动方法为例,所述语音帧序列每秒包括100个语音帧,所述时间窗口的长度为1s,步长为0.04s。由于所述第一声学特征序列中的每个特征向 量是与语音帧对应的,相应地,所述第一声学特征序列每秒同样包括100个特征向量。在所述第一声学特征序列上进行滑窗过程中,每次获得所述时间窗口内的100个特征向量,作为对应的100个语音帧的声学特征。通过在所述第一声学特征序列上以0.04s的步长移动所述时间窗口,分别获得第1~100语音帧对应的声学特征1、第4~104语音帧所对应的声学特征2,以此类推,在遍历第一声学特征后,得到声学特征1、声学特征2、…、声学特征M,从而获得第二声学特征序列,其中,M为正整数,其数值根据语音帧序的帧数(第一声学特征序列中特征向量的数目)、时间窗口的长度以及步长确定。Taking the driving method of the interactive object shown in FIG. 3 as an example, the speech frame sequence includes 100 speech frames per second, the length of the time window is 1 s, and the step length is 0.04 s. Since each feature vector in the first acoustic feature sequence corresponds to a speech frame, correspondingly, the first acoustic feature sequence also includes 100 feature vectors per second. During the window sliding process on the first acoustic feature sequence, 100 feature vectors in the time window are obtained each time as the acoustic features of the corresponding 100 speech frames. By moving the time window in steps of 0.04s on the first acoustic feature sequence, the acoustic features corresponding to the 1st to 100th speech frames 1 and the acoustic features 2 corresponding to the 4th to 104th speech frames are respectively obtained, By analogy, after traversing the first acoustic feature, acoustic feature 1, acoustic feature 2,..., acoustic feature M is obtained, thereby obtaining the second acoustic feature sequence, where M is a positive integer, and its value is based on the sequence of the speech frame The number of frames (the number of feature vectors in the first acoustic feature sequence), the length of the time window, and the step size are determined.
根据声学特征1、声学特征2、…、声学特征M,分别可以获得相应的姿态控制向量1、姿态控制向量2、…、姿态控制向量M,从而获得姿态控制向量的序列。According to the acoustic feature 1, the acoustic feature 2,..., the acoustic feature M, the corresponding attitude control vector 1, the attitude control vector 2,..., the attitude control vector M can be obtained respectively, so as to obtain the sequence of the attitude control vector.
如图3所示,所述姿态控制向量的序列与所述第二声学特征序列在时间上是对齐的,所述第二声学特征序列中的声学特征1、声学特征2、…、声学特征M,分别是根据所述第一声学特征序列中的N个特征向量获得的,因此,在播放所述语音帧的同时,可以根据所述姿态控制向量的序列驱动所述交互对象做出动作。As shown in FIG. 3, the sequence of the attitude control vector and the second acoustic feature sequence are aligned in time. Acoustic feature 1, acoustic feature 2, ..., acoustic feature M in the second acoustic feature sequence , Are respectively obtained according to the N feature vectors in the first acoustic feature sequence. Therefore, while the voice frame is played, the interactive object can be driven to make an action according to the sequence of the posture control vector.
假设在第一个时间窗口的设定时刻开始输出声学特征,可以将在所述设定时刻之前的姿态控制向量设置为默认值,也即在刚开始播放语音帧序列时,使所述交互对象做出默认的动作,在所述设定时刻之后开始利用根据第一声学特征序列所得到的姿态控制向量的序列驱动所述交互对象做出动作。Assuming that the output of acoustic features starts at the set time of the first time window, the attitude control vector before the set time can be set to the default value, that is, when the speech frame sequence is just started to be played, the interactive object A default action is made, and after the set time, the interactive object is driven to make an action using the sequence of the posture control vector obtained according to the first acoustic feature sequence.
以图3为例,在t0时刻开始输出声学特征1,并以步长对应的时间0.04s为间隔输出声学特征,在t1时刻开始输出声学特征2,t2时刻开始输出声学特征3,直至在t(M-1)时刻输出声学特征M。对应地,在ti~t(i+1)时间段内对应的是特征向量(i+1),其中,i为小于(M-1)的整数,而在t0时刻之前,姿态控制向量为默认姿态控制向量。Take Figure 3 as an example. Acoustic feature 1 is output at t0, and the acoustic feature is output at intervals of 0.04s corresponding to the step size. Acoustic feature 2 is output at t1, and acoustic feature 3 is output at t2 until at t (M-1) Acoustic feature M is output at every moment. Correspondingly, in the time period ti~t(i+1) corresponds to the feature vector (i+1), where i is an integer smaller than (M-1), and before t0, the attitude control vector is the default Attitude control vector.
在本公开实施例中,通过在播放所述语音段的同时,根据所述姿态控制向量的序列驱动所述交互对象做出动作,从而使交互对象的动作与所输出的声音同步,给目标对象以所述交互对象正在说话的感觉,提升了目标对象与交互对象的交互体验。In the embodiment of the present disclosure, the interactive object is driven to make an action according to the sequence of the gesture control vector while playing the voice segment, so that the action of the interactive object is synchronized with the output sound, and the target object With the feeling that the interactive object is speaking, the interactive experience between the target object and the interactive object is improved.
所述时间窗口的长度,与所述声学特征所包含的信息量相关。时间窗口的长度越大,所包含的信息量越多,驱动所述交互对象所做出的动作与声音的关联性越强。时间窗口滑动的步长与获取姿态控制向量的时间间隔(频率)相关,也即与驱动交互对象做出动作的频率相关。可以根据实际的交互场景来设置所述时间窗口的长度以及步长,以使交互对象做出的表情和动作与声音的关联性更强,并且更加生动、自然。The length of the time window is related to the amount of information contained in the acoustic feature. The greater the length of the time window, the more information it contains, and the stronger the correlation between the actions and sounds that drive the interactive object. The sliding step length of the time window is related to the time interval (frequency) of obtaining the attitude control vector, that is, it is related to the frequency of driving the interactive object to make an action. The length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
在一些实施例中,所述声学特征包括L个维度的梅尔频率倒谱系数MFCC,其中,L为正整数。MFCC表示语音信号的能量在不同频率范围的分布,可以通过将所述语音帧序列中的多个语音帧数据转换至频域,利用包括L个子带的梅尔滤波器,获得L个维度的MFCC。通过根据语音段的MFCC来获取姿态控制向量,以根据所述姿态控制向量驱动所述交互对象进行面部动作和肢体动作,使得交互对象的表情和肢体动作具有了情感因素,使得交互对象的说话过程更加自然、生动,从而提高了目标对象的交互体验。In some embodiments, the acoustic feature includes Mel frequency cepstral coefficients MFCC in L dimensions, where L is a positive integer. MFCC represents the distribution of the energy of the speech signal in different frequency ranges. The MFCC of L dimensions can be obtained by converting multiple speech frame data in the speech frame sequence to the frequency domain and using a Mel filter including L subbands. . Obtain the posture control vector according to the MFCC of the speech segment to drive the interactive object to perform facial and physical actions according to the posture control vector, so that the expression and physical actions of the interactive object have emotional factors, and the speaking process of the interactive object It is more natural and vivid, thereby improving the interactive experience of the target object.
在一些实施例中,可以通过将所述声学特征输入至预先训练的循环神经网络,获取与所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量。由于所述循环神经网络是一种时间递归神经网络,其可以学习所输入的声学特征的历史信息,根据声学特征序列输出所述至少一个局部区域的姿态控制向量。其中,所述声学特征序列包括第一声学特征序列和第二声学特征序列。In some embodiments, by inputting the acoustic feature to a pre-trained recurrent neural network, the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature can be obtained. Since the cyclic neural network is a time recurrent neural network, it can learn the historical information of the input acoustic features, and output the attitude control vector of the at least one local area according to the acoustic feature sequence. Wherein, the acoustic feature sequence includes a first acoustic feature sequence and a second acoustic feature sequence.
在本公开实施例中,利用预先训练的循环神经网络获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量,将声学特征的历史特征信息和当前特征信息进行融合,从而使得历史姿态控制向量对当前姿态控制向量的变化产生影响,使得交互人物的表情变化和肢体动作更加平缓、自然。In the embodiment of the present disclosure, a pre-trained cyclic neural network is used to obtain the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature, and the historical feature information of the acoustic feature and the current feature information are merged, thereby The historical attitude control vector has an impact on the current attitude control vector change, making the expression changes and body movements of the interactive characters more smooth and natural.
在一些实施例中,可以通过以下方式对所述循环神经网络进行训练。In some embodiments, the recurrent neural network can be trained in the following manner.
首先,获取声学特征样本,所述声学特征样本标注有真实值,所述真实值为所述交互对象的至少一个局部区域的姿态控制向量值。First, an acoustic feature sample is obtained, the acoustic feature sample is annotated with a true value, and the true value is a posture control vector value of at least one local area of the interactive object.
在获得了声学特征样本后,根据所述声学特征样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述循环神经网络预测得到的所述至少一个局部区域的姿态控制向量值与所述真实值之间的差异。After the acoustic feature samples are obtained, the initial recurrent neural network is trained according to the acoustic feature samples, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the recurrent neural network The difference between the attitude control vector value of the at least one local area and the real value obtained by network prediction.
在一些实施例中,可以通过以下方法获取声学特征样本。In some embodiments, the acoustic feature samples can be obtained by the following method.
首先,获取一角色发出语音的视频段,并从所述视频段中提取相应的语音段。例如,可以获取一真实人物正在说话的视频段。First, obtain a video segment of a character's voice, and extract the corresponding voice segment from the video segment. For example, a video segment in which a real person is speaking can be obtained.
接下来,根据第一采样周期对所述视频段进行采样获取多个包含所述角色的第一图像帧;根据第二采样周期对所述语音段进行采样,获得多个语音帧。Next, the video segment is sampled according to the first sampling period to obtain multiple first image frames containing the character; the voice segment is sampled according to the second sampling period to obtain multiple voice frames.
其中,所述第二采样周期小于所述第一采样周期,也即对语音段进行采样的频率高于对视频段采样的频率,以使一个第一图像帧可以对应于至少一个语音帧的声学特征。Wherein, the second sampling period is less than the first sampling period, that is, the frequency of sampling the voice segment is higher than the frequency of sampling the video segment, so that one first image frame can correspond to the acoustics of at least one voice frame. feature.
之后,获取与所述第一图像帧对应的至少一个语音帧对应的声学特征。需要注意的是,在训练过程中对应于一个第一图像帧的语音帧的数目,与前述驱动过程中获取声学特征所对应的语音帧的数目是相同的,并且在训练过程中获取声学特征的方法与前述驱动过程中也是相同的。Afterwards, the acoustic feature corresponding to the at least one speech frame corresponding to the first image frame is acquired. It should be noted that the number of speech frames corresponding to a first image frame in the training process is the same as the number of speech frames corresponding to the acoustic features obtained in the aforementioned driving process, and the number of acoustic features obtained in the training process The method is the same as in the aforementioned driving process.
接着,将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的至少一个局部区域的姿态控制向量值。其中,该姿态控制向量值可以包括所有局部区域的姿态控制向量值,也可以包括其中部分的局部区域的姿态控制向量值。Then, the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained. Wherein, the attitude control vector value may include the attitude control vector value of all the local areas, and may also include the attitude control vector value of some of the local areas.
以所述第一图像帧为包含真实人物的图像帧为例,可以将该真实人物的图像帧转换为包含交互对象所表示的形象的第二图像帧,并且所述真实人物的各个局部区域的姿态控制向量与所述交互对象的各个局部区域的姿态控制向量是对应的,从而可以获取第二图像帧中交互对象的各个局部区域的姿态控制向量。Taking the first image frame as an image frame containing a real person as an example, the image frame of the real person can be converted into a second image frame containing the image represented by the interactive object, and the local area of the real person The posture control vector corresponds to the posture control vector of each local area of the interactive object, so that the posture control vector of each local area of the interactive object in the second image frame can be obtained.
最后,根据所述姿态控制向量值,对与所述第一图像帧对应的声学特征进行标注,获得声学特征样本。Finally, according to the attitude control vector value, an acoustic feature corresponding to the first image frame is annotated to obtain an acoustic feature sample.
在本公开实施例中,通过将一角色的视频段,拆分为对应的多个第一图像帧和多个语音帧,并通过将包含真实人物的第一图像帧转化为包含交互对象的第二图像帧来获取至少一个语音帧的声学特征对应的姿态控制向量,使得声学特征与姿态控制向量的对应性较好,从而获得高质量的声学特征样本,使得交互对象的动作更接近于对应角色的真实动作。In the embodiment of the present disclosure, the video segment of a character is split into corresponding multiple first image frames and multiple voice frames, and the first image frame containing the real person is converted into the first image frame containing the interactive object. Two image frames are used to obtain the attitude control vector corresponding to the acoustic feature of at least one speech frame, so that the acoustic feature has a better correspondence with the attitude control vector, so as to obtain high-quality acoustic feature samples, so that the action of the interactive object is closer to the corresponding character Real action.
图4示出根据本公开至少一个实施例的交互对象的驱动装置的结构示意图,如图4所示,该装置可以包括:第一获取单元401,用于获取语音段所包含的语音帧序列;第二获取单元402,用于获取与所述语音帧序列对应的交互对象的至少一个局部区域的控制参数;驱动单元403,用于根据获取的所述控制参数控制所述交互对象的姿态。FIG. 4 shows a schematic structural diagram of a device for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 4, the device may include: a first obtaining unit 401, configured to obtain a sequence of speech frames contained in a speech segment; The second acquiring unit 402 is configured to acquire a control parameter of at least one partial region of the interaction object corresponding to the speech frame sequence; the driving unit 403 is configured to control the posture of the interaction object according to the acquired control parameter.
在一些实施例中,所述装置还包括输出单元,用于根据所述语音段控制展示所述交互对象的显示设备输出语音和/或展示文本。In some embodiments, the device further includes an output unit for controlling the display device displaying the interactive object to output voice and/or display text according to the voice segment.
在一些实施例中,所述交互对象的局部区域的控制参数包括所述局部区域的姿态控制向量,所述第二获取单元具体用于:获取所述语音帧序列对应的第一声学特征序列;根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征;获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量。In some embodiments, the control parameter of the local area of the interactive object includes a posture control vector of the local area, and the second acquiring unit is specifically configured to: acquire the first acoustic feature sequence corresponding to the speech frame sequence Acquire the acoustic feature corresponding to at least one speech frame according to the first acoustic feature sequence; Acquire the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature.
在一些实施例中,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对 应的声学特征向量,在根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征时,所述第二获取单元具体用于:获取与所述第二声学特征序列对应的姿态控制向量的序列;根据所述姿态控制向量的序列控制所述交互对象的姿态。In some embodiments, the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence, and after acquiring at least one voice frame corresponding to the first acoustic feature sequence The second acquiring unit is specifically configured to: acquire a sequence of posture control vectors corresponding to the second acoustic feature sequence; and control the posture of the interactive object according to the sequence of posture control vectors.
在一些实施例中,所述驱动单元具体用于:获取与所述第二声学特征序列对应的姿态控制向量的序列;根据所述姿态控制向量的序列控制所述交互对象的姿态。In some embodiments, the driving unit is specifically configured to: obtain a sequence of a posture control vector corresponding to the second acoustic feature sequence; and control the posture of the interactive object according to the sequence of the posture control vector.
在一些实施例中,在获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量时,所述第二获取单元具体用于:将所述声学特征输入至预先训练的循环神经网络,获得与所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量。In some embodiments, when acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature, the second acquiring unit is specifically configured to: input the acoustic feature into a pre-trained loop A neural network obtains a posture control vector of at least one local area of the interactive object corresponding to the acoustic feature.
在一些实施例中,所述循环神经网络通过声学特征样本训练得到;所述装置还包括样本获取单元,用于:从获取的视频段中提取所述角色发出语音的语音段;对所述视频段进行采样获取多个包含所述角色的第一图像帧;以及,对所述语音段进行采样,获得多个语音帧;获取与所述第一图像帧对应的所述语音帧的声学特征;将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的至少一个局部区域的姿态控制向量值;根据所述姿态控制向量值,对与所述第一图像帧对应的声学特征进行标注,获得声学特征样本。In some embodiments, the recurrent neural network is obtained through acoustic feature sample training; the device further includes a sample acquisition unit, configured to: extract the voice segment of the character from the acquired video segment; Sampling the segment to obtain a plurality of first image frames containing the character; and sampling the speech segment to obtain a plurality of speech frames; obtaining the acoustic characteristics of the speech frame corresponding to the first image frame; The first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained; The acoustic feature corresponding to the first image frame is annotated to obtain an acoustic feature sample.
在一些实施例中,所述装置还包括训练单元,用于根据所述声学特征样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述初始循环神经网络预测得到的所述至少一个局部区域的姿态控制向量值与标注的姿态控制向量值之间的差异。In some embodiments, the device further includes a training unit for training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change in network loss satisfies the convergence condition, wherein The network loss includes the difference between the attitude control vector value of the at least one local area and the labeled attitude control vector value predicted by the initial recurrent neural network.
本说明书至少一个实施例还提供了一种电子设备,如图5所示,所述设备包括存储器、处理器,存储器用于存储可在处理器上运行的计算机指令,处理器用于在执行所述计算机指令时实现本公开任一实施例所述的交互对象的驱动方法。At least one embodiment of this specification also provides an electronic device. As shown in FIG. 5, the device includes a memory and a processor. The memory is used to store computer instructions that can run on the processor. The method for driving interactive objects described in any embodiment of the present disclosure is realized by computer instructions.
本说明书至少一个实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例所述的交互对象的驱动方法。At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.
本领域技术人员应明白,本说明书一个或多个实施例可提供为方法、系统或计算机程序产品。因此,本说明书一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存 储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown in order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
本说明书中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本说明书中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本说明书中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。The embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or among them A combination of one or more. The embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules. Alternatively or in addition, the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission. The processing device executes. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
本说明书中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。The processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output. The processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他类型的中央处理单元。通常,中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字 助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit. Generally, the central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, the computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both. However, the computer does not have to have such equipment. In addition, the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.
虽然本说明书包含许多具体实施细节,但是这些不应被解释为限制任何发明的范围或所要求保护的范围,而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。Although this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claimed protection, but are mainly used to describe the features of specific embodiments of a particular invention. Certain features described in multiple embodiments in this specification can also be implemented in combination in a single embodiment. On the other hand, various features described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. In addition, although features may function in certain combinations as described above and even initially claimed as such, one or more features from the claimed combination may in some cases be removed from the combination, and the claimed The combination of protection can be directed to a sub-combination or a variant of the sub-combination.
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。Similarly, although operations are depicted in a specific order in the drawings, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or requiring all illustrated operations to be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the above embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can usually be integrated together in a single software product. In, or packaged into multiple software products.
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desired results. In addition, the processes depicted in the drawings are not necessarily in the specific order or sequential order shown in order to achieve the desired result. In some implementations, multitasking and parallel processing may be advantageous.
以上所述仅为本说明书一个或多个实施例的较佳实施例而已,并不用以限制本说明书一个或多个实施例,凡在本说明书一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书一个或多个实施例保护的范围之内。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. All within the spirit and principle of one or more embodiments of this specification, Any modification, equivalent replacement, improvement, etc. made should be included in the protection scope of one or more embodiments of this specification.

Claims (16)

  1. 一种交互对象的驱动方法,包括:A driving method of interactive objects includes:
    获取语音段所包含的语音帧序列;Acquire the sequence of speech frames contained in the speech segment;
    获取与所述语音帧序列对应的交互对象的至少一个局部区域的控制参数值;Acquiring a control parameter value of at least one partial area of the interactive object corresponding to the speech frame sequence;
    根据获取的所述控制参数值控制所述交互对象的姿态。Controlling the posture of the interactive object according to the acquired control parameter value.
  2. 根据权利要求1所述的方法,还包括:根据所述语音段控制展示所述交互对象的显示设备输出语音和/或展示文本。The method according to claim 1, further comprising: controlling the display device displaying the interactive object to output voice and/or display text according to the voice segment.
  3. 根据权利要求1或2所述的方法,其中,所述交互对象的局部区域的控制参数包括所述局部区域的姿态控制向量;The method according to claim 1 or 2, wherein the control parameter of the local area of the interactive object includes the attitude control vector of the local area;
    获取与所述语音帧序列对应的交互对象的至少一个局部区域的控制参数,包括:Obtaining the control parameter of at least one partial area of the interaction object corresponding to the speech frame sequence includes:
    获取所述语音帧序列对应的第一声学特征序列;Acquiring a first acoustic feature sequence corresponding to the speech frame sequence;
    根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征;Acquiring an acoustic feature corresponding to at least one speech frame according to the first acoustic feature sequence;
    获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量。Obtain a posture control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  4. 根据权利要求3所述的方法,其中,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对应的声学特征向量;The method according to claim 3, wherein the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;
    根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征,包括:According to the first acoustic feature sequence, acquiring the acoustic feature corresponding to at least one speech frame includes:
    以设定长度的时间窗口和设定步长,对所述第一声学特征序列进行滑窗,将所述时间窗口内的声学特征向量作为对应的所述至少一个语音帧的声学特征,并根据完成所述滑窗得到的多个所述声学特征,获得第二声学特征序列;Perform a sliding window on the first acoustic feature sequence with a time window of a set length and a set step size, and use the acoustic feature vector in the time window as the corresponding acoustic feature of the at least one speech frame, and Obtaining a second acoustic feature sequence according to the plurality of acoustic features obtained by completing the sliding window;
    根据获取的所述控制参数控制所述交互对象的姿态,包括:Controlling the posture of the interactive object according to the acquired control parameter includes:
    获取与所述第二声学特征序列对应的姿态控制向量的序列;Acquiring a sequence of attitude control vectors corresponding to the second acoustic feature sequence;
    根据所述姿态控制向量的序列控制所述交互对象的姿态。The posture of the interactive object is controlled according to the sequence of the posture control vector.
  5. 根据权利要求3所述的方法,其中,获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量,包括:The method according to claim 3, wherein acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature comprises:
    将所述声学特征输入至预先训练的循环神经网络,获得与所述声学特征对应的所述交互对象的至少一个局部区域的所述姿态控制向量。The acoustic feature is input to a pre-trained recurrent neural network, and the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature is obtained.
  6. 根据权利要求5所述的方法,其中,所述循环神经网络通过声学特征样本训练 得到;The method according to claim 5, wherein the recurrent neural network is obtained through acoustic feature sample training;
    根据以下方式获得所述声学特征样本:Obtain the acoustic feature samples according to the following methods:
    获取一角色发出语音的视频段,从所述视频段中提取所述角色发出语音的语音段;对所述视频段进行采样获取多个包含所述角色的第一图像帧;以及,对所述语音段进行采样,获得多个语音帧;Acquire a video segment in which a character speaks, extract from the video segment the speech segment in which the character speaks; sample the video segment to obtain a plurality of first image frames containing the character; and Sampling the voice segment to obtain multiple voice frames;
    获取与所述第一图像帧对应的所述语音帧的声学特征;Acquiring an acoustic feature of the speech frame corresponding to the first image frame;
    将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的至少一个局部区域的姿态控制向量值;Transforming the first image frame into a second image frame containing the interactive object, and obtaining a posture control vector value of at least one local area corresponding to the second image frame;
    根据所述姿态控制向量值,对与所述第一图像帧对应的所述声学特征进行标注,获得所述声学特征样本。According to the attitude control vector value, annotate the acoustic feature corresponding to the first image frame to obtain the acoustic feature sample.
  7. 根据权利要求6所述的方法,还包括:根据所述声学特征样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述循环神经网络预测得到的所述至少一个局部区域的所述姿态控制向量值与标注的所述姿态控制向量值之间的差异。The method according to claim 6, further comprising: training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change of the network loss satisfies the convergence condition, wherein the network loss includes The difference between the posture control vector value of the at least one local area and the marked posture control vector value predicted by the recurrent neural network.
  8. 一种交互对象的驱动装置,包括:A driving device for interactive objects, including:
    第一获取单元,用于获取语音段所包含的语音帧序列;The first acquiring unit is used to acquire the sequence of speech frames contained in the speech segment;
    第二获取单元,用于获取与所述语音帧序列对应的交互对象的至少一个局部区域的控制参数;The second acquiring unit is configured to acquire control parameters of at least one partial region of the interaction object corresponding to the speech frame sequence;
    驱动单元,用于根据获取的所述控制参数控制所述交互对象的姿态。The driving unit is configured to control the posture of the interactive object according to the acquired control parameter.
  9. 根据权利要求8所述的装置,还包括输出单元,用于根据所述语音段控制展示所述交互对象的显示设备输出语音和/或展示文本。8. The apparatus according to claim 8, further comprising an output unit, configured to control the display device displaying the interactive object to output voice and/or display text according to the voice segment.
  10. 根据权利要求8或9所述的装置,其中,所述交互对象的局部区域的控制参数包括所述局部区域的姿态控制向量,所述第二获取单元用于:The device according to claim 8 or 9, wherein the control parameter of the local area of the interactive object includes a posture control vector of the local area, and the second acquiring unit is configured to:
    获取所述语音帧序列对应的第一声学特征序列;Acquiring a first acoustic feature sequence corresponding to the speech frame sequence;
    根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征;Acquiring an acoustic feature corresponding to at least one speech frame according to the first acoustic feature sequence;
    获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量。Obtain a posture control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  11. 根据权利要求10所述的装置,其中,所述第一声学特征序列包括与所述语音 帧序列中的每个语音帧对应的声学特征向量;The apparatus according to claim 10, wherein the first acoustic feature sequence includes an acoustic feature vector corresponding to each speech frame in the speech frame sequence;
    在根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征时,所述第二获取单元用于:When acquiring the acoustic features corresponding to at least one speech frame according to the first acoustic feature sequence, the second acquiring unit is configured to:
    以设定长度的时间窗口和设定步长,对所述第一声学特征序列进行滑窗,将所述时间窗口内的声学特征向量作为对应的所述至少一个语音帧的声学特征,并根据完成滑窗得到的多个所述声学特征,获得第二声学特征序列;Perform a sliding window on the first acoustic feature sequence with a time window of a set length and a set step size, and use the acoustic feature vector in the time window as the corresponding acoustic feature of the at least one speech frame, and Obtaining a second acoustic feature sequence according to the plurality of acoustic features obtained by completing the sliding window;
    所述驱动单元用于:The driving unit is used for:
    获取与所述第二声学特征序列对应的姿态控制向量的序列;Acquiring a sequence of attitude control vectors corresponding to the second acoustic feature sequence;
    根据所述姿态控制向量的序列控制所述交互对象的姿态。The posture of the interactive object is controlled according to the sequence of the posture control vector.
  12. 根据权利要求10所述的装置,其中,在获取所述声学特征对应的所述交互对象的至少一个局部区域的姿态控制向量时,所述第二获取单元用于:将所述声学特征输入至预先训练的循环神经网络,获得与所述声学特征对应的所述交互对象的至少一个局部区域的所述姿态控制向量。The device according to claim 10, wherein, when acquiring the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature, the second acquiring unit is configured to: input the acoustic feature to A pre-trained cyclic neural network obtains the attitude control vector of at least one local area of the interactive object corresponding to the acoustic feature.
  13. 根据权利要求12所述的装置,其中,所述循环神经网络通过声学特征样本训练得到;The device according to claim 12, wherein the recurrent neural network is obtained through acoustic feature sample training;
    所述装置还包括样本获取单元,用于:The device also includes a sample acquisition unit for:
    获取一角色发出语音的视频段,从所述视频段中提取相应的语音段;对所述视频段进行采样获取多个包含所述角色的第一图像帧;以及,对所述语音段进行采样,获得多个语音帧;Acquire a video segment in which a character speaks, extract a corresponding speech segment from the video segment; sample the video segment to obtain a plurality of first image frames containing the character; and sample the speech segment To obtain multiple speech frames;
    获取与所述第一图像帧对应的所述语音帧的声学特征;Acquiring an acoustic feature of the speech frame corresponding to the first image frame;
    将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的至少一个局部区域的姿态控制向量值;Transforming the first image frame into a second image frame containing the interactive object, and obtaining a posture control vector value of at least one local area corresponding to the second image frame;
    根据所述姿态控制向量值,对与所述第一图像帧对应的所述声学特征进行标注,获得所述声学特征样本。According to the attitude control vector value, annotate the acoustic feature corresponding to the first image frame to obtain the acoustic feature sample.
  14. 根据权利要求13所述的装置,还包括训练单元,用于根据所述声学特征样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述循环神经网络预测得到的所述至少一个局部区域的所述姿态控制向量值与标注的所述姿态控制向量值之间的差异。The device according to claim 13, further comprising a training unit for training the initial recurrent neural network according to the acoustic feature samples, and training to obtain the recurrent neural network after the change of the network loss satisfies the convergence condition, wherein The network loss includes the difference between the attitude control vector value of the at least one local area predicted by the recurrent neural network and the marked attitude control vector value.
  15. 一种电子设备,包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至7任一项所述的方法。An electronic device, comprising a memory and a processor, the memory is used to store computer instructions that can run on the processor, and the processor is used to implement any one of claims 1 to 7 when the computer instructions are executed Methods.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至7中任一所述的方法。A computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the method according to any one of claims 1 to 7.
PCT/CN2020/129814 2020-03-31 2020-11-18 Interactive object driving method and apparatus, device, and storage medium WO2021196646A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021529000A JP2022530726A (en) 2020-03-31 2020-11-18 Interactive target drive methods, devices, devices, and recording media
KR1020217015867A KR20210124182A (en) 2020-03-31 2020-11-18 Interactive object driving method, apparatus, device and recording medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010247276.5A CN111459454B (en) 2020-03-31 2020-03-31 Interactive object driving method, device, equipment and storage medium
CN202010247276.5 2020-03-31

Publications (1)

Publication Number Publication Date
WO2021196646A1 true WO2021196646A1 (en) 2021-10-07

Family

ID=71678881

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129814 WO2021196646A1 (en) 2020-03-31 2020-11-18 Interactive object driving method and apparatus, device, and storage medium

Country Status (5)

Country Link
JP (1) JP2022530726A (en)
KR (1) KR20210124182A (en)
CN (2) CN111459454B (en)
TW (1) TW202139052A (en)
WO (1) WO2021196646A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023116208A1 (en) * 2021-12-24 2023-06-29 上海商汤智能科技有限公司 Digital object generation method and apparatus, and device and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459454B (en) * 2020-03-31 2021-08-20 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111460785B (en) * 2020-03-31 2023-02-28 北京市商汤科技开发有限公司 Method, device and equipment for driving interactive object and storage medium
CN112527115B (en) * 2020-12-15 2023-08-04 北京百度网讯科技有限公司 User image generation method, related device and computer program product
CN113050859B (en) * 2021-04-19 2023-10-24 北京市商汤科技开发有限公司 Driving method, device and equipment of interaction object and storage medium
CN113314104B (en) * 2021-05-31 2023-06-20 北京市商汤科技开发有限公司 Interactive object driving and phoneme processing method, device, equipment and storage medium
CN114283227B (en) * 2021-11-26 2023-04-07 北京百度网讯科技有限公司 Virtual character driving method and device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120284029A1 (en) * 2011-05-02 2012-11-08 Microsoft Corporation Photo-realistic synthesis of image sequences with lip movements synchronized with speech
CN110136698A (en) * 2019-04-11 2019-08-16 北京百度网讯科技有限公司 For determining the method, apparatus, equipment and storage medium of nozzle type
CN110288682A (en) * 2019-06-28 2019-09-27 北京百度网讯科技有限公司 Method and apparatus for controlling the variation of the three-dimensional portrait shape of the mouth as one speaks
CN111459454A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3674875B2 (en) * 1994-10-24 2005-07-27 株式会社イメージリンク Animation system
JP3212578B2 (en) * 1999-06-30 2001-09-25 インタロボット株式会社 Physical voice reaction toy
JP2001034785A (en) * 1999-07-16 2001-02-09 Atr Media Integration & Communications Res Lab Virtual transformation device
JP2003248837A (en) * 2001-11-12 2003-09-05 Mega Chips Corp Device and system for image generation, device and system for sound generation, server for image generation, program, and recording medium
JP4543263B2 (en) * 2006-08-28 2010-09-15 株式会社国際電気通信基礎技術研究所 Animation data creation device and animation data creation program
CN102609969B (en) * 2012-02-17 2013-08-07 上海交通大学 Method for processing face and speech synchronous animation based on Chinese text drive
JP2015166890A (en) * 2014-03-03 2015-09-24 ソニー株式会社 Information processing apparatus, information processing system, information processing method, and program
CN106056989B (en) * 2016-06-23 2018-10-16 广东小天才科技有限公司 A kind of interactive learning methods and device, terminal device
EP3490761A4 (en) * 2016-07-27 2020-04-01 Warner Bros. Entertainment Inc. Control of social robot based on prior character portrayal in fiction or performance
JP6945375B2 (en) * 2017-07-27 2021-10-06 株式会社バンダイナムコエンターテインメント Image generator and program
CN107704169B (en) * 2017-09-26 2020-11-17 北京光年无限科技有限公司 Virtual human state management method and system
JP2019078857A (en) * 2017-10-24 2019-05-23 国立研究開発法人情報通信研究機構 Method of learning acoustic model, and computer program
CN107861626A (en) * 2017-12-06 2018-03-30 北京光年无限科技有限公司 The method and system that a kind of virtual image is waken up
JP7140984B2 (en) * 2018-02-16 2022-09-22 日本電信電話株式会社 Nonverbal information generation device, nonverbal information generation model learning device, method, and program
JP7157340B2 (en) * 2018-02-16 2022-10-20 日本電信電話株式会社 Nonverbal information generation device, nonverbal information generation model learning device, method, and program
CN108942919B (en) * 2018-05-28 2021-03-30 北京光年无限科技有限公司 Interaction method and system based on virtual human
CN110176284A (en) * 2019-05-21 2019-08-27 杭州师范大学 A kind of speech apraxia recovery training method based on virtual reality
CN110400251A (en) * 2019-06-13 2019-11-01 深圳追一科技有限公司 Method for processing video frequency, device, terminal device and storage medium
CN110503942A (en) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 A kind of voice driven animation method and device based on artificial intelligence
CN110794964A (en) * 2019-10-22 2020-02-14 深圳追一科技有限公司 Interaction method and device for virtual robot, electronic equipment and storage medium
CN110815258B (en) * 2019-10-30 2023-03-31 华南理工大学 Robot teleoperation system and method based on electromagnetic force feedback and augmented reality
CN110929762B (en) * 2019-10-30 2023-05-12 中科南京人工智能创新研究院 Limb language detection and behavior analysis method and system based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120284029A1 (en) * 2011-05-02 2012-11-08 Microsoft Corporation Photo-realistic synthesis of image sequences with lip movements synchronized with speech
CN110136698A (en) * 2019-04-11 2019-08-16 北京百度网讯科技有限公司 For determining the method, apparatus, equipment and storage medium of nozzle type
CN110288682A (en) * 2019-06-28 2019-09-27 北京百度网讯科技有限公司 Method and apparatus for controlling the variation of the three-dimensional portrait shape of the mouth as one speaks
CN111459454A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023116208A1 (en) * 2021-12-24 2023-06-29 上海商汤智能科技有限公司 Digital object generation method and apparatus, and device and storage medium

Also Published As

Publication number Publication date
CN113672194A (en) 2021-11-19
KR20210124182A (en) 2021-10-14
JP2022530726A (en) 2022-07-01
CN111459454B (en) 2021-08-20
CN111459454A (en) 2020-07-28
TW202139052A (en) 2021-10-16

Similar Documents

Publication Publication Date Title
WO2021196646A1 (en) Interactive object driving method and apparatus, device, and storage medium
WO2021169431A1 (en) Interaction method and apparatus, and electronic device and storage medium
TWI766499B (en) Method and apparatus for driving interactive object, device and storage medium
TWI760015B (en) Method and apparatus for driving interactive object, device and storage medium
WO2021196644A1 (en) Method, apparatus and device for driving interactive object, and storage medium
JP7193015B2 (en) Communication support program, communication support method, communication support system, terminal device and non-verbal expression program
CN110162598B (en) Data processing method and device for data processing
CN113299312A (en) Image generation method, device, equipment and storage medium
RU2721180C1 (en) Method for generating an animation model of a head based on a speech signal and an electronic computing device which implements it
TW202248994A (en) Method for driving interactive object and processing phoneme, device and storage medium
CN113050859A (en) Interactive object driving method, device, equipment and storage medium
CN113689879A (en) Method, device, electronic equipment and medium for driving virtual human in real time
WO2021196647A1 (en) Method and apparatus for driving interactive object, device, and storage medium
CN110166844B (en) Data processing method and device for data processing
Pham et al. Learning continuous facial actions from speech for real-time animation
CN116958328A (en) Method, device, equipment and storage medium for synthesizing mouth shape

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021529000

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20928143

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20928143

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 521430187

Country of ref document: SA