WO2021196645A1 - Method, apparatus and device for driving interactive object, and storage medium - Google Patents

Method, apparatus and device for driving interactive object, and storage medium Download PDF

Info

Publication number
WO2021196645A1
WO2021196645A1 PCT/CN2020/129806 CN2020129806W WO2021196645A1 WO 2021196645 A1 WO2021196645 A1 WO 2021196645A1 CN 2020129806 W CN2020129806 W CN 2020129806W WO 2021196645 A1 WO2021196645 A1 WO 2021196645A1
Authority
WO
WIPO (PCT)
Prior art keywords
interactive object
data
driving
sequence
control parameter
Prior art date
Application number
PCT/CN2020/129806
Other languages
French (fr)
Chinese (zh)
Inventor
张子隆
吴文岩
吴潜溢
许亲亲
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to JP2021556973A priority Critical patent/JP7227395B2/en
Priority to KR1020217031139A priority patent/KR102707613B1/en
Publication of WO2021196645A1 publication Critical patent/WO2021196645A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.
  • the embodiments of the present disclosure provide a driving solution for interactive objects.
  • a method for driving an interactive object comprising: obtaining driving data of the interactive object, and determining a driving mode of the driving data; In response to the driving mode, the control parameter value of the interactive object is obtained according to the driving data; the posture of the interactive object is controlled according to the control parameter value.
  • the method further includes: controlling the display device to output voice and/or display text according to the driving data.
  • the determining the driving mode corresponding to the driving data includes: obtaining a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data sequence includes multiple A voice data unit; in response to detecting that the voice data unit includes target data, it is determined that the driving mode of the driving data is the first driving mode, and the target data is the same as the preset control parameter value of the interactive object Corresponding; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, the preset control parameter corresponding to the target data Value as the control parameter value of the interactive object.
  • the target data includes keywords or keywords, and the keywords or the keywords correspond to the preset control parameter values of the set actions of the interactive objects; or,
  • the target data includes syllables, and the syllables correspond to preset control parameter values for setting mouth movements of the interactive object.
  • the determining the driving mode corresponding to the driving data includes: obtaining a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data sequence includes multiple A voice data unit; if it is not detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the second driving mode, and the target data is the same as the preset control parameter value of the interactive object correspond.
  • obtaining the control parameter value of the interaction object according to the driving data includes: in response to the second driving mode, obtaining characteristic information of at least one voice data unit in the voice data sequence; Obtain the control parameter value of the interactive object corresponding to the characteristic information.
  • the voice data sequence includes a phoneme sequence
  • the acquiring feature information of at least one voice data unit in the voice data sequence includes: performing feature encoding on the phoneme sequence to obtain A first coding sequence corresponding to the phoneme sequence; obtaining a feature code corresponding to at least one phoneme according to the first coding sequence; obtaining feature information of the at least one phoneme according to the feature code.
  • the voice data sequence includes a voice frame sequence
  • the acquiring characteristic information of at least one voice data unit in the voice data sequence includes: acquiring the first voice frame sequence corresponding to the voice data sequence.
  • An acoustic feature sequence where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; according to the first acoustic feature sequence, at least one voice frame corresponding to Acoustic feature vector; according to the acoustic feature vector, feature information corresponding to the at least one speech frame is obtained.
  • the control parameter of the interactive object includes a facial posture parameter
  • the facial posture parameter includes a facial muscle control coefficient
  • the facial muscle control coefficient is used to control the motion state of at least one facial muscle
  • the obtaining the control parameter value of the interactive object according to the driving data includes: obtaining the facial muscle control coefficient of the interactive object according to the driving data; and controlling the posture of the interactive object according to the control parameter value , Including: according to the acquired facial muscle control coefficient, driving the interactive object to make a facial action matching the driving data.
  • the method further includes: acquiring driving data of the body posture associated with the facial posture parameter; and driving the body posture according to the driving data of the body posture associated with the facial posture parameter value.
  • the interacting object makes physical movements.
  • the control parameter value of the interactive object includes a control vector of at least one local area of the interactive object; the obtaining the control parameter value of the interactive object according to the driving data includes : Obtaining a control vector of at least one partial area of the interactive object according to the drive data; said controlling the posture of the interactive object according to the control parameter value includes: controlling the at least one partial area obtained according to the control parameter value
  • the vector controls the facial movements and/or body movements of the interactive object.
  • the obtaining the control parameter value of the interactive object corresponding to the characteristic information includes: inputting the characteristic information into a pre-trained recurrent neural network to obtain the characteristic The control parameter value of the interactive object corresponding to the information.
  • a device for driving an interactive object is provided.
  • the interactive object is displayed in a display device.
  • the device includes: a first acquiring unit configured to acquire driving data of the interactive object and determine The driving mode of the driving data; a second obtaining unit, configured to obtain a control parameter value of the interactive object according to the driving data in response to the driving mode; a driving unit, configured to control the control parameter value according to the control parameter value The posture of the interactive object.
  • an electronic device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed.
  • the method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.
  • the driving method, device, device, and computer-readable storage medium of the interactive object obtain the control parameter value of the interactive object according to the driving mode of the driving data of the interactive object, thereby controlling The posture of the interactive object, wherein for different driving modes, the control parameter value of the corresponding interactive object can be obtained in different ways, so that the interactive object displays the information that matches the content of the driving data and/or the corresponding voice Posture, so that the target object feels that it is communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved.
  • FIG. 1 is a schematic diagram of a display device in a method for driving interactive objects proposed by at least one embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for driving interactive objects proposed by at least one embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a process of obtaining control parameter values according to a phoneme sequence according to at least one embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of a process of obtaining control parameter values according to a sequence of speech frames proposed by at least one embodiment of the present disclosure
  • FIG. 6 is a schematic structural diagram of a driving device for interactive objects proposed in at least one embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of an electronic device provided by at least one embodiment of the present disclosure.
  • At least one embodiment of the present disclosure provides a method for driving interactive objects.
  • the driving method may be executed by electronic devices such as a terminal device or a server.
  • the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game.
  • the server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.
  • the interaction object may be any virtual image capable of interacting with the target object.
  • the interactive object may be a virtual character, or may also be a virtual animal, virtual item, cartoon image, or other virtual images capable of implementing interactive functions.
  • the display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure.
  • the target object may be a user, a robot, or other smart devices.
  • the interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner.
  • the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction.
  • the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.
  • the interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.
  • terminal devices may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc.
  • VR virtual reality
  • AR augmented reality
  • Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure.
  • the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect.
  • the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters.
  • the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen.
  • the display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor.
  • the processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.
  • the interactive object in response to the driving data for driving the interactive object to output voice, the interactive object may emit a specified voice to the target object.
  • the terminal device can generate driving data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to communicate or respond by issuing a specified voice, thereby providing anthropomorphic services for the target object.
  • the sound-driven data can also be generated in other ways, for example, generated by the server and sent to the terminal device.
  • the interactive object when the interactive object is driven to emit a specified voice according to the driving data, the interactive object may not be able to drive the interactive object to make facial movements synchronized with the specified voice, making the interactive object dull and rigid when uttering the voice. It is unnatural and affects the interactive experience between the target object and the interactive object. Based on this, at least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.
  • FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure.
  • the interactive object is displayed on a display device.
  • the method includes steps 201 to 203.
  • step 201 the driving data of the interactive object is acquired, and the driving mode of the driving data is determined.
  • the driving data may include audio data (voice data), text, and so on.
  • the drive data can be drive data generated by the server or terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or can be directly obtained by the terminal device, such as a drive called from an internal memory Data etc.
  • the present disclosure does not limit the acquisition method of the driving data.
  • the driving mode of the driving data can be determined.
  • the voice data sequence corresponding to the drive data may be obtained according to the type of the drive data, where the voice data sequence includes multiple voice data units.
  • the voice data unit may be formed in units of characters or words, or may be formed in units of phonemes or syllables.
  • the word sequence, word sequence, etc. corresponding to the driving data can be obtained;
  • corresponding to the audio type of driving data, the phoneme sequence, syllable sequence, and speech frame corresponding to the driving data can be obtained Sequence and so on.
  • audio data and text data can be converted to each other. For example, the audio data is converted into text data and then the voice data unit is divided, or the text data is converted into audio data and then the voice data unit is divided, which is not limited in the present disclosure.
  • the voice data unit When it is detected that the voice data unit includes target data, it can be determined that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object.
  • the target data may be a set keyword or keyword, etc., and the keyword or the keyword corresponds to a preset control parameter value of the setting action of the interactive object.
  • each target data is matched with the setting action in advance, and each setting action is controlled by the corresponding control parameter value, so each target data matches the control parameter value of the setting action .
  • the voice data unit contains "wave” in text form and/or "wave” in voice form, it can be determined that the drive data contains target data .
  • the target data includes a syllable
  • the syllable corresponds to a preset control parameter value for setting a mouth movement of the interactive object.
  • the syllable corresponding to the target data belongs to a pre-divided syllable type, and the one syllable type matches a set mouth shape.
  • a syllable is a phonetic unit formed by a combination of at least one phoneme, and the syllable includes a syllable of a pinyin language and a syllable of a non-pinyin language (for example, Chinese).
  • a syllable type refers to a syllable with the same or basically the same pronunciation action, and a syllable type can correspond to an action of an interactive object.
  • a syllable type may correspond to a set mouth shape when the interactive object speaks, that is, it corresponds to a pronunciation action.
  • different syllable types are matched with different control parameter values for setting the mouth shape.
  • the pinyin "ma”, “man”, “mang” type syllables because the pronunciation actions of these syllables are basically the same, so you can Regarding the same type, they can correspond to the control parameter value of the mouth shape of the "mouth open" when the interactive object speaks.
  • the voice data unit includes target data, it may be determined that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object.
  • first driving mode and second driving mode are only for example, and the embodiment of the present disclosure does not limit the specific driving mode.
  • step 202 in response to the driving mode, the control parameter value of the interactive object is obtained according to the driving data.
  • control parameter value of the interactive object can be obtained in a corresponding manner.
  • the preset control parameter value corresponding to the target data may be used as the control parameter value of the interaction object.
  • the preset control parameter value corresponding to the target data such as "wave" contained in the voice data sequence may be used as the control parameter value of the interactive object.
  • characteristic information of at least one voice data unit in the voice data sequence may be acquired; and control parameters of the interaction object corresponding to the characteristic information may be acquired value. That is, in the case where the target data is not detected in the voice data sequence, the corresponding control parameter value can be obtained according to the characteristic information of the voice data unit.
  • the feature information may include feature information of a voice data unit obtained by performing feature encoding on the voice data sequence, feature information of a voice data unit obtained according to the acoustic feature information of the voice data sequence, and so on.
  • step 203 the posture of the interactive object is controlled according to the control parameter value.
  • control parameters of the interactive object include facial posture parameters, and the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle.
  • the facial muscle control coefficient of the interactive object may be acquired according to the driving data; and the facial muscle control coefficient obtained is driven to drive the interactive object to perform facial actions matching the driving data.
  • control parameter value of the interactive object includes a control vector of at least one local area of the interactive object.
  • control vector of at least one partial area of the interactive object can be acquired according to the driving data; and the facial movements and the facial movements of the interactive object can be controlled according to the acquired control vector of the at least one local area. / Or body movements.
  • control parameter value of the interactive object according to the driving mode of the driving data of the interactive object, thereby controlling the posture of the interactive object, wherein the corresponding interactive object can be obtained in different ways for different driving modes
  • the control parameter value of makes the interactive object show a posture that matches the content of the driving data and/or the corresponding voice, so that the target object has a feeling of communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved .
  • the display device may also be controlled to output voice and/or display text according to the driving data. And while outputting voice and/or displaying text, the gesture of the interactive object can be controlled according to the control parameter value.
  • the output of the voice and/or the display text according to the driving data is synchronized with the control parameter value according to the control parameter value.
  • the gesture made by the object is also synchronized with the output voice and/or displayed text, thereby giving the target object a feeling that the interactive object is communicating with it.
  • the speech data sequence includes a phoneme sequence.
  • the audio data may be split into multiple audio frames, and the audio frames may be combined according to the state of the audio frames to form phonemes; each phoneme formed according to the audio data forms Phoneme sequence.
  • the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and a pronunciation action of a real person can form a phoneme.
  • the phoneme corresponding to the morpheme can be obtained according to the morphemes contained in the text, so as to obtain the corresponding phoneme sequence.
  • the feature information of at least one voice data unit in the voice data sequence may be obtained by: performing feature encoding on the phoneme sequence to obtain the first coding sequence corresponding to the phoneme sequence; According to the first coding sequence, the feature code corresponding to at least one phoneme is obtained; and the feature information of the at least one phoneme is obtained according to the feature code.
  • Figure 3 shows a schematic diagram of the process of feature encoding on a phoneme sequence.
  • the phoneme sequence 310 contains phonemes j, i1, j, and ie4 (for brevity, only some phonemes are shown), and corresponding code sequences 321, 322, and 323 are obtained for each phoneme j, i1, and ie4. .
  • the corresponding coding value at the time point when the phoneme is present is set to a first value (for example, 1)
  • the corresponding coding value at the time point when the phoneme is not present is set to a second value (for example, Is 0).
  • the value of the code sequence 321 is set to the first value 1; at the time point when there is no phoneme j, the value of the code sequence 321 is set to the second value.
  • the value is 0. All coding sequences 321, 322, and 323 constitute a total coding sequence 320.
  • the encoding values of the encoding sequences 321, 322, and 323 corresponding to phonemes j, i1, and ie4, and the duration of the corresponding phonemes in the three encoding sequences that is, the duration of j in the encoding sequence 321, and the duration of the encoding sequence in the encoding sequence 321.
  • the duration of i1 in 322 and the duration of ie4 in the coded sequence 323 can obtain the characteristic information of the coded sequences 321, 322, and 323.
  • Gaussian filters may be used to perform Gaussian convolution operations on the consecutive values of phonemes j, i1, and ie4 in the coded sequences 321, 322, and 323, respectively, to obtain feature information of the coded sequence. That is, the Gaussian convolution operation is performed on the continuous value of the phoneme in time through the Gaussian filter, so that the code value in each code sequence changes from the second value to the first value or from the first value to the second value. smooth. Gaussian convolution operation is performed on each coded sequence 321, 322, and 323 respectively, so as to obtain the feature value of each coded sequence, where the feature value constitutes the parameter in the feature information. According to the set of feature information of each coded sequence, the feature value is obtained. The feature information 330 corresponding to the phoneme sequence 310. Those skilled in the art should understand that other operations can also be performed on each coding sequence to obtain the characteristic information of the coding sequence, which is not limited in the present disclosure.
  • the characteristic information of the coding sequence is obtained according to the duration of each phoneme in the phoneme sequence, so that the change phase of the coding sequence is smooth, for example, the value of the coding sequence also presents an intermediate state in addition to 0 and 1.
  • the value of, such as 0.2, 0.3, etc., and the posture parameter values obtained according to the values of these intermediate states make the posture changes of the interactive characters more smooth and natural, especially the expression changes of the interactive characters are more smooth, natural, and improved The interactive experience of the target object.
  • the facial posture parameters may include facial muscle control coefficients.
  • the facial muscle model is obtained by dividing the facial muscles of the interactive objects, and each muscle (region) obtained by the division is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed, then It can make the faces of interactive characters make various expressions.
  • the motion state corresponding to different muscle control coefficients can be set according to the facial position where the muscle is located and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of the control coefficient is 0 to 1. Different values within this range correspond to different contraction/expansion states of the upper lip muscle.
  • the mouth can be opened and closed vertically;
  • the value of the control coefficient ranges from 0 to 1. Different values in this range correspond to the contraction/expansion state of the left corner of the mouth muscle.
  • the lateral change of the mouth can be achieved.
  • the interactive object While outputting the sound according to the phoneme sequence, the interactive object is driven to make facial expressions according to the facial muscle control coefficient corresponding to the phoneme sequence, so that when the display device outputs the sound, the interactive object can make the sound synchronously.
  • the expression of the target object so that the target object feels that the interactive object is speaking, and the interactive experience of the target object is improved.
  • the facial motion of the interactive object may be associated with the body posture, that is, the facial posture parameter value corresponding to the facial motion may be associated with the body posture.
  • the body posture may include body motion, Gesture movement, walking posture, etc.
  • the interactive object obtain the driving data of the body posture associated with the facial posture parameter value; while outputting the sound according to the phoneme sequence, according to the driving data of the body posture associated with the facial posture parameter value ,
  • To drive the interactive object to make physical actions That is, while driving the interactive object to make a facial action according to the driving data of the interactive object, it also obtains the driving data of the associated body posture according to the facial posture parameter value corresponding to the facial action, so that when the sound is output , Can drive the interactive object to make corresponding facial and body movements synchronously, make the speaking state of the interactive object more vivid and natural, and improve the interactive experience of the target object.
  • the time window is moved on the phoneme sequence, and the phonemes in the time window during each movement are output, wherein the set duration is used as the time of each movement
  • the step size of the window For example, you can set the length of the time window to 1 second and the set time to 0.1 second.
  • the phoneme at the set position of the time window or the attitude parameter value corresponding to the feature information of the phone is obtained, and the attitude parameter value is used to control the attitude of the interactive object;
  • the set position is The position of the set duration from the start position of the time window. For example, when the length of the time window is set to 1s, the set position may be 0.5s away from the start position of the time window.
  • the posture of the interactive object is controlled by the corresponding posture parameter value at the set position of the time window, so that the posture of the interactive object is synchronized with the output voice.
  • the target object feels that the interactive object is speaking.
  • the time interval (frequency) for obtaining the posture parameter value can be changed, thereby changing the frequency at which the interactive object makes the posture.
  • the set duration can be set according to the actual interactive scene, so that the posture of the interactive object changes more naturally.
  • the posture of the interactive object can be controlled by acquiring the control vector of at least one partial area of the interactive object.
  • the local area is obtained by dividing the whole (including face and/or body) of the interactive object.
  • the control of one or more local areas of the face may correspond to a series of facial expressions or actions of the interactive object.
  • the control of the eye area may correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective;
  • the control of the mouth area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees.
  • the control of one or more local areas of the body may correspond to a series of physical actions of the interactive object.
  • the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.
  • the control parameter of the local area of the interactive object includes the control vector of the local area.
  • the attitude control vector of each local area is used to drive the local area of the interactive object to perform actions.
  • Different control vector values correspond to different actions or action ranges. For example, for the control vector of the mouth area, one set of control vector values can make the mouth of the interactive object slightly open, while another set of control vector values can make the mouth of the interactive object open wider. By driving the interactive objects with different control vector values, the corresponding local areas can perform different actions or actions with different amplitudes.
  • the local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the control vectors of all the local areas can be obtained; when the expression of the interactive object needs to be controlled, Then, the control vector of the local area corresponding to the face can be obtained.
  • the feature code corresponding to at least one phoneme may be obtained by performing a sliding window on the first code sequence.
  • the first coding sequence may be a coding sequence after a Gaussian convolution operation.
  • a sliding window is performed on the coding sequence with a time window of a set length and a set step size, and the feature code in the time window is used as the feature code of the corresponding at least one phoneme.
  • the second code sequence can be obtained.
  • FIG. 4 by sliding a time window of a set length on the first coding sequence 420 or the smoothed first coding sequence 430, feature code 1, feature code 2, feature code 3 are obtained respectively, and so on, After traversing the first coding sequence, feature codes 1, feature codes 2, feature codes 3,..., Feature codes M are obtained, and thus a second code sequence 440 is obtained.
  • M is a positive integer, and its value is determined according to the length of the first coding sequence, the length of the time window, and the sliding step of the time window.
  • feature code 1 feature code 2, feature code 3,..., feature code M, corresponding control vector 1, control vector 2, control vector 3,..., control vector M can be obtained, thereby obtaining a sequence 450 of control vectors.
  • the sequence 450 of the control vector and the second coding sequence 440 are aligned in time. Since each feature code in the second coding sequence is obtained according to at least one phoneme in the phoneme sequence, the sequence 450 of the control vector Each feature vector of is also obtained from at least one phoneme in the phoneme sequence.
  • the interactive object While playing the phoneme sequence corresponding to the text data, the interactive object is driven to make an action according to the sequence of the control vector, that is, it can drive the interactive object to emit the sound corresponding to the text content, and make the sound synchronized with the sound. The action gives the target object the feeling that the interactive object is speaking, and improves the interactive experience between the target object and the interactive object.
  • the control vector before the set time can be set to the default value, that is, when the phoneme sequence is just started to be played, the interactive object is made to make
  • the default action is to use the sequence of the control vector obtained according to the first coding sequence to drive the interactive object to make an action after the set time.
  • the feature code 1 is output at time t0, and the output is the default control vector before time t0.
  • the length of the time window is related to the amount of information contained in the feature code. In the case where the amount of information contained in the time window is relatively large, the cyclic neural network processing will output a relatively uniform result. If the length of the time window is too large, the expression of the interactive object may not correspond to part of the text; if the length of the time window is too small, the expression of the interactive object may appear rigid when speaking. Therefore, the duration of the time window needs to be determined according to the minimum duration of the phoneme corresponding to the text data, so that the actions taken by driving the interactive object have a stronger correlation with the sound.
  • the sliding step of the time window is related to the time interval (frequency) for obtaining the control vector, that is, it is related to the frequency with which the interactive object is driven to make an action.
  • the length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
  • the interactive object when the time interval between phonemes in the phoneme sequence is greater than a set threshold, the interactive object is driven to take actions according to the set control vector of the local area. That is, when the interactive character pauses for a long time, the interactive object is driven to make a set action. For example, when the output voice pauses for a long time, the interactive object can be made to make a smiling expression or slightly swing the body to avoid the interactive object standing upright without expression when the pause is long, thereby making the interactive object speak more Natural and smooth, which improves the interactive experience of the target object.
  • the voice data sequence includes a voice frame sequence
  • acquiring feature information of at least one voice data unit in the voice data sequence includes: acquiring a first acoustic feature sequence corresponding to the voice frame sequence,
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;
  • the acoustic feature vector corresponding to at least one voice frame is acquired according to the first acoustic feature sequence;
  • the acoustic feature vector is used to obtain feature information corresponding to the at least one speech frame.
  • control parameter of at least one local area of the interactive object may be determined according to the acoustic characteristics of the speech frame sequence, and the control parameter may also be determined according to other characteristics of the speech frame sequence.
  • the acoustic feature sequence corresponding to the speech frame sequence is referred to as the first acoustic feature sequence.
  • the acoustic features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Cepstral Cofficient (MFCC) and so on.
  • MFCC Mel Frequency Cepstral Cofficient
  • the first acoustic feature sequence is obtained by processing the entire speech frame sequence.
  • the speech frame sequence can be windowed, fast Fourier transform, etc. Filtering, logarithmic processing, and discrete cosine processing to obtain the MFCC coefficients corresponding to each speech frame.
  • the first acoustic feature sequence is obtained by processing the entire voice frame sequence, and reflects the overall acoustic feature of the voice data sequence.
  • the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence.
  • the first acoustic feature sequence includes the MFCC coefficients of each speech frame.
  • the first acoustic feature sequence obtained according to the speech frame sequence is shown in FIG. 5.
  • the acoustic feature corresponding to at least one speech frame is acquired.
  • the same number of feature vectors corresponding to the at least one voice frame may be used as the voice The acoustic characteristics of the frame.
  • the same number of feature vectors can form a feature matrix, and the feature matrix is the acoustic feature of the at least one speech frame.
  • the N feature vectors in the first acoustic feature sequence form the acoustic features of the corresponding N speech frames; where N is a positive integer.
  • the first acoustic feature matrix may include a plurality of acoustic features, and the speech frames corresponding to each of the acoustic features may partially overlap.
  • the control vector of at least one local area can be acquired.
  • the local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the control vectors of all the local areas can be obtained; when the expression of the interactive object needs to be controlled, Then, the control vector of the local area corresponding to the face can be obtained.
  • the interactive object While playing the voice data sequence, the interactive object is driven to act according to the control vector corresponding to each acoustic feature obtained through the first acoustic feature sequence, so that the terminal device can output sound while the interactive object can Perform actions that match the output sound, including facial actions, expressions, and body actions, so that the target object feels that the interactive object is speaking.
  • the control vector is related to the acoustic characteristics of the output sound, driving according to the control vector can make the expression and body movements of the interactive object have emotional factors, thereby making the speaking process of the interactive object more natural and vivid, thereby Improve the interactive experience between the target object and the interactive object.
  • the acoustic feature corresponding to the at least one speech frame may be acquired by performing a sliding window on the first acoustic feature sequence.
  • the acoustic feature vector in the time window is used as the corresponding acoustic feature of the same number of speech frames, so that Acquire the acoustic features corresponding to these speech frames.
  • the second acoustic feature sequence can be obtained according to the obtained multiple acoustic features.
  • the speech frame sequence includes 100 speech frames per second, the length of the time window is 1 s, and the step length is 0.04 s. Since each feature vector in the first acoustic feature sequence corresponds to a speech frame, correspondingly, the first acoustic feature sequence also includes 100 feature vectors per second. During the window sliding process on the first acoustic feature sequence, 100 feature vectors in the time window are obtained each time as the acoustic features of the corresponding 100 speech frames.
  • the acoustic features corresponding to the 1st to 100th speech frames 1 and the acoustic features 2 corresponding to the 4th to 104th speech frames are respectively obtained.
  • acoustic feature 1 After traversing the first acoustic feature, acoustic feature 1, acoustic feature 2,..., acoustic feature M is obtained, thereby obtaining the second acoustic feature sequence, where M is a positive integer, and its value is based on the sequence of the speech frame
  • M is a positive integer, and its value is based on the sequence of the speech frame
  • the number of frames (the number of feature vectors in the first acoustic feature sequence), the length of the time window, and the step size are determined.
  • the acoustic feature 1 the acoustic feature 2,..., the acoustic feature M, the corresponding control vector 1, the control vector 2,..., the control vector M can be obtained respectively, so as to obtain the sequence of the control vector.
  • the sequence of the control vector and the second acoustic feature sequence are aligned in time.
  • Acoustic feature 1, acoustic feature 2, ..., acoustic feature M in the second acoustic feature sequence They are respectively obtained based on the N feature vectors in the first acoustic feature sequence. Therefore, while the voice frame is played, the interactive object can be driven to perform actions according to the sequence of the control vector.
  • the control vector before the set time can be set to the default value, that is, when the speech frame sequence is just started to be played, the interactive object is made to do A default action is taken, and after the set time, the interactive object is driven to make an action using the sequence of the control vector obtained according to the first acoustic characteristic sequence.
  • Acoustic feature 1 is output at t0, and the acoustic feature is output at intervals of 0.04s corresponding to the step size.
  • Acoustic feature 2 is output at t1
  • acoustic feature 3 is output at t2 until at t (M-1) Acoustic feature M is output at every moment.
  • the feature vector (i+1) corresponds to the time period t i ⁇ t(i+1), where i is an integer smaller than (M-1), and before t0, the control vector is the default Control vector.
  • the interactive object while playing the voice data sequence, the interactive object is driven to make an action according to the sequence of the control vector, so that the action of the interactive object is synchronized with the output sound, and the target object With the feeling that the interactive object is speaking, the interactive experience between the target object and the interactive object is improved.
  • the length of the time window is related to the amount of information contained in the acoustic feature. The greater the length of the time window, the more information it contains, and the stronger the correlation between the actions and sounds that drive the interactive object.
  • the sliding step of the time window is related to the time interval (frequency) for obtaining the control vector, that is, it is related to the frequency with which the interactive object is driven to make an action.
  • the length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
  • the acoustic feature includes Mel frequency cepstral coefficients MFCC in L dimensions, where L is a positive integer.
  • MFCC represents the distribution of the energy of the speech signal in different frequency ranges.
  • the MFCC of L dimensions can be obtained by converting multiple speech frame data in the speech frame sequence to the frequency domain and using a Mel filter including L subbands.
  • the control vector is obtained according to the MFCC of the voice data sequence to drive the interactive object to perform facial and physical actions according to the control vector, so that the expression and physical actions of the interactive object have emotional factors, making the speaking process of the interactive object more Natural and vivid, thereby improving the interactive experience between the target object and the interactive object.
  • the characteristic information of the voice data unit may be input to a pre-trained recurrent neural network to obtain the control parameter value of the interactive object corresponding to the characteristic information.
  • the recurrent neural network is a time recursive neural network, it can learn historical information of the input feature information, and output control parameters according to the sequence of voice units; for example, the control parameter can be a facial posture control parameter, or at least one local area Control vector.
  • a pre-trained cyclic neural network is used to obtain the control parameters corresponding to the feature information of the voice data unit, and the related historical feature information and current feature information are merged, so that the historical control parameters are compared to the current feature information.
  • the changes of control parameters have an impact, making the expression changes and body movements of the interactive characters more smooth and natural.
  • the recurrent neural network can be trained in the following manner.
  • the characteristic information sample can be obtained in the following manner.
  • a video segment of a character's voice Acquire a video segment of a character's voice, and extract the corresponding voice segment of the character from the video segment. For example, a video segment in which a real person is speaking can be obtained; A first image frame; and, sampling the voice segment to obtain multiple voice frames.
  • the first image frame is converted into a second image frame containing the interactive object, and the control parameter value of the interactive object corresponding to the second image frame is obtained.
  • control parameter value annotate the feature information corresponding to the first image frame to obtain a feature information sample.
  • the feature information includes feature codes of phonemes
  • the control parameters include facial muscle control coefficients.
  • the feature information includes a feature code of a phoneme
  • the control parameter includes at least one partial control vector of the interactive object.
  • the characteristic information includes acoustic characteristics of the speech frame
  • the control parameter includes at least one partial control vector of the interactive object.
  • the feature information sample is not limited to the above, and corresponding to various features of various types of voice data units, corresponding feature information samples can be obtained.
  • the initial recurrent neural network is trained according to the characteristic information sample, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the recurrent neural network.
  • the video segment of a character is split into corresponding multiple first image frames and multiple voice frames, and the first image frame containing the real person is converted into the second image frame containing the interactive object.
  • the image frame is used to obtain the control parameter value corresponding to the feature information of at least one voice frame, so that the feature information has a better correspondence with the control parameter value, so as to obtain high-quality feature information samples, so that the posture of the interactive object is closer to that of the corresponding character Real posture.
  • FIG. 6 shows a schematic structural diagram of a driving device for an interactive object according to at least one embodiment of the present disclosure.
  • the device may include: a first acquiring unit 601 for acquiring driving data of the interactive object, and Determine the driving mode of the driving data; the second obtaining unit 602 is configured to obtain the control parameter value of the interactive object according to the driving data in response to the driving mode; the driving unit 603 is configured to obtain the control parameter value according to the control parameter The value controls the posture of the interactive object.
  • the device further includes an output unit for controlling the display device to output voice and/or display text according to the driving data.
  • the first acquiring unit when determining the driving mode corresponding to the driving data, is specifically configured to: acquire a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data
  • the data sequence includes a plurality of voice data units; if it is detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the first driving mode, and the target data and the preset control parameter value of the interactive object Corresponding; in response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, setting the preset control parameter value corresponding to the target data As the control parameter value of the interactive object.
  • the target data includes keywords or keywords, and the keywords or keywords correspond to preset control parameter values of the set action of the interactive object; or, the target data includes syllables , The syllable corresponds to the preset control parameter value for setting the mouth movement of the interactive object.
  • the first acquiring unit when identifying the driving mode of the driving data, is specifically configured to: acquire a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data The sequence includes a plurality of voice data units; if it is not detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the second driving mode, and the target data and the preset control parameter value of the interactive object Corresponding; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the second driving mode, acquiring the value of at least one voice data unit in the voice data sequence Characteristic information; obtaining the control parameter value of the interactive object corresponding to the characteristic information.
  • the voice data sequence includes a phoneme sequence
  • the second acquiring unit is specifically configured to:
  • the feature code is used to obtain the first coding sequence corresponding to the phoneme sequence; the feature code corresponding to at least one phoneme is obtained according to the first coding sequence; the feature information of the at least one phoneme is obtained according to the feature code.
  • the voice data sequence includes a voice frame sequence
  • the second acquiring unit when acquiring characteristic information of at least one voice data unit in the voice data sequence, is specifically configured to: acquire the voice frame A first acoustic feature sequence corresponding to a sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; and at least one acoustic feature sequence is acquired according to the first acoustic feature sequence Acoustic feature vector corresponding to the speech frame; according to the acoustic feature vector, the feature information corresponding to the at least one speech frame is obtained.
  • control parameters of the interactive object include facial posture parameters, and the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle;
  • the second acquisition unit is specifically configured to: acquire the facial muscle control coefficient of the interactive object according to the driving data;
  • the driving unit is specifically configured to: according to the acquired The facial muscle control coefficient drives the interactive object to make facial actions that match the drive data;
  • the device also includes a limb drive unit, which is used to obtain the drive data of the body posture associated with the facial posture parameter;
  • the driving data of the body posture associated with the facial posture parameter value drives the interactive object to make limb movements.
  • control parameter of the interactive object includes a control vector of at least one local area of the interactive object; when acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit Specifically configured to: obtain a control vector of at least one local area of the interactive object according to the driving data; the driving unit is specifically configured to: control the interactive object according to the acquired control vector of the at least one local area Facial movements and/or body movements.
  • an electronic device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed.
  • the method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.
  • At least one embodiment of this specification also provides an electronic device.
  • the device includes a memory and a processor.
  • the memory is used to store computer instructions that can run on the processor.
  • the method for driving interactive objects described in any embodiment of the present disclosure is realized by computer instructions.
  • At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.
  • one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or among them A combination of one or more.
  • the embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules.
  • the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission.
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output.
  • the processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit.
  • the central processing unit will receive instructions and data from a read-only memory and/or a random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • the computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both.
  • the computer does not have to have such equipment.
  • the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks or Removable disks
  • magneto-optical disks CD ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Processing Or Creating Images (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A method, apparatus and device for driving an interactive object, and a storage medium. The interactive object is displayed in a display device, and the method comprises: obtaining driving data of the interactive object, and determining the driving mode of the driving data (201); in response to the driving mode, obtaining a control parameter value of the interactive object according to the driving data (202); and controlling the posture of the interactive object according to the control parameter value (203).

Description

交互对象的驱动方法、装置、设备以及存储介质Driving method, device, equipment and storage medium of interactive object
相关交叉引用Related cross references
本申请基于申请号为2020102461120、申请日为2020年3月31日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is filed based on a Chinese patent application with an application number of 2020102461120 and an application date of March 31, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this application.
技术领域Technical field
本公开涉及计算机技术领域,具体涉及一种交互对象的驱动方法、装置、设备以及存储介质。The present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.
背景技术Background technique
人机交互的方式大多基于按键、触摸、语音进行输入,通过在显示屏上呈现图像、文本或虚拟人物进行回应。目前虚拟人物多是在语音助理的基础上改进得到的,其只是对设备的语音进行输出。The way of human-computer interaction is mostly based on keystrokes, touches, and voice input, and responds by presenting images, texts or virtual characters on the display screen. Currently, virtual characters are mostly improved on the basis of voice assistants, which only output the voice of the device.
发明内容Summary of the invention
本公开实施例提供一种交互对象的驱动方案。The embodiments of the present disclosure provide a driving solution for interactive objects.
根据本公开的一方面,提供一种交互对象的驱动方法,所述交互对象展示在显示设备中,所述方法包括:获取所述交互对象的驱动数据,并确定所述驱动数据的驱动模式;响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值;根据所述控制参数值控制所述交互对象的姿态。According to an aspect of the present disclosure, there is provided a method for driving an interactive object, the interactive object being displayed in a display device, the method comprising: obtaining driving data of the interactive object, and determining a driving mode of the driving data; In response to the driving mode, the control parameter value of the interactive object is obtained according to the driving data; the posture of the interactive object is controlled according to the control parameter value.
结合本公开提供的任一实施方式,所述方法还包括:根据所述驱动数据控制所述显示设备输出语音和/或展示文本。With reference to any one of the embodiments provided in the present disclosure, the method further includes: controlling the display device to output voice and/or display text according to the driving data.
结合本公开提供的任一实施方式,所述确定所述驱动数据对应的驱动模式,包括:根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;响应于检测到所述语音数据单元中包括目标数据,则确定所述驱动数据的驱动模式为第一驱动模式,所述目标数据与所述交互对象的预设控制参数值相对应;所述响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:响应于所述第一驱动模式,将所述目标数据对应的所述预设控制参数值,作为所述交互对象的控制参数值。With reference to any one of the embodiments provided in the present disclosure, the determining the driving mode corresponding to the driving data includes: obtaining a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data sequence includes multiple A voice data unit; in response to detecting that the voice data unit includes target data, it is determined that the driving mode of the driving data is the first driving mode, and the target data is the same as the preset control parameter value of the interactive object Corresponding; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, the preset control parameter corresponding to the target data Value as the control parameter value of the interactive object.
结合本公开提供的任一实施方式,所述目标数据包括关键词或关键字,所述关键词或所述关键字与所述交互对象的设定动作的预设控制参数值相对应;或者,所述目标数据包括音节,所述音节与所述交互对象的设定嘴型动作的预设控制参数值对应。With reference to any of the embodiments provided in the present disclosure, the target data includes keywords or keywords, and the keywords or the keywords correspond to the preset control parameter values of the set actions of the interactive objects; or, The target data includes syllables, and the syllables correspond to preset control parameter values for setting mouth movements of the interactive object.
结合本公开提供的任一实施方式,所述确定所述驱动数据对应的驱动模式,包括:根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;若未检测到所述语音数据单元中包括目标数据,则确定所述驱动数据的驱动模式为第二驱动模式,所述目标数据与所述交互对象的预设控制参数值相对应。响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:响应于所述第二驱动模式,获取所述语音数据序列中的至少一个语音数据单元的特征信息;获取与所述特征信息对应的所述交互对象的控制参数值。With reference to any one of the embodiments provided in the present disclosure, the determining the driving mode corresponding to the driving data includes: obtaining a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data sequence includes multiple A voice data unit; if it is not detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the second driving mode, and the target data is the same as the preset control parameter value of the interactive object correspond. In response to the driving mode, obtaining the control parameter value of the interaction object according to the driving data includes: in response to the second driving mode, obtaining characteristic information of at least one voice data unit in the voice data sequence; Obtain the control parameter value of the interactive object corresponding to the characteristic information.
结合本公开提供的任一实施方式,所述语音数据序列包括音素序列,所述获取所述语音数据序列中的至少一个语音数据单元的特征信息,包括:对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列;根据所述第一编码序列,获取至少一个音素对应的特征编码;根据所述特征编码,获得所述至少一个音素的特征信息。With reference to any one of the embodiments provided in the present disclosure, the voice data sequence includes a phoneme sequence, and the acquiring feature information of at least one voice data unit in the voice data sequence includes: performing feature encoding on the phoneme sequence to obtain A first coding sequence corresponding to the phoneme sequence; obtaining a feature code corresponding to at least one phoneme according to the first coding sequence; obtaining feature information of the at least one phoneme according to the feature code.
结合本公开提供的任一实施方式,所述语音数据序列包括语音帧序列,所述获取所述语音数据序列中的至少一个语音数据单元的特征信息,包括:获取所述语音帧序列对应的第一声学特征序列,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对应的声学特征向量;根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征向量;根据所述声学特征向量,获得所述至少一个语音帧对应的特征信息。With reference to any one of the embodiments provided by the present disclosure, the voice data sequence includes a voice frame sequence, and the acquiring characteristic information of at least one voice data unit in the voice data sequence includes: acquiring the first voice frame sequence corresponding to the voice data sequence. An acoustic feature sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; according to the first acoustic feature sequence, at least one voice frame corresponding to Acoustic feature vector; according to the acoustic feature vector, feature information corresponding to the at least one speech frame is obtained.
结合本公开提供的任一实施方式,所述交互对象的控制参数包括面部姿态参数,所述面部姿态参数包括面部肌肉控制系数,所述面部肌肉控制系数用于控制至少一个面部肌肉的运动状态;所述根据所述驱动数据获取所述交互对象的控制参数值,包括:根据所述驱动数据获取所述交互对象的面部肌肉控制系数;所述根据所述控制参数值控制所述交互对象的姿态,包括:根据所获取的面部肌肉控制系数,驱动所述交互对象做出与所述驱动数据匹配的面部动作。With reference to any one of the embodiments provided in the present disclosure, the control parameter of the interactive object includes a facial posture parameter, the facial posture parameter includes a facial muscle control coefficient, and the facial muscle control coefficient is used to control the motion state of at least one facial muscle; The obtaining the control parameter value of the interactive object according to the driving data includes: obtaining the facial muscle control coefficient of the interactive object according to the driving data; and controlling the posture of the interactive object according to the control parameter value , Including: according to the acquired facial muscle control coefficient, driving the interactive object to make a facial action matching the driving data.
结合本公开提供的任一实施方式,所述方法还包括:获取与所述面部姿态参数关联的身体姿态的驱动数据;根据与所述面部姿态参数值关联的身体姿态的驱动数据,驱动所述交互对象做出肢体动作。With reference to any one of the embodiments provided in the present disclosure, the method further includes: acquiring driving data of the body posture associated with the facial posture parameter; and driving the body posture according to the driving data of the body posture associated with the facial posture parameter value. The interacting object makes physical movements.
结合本公开提供的任一实施方式,所述交互对象的控制参数值包括所述交互对象的至少一个局部区域的控制向量;所述根据所述驱动数据获取所述交互对象的控制参数值,包括:根据所述驱动数据获取所述交互对象的至少一个局部区域的控制向量;所述根据所述控制参数值控制所述交互对象的姿态,包括:根据所获取的所述至少一个局部区域的控制向量,控制所述交互对象的面部动作和/或肢体动作。With reference to any one of the embodiments provided in the present disclosure, the control parameter value of the interactive object includes a control vector of at least one local area of the interactive object; the obtaining the control parameter value of the interactive object according to the driving data includes : Obtaining a control vector of at least one partial area of the interactive object according to the drive data; said controlling the posture of the interactive object according to the control parameter value includes: controlling the at least one partial area obtained according to the control parameter value The vector controls the facial movements and/or body movements of the interactive object.
结合本公开提供的任一实施方式,所述获取与所述特征信息对应的所述交互对象的控制参数值,包括:将所述特征信息输入至预先训练的循环神经网络,获得与所述特征信息对应的所述交互对象的控制参数值。With reference to any one of the embodiments provided in the present disclosure, the obtaining the control parameter value of the interactive object corresponding to the characteristic information includes: inputting the characteristic information into a pre-trained recurrent neural network to obtain the characteristic The control parameter value of the interactive object corresponding to the information.
根据本公开的一方面,提出一种交互对象的驱动装置,所述交互对象展示在显示设备中,所述装置包括:第一获取单元,用于获取所述交互对象的驱动数据,并确定所述驱动数据的驱动模式;第二获取单元,用于响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值;驱动单元,用于根据所述控制参数值控制所述交互对象的姿态。According to an aspect of the present disclosure, a device for driving an interactive object is provided. The interactive object is displayed in a display device. The device includes: a first acquiring unit configured to acquire driving data of the interactive object and determine The driving mode of the driving data; a second obtaining unit, configured to obtain a control parameter value of the interactive object according to the driving data in response to the driving mode; a driving unit, configured to control the control parameter value according to the control parameter value The posture of the interactive object.
根据本公开的一方面,提供一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed. The method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
根据本公开的一方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.
本公开一个或多个实施例的交互对象的驱动方法、装置、设备及计算机可读存储介质,根据所述交互对象的驱动数据的驱动模式,来获取所述交互对象的控制参数值,从而控制所述交互对象的姿态,其中,对于不同的驱动模式可以通过不同的方式来获取相应的交互对象的控制参数值,使得交互对象展示出与所述驱动数据的内容和/或对应的语音匹配的姿态,从而使目标对象产生与交互对象正在交流的感觉,提升了目标对象与交互对象的交互体验。The driving method, device, device, and computer-readable storage medium of the interactive object according to one or more embodiments of the present disclosure obtain the control parameter value of the interactive object according to the driving mode of the driving data of the interactive object, thereby controlling The posture of the interactive object, wherein for different driving modes, the control parameter value of the corresponding interactive object can be obtained in different ways, so that the interactive object displays the information that matches the content of the driving data and/or the corresponding voice Posture, so that the target object feels that it is communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved.
附图说明Description of the drawings
为了更清楚地说明本说明书一个或多个实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain one or more embodiments of this specification or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, in the following description The drawings are only some of the embodiments described in one or more embodiments of this specification. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1是本公开至少一个实施例提出的交互对象的驱动方法中显示设备的示意图;FIG. 1 is a schematic diagram of a display device in a method for driving interactive objects proposed by at least one embodiment of the present disclosure;
图2是本公开至少一个实施例提出的交互对象的驱动方法的流程图;2 is a flowchart of a method for driving interactive objects proposed by at least one embodiment of the present disclosure;
图3是本公开至少一个实施例提出的对音素序列进行特征编码的过程示意图;FIG. 3 is a schematic diagram of a process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure;
图4是本公开至少一个实施例提出的根据音素序列获得控制参数值的过程示意图;4 is a schematic diagram of a process of obtaining control parameter values according to a phoneme sequence according to at least one embodiment of the present disclosure;
图5是本公开至少一个实施例提出的根据语音帧序列获得控制参数值的过程示意图;FIG. 5 is a schematic diagram of a process of obtaining control parameter values according to a sequence of speech frames proposed by at least one embodiment of the present disclosure;
图6是本公开至少一个实施例提出的交互对象的驱动装置的结构示意图;6 is a schematic structural diagram of a driving device for interactive objects proposed in at least one embodiment of the present disclosure;
图7是本公开至少一个实施例提出的电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by at least one embodiment of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. On the contrary, they are merely examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations. In addition, the term "at least one" in this document means any one of a plurality of or any combination of at least two of the plurality, for example, including at least one of A, B, and C, may mean including A, Any one or more elements selected in the set formed by B and C.
本公开至少一个实施例提供了一种交互对象的驱动方法,所述驱动方法可以由终端设备或服务器等电子设备执行,所述终端设备可以是固定终端或移动终端,例如手机、平板电脑、游戏机、台式机、广告机、一体机、车载终端等等,所述服务器包括本地服务器或云端服务器等,所述方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。At least one embodiment of the present disclosure provides a method for driving interactive objects. The driving method may be executed by electronic devices such as a terminal device or a server. The terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game. The server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.
在本公开实施例中,交互对象可以是任意一种能够与目标对象进行交互的虚拟形象。在一实施例中,交互对象可以是虚拟人物,还可以是虚拟动物、虚拟物品、卡通形象等等其他能够实现交互功能的虚拟形象。交互对象的展现形式即可以是2D形式也可以是3D形式,本公开对此并不限定。所述目标对象可以是用户,也可以是机器人,还可以是其他智能设备。所述交互对象和所述目标对象之间的交互方式可以是主动交互方式,也可以是被动交互方式。一示例中,目标对象可以通过做出手势或者肢体动作来发出需求,通过主动交互的方式来触发交互对象与其交互。另一示例中,交互对象可以通过主动打招呼、提示目标对象做出动作等方式,使得目标对象采用被动方式与交互对象进行交互。In the embodiments of the present disclosure, the interaction object may be any virtual image capable of interacting with the target object. In an embodiment, the interactive object may be a virtual character, or may also be a virtual animal, virtual item, cartoon image, or other virtual images capable of implementing interactive functions. The display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure. The target object may be a user, a robot, or other smart devices. The interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner. In an example, the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction. In another example, the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.
所述交互对象可以通过终端设备进行展示,所述终端设备可以是电视机、带有显示功能的一体机、投影仪、虚拟现实(Virtual Reality,VR)设备、增强现实(Augmented Reality,AR)设备等,本公开并不限定终端设备的具体形式。The interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.
图1示出本公开至少一个实施例提出的显示设备。如图1所示,该显示设备具有透明显示屏,在透明显示屏上可以显示立体画面,以呈现出具有立体效果的虚拟场景以及交互对象。例如图1中透明显示屏显示的交互对象包括虚拟卡通人物。在一些实施例中,本公开中所述的终端设备也可以为上述具有透明显示屏的显示设备,显示设备中配置有存储器和处理器,存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开提供的交互对象的驱动方法,以驱动透明显示屏中显示的交互对象对目标对象进行交流或回应。Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure. As shown in Figure 1, the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect. For example, the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen. The display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor. The processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.
在一些实施例中,响应于用于驱动交互对象输出语音的驱动数据,交互对象可以对目标对象发出指定语音。终端设备可以根据终端设备周边目标对象的动作、表情、身份、偏好等,生成驱动数据,以驱动交互对象通过发出指定语音进行交流或回应,从而为目标对象提供拟人化的服务。需要说明的是,声音驱动数据也可以通过其他方式生成,比如,由服务器生成并发送给终端设备。In some embodiments, in response to the driving data for driving the interactive object to output voice, the interactive object may emit a specified voice to the target object. The terminal device can generate driving data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to communicate or respond by issuing a specified voice, thereby providing anthropomorphic services for the target object. It should be noted that the sound-driven data can also be generated in other ways, for example, generated by the server and sent to the terminal device.
在交互对象与目标对象的交互过程中,根据该驱动数据驱动交互对象发出指定语音时,可能无法驱动所述交互对象做出与该指定语音同步的面部动作,使得交互对象在发出语音时呆板、不自然,影响了目标对象与交互对象的交互体验。基于此,本公开至少一个实施例提出一种交互对象的驱动方法,以提升目标对象与交互对象进行交互的体验。During the interaction between the interactive object and the target object, when the interactive object is driven to emit a specified voice according to the driving data, the interactive object may not be able to drive the interactive object to make facial movements synchronized with the specified voice, making the interactive object dull and rigid when uttering the voice. It is unnatural and affects the interactive experience between the target object and the interactive object. Based on this, at least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.
图2示出根据本公开至少一个实施例的交互对象的驱动方法的流程图,所述交互对象展示在显示设备中,如图2所示,所述方法包括步骤201~步骤203。FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. The interactive object is displayed on a display device. As shown in FIG. 2, the method includes steps 201 to 203.
在步骤201中,获取所述交互对象的驱动数据,并确定所述驱动数据的驱动模式。In step 201, the driving data of the interactive object is acquired, and the driving mode of the driving data is determined.
在本公开实施例中,所述驱动数据可以包括音频数据(语音数据)、文本等等。所述驱动数据可以是服务器端或终端设备根据与交互对象进行交互的目标对象的动作、表情、身份、偏好等生成的驱动数据,也可以是终端设备直接获取的,比如从内部存储器调用的驱动数据等。本公开对于该驱动数据的获取方式不进行限制。In an embodiment of the present disclosure, the driving data may include audio data (voice data), text, and so on. The drive data can be drive data generated by the server or terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or can be directly obtained by the terminal device, such as a drive called from an internal memory Data etc. The present disclosure does not limit the acquisition method of the driving data.
根据所述驱动数据的类型,以及所述驱动数据中所包含的信息,可以确定所述驱动数据的驱动模式。According to the type of the driving data and the information contained in the driving data, the driving mode of the driving data can be determined.
在一个示例中,可以根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,其中,所述语音数据序列包括多个语音数据单元。其中,所述语音数据单元可以是以字或词为单位构成的,也可以是以音素或音节为单位构成的。对应于文本类型的驱动数据,则可以获得所述驱动数据对应的字序列、词序列等等;对应于音频类型的驱动数据,则可以获得所述驱动数据对应的音素序列、音节序列、语音帧序列等等。在一实施例中,音频数据和文本数据可以互相转换。比如,将音频数据转换为文本数据之后再进行语音数据单元的划分,或者,将文本数据转换为音频数据之后再进行语音数据单元的划分,本公开对此并不限定。In an example, the voice data sequence corresponding to the drive data may be obtained according to the type of the drive data, where the voice data sequence includes multiple voice data units. Wherein, the voice data unit may be formed in units of characters or words, or may be formed in units of phonemes or syllables. Corresponding to the text type of driving data, the word sequence, word sequence, etc. corresponding to the driving data can be obtained; corresponding to the audio type of driving data, the phoneme sequence, syllable sequence, and speech frame corresponding to the driving data can be obtained Sequence and so on. In an embodiment, audio data and text data can be converted to each other. For example, the audio data is converted into text data and then the voice data unit is divided, or the text data is converted into audio data and then the voice data unit is divided, which is not limited in the present disclosure.
在检测到所述语音数据单元中包括目标数据的情况下,则可以确定所述驱动数据的驱动模式为第一驱动模式,所述目标数据与交互对象的预设控制参数值相对应。When it is detected that the voice data unit includes target data, it can be determined that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object.
所述目标数据可以是设置的关键词或关键字等等,所述关键词或所述关键字与交互对象的设定动作的预设控制参数值相对应。The target data may be a set keyword or keyword, etc., and the keyword or the keyword corresponds to a preset control parameter value of the setting action of the interactive object.
在本公开实施例中,预先为每一个目标数据匹配了设定动作,而每个设定动作通过相应的控制参数值进行控制而实现,因而每个目标数据与设定动作的控制参数值匹配。以关键词为“挥手”为例,在所述语音数据单元包含了文本形式的“挥手”,和/或语音形式的“挥手”的情况下,则可以确定所述驱动数据中包含了目标数据。In the embodiment of the present disclosure, each target data is matched with the setting action in advance, and each setting action is controlled by the corresponding control parameter value, so each target data matches the control parameter value of the setting action . Taking the keyword "wave" as an example, if the voice data unit contains "wave" in text form and/or "wave" in voice form, it can be determined that the drive data contains target data .
示例性的,所述目标数据包括音节,所述音节与所述交互对象的设定嘴型动作的预设控制参数值对应。Exemplarily, the target data includes a syllable, and the syllable corresponds to a preset control parameter value for setting a mouth movement of the interactive object.
所述目标数据对应的音节属于预先划分好的一种音节类型,且所述一种音节类型与一种设定嘴型相匹配。其中,音节是由至少一个音素组合形成的语音单位,所述音节包括拼音语言的音节,和非拼音语言(例如,汉语)的音节。一种音节类型是指发音动作一致或者基本一致的音节,一种音节类型可与交互对象的一种动作对应。在一实施例中,一种音节类型可与交互对象说话时的一种设定的嘴型对应,即与一种发音动作对应。这样,不同音节类型分别匹配了不同的设定嘴型的控制参数值,比如,拼音“ma”、“man”、“mang”这类型的音节,由于这类音节的发音动作基本一致,故可以视为同一类型,均可对应交互对象说话时“嘴巴张开”的嘴型的控制参数值。The syllable corresponding to the target data belongs to a pre-divided syllable type, and the one syllable type matches a set mouth shape. Wherein, a syllable is a phonetic unit formed by a combination of at least one phoneme, and the syllable includes a syllable of a pinyin language and a syllable of a non-pinyin language (for example, Chinese). A syllable type refers to a syllable with the same or basically the same pronunciation action, and a syllable type can correspond to an action of an interactive object. In an embodiment, a syllable type may correspond to a set mouth shape when the interactive object speaks, that is, it corresponds to a pronunciation action. In this way, different syllable types are matched with different control parameter values for setting the mouth shape. For example, the pinyin "ma", "man", "mang" type syllables, because the pronunciation actions of these syllables are basically the same, so you can Regarding the same type, they can correspond to the control parameter value of the mouth shape of the "mouth open" when the interactive object speaks.
在未检测到所述语音数据单元中包括目标数据的情况下,则可以确定所述驱动数据的驱动模式为第二驱动模式,所述目标数据与交互对象的预设控制参数值相对应。If it is not detected that the voice data unit includes target data, it may be determined that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object.
本领域技术人员应当理解,上述第一驱动模式和第二驱动模式仅用于示例,本公开实施例对于具体驱动模式不进行限定。Those skilled in the art should understand that the above-mentioned first driving mode and second driving mode are only for example, and the embodiment of the present disclosure does not limit the specific driving mode.
在步骤202中,响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值。In step 202, in response to the driving mode, the control parameter value of the interactive object is obtained according to the driving data.
对于驱动数据的各种驱动模式,可以采用相应的方式获取所述交互对象的控制参数值。For various driving modes of driving data, the control parameter value of the interactive object can be obtained in a corresponding manner.
在一个示例中,响应于步骤201中确定的第一驱动模式,可以将所述目标数据对应的所述预设控制参数值作为所述交互对象的控制参数值。例如,对于第一驱动模式,可以将所述语音数据序列中包含的目标数据(比如“挥手”)所对应的预设控制参数值作为所述交互对象的控制参数值。In an example, in response to the first driving mode determined in step 201, the preset control parameter value corresponding to the target data may be used as the control parameter value of the interaction object. For example, for the first driving mode, the preset control parameter value corresponding to the target data (such as "wave") contained in the voice data sequence may be used as the control parameter value of the interactive object.
在一个示例中,响应于步骤201中确定的第二驱动模式,可以获取所述语音数据序列中的至少一个语音数据单元的特征信息;获取与所述特征信息对应的所述交互对象的控制参数值。也即,在未检测到语音数据序列中包含目标数据的情况下,则可以根据所述语音数据单元的特征信息来获取对应的控制参数值。所述特征信息可以包括对所述语音数据序列进行特征编码所获得的语音数据单元的特征信息、根据所述语音数据序列的声学特征信息所获得的语音数据单元的特征信息等等。In an example, in response to the second driving mode determined in step 201, characteristic information of at least one voice data unit in the voice data sequence may be acquired; and control parameters of the interaction object corresponding to the characteristic information may be acquired value. That is, in the case where the target data is not detected in the voice data sequence, the corresponding control parameter value can be obtained according to the characteristic information of the voice data unit. The feature information may include feature information of a voice data unit obtained by performing feature encoding on the voice data sequence, feature information of a voice data unit obtained according to the acoustic feature information of the voice data sequence, and so on.
在步骤203中,根据所述控制参数值控制所述交互对象的姿态。In step 203, the posture of the interactive object is controlled according to the control parameter value.
在一些实施例中,所述交互对象的控制参数包括面部姿态参数,所述面部姿态参数包括面部肌肉控制系数,该面部肌肉控制系数用于控制至少一个面部肌肉的运动状态。在一实施例中,可以根据所述驱动数据获取所述交互对象的面部肌肉控制系数;并根据所获取的面部肌肉控制系数,驱动所述交互对象做出与所述驱动数据匹配的面部动作。In some embodiments, the control parameters of the interactive object include facial posture parameters, and the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle. In an embodiment, the facial muscle control coefficient of the interactive object may be acquired according to the driving data; and the facial muscle control coefficient obtained is driven to drive the interactive object to perform facial actions matching the driving data.
在一些实施例中,所述交互对象的控制参数值包括所述交互对象的至少一个局部区域的控制向量。在一实施例中,根据所述驱动数据可以获取所述交互对象的至少一个局部区域的控制向量;并根据所获取的所述至少一个局部区域的控制向量可以控制所述交互对象的面部动作和/或肢体动作。In some embodiments, the control parameter value of the interactive object includes a control vector of at least one local area of the interactive object. In an embodiment, the control vector of at least one partial area of the interactive object can be acquired according to the driving data; and the facial movements and the facial movements of the interactive object can be controlled according to the acquired control vector of the at least one local area. / Or body movements.
根据所述交互对象的驱动数据的驱动模式,来获取所述交互对象的控制参数值,从而控制所述交互对象的姿态,其中,对于不同的驱动模式可以通过不同的方式来获取相应的交互对象的控制参数值,使得交互对象展示出与所述驱动数据的内容和/或对应的语音匹配的姿态,从而使目标对象产生与交互对象正在交流的感觉,提升了目标对象与交互对象的交互体验。Obtain the control parameter value of the interactive object according to the driving mode of the driving data of the interactive object, thereby controlling the posture of the interactive object, wherein the corresponding interactive object can be obtained in different ways for different driving modes The control parameter value of, makes the interactive object show a posture that matches the content of the driving data and/or the corresponding voice, so that the target object has a feeling of communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved .
在一些实施例中,还可以根据所述驱动数据控制所述显示设备输出语音和/或展示文本。并且可以在输出语音和/或展示文本的同时,根据所述控制参数值控制所述交互对象的姿态。In some embodiments, the display device may also be controlled to output voice and/or display text according to the driving data. And while outputting voice and/or displaying text, the gesture of the interactive object can be controlled according to the control parameter value.
在本公开实施例中,由于控制参数值与所述驱动数据相匹配,因此根据所述驱动数据输出语音和/或展示文本与根据所述控制参数值控制交互对象的姿态同步的情况下,交互对象所做出的姿态与所输出的语音和/或所展示的文本也是同步的,从而给目标对象一种所述交互对象正在与其进行交流的感觉。In the embodiment of the present disclosure, because the control parameter value matches the driving data, the output of the voice and/or the display text according to the driving data is synchronized with the control parameter value according to the control parameter value. The gesture made by the object is also synchronized with the output voice and/or displayed text, thereby giving the target object a feeling that the interactive object is communicating with it.
在一些实施例中,所述语音数据序列包括音素序列。响应于所述驱动数据包括音频数据,可以通过将音频数据拆分为多个音频帧,根据音频帧的状态对音频帧进行组合而形成音素;根据所述音频数据所形成的各个音素则形成了音素序列。其中,音素是根据语音的自然属性划分出来的最小语音单元,真实人物一个发音动作能够形成一个音素。响应于所述驱动数据为文本,可以根据所述文本中包含的语素,获得所述语素所对应的音素,从而获得相应的音素序列。In some embodiments, the speech data sequence includes a phoneme sequence. In response to the drive data including audio data, the audio data may be split into multiple audio frames, and the audio frames may be combined according to the state of the audio frames to form phonemes; each phoneme formed according to the audio data forms Phoneme sequence. Among them, the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and a pronunciation action of a real person can form a phoneme. In response to the driving data being text, the phoneme corresponding to the morpheme can be obtained according to the morphemes contained in the text, so as to obtain the corresponding phoneme sequence.
在一些实施例中,可以通过以下方式获取所述语音数据序列中的至少一个语音数据单元的特征信息:对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列;根据所述第一编码序列,获取至少一个音素对应的特征编码;根据所述特征编码,获得所述至少一个音素的特征信息。In some embodiments, the feature information of at least one voice data unit in the voice data sequence may be obtained by: performing feature encoding on the phoneme sequence to obtain the first coding sequence corresponding to the phoneme sequence; According to the first coding sequence, the feature code corresponding to at least one phoneme is obtained; and the feature information of the at least one phoneme is obtained according to the feature code.
图3示出对音素序列进行特征编码的过程示意图。如图3所示,音素序列310含音素j、i1、j、ie4(为简洁起见,只示出部分音素),针对每种音素j、i1、ie4分别获得对应的编码序列321、322、323。在各个编码序列中,在有所述音素的时间点上对应的编码值设置为第一数值(例如为1),在没有所述音素的时间点上对应的编码值设置为第二数值(例如为0)。以编码序列321为例,在音素序列310中有音素j的时间点上,编码序列321的值设置为第一数值1;在没有音素j的时间点上,编码序列321的值设置为第二数值0。所有编码序列321、322、323构成总编码序列320。Figure 3 shows a schematic diagram of the process of feature encoding on a phoneme sequence. As shown in Figure 3, the phoneme sequence 310 contains phonemes j, i1, j, and ie4 (for brevity, only some phonemes are shown), and corresponding code sequences 321, 322, and 323 are obtained for each phoneme j, i1, and ie4. . In each coding sequence, the corresponding coding value at the time point when the phoneme is present is set to a first value (for example, 1), and the corresponding coding value at the time point when the phoneme is not present is set to a second value (for example, Is 0). Taking the code sequence 321 as an example, at the time point when there is phoneme j in the phoneme sequence 310, the value of the code sequence 321 is set to the first value 1; at the time point when there is no phoneme j, the value of the code sequence 321 is set to the second value. The value is 0. All coding sequences 321, 322, and 323 constitute a total coding sequence 320.
根据音素j、i1、ie4分别对应的编码序列321、322、323的编码值,以及该三个编码序列中对应的音素的持续时间,也即在编码序列321中j的持续时间、在编码序列322中i1的持续时间、在编码序列323中ie4的持续时间,可以获得编码序列321、322、323的特征信息。According to the encoding values of the encoding sequences 321, 322, and 323 corresponding to phonemes j, i1, and ie4, and the duration of the corresponding phonemes in the three encoding sequences, that is, the duration of j in the encoding sequence 321, and the duration of the encoding sequence in the encoding sequence 321. The duration of i1 in 322 and the duration of ie4 in the coded sequence 323 can obtain the characteristic information of the coded sequences 321, 322, and 323.
例如,可以利用高斯滤波器分别对所述编码序列321、322、323中的音素j、i1、ie4在时间上的连续值进行高斯卷积操作,获得所述编码序列的特征信息。也即,通过高斯滤波器对音素在时间上的连续值进行高斯卷积操作,使得各个编码序列中编码值从第二数值到第一数值或者从第一数值到第二数值的变化阶段变得平滑。对各个编码序列321、322、323分别进行高斯卷积操作,从而获得各个编码序列的特征值,其中,特征值为构成特征信息中的参数,根据各个编码序列的特征信息的集合,获得了该音素序列310所对应的特征信息330。本领域技术人员应当理解,也可以对各个编码序列进行其他的操作来获得所述编码序列的特征信息,本公开对此不进行限制。For example, Gaussian filters may be used to perform Gaussian convolution operations on the consecutive values of phonemes j, i1, and ie4 in the coded sequences 321, 322, and 323, respectively, to obtain feature information of the coded sequence. That is, the Gaussian convolution operation is performed on the continuous value of the phoneme in time through the Gaussian filter, so that the code value in each code sequence changes from the second value to the first value or from the first value to the second value. smooth. Gaussian convolution operation is performed on each coded sequence 321, 322, and 323 respectively, so as to obtain the feature value of each coded sequence, where the feature value constitutes the parameter in the feature information. According to the set of feature information of each coded sequence, the feature value is obtained. The feature information 330 corresponding to the phoneme sequence 310. Those skilled in the art should understand that other operations can also be performed on each coding sequence to obtain the characteristic information of the coding sequence, which is not limited in the present disclosure.
在本公开实施例中,通过根据音素序列中每种音素的持续时间获得所述编码序列的特征信息,使得编码序列的变化阶段平滑,例如,编码序列的值除了0和1也呈现出中间状态的值,例如0.2、0.3等等,而根据这些中间状态的值所获取的姿态参数值,使得交互人物的姿态变化过度的更加平缓、自然,尤其是交互人物的表情变化更加平缓、自然,提高了目标对象的交互体验。In the embodiment of the present disclosure, the characteristic information of the coding sequence is obtained according to the duration of each phoneme in the phoneme sequence, so that the change phase of the coding sequence is smooth, for example, the value of the coding sequence also presents an intermediate state in addition to 0 and 1. The value of, such as 0.2, 0.3, etc., and the posture parameter values obtained according to the values of these intermediate states make the posture changes of the interactive characters more smooth and natural, especially the expression changes of the interactive characters are more smooth, natural, and improved The interactive experience of the target object.
在一些实施例中,所述面部姿态参数可以包括面部肌肉控制系数。In some embodiments, the facial posture parameters may include facial muscle control coefficients.
人脸的运动,从解剖学角度来看,是由面部各部分肌肉协同变形的结果。因此,通 过对交互对象的面部肌肉进行划分而获得面部肌肉模型,并对划分得到的每一块肌肉(区域)通过对应的面部肌肉控制系数控制其运动,也即对其进行收缩/扩张控制,则能够使交互人物的面部做出各种表情。对于所述面部肌肉模型的每一块肌肉,可以根据肌肉所在的面部位置和肌肉自身的运动特征,来设置不同的肌肉控制系数所对应的运动状态。例如,对于上唇肌肉,其控制系数的数值范围为0~1,在该范围内的不同数值,对应于上唇肌肉不同的收缩/扩张状态,通过改变该数值,可以实现嘴部的纵向开合;而对于左嘴角肌肉,其控制系数的数值范围为0~1,在该范围内的不同数值,对应于左嘴角肌肉的收缩/扩张状态,通过改变该数值,可以实现嘴部的横向变化。From an anatomical point of view, the movement of the human face is the result of the coordinated deformation of various facial muscles. Therefore, the facial muscle model is obtained by dividing the facial muscles of the interactive objects, and each muscle (region) obtained by the division is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed, then It can make the faces of interactive characters make various expressions. For each muscle of the facial muscle model, the motion state corresponding to different muscle control coefficients can be set according to the facial position where the muscle is located and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of the control coefficient is 0 to 1. Different values within this range correspond to different contraction/expansion states of the upper lip muscle. By changing this value, the mouth can be opened and closed vertically; For the left corner of the mouth muscle, the value of the control coefficient ranges from 0 to 1. Different values in this range correspond to the contraction/expansion state of the left corner of the mouth muscle. By changing this value, the lateral change of the mouth can be achieved.
在根据音素序列输出声音的同时,根据与所述音素序列对应的面部肌肉控制系数来驱动所述交互对象做出面部表情,则可以实现显示设备在输出声音时,交互对象同步做出发出该声音的表情,从而使目标对象产生该交互对象正在说话的感觉,提高了目标对象的交互体验。While outputting the sound according to the phoneme sequence, the interactive object is driven to make facial expressions according to the facial muscle control coefficient corresponding to the phoneme sequence, so that when the display device outputs the sound, the interactive object can make the sound synchronously. The expression of the target object, so that the target object feels that the interactive object is speaking, and the interactive experience of the target object is improved.
在一些实施例中,可以将所述交互对象的面部动作与身体姿态相关联,也即将该面部动作所对应的面部姿态参数值与所述身体姿态相关联,所述身体姿态可以包括肢体动作、手势动作、走路姿态等等。In some embodiments, the facial motion of the interactive object may be associated with the body posture, that is, the facial posture parameter value corresponding to the facial motion may be associated with the body posture. The body posture may include body motion, Gesture movement, walking posture, etc.
在交互对象的驱动过程中,获取与所述面部姿态参数值关联的身体姿态的驱动数据;在根据所述音素序列输出声音的同时,根据与所述面部姿态参数值关联的身体姿态的驱动数据,驱动所述交互对象做出肢体动作。也即,在根据所述交互对象的驱动数据驱动所述交互对象做出面部动作的同时,还根据该面部动作对应的面部姿态参数值获取相关联的身体姿态的驱动数据,从而在输出声音时,可以驱动交互对象同步做出相应的面部动作和肢体动作,使交互对象的说话状态更加生动自然,提高了目标对象的交互体验。In the driving process of the interactive object, obtain the driving data of the body posture associated with the facial posture parameter value; while outputting the sound according to the phoneme sequence, according to the driving data of the body posture associated with the facial posture parameter value , To drive the interactive object to make physical actions. That is, while driving the interactive object to make a facial action according to the driving data of the interactive object, it also obtains the driving data of the associated body posture according to the facial posture parameter value corresponding to the facial action, so that when the sound is output , Can drive the interactive object to make corresponding facial and body movements synchronously, make the speaking state of the interactive object more vivid and natural, and improve the interactive experience of the target object.
由于声音的输出需要保持连续性,因此,在一实施例中,在音素序列上移动时间窗口,并输出在每次移动过程中时间窗口内的音素,其中,以设定时长作为每次移动时间窗口的步长。例如,可以将时间窗口的长度设置为1秒,将设定时长设置为0.1秒。在输出时间窗口内的音素的同时,获取时间窗口设定位置处的音素或音素的特征信息所对应的姿态参数值,利用所述姿态参数值控制所述交互对象的姿态;该设定位置为距离时间窗口起始位置设定时长的位置,例如在时间窗口的长度设置为1s时,该设定位置距离时间窗口的起始位置可以为0.5s。随着时间窗口的每次移动,在输出时间窗口内的音素同时,都以时间窗口设定位置处对应的姿态参数值控制交互对象的姿态,从而使交互对象的姿态与输出的语音同步,给目标对象以所述交互对象正在说话的感觉。Since the output of sound needs to maintain continuity, in one embodiment, the time window is moved on the phoneme sequence, and the phonemes in the time window during each movement are output, wherein the set duration is used as the time of each movement The step size of the window. For example, you can set the length of the time window to 1 second and the set time to 0.1 second. While outputting the phonemes in the time window, the phoneme at the set position of the time window or the attitude parameter value corresponding to the feature information of the phone is obtained, and the attitude parameter value is used to control the attitude of the interactive object; the set position is The position of the set duration from the start position of the time window. For example, when the length of the time window is set to 1s, the set position may be 0.5s away from the start position of the time window. With each movement of the time window, while outputting the phonemes in the time window, the posture of the interactive object is controlled by the corresponding posture parameter value at the set position of the time window, so that the posture of the interactive object is synchronized with the output voice. The target object feels that the interactive object is speaking.
通过改变设定时长,可以改变获取姿态参数值的时间间隔(频率),从而改变了交互对象做出姿态的频率。可以根据实际的交互场景来设置该设定时长,以使交互对象的姿态变化更加自然。By changing the set time length, the time interval (frequency) for obtaining the posture parameter value can be changed, thereby changing the frequency at which the interactive object makes the posture. The set duration can be set according to the actual interactive scene, so that the posture of the interactive object changes more naturally.
在一些实施例中,可以通过获取交互对象的至少一个局部区域的控制向量控制所述交互对象的姿态。In some embodiments, the posture of the interactive object can be controlled by acquiring the control vector of at least one partial area of the interactive object.
所述局部区域是对交互对象的整体(包括面部和/或身体)进行划分而得到的。面部的一个或多个局部区域的控制可以对应于交互对象的一系列面部表情或动作,例如眼部区域的控制可以对应于交互对象睁眼、闭眼、眨眼、视角变换等面部动作;又例如嘴部区域的控制可以对应于交互对象闭嘴、不同程度的张嘴等面部动作。而身体的一个或多个局部区域的控制可以对应于交互对象的一系列肢体动作,例如腿部区域的控制可以对应于交互对象走路、跳跃、踢腿等动作。The local area is obtained by dividing the whole (including face and/or body) of the interactive object. The control of one or more local areas of the face may correspond to a series of facial expressions or actions of the interactive object. For example, the control of the eye area may correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective; The control of the mouth area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees. The control of one or more local areas of the body may correspond to a series of physical actions of the interactive object. For example, the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.
所述交互对象的局部区域的控制参数,包括所述局部区域的控制向量。每个局部区域的姿态控制向量用于驱动所述交互对象的所述局部区域进行动作。不同的控制向量值 对应于不同的动作或者动作幅度。例如,对于嘴部区域的控制向量,其一组控制向量值可以使所述交互对象的嘴部微张,而另一组控制向量值可以使所述交互对象的嘴部大张。通过以不同的控制向量值来驱动所述交互对象,可以使相应的局部区域做出不同动作或者不同幅度的动作。The control parameter of the local area of the interactive object includes the control vector of the local area. The attitude control vector of each local area is used to drive the local area of the interactive object to perform actions. Different control vector values correspond to different actions or action ranges. For example, for the control vector of the mouth area, one set of control vector values can make the mouth of the interactive object slightly open, while another set of control vector values can make the mouth of the interactive object open wider. By driving the interactive objects with different control vector values, the corresponding local areas can perform different actions or actions with different amplitudes.
局部区域可以根据需要控制的交互对象的动作进行选择,例如在需要控制所述交互对象面部以及肢体同时进行动作时,可以获取全部局部区域的控制向量;在需要控制所述交互对象的表情时,则可以获取所述面部所对应的局部区域的控制向量。The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the control vectors of all the local areas can be obtained; when the expression of the interactive object needs to be controlled, Then, the control vector of the local area corresponding to the face can be obtained.
在一些实施例中,可以通过在所述第一编码序列上进行滑窗的方式获取至少一个音素对应的特征编码。其中,所述第一编码序列可以是经过高斯卷积操作后的编码序列。In some embodiments, the feature code corresponding to at least one phoneme may be obtained by performing a sliding window on the first code sequence. Wherein, the first coding sequence may be a coding sequence after a Gaussian convolution operation.
以设定长度的时间窗口和设定步长,对所述编码序列进行滑窗,将所述时间窗口内的特征编码作为所对应的至少一个音素的特征编码,在完成滑窗后,根据得到的多个特征编码,可以获得第二编序列。如图4所示,通过在第一编码序列420或者平滑后的第一编码序列430上,滑动设定长度的时间窗口,分别获得特征编码1、特征编码2、特征编码3,以此类推,在遍历第一编码序列后,获得特征编码1、特征编码2、特征编码3、…、特征编码M,从而得到了第二编码序列440。其中,M为正整数,其数值根据第一编码序列的长度、时间窗口的长度以及时间窗口滑动的步长确定。A sliding window is performed on the coding sequence with a time window of a set length and a set step size, and the feature code in the time window is used as the feature code of the corresponding at least one phoneme. After the sliding window is completed, according to With multiple feature codes of, the second code sequence can be obtained. As shown in FIG. 4, by sliding a time window of a set length on the first coding sequence 420 or the smoothed first coding sequence 430, feature code 1, feature code 2, feature code 3 are obtained respectively, and so on, After traversing the first coding sequence, feature codes 1, feature codes 2, feature codes 3,..., Feature codes M are obtained, and thus a second code sequence 440 is obtained. Among them, M is a positive integer, and its value is determined according to the length of the first coding sequence, the length of the time window, and the sliding step of the time window.
根据特征编码1、特征编码2、特征编码3、…、特征编码M,可以获得相应的控制向量1、控制向量2、控制向量3、…、控制向量M,从而获得控制向量的序列450。According to feature code 1, feature code 2, feature code 3,..., feature code M, corresponding control vector 1, control vector 2, control vector 3,..., control vector M can be obtained, thereby obtaining a sequence 450 of control vectors.
控制向量的序列450与第二编码序列440在时间上是对齐的,由于所述第二编码序列中的每个特征编码是根据音素序列中的至少一个音素获得的,因此控制向量的序列450中的每个特征向量同样是根据音素序列中的至少一个音素获得的。在播放文本数据所对应的音素序列的同时,根据所述控制向量的序列驱动所述交互对象做出动作,即能够实现驱动交互对象发出文本内容所对应的声音的同时,做出与声音同步的动作,给目标对象以所述交互对象正在说话的感觉,提升了目标对象与交互对象的交互体验。The sequence 450 of the control vector and the second coding sequence 440 are aligned in time. Since each feature code in the second coding sequence is obtained according to at least one phoneme in the phoneme sequence, the sequence 450 of the control vector Each feature vector of is also obtained from at least one phoneme in the phoneme sequence. While playing the phoneme sequence corresponding to the text data, the interactive object is driven to make an action according to the sequence of the control vector, that is, it can drive the interactive object to emit the sound corresponding to the text content, and make the sound synchronized with the sound. The action gives the target object the feeling that the interactive object is speaking, and improves the interactive experience between the target object and the interactive object.
假设在第一个时间窗口的设定时刻开始输出特征编码,可以将在所述设定时刻之前的控制向量设置为默认值,也即在刚开始播放音素序列时,使所述交互对象做出默认的动作,在所述设定时刻之后开始利用根据第一编码序列所得到的控制向量的序列驱动所述交互对象做出动作。以图4为例,在t0时刻开始输出特征编码1,在t0时刻之前输出是默认控制向量。Assuming that the feature code starts to be output at the set time of the first time window, the control vector before the set time can be set to the default value, that is, when the phoneme sequence is just started to be played, the interactive object is made to make The default action is to use the sequence of the control vector obtained according to the first coding sequence to drive the interactive object to make an action after the set time. Taking Fig. 4 as an example, the feature code 1 is output at time t0, and the output is the default control vector before time t0.
所述时间窗口的长度与所述特征编码所包含的信息量相关。在时间窗口所含的信息量较大的情况下,经所述循环神经网络处理会输出较均匀的结果。若时间窗口的长度过大,可能导致交互对象说话时的表情无法与部分文字对应;若时间窗口的长度过小,可能导致交互对象说话时的表情显得生硬。因此,时间窗口的时长需要根据文本数据所对应的音素持续的最小时间来确定,以使驱动所述交互对象所做出的动作与声音具有更强的关联性。The length of the time window is related to the amount of information contained in the feature code. In the case where the amount of information contained in the time window is relatively large, the cyclic neural network processing will output a relatively uniform result. If the length of the time window is too large, the expression of the interactive object may not correspond to part of the text; if the length of the time window is too small, the expression of the interactive object may appear rigid when speaking. Therefore, the duration of the time window needs to be determined according to the minimum duration of the phoneme corresponding to the text data, so that the actions taken by driving the interactive object have a stronger correlation with the sound.
时间窗口滑动的步长与获取控制向量的时间间隔(频率)相关,也即与驱动交互对象做出动作的频率相关。可以根据实际的交互场景来设置所述时间窗口的长度以及步长,以使交互对象做出的表情和动作与声音的关联性更强,并且更加生动、自然。The sliding step of the time window is related to the time interval (frequency) for obtaining the control vector, that is, it is related to the frequency with which the interactive object is driven to make an action. The length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
在一些实施例中,在所述音素序列中音素之间的时间间隔大于设定阈值的情况下,根据所述局部区域的设定控制向量,驱动所述交互对象做出动作。也即,在交互人物说话停顿较长时,驱动交互对象做出设定的动作。例如,在输出的语音停顿较长时,可以使交互对象做出微笑的表情,或者身体微微的摆动,以避免在停顿较长时交互对象面无表情地直立,从而使得交互对象说话的过程更加自然、流畅,提高了目标对象的交互感 受。In some embodiments, when the time interval between phonemes in the phoneme sequence is greater than a set threshold, the interactive object is driven to take actions according to the set control vector of the local area. That is, when the interactive character pauses for a long time, the interactive object is driven to make a set action. For example, when the output voice pauses for a long time, the interactive object can be made to make a smiling expression or slightly swing the body to avoid the interactive object standing upright without expression when the pause is long, thereby making the interactive object speak more Natural and smooth, which improves the interactive experience of the target object.
在一些实施例中,所述语音数据序列包括语音帧序列,获取所述语音数据序列中的至少一个语音数据单元的特征信息,包括:获取所述语音帧序列对应的第一声学特征序列,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对应的声学特征向量;根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征向量;根据所述声学特征向量,获得所述至少一个语音帧对应的特征信息。In some embodiments, the voice data sequence includes a voice frame sequence, and acquiring feature information of at least one voice data unit in the voice data sequence includes: acquiring a first acoustic feature sequence corresponding to the voice frame sequence, The first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; the acoustic feature vector corresponding to at least one voice frame is acquired according to the first acoustic feature sequence; The acoustic feature vector is used to obtain feature information corresponding to the at least one speech frame.
在本公开实施例中,可以根据所述语音帧序列的声学特征,确定交互对象的至少一个局部区域的控制参数,也可以根据所述语音帧序列的其他特征来确定控制参数。In the embodiments of the present disclosure, the control parameter of at least one local area of the interactive object may be determined according to the acoustic characteristics of the speech frame sequence, and the control parameter may also be determined according to other characteristics of the speech frame sequence.
首先,获取所述语音帧序列对应的声学特征序列。此处,为了与后续提到的声学特征序列进行区分,将所述语音帧序列对应的声学特征序列称为第一声学特征序列。First, obtain the acoustic feature sequence corresponding to the speech frame sequence. Here, in order to distinguish from the acoustic feature sequence mentioned later, the acoustic feature sequence corresponding to the speech frame sequence is referred to as the first acoustic feature sequence.
在本公开实施例中,声学特征可以是与语音情感相关的特征,例如基频特征、共峰特征、梅尔频率倒谱系数(Mel Frequency Cepstral Cofficient,MFCC)等等。In the embodiments of the present disclosure, the acoustic features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Cepstral Cofficient (MFCC) and so on.
所述第一声学特征序列是对整体的语音帧序列进行处理所得到的,以MFCC特征为例,可以通过对所述语音帧序列中的各个语音帧进行加窗、快速傅里叶变换、滤波、对数处理、离散余弦处理,得到各个语音帧对应的MFCC系数。The first acoustic feature sequence is obtained by processing the entire speech frame sequence. Taking the MFCC feature as an example, the speech frame sequence can be windowed, fast Fourier transform, etc. Filtering, logarithmic processing, and discrete cosine processing to obtain the MFCC coefficients corresponding to each speech frame.
所述第一声学特征序列是针对整体的语音帧序列进行处理所得到的,体现了语音数据序列的整体声学特征。The first acoustic feature sequence is obtained by processing the entire voice frame sequence, and reflects the overall acoustic feature of the voice data sequence.
在本公开实施例中,所述第一声学特征序列包含与所述语音帧序列中的每个语音帧对应的声学特征向量。以MFCC为例,所述第一声学特征序列包含了每个语音帧的MFCC系数。根据所述语音帧序列所获得的第一声学特征序列如图5所示。In the embodiment of the present disclosure, the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence. Taking MFCC as an example, the first acoustic feature sequence includes the MFCC coefficients of each speech frame. The first acoustic feature sequence obtained according to the speech frame sequence is shown in FIG. 5.
接下来,根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征。Next, according to the first acoustic feature sequence, the acoustic feature corresponding to at least one speech frame is acquired.
在所述第一声学特征序列包括了所述语音帧序列中的每个语音帧对应的声学特征向量的情况下,可以将所述至少一个语音帧对应的相同数目的特征向量作为所述语音帧的声学特征。其中,上述相同数目的特征向量可以形成一个特征矩阵,该特征矩阵即为所述至少一个语音帧的声学特征。In the case that the first acoustic feature sequence includes the acoustic feature vector corresponding to each voice frame in the voice frame sequence, the same number of feature vectors corresponding to the at least one voice frame may be used as the voice The acoustic characteristics of the frame. Wherein, the same number of feature vectors can form a feature matrix, and the feature matrix is the acoustic feature of the at least one speech frame.
以图5为例,所述第一声学特征序列中的N个特征向量形成了所对应的N个语音帧的声学特征;其中,N为正整数。所述第一声学特征矩阵可以包括多个声学特征,各个所述声学特征所对应的语音帧之间可以是部分重叠的。Taking FIG. 5 as an example, the N feature vectors in the first acoustic feature sequence form the acoustic features of the corresponding N speech frames; where N is a positive integer. The first acoustic feature matrix may include a plurality of acoustic features, and the speech frames corresponding to each of the acoustic features may partially overlap.
最后,获取所述声学特征对应的所述交互对象的至少一个局部区域的控制向量。Finally, a control vector of at least one local area of the interactive object corresponding to the acoustic feature is acquired.
对于所获得的至少一个语音帧对应的声学特征,可以获取至少一个局部区域的控制向量。局部区域可以根据需要控制的交互对象的动作进行选择,例如在需要控制所述交互对象面部以及肢体同时进行动作时,可以获取全部局部区域的控制向量;在需要控制所述交互对象的表情时,则可以获取所述面部所对应的局部区域的控制向量。For the acquired acoustic feature corresponding to at least one speech frame, the control vector of at least one local area can be acquired. The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the control vectors of all the local areas can be obtained; when the expression of the interactive object needs to be controlled, Then, the control vector of the local area corresponding to the face can be obtained.
在播放语音数据序列的同时,根据通过所述第一声学特征序列所获得的各个声学特征对应的控制向量驱动所述交互对象做出动作,可以实现终端设备在输出声音的同时,交互对象能够做出与所输出的声音相配合的动作,该动作包括面部动作、表情以及肢体动作等,从而使目标对象产生该交互对象正在说话的感觉。并且由于所述控制向量是与输出声音的声学特征相关的,根据所述控制向量进行驱动能够使得交互对象的表情和肢体动作具有了情感因素,从而使得交互对象的说话过程更加自然、生动,从而提高了目标对象与交互对象的交互体验。While playing the voice data sequence, the interactive object is driven to act according to the control vector corresponding to each acoustic feature obtained through the first acoustic feature sequence, so that the terminal device can output sound while the interactive object can Perform actions that match the output sound, including facial actions, expressions, and body actions, so that the target object feels that the interactive object is speaking. And because the control vector is related to the acoustic characteristics of the output sound, driving according to the control vector can make the expression and body movements of the interactive object have emotional factors, thereby making the speaking process of the interactive object more natural and vivid, thereby Improve the interactive experience between the target object and the interactive object.
在一些实施例中,可以通过在所述第一声学特征序列上进行滑窗的方式获取所述至 少一个语音帧对应的声学特征。In some embodiments, the acoustic feature corresponding to the at least one speech frame may be acquired by performing a sliding window on the first acoustic feature sequence.
通过以设定长度的时间窗口和设定步长,对所述第一声学特征序列进行滑窗,将所述时间窗口内的声学特征向量作为对应的相同数目语音帧的声学特征,从而可以获得这些语音帧共同对应的声学特征。在完成滑窗后,根据得到的多个声学特征,则可以获得第二声学特征序列。By sliding the window of the first acoustic feature sequence with a time window of a set length and a set step size, the acoustic feature vector in the time window is used as the corresponding acoustic feature of the same number of speech frames, so that Acquire the acoustic features corresponding to these speech frames. After the sliding window is completed, the second acoustic feature sequence can be obtained according to the obtained multiple acoustic features.
以图5所示的交互对象的驱动方法为例,所述语音帧序列每秒包括100个语音帧,所述时间窗口的长度为1s,步长为0.04s。由于所述第一声学特征序列中的每个特征向量是与语音帧对应的,相应地,所述第一声学特征序列每秒同样包括100个特征向量。在所述第一声学特征序列上进行滑窗过程中,每次获得所述时间窗口内的100个特征向量,作为对应的100个语音帧的声学特征。通过在所述第一声学特征序列上以0.04s的步长移动所述时间窗口,分别获得第1~100语音帧对应的声学特征1、第4~104语音帧所对应的声学特征2,以此类推,在遍历第一声学特征后,得到声学特征1、声学特征2、…、声学特征M,从而获得第二声学特征序列,其中,M为正整数,其数值根据语音帧序的帧数(第一声学特征序列中特征向量的数目)、时间窗口的长度以及步长确定。Taking the driving method of the interactive object shown in FIG. 5 as an example, the speech frame sequence includes 100 speech frames per second, the length of the time window is 1 s, and the step length is 0.04 s. Since each feature vector in the first acoustic feature sequence corresponds to a speech frame, correspondingly, the first acoustic feature sequence also includes 100 feature vectors per second. During the window sliding process on the first acoustic feature sequence, 100 feature vectors in the time window are obtained each time as the acoustic features of the corresponding 100 speech frames. By moving the time window in steps of 0.04s on the first acoustic feature sequence, the acoustic features corresponding to the 1st to 100th speech frames 1 and the acoustic features 2 corresponding to the 4th to 104th speech frames are respectively obtained, By analogy, after traversing the first acoustic feature, acoustic feature 1, acoustic feature 2,..., acoustic feature M is obtained, thereby obtaining the second acoustic feature sequence, where M is a positive integer, and its value is based on the sequence of the speech frame The number of frames (the number of feature vectors in the first acoustic feature sequence), the length of the time window, and the step size are determined.
根据声学特征1、声学特征2、…、声学特征M,分别可以获得相应的控制向量1、控制向量2、…、控制向量M,从而获得控制向量的序列。According to the acoustic feature 1, the acoustic feature 2,..., the acoustic feature M, the corresponding control vector 1, the control vector 2,..., the control vector M can be obtained respectively, so as to obtain the sequence of the control vector.
如图5所示,所述控制向量的序列与所述第二声学特征序列在时间上是对齐的,所述第二声学特征序列中的声学特征1、声学特征2、…、声学特征M,分别是根据所述第一声学特征序列中的N个特征向量获得的,因此,在播放所述语音帧的同时,可以根据所述控制向量的序列驱动所述交互对象做出动作。As shown in FIG. 5, the sequence of the control vector and the second acoustic feature sequence are aligned in time. Acoustic feature 1, acoustic feature 2, ..., acoustic feature M in the second acoustic feature sequence, They are respectively obtained based on the N feature vectors in the first acoustic feature sequence. Therefore, while the voice frame is played, the interactive object can be driven to perform actions according to the sequence of the control vector.
假设在第一个时间窗口的设定时刻开始输出声学特征,可以将在所述设定时刻之前的控制向量设置为默认值,也即在刚开始播放语音帧序列时,使所述交互对象做出默认的动作,在所述设定时刻之后开始利用根据第一声学特征序列所得到的控制向量的序列驱动所述交互对象做出动作。Assuming that the output of acoustic features starts at the set time of the first time window, the control vector before the set time can be set to the default value, that is, when the speech frame sequence is just started to be played, the interactive object is made to do A default action is taken, and after the set time, the interactive object is driven to make an action using the sequence of the control vector obtained according to the first acoustic characteristic sequence.
以图5为例,在t0时刻开始输出声学特征1,并以步长对应的时间0.04s为间隔输出声学特征,在t1时刻开始输出声学特征2,t2时刻开始输出声学特征3,直至在t(M-1)时刻输出声学特征M。对应地,在t i~t(i+1)时间段内对应的是特征向量(i+1),其中,i为小于(M-1)的整数,而在t0时刻之前,控制向量为默认控制向量。Take Figure 5 as an example. Acoustic feature 1 is output at t0, and the acoustic feature is output at intervals of 0.04s corresponding to the step size. Acoustic feature 2 is output at t1, and acoustic feature 3 is output at t2 until at t (M-1) Acoustic feature M is output at every moment. Correspondingly, the feature vector (i+1) corresponds to the time period t i~t(i+1), where i is an integer smaller than (M-1), and before t0, the control vector is the default Control vector.
在本公开实施例中,通过在播放所述语音数据序列的同时,根据所述控制向量的序列驱动所述交互对象做出动作,从而使交互对象的动作与所输出的声音同步,给目标对象以所述交互对象正在说话的感觉,提升了目标对象与交互对象的交互体验。In the embodiment of the present disclosure, while playing the voice data sequence, the interactive object is driven to make an action according to the sequence of the control vector, so that the action of the interactive object is synchronized with the output sound, and the target object With the feeling that the interactive object is speaking, the interactive experience between the target object and the interactive object is improved.
所述时间窗口的长度,与所述声学特征所包含的信息量相关。时间窗口的长度越大,所包含的信息量越多,驱动所述交互对象所做出的动作与声音的关联性越强。时间窗口滑动的步长与获取控制向量的时间间隔(频率)相关,也即与驱动交互对象做出动作的频率相关。可以根据实际的交互场景来设置所述时间窗口的长度以及步长,以使交互对象做出的表情和动作与声音的关联性更强,并且更加生动、自然。The length of the time window is related to the amount of information contained in the acoustic feature. The greater the length of the time window, the more information it contains, and the stronger the correlation between the actions and sounds that drive the interactive object. The sliding step of the time window is related to the time interval (frequency) for obtaining the control vector, that is, it is related to the frequency with which the interactive object is driven to make an action. The length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.
在一些实施例中,所述声学特征包括L个维度的梅尔频率倒谱系数MFCC,其中,L为正整数。MFCC表示语音信号的能量在不同频率范围的分布,可以通过将所述语音帧序列中的多个语音帧数据转换至频域,利用包括L个子带的梅尔滤波器,获得L个维度的MFCC。通过根据语音数据序列的MFCC来获取控制向量,以根据所述控制向量驱动所述交互对象进行面部动作和肢体动作,使得交互对象的表情和肢体动作具有了情感因素,使得交互对象的说话过程更加自然、生动,从而提高了目标对象与交互对象的交互体验。In some embodiments, the acoustic feature includes Mel frequency cepstral coefficients MFCC in L dimensions, where L is a positive integer. MFCC represents the distribution of the energy of the speech signal in different frequency ranges. The MFCC of L dimensions can be obtained by converting multiple speech frame data in the speech frame sequence to the frequency domain and using a Mel filter including L subbands. . The control vector is obtained according to the MFCC of the voice data sequence to drive the interactive object to perform facial and physical actions according to the control vector, so that the expression and physical actions of the interactive object have emotional factors, making the speaking process of the interactive object more Natural and vivid, thereby improving the interactive experience between the target object and the interactive object.
在一些实施例中,可以将所述语音数据单元的特征信息输入至预先训练的循环神经网络,获得与所述特征信息对应的所述交互对象的控制参数值。由于所述循环神经网络是一种时间递归神经网络,其可以学习所输入的特征信息的历史信息,根据语音单元序列输出控制参数;例如该控制参数可以为面部姿态控制参数,或者至少一个局部区域的控制向量。In some embodiments, the characteristic information of the voice data unit may be input to a pre-trained recurrent neural network to obtain the control parameter value of the interactive object corresponding to the characteristic information. Since the recurrent neural network is a time recursive neural network, it can learn historical information of the input feature information, and output control parameters according to the sequence of voice units; for example, the control parameter can be a facial posture control parameter, or at least one local area Control vector.
在本公开实施例中,利用预先训练的循环神经网络获取所述语音数据单元的特征信息对应的控制参数,将具有关联性的历史特征信息和当前特征信息进行融合,从而使得历史控制参数对当前控制参数的变化产生影响,使得交互人物的表情变化和肢体动作更加平缓、自然。In the embodiments of the present disclosure, a pre-trained cyclic neural network is used to obtain the control parameters corresponding to the feature information of the voice data unit, and the related historical feature information and current feature information are merged, so that the historical control parameters are compared to the current feature information. The changes of control parameters have an impact, making the expression changes and body movements of the interactive characters more smooth and natural.
在一些实施例中,可以通过以下方式对所述循环神经网络进行训练。In some embodiments, the recurrent neural network can be trained in the following manner.
首先,获取特征信息样本。例如,可以通过以下方式获取所述特征信息样本。First, obtain a sample of characteristic information. For example, the characteristic information sample can be obtained in the following manner.
获取一角色发出语音的视频段,从所述视频段中提取角色的相应语音段,例如,可以获取一真实人物正在说话的视频段;对所述视频段进行采样获取多个包含所述角色的第一图像帧;以及,对所述语音段进行采样,获得多个语音帧。Acquire a video segment of a character's voice, and extract the corresponding voice segment of the character from the video segment. For example, a video segment in which a real person is speaking can be obtained; A first image frame; and, sampling the voice segment to obtain multiple voice frames.
根据与所述第一图像帧对应的所述语音帧所包含的语音数据单元,获取所述语音帧对应的特征信息;Acquiring feature information corresponding to the voice frame according to the voice data unit contained in the voice frame corresponding to the first image frame;
将所述第一图像帧转化为包含所述交互对象的第二图像帧,获取所述第二图像帧对应的所述交互对象的控制参数值。The first image frame is converted into a second image frame containing the interactive object, and the control parameter value of the interactive object corresponding to the second image frame is obtained.
根据所述控制参数值,对与所述第一图像帧对应的特征信息进行标注,获得特征信息样本。According to the control parameter value, annotate the feature information corresponding to the first image frame to obtain a feature information sample.
在一些实施例中,所述特征信息包括音素的特征编码,所述控制参数包括面部肌肉控制系数。根据上述获取特征信息样本的方法,利用所获得的面部肌肉控制系数,对与所述第一图像帧对应的音素的特征编码进行标注,则获得了音素的特征编码对应的特征信息样本。In some embodiments, the feature information includes feature codes of phonemes, and the control parameters include facial muscle control coefficients. According to the above method for acquiring characteristic information samples, using the obtained facial muscle control coefficients to label the characteristic codes of the phonemes corresponding to the first image frame, the characteristic information samples corresponding to the characteristic codes of the phonemes are obtained.
在一些实施例中,所述特征信息包括音素的特征编码,所述控制参数包括所述交互对象的至少一个局部的控制向量。根据上述获取特征信息样本的方法,利用所获得的至少一个局部的控制向量,对与所述第一图像帧对应的音素的特征编码进行标注,则获得了音素的特征编码对应的特征信息样本。In some embodiments, the feature information includes a feature code of a phoneme, and the control parameter includes at least one partial control vector of the interactive object. According to the above method for acquiring characteristic information samples, using at least one partial control vector obtained to mark the characteristic codes of the phonemes corresponding to the first image frame, the characteristic information samples corresponding to the characteristic codes of the phonemes are obtained.
在一些实施例中,所述特征信息包括语音帧的声学特征,所述控制参数包括所述交互对象的至少一个局部的控制向量。根据上述获取特征信息样本的方法,利用所获得的至少一个局部的控制向量,对与所述第一图像帧对应的语音帧的声学特征进行标注,则获得了语音帧的声学特征对应的特征信息样本。In some embodiments, the characteristic information includes acoustic characteristics of the speech frame, and the control parameter includes at least one partial control vector of the interactive object. According to the above method for acquiring characteristic information samples, using the obtained at least one partial control vector to label the acoustic characteristics of the speech frame corresponding to the first image frame, the characteristic information corresponding to the acoustic characteristics of the speech frame is obtained. sample.
本领域技术人员应当理解,所述特征信息样本不限于以上所述,对应于各个类型的语音数据单元的各种特征,可以获得相应的特征信息样本。Those skilled in the art should understand that the feature information sample is not limited to the above, and corresponding to various features of various types of voice data units, corresponding feature information samples can be obtained.
在获得所述特征信息样本后,根据所述特征信息样本对初始循环神经网络进行训练,在网络损失的变化满足收敛条件后训练得到所述循环神经网络,其中,所述网络损失包括所述循环神经网络预测得到的控制参数值与标注的控制参数值之间的差异。After obtaining the characteristic information sample, the initial recurrent neural network is trained according to the characteristic information sample, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the recurrent neural network. The difference between the control parameter value predicted by the neural network and the marked control parameter value.
在本公开实施例中,通过将一角色的视频段,拆分为对应的多个第一图像帧和多个语音帧,通过将包含真实人物的第一图像帧转化为包含交互对象的第二图像帧来获取至少一个语音帧的特征信息对应的控制参数值,使得特征信息与控制参数值的对应性较好,从而获得高质量的特征信息样本,使得交互对象的姿态更接近于对应角色的真实 姿态。In the embodiment of the present disclosure, the video segment of a character is split into corresponding multiple first image frames and multiple voice frames, and the first image frame containing the real person is converted into the second image frame containing the interactive object. The image frame is used to obtain the control parameter value corresponding to the feature information of at least one voice frame, so that the feature information has a better correspondence with the control parameter value, so as to obtain high-quality feature information samples, so that the posture of the interactive object is closer to that of the corresponding character Real posture.
图6示出根据本公开至少一个实施例的交互对象的驱动装置的结构示意图,如图6所示,该装置可以包括:第一获取单元601,用于获取所述交互对象的驱动数据,并确定所述驱动数据的驱动模式;第二获取单元602,用于响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值;驱动单元603,用于根据所述控制参数值控制所述交互对象的姿态。FIG. 6 shows a schematic structural diagram of a driving device for an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 6, the device may include: a first acquiring unit 601 for acquiring driving data of the interactive object, and Determine the driving mode of the driving data; the second obtaining unit 602 is configured to obtain the control parameter value of the interactive object according to the driving data in response to the driving mode; the driving unit 603 is configured to obtain the control parameter value according to the control parameter The value controls the posture of the interactive object.
在一些实施例中,所述装置还包括输出单元,用于根据所述驱动数据控制所述显示设备输出语音和/或展示文本。In some embodiments, the device further includes an output unit for controlling the display device to output voice and/or display text according to the driving data.
在一些实施例中,在确定所述驱动数据对应的驱动模式时,所述第一获取单元具体用于:根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;若检测到所述语音数据单元中包括目标数据,则确定所述驱动数据的驱动模式为第一驱动模式,所述目标数据与交互对象的预设控制参数值相对应;响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:响应于所述第一驱动模式,将所述目标数据对应的所述预设控制参数值作为所述交互对象的控制参数值。In some embodiments, when determining the driving mode corresponding to the driving data, the first acquiring unit is specifically configured to: acquire a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data The data sequence includes a plurality of voice data units; if it is detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the first driving mode, and the target data and the preset control parameter value of the interactive object Corresponding; in response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, setting the preset control parameter value corresponding to the target data As the control parameter value of the interactive object.
在一些实施例中,所述目标数据包括关键词或关键字,所述关键词或所述关键字与交互对象的设定动作的预设控制参数值相对应;或者,所述目标数据包括音节,所述音节与所述交互对象的设定嘴型动作的预设控制参数值对应。In some embodiments, the target data includes keywords or keywords, and the keywords or keywords correspond to preset control parameter values of the set action of the interactive object; or, the target data includes syllables , The syllable corresponds to the preset control parameter value for setting the mouth movement of the interactive object.
在一些实施例中,在识别所述驱动数据的驱动模式时,所述第一获取单元具体用于:根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;若未检测到所述语音数据单元中包括目标数据,则确定所述驱动数据的驱动模式为第二驱动模式,所述目标数据与交互对象的预设控制参数值相对应;响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:响应于所述第二驱动模式,获取所述语音数据序列中的至少一个语音数据单元的特征信息;获取与所述特征信息对应的所述交互对象的控制参数值。In some embodiments, when identifying the driving mode of the driving data, the first acquiring unit is specifically configured to: acquire a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data The sequence includes a plurality of voice data units; if it is not detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the second driving mode, and the target data and the preset control parameter value of the interactive object Corresponding; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the second driving mode, acquiring the value of at least one voice data unit in the voice data sequence Characteristic information; obtaining the control parameter value of the interactive object corresponding to the characteristic information.
在一些实施例中,所述语音数据序列包括音素序列,在取所述语音数据序列中的至少一个语音数据单元的特征信息时,所述第二获取单元具体用于:对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列;根据所述第一编码序列,获取至少一个音素对应的特征编码;根据所述特征编码,获得所述至少一个音素的特征信息。In some embodiments, the voice data sequence includes a phoneme sequence, and when the feature information of at least one voice data unit in the voice data sequence is acquired, the second acquiring unit is specifically configured to: The feature code is used to obtain the first coding sequence corresponding to the phoneme sequence; the feature code corresponding to at least one phoneme is obtained according to the first coding sequence; the feature information of the at least one phoneme is obtained according to the feature code.
在一些实施例中,所述语音数据序列包括语音帧序列,在获取所述语音数据序列中的至少一个语音数据单元的特征信息时,所述第二获取单元具体用于:获取所述语音帧序列对应的第一声学特征序列,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对应的声学特征向量;根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征向量;根据所述声学特征向量,获得所述至少一个语音帧对应的特征信息。In some embodiments, the voice data sequence includes a voice frame sequence, and when acquiring characteristic information of at least one voice data unit in the voice data sequence, the second acquiring unit is specifically configured to: acquire the voice frame A first acoustic feature sequence corresponding to a sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; and at least one acoustic feature sequence is acquired according to the first acoustic feature sequence Acoustic feature vector corresponding to the speech frame; according to the acoustic feature vector, the feature information corresponding to the at least one speech frame is obtained.
在一些实施例中,所述交互对象的控制参数包括面部姿态参数,所述面部姿态参数包括面部肌肉控制系数,该面部肌肉控制系数用于控制至少一个面部肌肉的运动状态;在根据所述驱动数据获取所述交互对象的控制参数值时,所述第二获取单元具体用于:根据所述驱动数据获取所述交互对象的面部肌肉控制系数;所述驱动单元具体用于:根据所获取的面部肌肉控制系数,驱动所述交互对象做出与所述驱动数据匹配的面部动作;所述装置还包括肢体驱动单元,用于获取与所述面部姿态参数关联的身体姿态的驱动数据;根据与所述面部姿态参数值关联的身体姿态的驱动数据,驱动所述交互对象做出肢体动作。In some embodiments, the control parameters of the interactive object include facial posture parameters, and the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle; When data acquires the control parameter value of the interactive object, the second acquisition unit is specifically configured to: acquire the facial muscle control coefficient of the interactive object according to the driving data; the driving unit is specifically configured to: according to the acquired The facial muscle control coefficient drives the interactive object to make facial actions that match the drive data; the device also includes a limb drive unit, which is used to obtain the drive data of the body posture associated with the facial posture parameter; The driving data of the body posture associated with the facial posture parameter value drives the interactive object to make limb movements.
在一些实施例中,所述交互对象的控制参数包括所述交互对象的至少一个局部 区域的控制向量;在根据所述驱动数据获取所述交互对象的控制参数值时,所述第二获取单元具体用于:根据所述驱动数据获取所述交互对象的至少一个局部区域的控制向量;所述驱动单元具体用于:根据所获取的所述至少一个局部区域的控制向量,控制所述交互对象的面部动作和/或肢体动作。In some embodiments, the control parameter of the interactive object includes a control vector of at least one local area of the interactive object; when acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit Specifically configured to: obtain a control vector of at least one local area of the interactive object according to the driving data; the driving unit is specifically configured to: control the interactive object according to the acquired control vector of the at least one local area Facial movements and/or body movements.
根据本公开的一方面,提供一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed. The method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.
根据本公开的一方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.
本说明书至少一个实施例还提供了一种电子设备,如图7所示,所述设备包括存储器、处理器,存储器用于存储可在处理器上运行的计算机指令,处理器用于在执行所述计算机指令时实现本公开任一实施例所述的交互对象的驱动方法。At least one embodiment of this specification also provides an electronic device. As shown in FIG. 7, the device includes a memory and a processor. The memory is used to store computer instructions that can run on the processor. The method for driving interactive objects described in any embodiment of the present disclosure is realized by computer instructions.
本说明书至少一个实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例所述的交互对象的驱动方法。At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.
本领域技术人员应明白,本说明书一个或多个实施例可提供为方法、系统或计算机程序产品。因此,本说明书一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown in order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
本说明书中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本说明书中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本说明书中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。The embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or among them A combination of one or more. The embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules. Alternatively or in addition, the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission. The processing device executes. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
本说明书中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。The processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output. The processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何 其他类型的中央处理单元。通常,中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit. Generally, the central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, the computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both. However, the computer does not have to have such equipment. In addition, the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.
虽然本说明书包含许多具体实施细节,但是这些不应被解释为限制任何发明的范围或所要求保护的范围,而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。Although this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claimed protection, but are mainly used to describe the features of specific embodiments of a particular invention. Certain features described in multiple embodiments in this specification can also be implemented in combination in a single embodiment. On the other hand, various features described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. In addition, although features may function in certain combinations as described above and even initially claimed as such, one or more features from the claimed combination may in some cases be removed from the combination, and the claimed The combination of protection can be directed to a sub-combination or a variant of the sub-combination.
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。Similarly, although operations are depicted in a specific order in the drawings, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or requiring all illustrated operations to be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the above embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can usually be integrated together in a single software product. In, or packaged into multiple software products.
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desired results. In addition, the processes depicted in the drawings are not necessarily in the specific order or sequential order shown in order to achieve the desired result. In some implementations, multitasking and parallel processing may be advantageous.
以上所述仅为本说明书一个或多个实施例的较佳实施例而已,并不用以限制本说明书一个或多个实施例,凡在本说明书一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书一个或多个实施例保护的范围之内。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. All within the spirit and principle of one or more embodiments of this specification, Any modification, equivalent replacement, improvement, etc. made should be included in the protection scope of one or more embodiments of this specification.

Claims (20)

  1. 一种交互对象的驱动方法,所述交互对象展示在显示设备中,所述方法包括:A method for driving an interactive object, the interactive object being displayed in a display device, the method including:
    获取所述交互对象的驱动数据,并确定所述驱动数据的驱动模式;Acquiring driving data of the interactive object, and determining a driving mode of the driving data;
    响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值;In response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data;
    根据所述控制参数值控制所述交互对象的姿态。The posture of the interactive object is controlled according to the control parameter value.
  2. 根据权利要求1所述的方法,还包括:根据所述驱动数据控制所述显示设备输出语音和/或展示文本。The method according to claim 1, further comprising: controlling the display device to output voice and/or display text according to the driving data.
  3. 根据权利要求1或2所述的方法,其中,确定所述驱动数据对应的驱动模式,包括:The method according to claim 1 or 2, wherein determining the driving mode corresponding to the driving data comprises:
    根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;
    响应于检测到所述语音数据单元中包括目标数据,确定所述驱动数据的驱动模式为第一驱动模式,所述目标数据与所述交互对象的预设控制参数值相对应;In response to detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object;
    响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:
    响应于所述第一驱动模式,将所述目标数据对应的所述预设控制参数值作为所述交互对象的控制参数值。In response to the first driving mode, the preset control parameter value corresponding to the target data is used as the control parameter value of the interaction object.
  4. 根据权利要求3所述的方法,其中,所述目标数据包括关键词或关键字,所述关键词或所述关键字与所述交互对象的设定动作的预设控制参数值相对应;或者,The method according to claim 3, wherein the target data includes keywords or keywords, and the keywords or the keywords correspond to preset control parameter values of the setting action of the interactive object; or ,
    所述目标数据包括音节,所述音节与所述交互对象的设定嘴型动作的预设控制参数值对应。The target data includes syllables, and the syllables correspond to preset control parameter values for setting mouth movements of the interactive object.
  5. 根据权利要求1至4任一所述的方法,其中,确定所述驱动数据的驱动模式,包括:The method according to any one of claims 1 to 4, wherein determining the driving mode of the driving data comprises:
    根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;
    响应于未检测到所述语音数据单元中包括目标数据,确定所述驱动数据的驱动模式为第二驱动模式,所述目标数据与所述交互对象的所述预设控制参数值相对应;In response to not detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object;
    响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:
    响应于所述第二驱动模式,获取所述语音数据序列中的至少一个语音数据单元的特征信息;In response to the second driving mode, acquiring characteristic information of at least one voice data unit in the voice data sequence;
    获取与所述特征信息对应的所述交互对象的控制参数值。Obtain the control parameter value of the interactive object corresponding to the characteristic information.
  6. 根据权利要求5所述的方法,其中,所述语音数据序列包括音素序列,获取所述语音数据序列中的至少一个语音数据单元的特征信息,包括:The method according to claim 5, wherein the voice data sequence comprises a phoneme sequence, and acquiring characteristic information of at least one voice data unit in the voice data sequence comprises:
    对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列;Performing feature encoding on the phoneme sequence to obtain a first encoding sequence corresponding to the phoneme sequence;
    根据所述第一编码序列,获取至少一个音素对应的特征编码;Obtaining a feature code corresponding to at least one phoneme according to the first coding sequence;
    根据所述特征编码,获得所述至少一个音素的特征信息。According to the feature encoding, feature information of the at least one phoneme is obtained.
  7. 根据权利要求5所述的方法,其中,所述语音数据序列包括语音帧序列,获取所述语音数据序列中的至少一个语音数据单元的特征信息,包括:The method according to claim 5, wherein the voice data sequence comprises a voice frame sequence, and acquiring characteristic information of at least one voice data unit in the voice data sequence comprises:
    获取所述语音帧序列对应的第一声学特征序列,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对应的声学特征向量;Acquiring a first acoustic feature sequence corresponding to the voice frame sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;
    根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征向量;Obtaining an acoustic feature vector corresponding to at least one speech frame according to the first acoustic feature sequence;
    根据所述声学特征向量,获得所述至少一个语音帧对应的特征信息。According to the acoustic feature vector, feature information corresponding to the at least one speech frame is obtained.
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述交互对象的控制参数包括面部姿态参数,所述面部姿态参数包括面部肌肉控制系数,所述面部肌肉控制系数用于控制至少一个面部肌肉的运动状态;The method according to any one of claims 1 to 7, wherein the control parameters of the interactive object include facial posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control The movement state of at least one facial muscle;
    根据所述驱动数据获取所述交互对象的控制参数值,包括:Obtaining the control parameter value of the interactive object according to the driving data includes:
    根据所述驱动数据获取所述交互对象的面部肌肉控制系数;Acquiring the facial muscle control coefficient of the interactive object according to the driving data;
    根据所述控制参数值控制所述交互对象的姿态,包括:Controlling the posture of the interactive object according to the control parameter value includes:
    根据所获取的面部肌肉控制系数,驱动所述交互对象做出与所述驱动数据匹配的面部动 作。According to the acquired facial muscle control coefficient, the interactive object is driven to make facial movements matching the driving data.
  9. 根据权利要求8所述的方法,还包括:The method according to claim 8, further comprising:
    获取与所述面部姿态参数关联的身体姿态的驱动数据;Acquiring driving data of the body posture associated with the facial posture parameter;
    根据与所述面部姿态参数值关联的身体姿态的驱动数据,驱动所述交互对象做出肢体动作。According to the driving data of the body posture associated with the facial posture parameter value, the interactive object is driven to make a limb movement.
  10. 根据权利要求1至9任一项所述的方法,其中,所述交互对象的控制参数包括所述交互对象的至少一个局部区域的控制向量;The method according to any one of claims 1 to 9, wherein the control parameter of the interactive object comprises a control vector of at least one local area of the interactive object;
    根据所述驱动数据获取所述交互对象的控制参数值,包括:Obtaining the control parameter value of the interactive object according to the driving data includes:
    根据所述驱动数据获取所述交互对象的至少一个局部区域的控制向量;Acquiring a control vector of at least one local area of the interactive object according to the driving data;
    根据所述控制参数值控制所述交互对象的姿态,包括:Controlling the posture of the interactive object according to the control parameter value includes:
    根据所获取的所述至少一个局部区域的控制向量,控制所述交互对象的面部动作和/或肢体动作。According to the acquired control vector of the at least one local area, the facial movement and/or the limb movement of the interactive object are controlled.
  11. 根据权利要求5所述的方法,其中,获取与所述特征信息对应的所述交互对象的控制参数值,包括:The method according to claim 5, wherein obtaining the control parameter value of the interactive object corresponding to the characteristic information comprises:
    将所述特征信息输入至预先训练的循环神经网络,获得与所述特征信息对应的所述交互对象的控制参数值。The characteristic information is input to a pre-trained recurrent neural network, and the control parameter value of the interactive object corresponding to the characteristic information is obtained.
  12. 一种交互对象的驱动装置,所述交互对象展示在显示设备中,所述装置包括:A driving device for an interactive object, the interactive object is displayed in a display device, and the device includes:
    第一获取单元,用于获取所述交互对象的驱动数据,并确定所述驱动数据的驱动模式;The first acquiring unit is configured to acquire the driving data of the interactive object and determine the driving mode of the driving data;
    第二获取单元,用于响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值;The second acquiring unit is configured to acquire the control parameter value of the interactive object according to the driving data in response to the driving mode;
    驱动单元,用于根据所述控制参数值控制所述交互对象的姿态。The driving unit is used to control the posture of the interactive object according to the control parameter value.
  13. 根据权利要求12所述的装置,还包括输出单元,用于根据所述驱动数据控制所述显示设备输出语音和/或展示文本。The apparatus according to claim 12, further comprising an output unit for controlling the display device to output voice and/or display text according to the driving data.
  14. 根据权利要求12或13所述的装置,其中,在确定所述驱动数据对应的驱动模式时,所述第一获取单元用于:The device according to claim 12 or 13, wherein, when determining the driving mode corresponding to the driving data, the first obtaining unit is configured to:
    根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;
    响应于检测到所述语音数据单元中包括目标数据,确定所述驱动数据的驱动模式为第一驱动模式,所述目标数据与交互对象的预设控制参数值相对应;In response to detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object;
    响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:
    响应于所述第一驱动模式,将所述目标数据对应的所述预设控制参数值,作为所述交互对象的控制参数值;In response to the first driving mode, using the preset control parameter value corresponding to the target data as the control parameter value of the interaction object;
    其中,所述目标数据包括关键词或关键字,所述关键词或所述关键字与所述交互对象的设定动作的预设控制参数值相对应;或者,Wherein, the target data includes keywords or keywords, and the keywords or the keywords correspond to the preset control parameter values of the setting action of the interactive object; or,
    所述目标数据包括音节,所述音节与所述交互对象的设定嘴型动作的预设控制参数值对应。The target data includes syllables, and the syllables correspond to preset control parameter values for setting mouth movements of the interactive object.
  15. 根据权利要求12至14任一项所述的装置,其中,在确定所述驱动数据的驱动模式时,所述第一获取单元用于:The device according to any one of claims 12 to 14, wherein, when determining the driving mode of the driving data, the first obtaining unit is configured to:
    根据所述驱动数据的类型,获取所述驱动数据对应的语音数据序列,所述语音数据序列包括多个语音数据单元;Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;
    响应于未检测到所述语音数据单元中包括目标数据,确定所述驱动数据的驱动模式为第二驱动模式,所述目标数据与所述交互对象的预设控制参数值相对应;In response to not detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object;
    响应于所述驱动模式,根据所述驱动数据获取所述交互对象的控制参数值,包括:In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:
    响应于所述第二驱动模式,获取所述语音数据序列中的至少一个语音数据单元的特征信息;In response to the second driving mode, acquiring characteristic information of at least one voice data unit in the voice data sequence;
    获取与所述特征信息对应的所述交互对象的控制参数值。Obtain the control parameter value of the interactive object corresponding to the characteristic information.
  16. 根据权利要求15所述的装置,其中,所述语音数据序列包括音素序列,在获取所述语音数据序列中的至少一个语音数据单元的特征信息时,所述第二获取单元用于:The device according to claim 15, wherein the voice data sequence comprises a phoneme sequence, and when acquiring characteristic information of at least one voice data unit in the voice data sequence, the second acquiring unit is configured to:
    对所述音素序列进行特征编码,获得所述音素序列对应的第一编码序列;Performing feature encoding on the phoneme sequence to obtain a first encoding sequence corresponding to the phoneme sequence;
    根据所述第一编码序列,获取至少一个音素对应的特征编码;Obtaining a feature code corresponding to at least one phoneme according to the first coding sequence;
    根据所述特征编码,获得所述至少一个音素的特征信息;Obtaining the characteristic information of the at least one phoneme according to the characteristic encoding;
    或者,所述语音数据序列包括语音帧序列,在获取所述语音数据序列中的至少一个语音数据单元的特征信息时,所述第二获取单元用于:Alternatively, the voice data sequence includes a voice frame sequence, and when acquiring characteristic information of at least one voice data unit in the voice data sequence, the second acquiring unit is configured to:
    获取所述语音帧序列对应的第一声学特征序列,所述第一声学特征序列包括与所述语音帧序列中的每个语音帧对应的声学特征向量;Acquiring a first acoustic feature sequence corresponding to the voice frame sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;
    根据所述第一声学特征序列,获取至少一个语音帧对应的声学特征向量;Obtaining an acoustic feature vector corresponding to at least one speech frame according to the first acoustic feature sequence;
    根据所述声学特征向量,获得所述至少一个语音帧对应的特征信息。According to the acoustic feature vector, feature information corresponding to the at least one speech frame is obtained.
  17. 根据权利要求12至16任一项所述的装置,其中,所述交互对象的控制参数包括面部姿态参数,所述面部姿态参数包括面部肌肉控制系数,所述面部肌肉控制系数用于控制至少一个面部肌肉的运动状态;The device according to any one of claims 12 to 16, wherein the control parameters of the interactive object include facial posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control at least one The movement state of facial muscles;
    在根据所述驱动数据获取所述交互对象的控制参数值时,所述第二获取单元用于:When acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit is configured to:
    根据所述驱动数据获取所述交互对象的所述面部肌肉控制系数;Acquiring the facial muscle control coefficient of the interactive object according to the driving data;
    所述驱动单元用于:The driving unit is used for:
    根据所获取的所述面部肌肉控制系数,驱动所述交互对象做出与所述驱动数据匹配的面部动作;According to the acquired facial muscle control coefficient, driving the interactive object to make a facial action matching the driving data;
    所述装置还包括肢体驱动单元,所述肢体驱动单元用于获取与所述面部姿态参数关联的身体姿态的驱动数据;根据与所述面部姿态参数值关联的身体姿态的驱动数据,驱动所述交互对象做出肢体动作。The device also includes a limb drive unit, which is used to obtain body posture drive data associated with the facial posture parameter; and drive the body posture drive data associated with the facial posture parameter value. The interactive object makes physical movements.
  18. 根据权利要求12至16任一项所述的装置,其中,所述交互对象的控制参数包括所述交互对象的至少一个局部区域的控制向量;The device according to any one of claims 12 to 16, wherein the control parameter of the interactive object comprises a control vector of at least one local area of the interactive object;
    在根据所述驱动数据获取所述交互对象的控制参数值时,所述第二获取单元用于:When acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit is configured to:
    根据所述驱动数据获取所述交互对象的至少一个局部区域的控制向量;Acquiring a control vector of at least one local area of the interactive object according to the driving data;
    所述驱动单元用于:The driving unit is used for:
    根据所获取的所述至少一个局部区域的控制向量,控制所述交互对象的面部动作和/或肢体动作。According to the acquired control vector of the at least one local area, the facial movement and/or the limb movement of the interactive object are controlled.
  19. 一种电子设备,包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至11任一项所述的方法。An electronic device, comprising a memory and a processor, the memory is used to store computer instructions that can run on the processor, and the processor is used to implement any one of claims 1 to 11 when the computer instructions are executed Methods.
  20. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至11中任一所述的方法。A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method according to any one of claims 1 to 11 is realized.
PCT/CN2020/129806 2020-03-31 2020-11-18 Method, apparatus and device for driving interactive object, and storage medium WO2021196645A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021556973A JP7227395B2 (en) 2020-03-31 2020-11-18 Interactive object driving method, apparatus, device, and storage medium
KR1020217031139A KR102707613B1 (en) 2020-03-31 2020-11-18 Methods, apparatus, devices and storage media for driving interactive objects

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010246112.0A CN111459452B (en) 2020-03-31 2020-03-31 Driving method, device and equipment of interaction object and storage medium
CN202010246112.0 2020-03-31

Publications (1)

Publication Number Publication Date
WO2021196645A1 true WO2021196645A1 (en) 2021-10-07

Family

ID=71683479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129806 WO2021196645A1 (en) 2020-03-31 2020-11-18 Method, apparatus and device for driving interactive object, and storage medium

Country Status (5)

Country Link
JP (1) JP7227395B2 (en)
KR (1) KR102707613B1 (en)
CN (1) CN111459452B (en)
TW (1) TWI760015B (en)
WO (1) WO2021196645A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116932706A (en) * 2022-04-15 2023-10-24 华为技术有限公司 Chinese translation method and electronic equipment

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460785B (en) * 2020-03-31 2023-02-28 北京市商汤科技开发有限公司 Method, device and equipment for driving interactive object and storage medium
CN111459452B (en) * 2020-03-31 2023-07-18 北京市商汤科技开发有限公司 Driving method, device and equipment of interaction object and storage medium
CN111459450A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN113050859B (en) * 2021-04-19 2023-10-24 北京市商汤科技开发有限公司 Driving method, device and equipment of interaction object and storage medium
CN114283227B (en) * 2021-11-26 2023-04-07 北京百度网讯科技有限公司 Virtual character driving method and device, electronic equipment and readable storage medium
CN116977499B (en) * 2023-09-21 2024-01-16 粤港澳大湾区数字经济研究院(福田) Combined generation method of facial and body movement parameters and related equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180160077A1 (en) * 2016-04-08 2018-06-07 Maxx Media Group, LLC System, Method and Software for Producing Virtual Three Dimensional Avatars that Actively Respond to Audio Signals While Appearing to Project Forward of or Above an Electronic Display
CN109377539A (en) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 Method and apparatus for generating animation
CN110009716A (en) * 2019-03-28 2019-07-12 网易(杭州)网络有限公司 Generation method, device, electronic equipment and the storage medium of facial expression
CN110688008A (en) * 2019-09-27 2020-01-14 贵州小爱机器人科技有限公司 Virtual image interaction method and device
CN110876024A (en) * 2018-08-31 2020-03-10 百度在线网络技术(北京)有限公司 Method and device for determining lip action of avatar
CN111459452A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6570555B1 (en) * 1998-12-30 2003-05-27 Fuji Xerox Co., Ltd. Method and apparatus for embodied conversational characters with multimodal input/output in an interface device
JP4661074B2 (en) * 2004-04-07 2011-03-30 ソニー株式会社 Information processing system, information processing method, and robot apparatus
KR101370897B1 (en) * 2007-03-19 2014-03-11 엘지전자 주식회사 Method for controlling image, and terminal therefor
AU2012253367B2 (en) * 2011-05-11 2015-04-30 The Cleveland Clinic Foundation Interactive graphical map visualization for healthcare
CN102609969B (en) * 2012-02-17 2013-08-07 上海交通大学 Method for processing face and speech synchronous animation based on Chinese text drive
JP2015166890A (en) * 2014-03-03 2015-09-24 ソニー株式会社 Information processing apparatus, information processing system, information processing method, and program
JP2016038601A (en) * 2014-08-05 2016-03-22 日本放送協会 Cg character interaction device and cg character interaction program
CN106056989B (en) * 2016-06-23 2018-10-16 广东小天才科技有限公司 Language learning method and device and terminal equipment
CN107329990A (en) * 2017-06-06 2017-11-07 北京光年无限科技有限公司 A kind of mood output intent and dialogue interactive system for virtual robot
CN107704169B (en) * 2017-09-26 2020-11-17 北京光年无限科技有限公司 Virtual human state management method and system
CN107861626A (en) * 2017-12-06 2018-03-30 北京光年无限科技有限公司 The method and system that a kind of virtual image is waken up
KR101992424B1 (en) * 2018-02-06 2019-06-24 (주)페르소나시스템 Apparatus for making artificial intelligence character for augmented reality and service system using the same
CN108942919B (en) * 2018-05-28 2021-03-30 北京光年无限科技有限公司 Interaction method and system based on virtual human
CN109739350A (en) * 2018-12-24 2019-05-10 武汉西山艺创文化有限公司 AI intelligent assistant equipment and its exchange method based on transparent liquid crystal display
CN110176284A (en) * 2019-05-21 2019-08-27 杭州师范大学 A kind of speech apraxia recovery training method based on virtual reality
CN110288682B (en) * 2019-06-28 2023-09-26 北京百度网讯科技有限公司 Method and apparatus for controlling changes in a three-dimensional virtual portrait mouth shape
CN110716634A (en) * 2019-08-28 2020-01-21 北京市商汤科技开发有限公司 Interaction method, device, equipment and display equipment
CN110815258B (en) * 2019-10-30 2023-03-31 华南理工大学 Robot teleoperation system and method based on electromagnetic force feedback and augmented reality

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180160077A1 (en) * 2016-04-08 2018-06-07 Maxx Media Group, LLC System, Method and Software for Producing Virtual Three Dimensional Avatars that Actively Respond to Audio Signals While Appearing to Project Forward of or Above an Electronic Display
CN110876024A (en) * 2018-08-31 2020-03-10 百度在线网络技术(北京)有限公司 Method and device for determining lip action of avatar
CN109377539A (en) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 Method and apparatus for generating animation
CN110009716A (en) * 2019-03-28 2019-07-12 网易(杭州)网络有限公司 Generation method, device, electronic equipment and the storage medium of facial expression
CN110688008A (en) * 2019-09-27 2020-01-14 贵州小爱机器人科技有限公司 Virtual image interaction method and device
CN111459452A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116932706A (en) * 2022-04-15 2023-10-24 华为技术有限公司 Chinese translation method and electronic equipment

Also Published As

Publication number Publication date
JP2022531072A (en) 2022-07-06
KR20210129713A (en) 2021-10-28
CN111459452A (en) 2020-07-28
CN111459452B (en) 2023-07-18
TWI760015B (en) 2022-04-01
KR102707613B1 (en) 2024-09-19
JP7227395B2 (en) 2023-02-21
TW202138970A (en) 2021-10-16

Similar Documents

Publication Publication Date Title
WO2021196645A1 (en) Method, apparatus and device for driving interactive object, and storage medium
WO2021169431A1 (en) Interaction method and apparatus, and electronic device and storage medium
TWI766499B (en) Method and apparatus for driving interactive object, device and storage medium
WO2021196646A1 (en) Interactive object driving method and apparatus, device, and storage medium
US20200279553A1 (en) Linguistic style matching agent
WO2021196644A1 (en) Method, apparatus and device for driving interactive object, and storage medium
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
WO2021232876A1 (en) Method and apparatus for driving virtual human in real time, and electronic device and medium
CN110148406B (en) Data processing method and device for data processing
CN110162598B (en) Data processing method and device for data processing
WO2022252890A1 (en) Interaction object driving and phoneme processing methods and apparatus, device and storage medium
WO2021232877A1 (en) Method and apparatus for driving virtual human in real time, and electronic device, and medium
WO2021196647A1 (en) Method and apparatus for driving interactive object, device, and storage medium
CN110166844B (en) Data processing method and device for data processing
CN112632262A (en) Conversation method, conversation device, computer equipment and storage medium

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021556973

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217031139

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20929260

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20929260

Country of ref document: EP

Kind code of ref document: A1