WO2021196645A1

WO2021196645A1 - Method, apparatus and device for driving interactive object, and storage medium

Info

Publication number: WO2021196645A1
Application number: PCT/CN2020/129806
Authority: WO
Inventors: 张子隆; 吴文岩; 吴潜溢; 许亲亲
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2020-03-31
Filing date: 2020-11-18
Publication date: 2021-10-07
Also published as: JP2022531072A; KR20210129713A; CN111459452A; CN111459452B; TWI760015B; KR102707613B1; JP7227395B2; TW202138970A

Abstract

A method, apparatus and device for driving an interactive object, and a storage medium. The interactive object is displayed in a display device, and the method comprises: obtaining driving data of the interactive object, and determining the driving mode of the driving data (201); in response to the driving mode, obtaining a control parameter value of the interactive object according to the driving data (202); and controlling the posture of the interactive object according to the control parameter value (203).

Description

Driving method, device, equipment and storage medium of interactive object

Related cross references

This application is filed based on a Chinese patent application with an application number of 2020102461120 and an application date of March 31, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this application.

Technical field

The present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.

Background technique

The way of human-computer interaction is mostly based on keystrokes, touches, and voice input, and responds by presenting images, texts or virtual characters on the display screen. Currently, virtual characters are mostly improved on the basis of voice assistants, which only output the voice of the device.

Summary of the invention

The embodiments of the present disclosure provide a driving solution for interactive objects.

According to an aspect of the present disclosure, there is provided a method for driving an interactive object, the interactive object being displayed in a display device, the method comprising: obtaining driving data of the interactive object, and determining a driving mode of the driving data; In response to the driving mode, the control parameter value of the interactive object is obtained according to the driving data; the posture of the interactive object is controlled according to the control parameter value.

With reference to any one of the embodiments provided in the present disclosure, the method further includes: controlling the display device to output voice and/or display text according to the driving data.

With reference to any one of the embodiments provided in the present disclosure, the determining the driving mode corresponding to the driving data includes: obtaining a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data sequence includes multiple A voice data unit; in response to detecting that the voice data unit includes target data, it is determined that the driving mode of the driving data is the first driving mode, and the target data is the same as the preset control parameter value of the interactive object Corresponding; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, the preset control parameter corresponding to the target data Value as the control parameter value of the interactive object.

With reference to any of the embodiments provided in the present disclosure, the target data includes keywords or keywords, and the keywords or the keywords correspond to the preset control parameter values of the set actions of the interactive objects; or, The target data includes syllables, and the syllables correspond to preset control parameter values for setting mouth movements of the interactive object.

With reference to any one of the embodiments provided in the present disclosure, the determining the driving mode corresponding to the driving data includes: obtaining a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data sequence includes multiple A voice data unit; if it is not detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the second driving mode, and the target data is the same as the preset control parameter value of the interactive object correspond. In response to the driving mode, obtaining the control parameter value of the interaction object according to the driving data includes: in response to the second driving mode, obtaining characteristic information of at least one voice data unit in the voice data sequence; Obtain the control parameter value of the interactive object corresponding to the characteristic information.

With reference to any one of the embodiments provided in the present disclosure, the voice data sequence includes a phoneme sequence, and the acquiring feature information of at least one voice data unit in the voice data sequence includes: performing feature encoding on the phoneme sequence to obtain A first coding sequence corresponding to the phoneme sequence; obtaining a feature code corresponding to at least one phoneme according to the first coding sequence; obtaining feature information of the at least one phoneme according to the feature code.

With reference to any one of the embodiments provided by the present disclosure, the voice data sequence includes a voice frame sequence, and the acquiring characteristic information of at least one voice data unit in the voice data sequence includes: acquiring the first voice frame sequence corresponding to the voice data sequence. An acoustic feature sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; according to the first acoustic feature sequence, at least one voice frame corresponding to Acoustic feature vector; according to the acoustic feature vector, feature information corresponding to the at least one speech frame is obtained.

With reference to any one of the embodiments provided in the present disclosure, the control parameter of the interactive object includes a facial posture parameter, the facial posture parameter includes a facial muscle control coefficient, and the facial muscle control coefficient is used to control the motion state of at least one facial muscle; The obtaining the control parameter value of the interactive object according to the driving data includes: obtaining the facial muscle control coefficient of the interactive object according to the driving data; and controlling the posture of the interactive object according to the control parameter value , Including: according to the acquired facial muscle control coefficient, driving the interactive object to make a facial action matching the driving data.

With reference to any one of the embodiments provided in the present disclosure, the method further includes: acquiring driving data of the body posture associated with the facial posture parameter; and driving the body posture according to the driving data of the body posture associated with the facial posture parameter value. The interacting object makes physical movements.

With reference to any one of the embodiments provided in the present disclosure, the control parameter value of the interactive object includes a control vector of at least one local area of the interactive object; the obtaining the control parameter value of the interactive object according to the driving data includes : Obtaining a control vector of at least one partial area of the interactive object according to the drive data; said controlling the posture of the interactive object according to the control parameter value includes: controlling the at least one partial area obtained according to the control parameter value The vector controls the facial movements and/or body movements of the interactive object.

With reference to any one of the embodiments provided in the present disclosure, the obtaining the control parameter value of the interactive object corresponding to the characteristic information includes: inputting the characteristic information into a pre-trained recurrent neural network to obtain the characteristic The control parameter value of the interactive object corresponding to the information.

According to an aspect of the present disclosure, a device for driving an interactive object is provided. The interactive object is displayed in a display device. The device includes: a first acquiring unit configured to acquire driving data of the interactive object and determine The driving mode of the driving data; a second obtaining unit, configured to obtain a control parameter value of the interactive object according to the driving data in response to the driving mode; a driving unit, configured to control the control parameter value according to the control parameter value The posture of the interactive object.

According to an aspect of the present disclosure, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions when the computer instructions are executed. The method for driving interactive objects described in any of the embodiments provided in the present disclosure is implemented.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.

The driving method, device, device, and computer-readable storage medium of the interactive object according to one or more embodiments of the present disclosure obtain the control parameter value of the interactive object according to the driving mode of the driving data of the interactive object, thereby controlling The posture of the interactive object, wherein for different driving modes, the control parameter value of the corresponding interactive object can be obtained in different ways, so that the interactive object displays the information that matches the content of the driving data and/or the corresponding voice Posture, so that the target object feels that it is communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved.

Description of the drawings

In order to more clearly explain one or more embodiments of this specification or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, in the following description The drawings are only some of the embodiments described in one or more embodiments of this specification. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is a schematic diagram of a display device in a method for driving interactive objects proposed by at least one embodiment of the present disclosure;

2 is a flowchart of a method for driving interactive objects proposed by at least one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure;

4 is a schematic diagram of a process of obtaining control parameter values according to a phoneme sequence according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a process of obtaining control parameter values according to a sequence of speech frames proposed by at least one embodiment of the present disclosure;

6 is a schematic structural diagram of a driving device for interactive objects proposed in at least one embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device provided by at least one embodiment of the present disclosure.

Detailed ways

The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. On the contrary, they are merely examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.

The term "and/or" in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations. In addition, the term "at least one" in this document means any one of a plurality of or any combination of at least two of the plurality, for example, including at least one of A, B, and C, may mean including A, Any one or more elements selected in the set formed by B and C.

At least one embodiment of the present disclosure provides a method for driving interactive objects. The driving method may be executed by electronic devices such as a terminal device or a server. The terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game. The server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.

In the embodiments of the present disclosure, the interaction object may be any virtual image capable of interacting with the target object. In an embodiment, the interactive object may be a virtual character, or may also be a virtual animal, virtual item, cartoon image, or other virtual images capable of implementing interactive functions. The display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure. The target object may be a user, a robot, or other smart devices. The interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner. In an example, the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction. In another example, the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.

The interactive objects may be displayed through terminal devices, which may be televisions, all-in-one machines with display functions, projectors, virtual reality (VR) devices, and augmented reality (AR) devices Etc., the present disclosure does not limit the specific form of the terminal device.

Fig. 1 shows a display device proposed by at least one embodiment of the present disclosure. As shown in Figure 1, the display device has a transparent display screen, and a stereoscopic picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a stereoscopic effect. For example, the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen. The display device is configured with a memory and a processor, and the memory is used to store computer instructions that can run on the processor. The processor is used to implement the method for driving the interactive object provided in the present disclosure when the computer instruction is executed, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.

In some embodiments, in response to the driving data for driving the interactive object to output voice, the interactive object may emit a specified voice to the target object. The terminal device can generate driving data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to communicate or respond by issuing a specified voice, thereby providing anthropomorphic services for the target object. It should be noted that the sound-driven data can also be generated in other ways, for example, generated by the server and sent to the terminal device.

During the interaction between the interactive object and the target object, when the interactive object is driven to emit a specified voice according to the driving data, the interactive object may not be able to drive the interactive object to make facial movements synchronized with the specified voice, making the interactive object dull and rigid when uttering the voice. It is unnatural and affects the interactive experience between the target object and the interactive object. Based on this, at least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.

FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. The interactive object is displayed on a display device. As shown in FIG. 2, the method includes steps 201 to 203.

In step 201, the driving data of the interactive object is acquired, and the driving mode of the driving data is determined.

In an embodiment of the present disclosure, the driving data may include audio data (voice data), text, and so on. The drive data can be drive data generated by the server or terminal device according to the actions, expressions, identity, preferences, etc. of the target object interacting with the interactive object, or can be directly obtained by the terminal device, such as a drive called from an internal memory Data etc. The present disclosure does not limit the acquisition method of the driving data.

According to the type of the driving data and the information contained in the driving data, the driving mode of the driving data can be determined.

In an example, the voice data sequence corresponding to the drive data may be obtained according to the type of the drive data, where the voice data sequence includes multiple voice data units. Wherein, the voice data unit may be formed in units of characters or words, or may be formed in units of phonemes or syllables. Corresponding to the text type of driving data, the word sequence, word sequence, etc. corresponding to the driving data can be obtained; corresponding to the audio type of driving data, the phoneme sequence, syllable sequence, and speech frame corresponding to the driving data can be obtained Sequence and so on. In an embodiment, audio data and text data can be converted to each other. For example, the audio data is converted into text data and then the voice data unit is divided, or the text data is converted into audio data and then the voice data unit is divided, which is not limited in the present disclosure.

When it is detected that the voice data unit includes target data, it can be determined that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object.

The target data may be a set keyword or keyword, etc., and the keyword or the keyword corresponds to a preset control parameter value of the setting action of the interactive object.

In the embodiment of the present disclosure, each target data is matched with the setting action in advance, and each setting action is controlled by the corresponding control parameter value, so each target data matches the control parameter value of the setting action . Taking the keyword "wave" as an example, if the voice data unit contains "wave" in text form and/or "wave" in voice form, it can be determined that the drive data contains target data .

Exemplarily, the target data includes a syllable, and the syllable corresponds to a preset control parameter value for setting a mouth movement of the interactive object.

The syllable corresponding to the target data belongs to a pre-divided syllable type, and the one syllable type matches a set mouth shape. Wherein, a syllable is a phonetic unit formed by a combination of at least one phoneme, and the syllable includes a syllable of a pinyin language and a syllable of a non-pinyin language (for example, Chinese). A syllable type refers to a syllable with the same or basically the same pronunciation action, and a syllable type can correspond to an action of an interactive object. In an embodiment, a syllable type may correspond to a set mouth shape when the interactive object speaks, that is, it corresponds to a pronunciation action. In this way, different syllable types are matched with different control parameter values for setting the mouth shape. For example, the pinyin "ma", "man", "mang" type syllables, because the pronunciation actions of these syllables are basically the same, so you can Regarding the same type, they can correspond to the control parameter value of the mouth shape of the "mouth open" when the interactive object speaks.

If it is not detected that the voice data unit includes target data, it may be determined that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object.

Those skilled in the art should understand that the above-mentioned first driving mode and second driving mode are only for example, and the embodiment of the present disclosure does not limit the specific driving mode.

In step 202, in response to the driving mode, the control parameter value of the interactive object is obtained according to the driving data.

For various driving modes of driving data, the control parameter value of the interactive object can be obtained in a corresponding manner.

In an example, in response to the first driving mode determined in step 201, the preset control parameter value corresponding to the target data may be used as the control parameter value of the interaction object. For example, for the first driving mode, the preset control parameter value corresponding to the target data (such as "wave") contained in the voice data sequence may be used as the control parameter value of the interactive object.

In an example, in response to the second driving mode determined in step 201, characteristic information of at least one voice data unit in the voice data sequence may be acquired; and control parameters of the interaction object corresponding to the characteristic information may be acquired value. That is, in the case where the target data is not detected in the voice data sequence, the corresponding control parameter value can be obtained according to the characteristic information of the voice data unit. The feature information may include feature information of a voice data unit obtained by performing feature encoding on the voice data sequence, feature information of a voice data unit obtained according to the acoustic feature information of the voice data sequence, and so on.

In step 203, the posture of the interactive object is controlled according to the control parameter value.

In some embodiments, the control parameters of the interactive object include facial posture parameters, and the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle. In an embodiment, the facial muscle control coefficient of the interactive object may be acquired according to the driving data; and the facial muscle control coefficient obtained is driven to drive the interactive object to perform facial actions matching the driving data.

In some embodiments, the control parameter value of the interactive object includes a control vector of at least one local area of the interactive object. In an embodiment, the control vector of at least one partial area of the interactive object can be acquired according to the driving data; and the facial movements and the facial movements of the interactive object can be controlled according to the acquired control vector of the at least one local area. / Or body movements.

Obtain the control parameter value of the interactive object according to the driving mode of the driving data of the interactive object, thereby controlling the posture of the interactive object, wherein the corresponding interactive object can be obtained in different ways for different driving modes The control parameter value of, makes the interactive object show a posture that matches the content of the driving data and/or the corresponding voice, so that the target object has a feeling of communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved .

In some embodiments, the display device may also be controlled to output voice and/or display text according to the driving data. And while outputting voice and/or displaying text, the gesture of the interactive object can be controlled according to the control parameter value.

In the embodiment of the present disclosure, because the control parameter value matches the driving data, the output of the voice and/or the display text according to the driving data is synchronized with the control parameter value according to the control parameter value. The gesture made by the object is also synchronized with the output voice and/or displayed text, thereby giving the target object a feeling that the interactive object is communicating with it.

In some embodiments, the speech data sequence includes a phoneme sequence. In response to the drive data including audio data, the audio data may be split into multiple audio frames, and the audio frames may be combined according to the state of the audio frames to form phonemes; each phoneme formed according to the audio data forms Phoneme sequence. Among them, the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and a pronunciation action of a real person can form a phoneme. In response to the driving data being text, the phoneme corresponding to the morpheme can be obtained according to the morphemes contained in the text, so as to obtain the corresponding phoneme sequence.

In some embodiments, the feature information of at least one voice data unit in the voice data sequence may be obtained by: performing feature encoding on the phoneme sequence to obtain the first coding sequence corresponding to the phoneme sequence; According to the first coding sequence, the feature code corresponding to at least one phoneme is obtained; and the feature information of the at least one phoneme is obtained according to the feature code.

Figure 3 shows a schematic diagram of the process of feature encoding on a phoneme sequence. As shown in Figure 3, the phoneme sequence 310 contains phonemes j, i1, j, and ie4 (for brevity, only some phonemes are shown), and

corresponding code sequences

321, 322, and 323 are obtained for each phoneme j, i1, and ie4. . In each coding sequence, the corresponding coding value at the time point when the phoneme is present is set to a first value (for example, 1), and the corresponding coding value at the time point when the phoneme is not present is set to a second value (for example, Is 0). Taking the code sequence 321 as an example, at the time point when there is phoneme j in the phoneme sequence 310, the value of the code sequence 321 is set to the first value 1; at the time point when there is no phoneme j, the value of the code sequence 321 is set to the second value. The value is 0. All

coding sequences

321, 322, and 323 constitute a total coding sequence 320.

According to the encoding values of the

encoding sequences

321, 322, and 323 corresponding to phonemes j, i1, and ie4, and the duration of the corresponding phonemes in the three encoding sequences, that is, the duration of j in the encoding sequence 321, and the duration of the encoding sequence in the encoding sequence 321. The duration of i1 in 322 and the duration of ie4 in the coded sequence 323 can obtain the characteristic information of the coded

sequences

321, 322, and 323.

For example, Gaussian filters may be used to perform Gaussian convolution operations on the consecutive values of phonemes j, i1, and ie4 in the coded

sequences

321, 322, and 323, respectively, to obtain feature information of the coded sequence. That is, the Gaussian convolution operation is performed on the continuous value of the phoneme in time through the Gaussian filter, so that the code value in each code sequence changes from the second value to the first value or from the first value to the second value. smooth. Gaussian convolution operation is performed on each

coded sequence

321, 322, and 323 respectively, so as to obtain the feature value of each coded sequence, where the feature value constitutes the parameter in the feature information. According to the set of feature information of each coded sequence, the feature value is obtained. The feature information 330 corresponding to the phoneme sequence 310. Those skilled in the art should understand that other operations can also be performed on each coding sequence to obtain the characteristic information of the coding sequence, which is not limited in the present disclosure.

In the embodiment of the present disclosure, the characteristic information of the coding sequence is obtained according to the duration of each phoneme in the phoneme sequence, so that the change phase of the coding sequence is smooth, for example, the value of the coding sequence also presents an intermediate state in addition to 0 and 1. The value of, such as 0.2, 0.3, etc., and the posture parameter values obtained according to the values of these intermediate states make the posture changes of the interactive characters more smooth and natural, especially the expression changes of the interactive characters are more smooth, natural, and improved The interactive experience of the target object.

In some embodiments, the facial posture parameters may include facial muscle control coefficients.

From an anatomical point of view, the movement of the human face is the result of the coordinated deformation of various facial muscles. Therefore, the facial muscle model is obtained by dividing the facial muscles of the interactive objects, and each muscle (region) obtained by the division is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed, then It can make the faces of interactive characters make various expressions. For each muscle of the facial muscle model, the motion state corresponding to different muscle control coefficients can be set according to the facial position where the muscle is located and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of the control coefficient is 0 to 1. Different values within this range correspond to different contraction/expansion states of the upper lip muscle. By changing this value, the mouth can be opened and closed vertically; For the left corner of the mouth muscle, the value of the control coefficient ranges from 0 to 1. Different values in this range correspond to the contraction/expansion state of the left corner of the mouth muscle. By changing this value, the lateral change of the mouth can be achieved.

While outputting the sound according to the phoneme sequence, the interactive object is driven to make facial expressions according to the facial muscle control coefficient corresponding to the phoneme sequence, so that when the display device outputs the sound, the interactive object can make the sound synchronously. The expression of the target object, so that the target object feels that the interactive object is speaking, and the interactive experience of the target object is improved.

In some embodiments, the facial motion of the interactive object may be associated with the body posture, that is, the facial posture parameter value corresponding to the facial motion may be associated with the body posture. The body posture may include body motion, Gesture movement, walking posture, etc.

In the driving process of the interactive object, obtain the driving data of the body posture associated with the facial posture parameter value; while outputting the sound according to the phoneme sequence, according to the driving data of the body posture associated with the facial posture parameter value , To drive the interactive object to make physical actions. That is, while driving the interactive object to make a facial action according to the driving data of the interactive object, it also obtains the driving data of the associated body posture according to the facial posture parameter value corresponding to the facial action, so that when the sound is output , Can drive the interactive object to make corresponding facial and body movements synchronously, make the speaking state of the interactive object more vivid and natural, and improve the interactive experience of the target object.

Since the output of sound needs to maintain continuity, in one embodiment, the time window is moved on the phoneme sequence, and the phonemes in the time window during each movement are output, wherein the set duration is used as the time of each movement The step size of the window. For example, you can set the length of the time window to 1 second and the set time to 0.1 second. While outputting the phonemes in the time window, the phoneme at the set position of the time window or the attitude parameter value corresponding to the feature information of the phone is obtained, and the attitude parameter value is used to control the attitude of the interactive object; the set position is The position of the set duration from the start position of the time window. For example, when the length of the time window is set to 1s, the set position may be 0.5s away from the start position of the time window. With each movement of the time window, while outputting the phonemes in the time window, the posture of the interactive object is controlled by the corresponding posture parameter value at the set position of the time window, so that the posture of the interactive object is synchronized with the output voice. The target object feels that the interactive object is speaking.

By changing the set time length, the time interval (frequency) for obtaining the posture parameter value can be changed, thereby changing the frequency at which the interactive object makes the posture. The set duration can be set according to the actual interactive scene, so that the posture of the interactive object changes more naturally.

In some embodiments, the posture of the interactive object can be controlled by acquiring the control vector of at least one partial area of the interactive object.

The local area is obtained by dividing the whole (including face and/or body) of the interactive object. The control of one or more local areas of the face may correspond to a series of facial expressions or actions of the interactive object. For example, the control of the eye area may correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective; The control of the mouth area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees. The control of one or more local areas of the body may correspond to a series of physical actions of the interactive object. For example, the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.

The control parameter of the local area of the interactive object includes the control vector of the local area. The attitude control vector of each local area is used to drive the local area of the interactive object to perform actions. Different control vector values correspond to different actions or action ranges. For example, for the control vector of the mouth area, one set of control vector values can make the mouth of the interactive object slightly open, while another set of control vector values can make the mouth of the interactive object open wider. By driving the interactive objects with different control vector values, the corresponding local areas can perform different actions or actions with different amplitudes.

The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the control vectors of all the local areas can be obtained; when the expression of the interactive object needs to be controlled, Then, the control vector of the local area corresponding to the face can be obtained.

In some embodiments, the feature code corresponding to at least one phoneme may be obtained by performing a sliding window on the first code sequence. Wherein, the first coding sequence may be a coding sequence after a Gaussian convolution operation.

A sliding window is performed on the coding sequence with a time window of a set length and a set step size, and the feature code in the time window is used as the feature code of the corresponding at least one phoneme. After the sliding window is completed, according to With multiple feature codes of, the second code sequence can be obtained. As shown in FIG. 4, by sliding a time window of a set length on the first coding sequence 420 or the smoothed first coding sequence 430, feature code 1, feature code 2, feature code 3 are obtained respectively, and so on, After traversing the first coding sequence, feature codes 1, feature codes 2, feature codes 3,..., Feature codes M are obtained, and thus a second code sequence 440 is obtained. Among them, M is a positive integer, and its value is determined according to the length of the first coding sequence, the length of the time window, and the sliding step of the time window.

According to feature code 1, feature code 2, feature code 3,..., feature code M, corresponding control vector 1, control vector 2, control vector 3,..., control vector M can be obtained, thereby obtaining a sequence 450 of control vectors.

The sequence 450 of the control vector and the second coding sequence 440 are aligned in time. Since each feature code in the second coding sequence is obtained according to at least one phoneme in the phoneme sequence, the sequence 450 of the control vector Each feature vector of is also obtained from at least one phoneme in the phoneme sequence. While playing the phoneme sequence corresponding to the text data, the interactive object is driven to make an action according to the sequence of the control vector, that is, it can drive the interactive object to emit the sound corresponding to the text content, and make the sound synchronized with the sound. The action gives the target object the feeling that the interactive object is speaking, and improves the interactive experience between the target object and the interactive object.

Assuming that the feature code starts to be output at the set time of the first time window, the control vector before the set time can be set to the default value, that is, when the phoneme sequence is just started to be played, the interactive object is made to make The default action is to use the sequence of the control vector obtained according to the first coding sequence to drive the interactive object to make an action after the set time. Taking Fig. 4 as an example, the feature code 1 is output at time t0, and the output is the default control vector before time t0.

The length of the time window is related to the amount of information contained in the feature code. In the case where the amount of information contained in the time window is relatively large, the cyclic neural network processing will output a relatively uniform result. If the length of the time window is too large, the expression of the interactive object may not correspond to part of the text; if the length of the time window is too small, the expression of the interactive object may appear rigid when speaking. Therefore, the duration of the time window needs to be determined according to the minimum duration of the phoneme corresponding to the text data, so that the actions taken by driving the interactive object have a stronger correlation with the sound.

The sliding step of the time window is related to the time interval (frequency) for obtaining the control vector, that is, it is related to the frequency with which the interactive object is driven to make an action. The length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.

In some embodiments, when the time interval between phonemes in the phoneme sequence is greater than a set threshold, the interactive object is driven to take actions according to the set control vector of the local area. That is, when the interactive character pauses for a long time, the interactive object is driven to make a set action. For example, when the output voice pauses for a long time, the interactive object can be made to make a smiling expression or slightly swing the body to avoid the interactive object standing upright without expression when the pause is long, thereby making the interactive object speak more Natural and smooth, which improves the interactive experience of the target object.

In some embodiments, the voice data sequence includes a voice frame sequence, and acquiring feature information of at least one voice data unit in the voice data sequence includes: acquiring a first acoustic feature sequence corresponding to the voice frame sequence, The first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; the acoustic feature vector corresponding to at least one voice frame is acquired according to the first acoustic feature sequence; The acoustic feature vector is used to obtain feature information corresponding to the at least one speech frame.

In the embodiments of the present disclosure, the control parameter of at least one local area of the interactive object may be determined according to the acoustic characteristics of the speech frame sequence, and the control parameter may also be determined according to other characteristics of the speech frame sequence.

First, obtain the acoustic feature sequence corresponding to the speech frame sequence. Here, in order to distinguish from the acoustic feature sequence mentioned later, the acoustic feature sequence corresponding to the speech frame sequence is referred to as the first acoustic feature sequence.

In the embodiments of the present disclosure, the acoustic features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Cepstral Cofficient (MFCC) and so on.

The first acoustic feature sequence is obtained by processing the entire speech frame sequence. Taking the MFCC feature as an example, the speech frame sequence can be windowed, fast Fourier transform, etc. Filtering, logarithmic processing, and discrete cosine processing to obtain the MFCC coefficients corresponding to each speech frame.

The first acoustic feature sequence is obtained by processing the entire voice frame sequence, and reflects the overall acoustic feature of the voice data sequence.

In the embodiment of the present disclosure, the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence. Taking MFCC as an example, the first acoustic feature sequence includes the MFCC coefficients of each speech frame. The first acoustic feature sequence obtained according to the speech frame sequence is shown in FIG. 5.

Next, according to the first acoustic feature sequence, the acoustic feature corresponding to at least one speech frame is acquired.

In the case that the first acoustic feature sequence includes the acoustic feature vector corresponding to each voice frame in the voice frame sequence, the same number of feature vectors corresponding to the at least one voice frame may be used as the voice The acoustic characteristics of the frame. Wherein, the same number of feature vectors can form a feature matrix, and the feature matrix is the acoustic feature of the at least one speech frame.

Taking FIG. 5 as an example, the N feature vectors in the first acoustic feature sequence form the acoustic features of the corresponding N speech frames; where N is a positive integer. The first acoustic feature matrix may include a plurality of acoustic features, and the speech frames corresponding to each of the acoustic features may partially overlap.

Finally, a control vector of at least one local area of the interactive object corresponding to the acoustic feature is acquired.

For the acquired acoustic feature corresponding to at least one speech frame, the control vector of at least one local area can be acquired. The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to perform actions at the same time, the control vectors of all the local areas can be obtained; when the expression of the interactive object needs to be controlled, Then, the control vector of the local area corresponding to the face can be obtained.

While playing the voice data sequence, the interactive object is driven to act according to the control vector corresponding to each acoustic feature obtained through the first acoustic feature sequence, so that the terminal device can output sound while the interactive object can Perform actions that match the output sound, including facial actions, expressions, and body actions, so that the target object feels that the interactive object is speaking. And because the control vector is related to the acoustic characteristics of the output sound, driving according to the control vector can make the expression and body movements of the interactive object have emotional factors, thereby making the speaking process of the interactive object more natural and vivid, thereby Improve the interactive experience between the target object and the interactive object.

In some embodiments, the acoustic feature corresponding to the at least one speech frame may be acquired by performing a sliding window on the first acoustic feature sequence.

By sliding the window of the first acoustic feature sequence with a time window of a set length and a set step size, the acoustic feature vector in the time window is used as the corresponding acoustic feature of the same number of speech frames, so that Acquire the acoustic features corresponding to these speech frames. After the sliding window is completed, the second acoustic feature sequence can be obtained according to the obtained multiple acoustic features.

Taking the driving method of the interactive object shown in FIG. 5 as an example, the speech frame sequence includes 100 speech frames per second, the length of the time window is 1 s, and the step length is 0.04 s. Since each feature vector in the first acoustic feature sequence corresponds to a speech frame, correspondingly, the first acoustic feature sequence also includes 100 feature vectors per second. During the window sliding process on the first acoustic feature sequence, 100 feature vectors in the time window are obtained each time as the acoustic features of the corresponding 100 speech frames. By moving the time window in steps of 0.04s on the first acoustic feature sequence, the acoustic features corresponding to the 1st to 100th speech frames 1 and the acoustic features 2 corresponding to the 4th to 104th speech frames are respectively obtained, By analogy, after traversing the first acoustic feature, acoustic feature 1, acoustic feature 2,..., acoustic feature M is obtained, thereby obtaining the second acoustic feature sequence, where M is a positive integer, and its value is based on the sequence of the speech frame The number of frames (the number of feature vectors in the first acoustic feature sequence), the length of the time window, and the step size are determined.

According to the acoustic feature 1, the acoustic feature 2,..., the acoustic feature M, the corresponding control vector 1, the control vector 2,..., the control vector M can be obtained respectively, so as to obtain the sequence of the control vector.

As shown in FIG. 5, the sequence of the control vector and the second acoustic feature sequence are aligned in time. Acoustic feature 1, acoustic feature 2, ..., acoustic feature M in the second acoustic feature sequence, They are respectively obtained based on the N feature vectors in the first acoustic feature sequence. Therefore, while the voice frame is played, the interactive object can be driven to perform actions according to the sequence of the control vector.

Assuming that the output of acoustic features starts at the set time of the first time window, the control vector before the set time can be set to the default value, that is, when the speech frame sequence is just started to be played, the interactive object is made to do A default action is taken, and after the set time, the interactive object is driven to make an action using the sequence of the control vector obtained according to the first acoustic characteristic sequence.

Take Figure 5 as an example. Acoustic feature 1 is output at t0, and the acoustic feature is output at intervals of 0.04s corresponding to the step size. Acoustic feature 2 is output at t1, and acoustic feature 3 is output at t2 until at t (M-1) Acoustic feature M is output at every moment. Correspondingly, the feature vector (i+1) corresponds to the time period t i～t(i+1), where i is an integer smaller than (M-1), and before t0, the control vector is the default Control vector.

In the embodiment of the present disclosure, while playing the voice data sequence, the interactive object is driven to make an action according to the sequence of the control vector, so that the action of the interactive object is synchronized with the output sound, and the target object With the feeling that the interactive object is speaking, the interactive experience between the target object and the interactive object is improved.

The length of the time window is related to the amount of information contained in the acoustic feature. The greater the length of the time window, the more information it contains, and the stronger the correlation between the actions and sounds that drive the interactive object. The sliding step of the time window is related to the time interval (frequency) for obtaining the control vector, that is, it is related to the frequency with which the interactive object is driven to make an action. The length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.

In some embodiments, the acoustic feature includes Mel frequency cepstral coefficients MFCC in L dimensions, where L is a positive integer. MFCC represents the distribution of the energy of the speech signal in different frequency ranges. The MFCC of L dimensions can be obtained by converting multiple speech frame data in the speech frame sequence to the frequency domain and using a Mel filter including L subbands. . The control vector is obtained according to the MFCC of the voice data sequence to drive the interactive object to perform facial and physical actions according to the control vector, so that the expression and physical actions of the interactive object have emotional factors, making the speaking process of the interactive object more Natural and vivid, thereby improving the interactive experience between the target object and the interactive object.

In some embodiments, the characteristic information of the voice data unit may be input to a pre-trained recurrent neural network to obtain the control parameter value of the interactive object corresponding to the characteristic information. Since the recurrent neural network is a time recursive neural network, it can learn historical information of the input feature information, and output control parameters according to the sequence of voice units; for example, the control parameter can be a facial posture control parameter, or at least one local area Control vector.

In the embodiments of the present disclosure, a pre-trained cyclic neural network is used to obtain the control parameters corresponding to the feature information of the voice data unit, and the related historical feature information and current feature information are merged, so that the historical control parameters are compared to the current feature information. The changes of control parameters have an impact, making the expression changes and body movements of the interactive characters more smooth and natural.

In some embodiments, the recurrent neural network can be trained in the following manner.

First, obtain a sample of characteristic information. For example, the characteristic information sample can be obtained in the following manner.

Acquire a video segment of a character's voice, and extract the corresponding voice segment of the character from the video segment. For example, a video segment in which a real person is speaking can be obtained; A first image frame; and, sampling the voice segment to obtain multiple voice frames.

Acquiring feature information corresponding to the voice frame according to the voice data unit contained in the voice frame corresponding to the first image frame;

The first image frame is converted into a second image frame containing the interactive object, and the control parameter value of the interactive object corresponding to the second image frame is obtained.

According to the control parameter value, annotate the feature information corresponding to the first image frame to obtain a feature information sample.

In some embodiments, the feature information includes feature codes of phonemes, and the control parameters include facial muscle control coefficients. According to the above method for acquiring characteristic information samples, using the obtained facial muscle control coefficients to label the characteristic codes of the phonemes corresponding to the first image frame, the characteristic information samples corresponding to the characteristic codes of the phonemes are obtained.

In some embodiments, the feature information includes a feature code of a phoneme, and the control parameter includes at least one partial control vector of the interactive object. According to the above method for acquiring characteristic information samples, using at least one partial control vector obtained to mark the characteristic codes of the phonemes corresponding to the first image frame, the characteristic information samples corresponding to the characteristic codes of the phonemes are obtained.

In some embodiments, the characteristic information includes acoustic characteristics of the speech frame, and the control parameter includes at least one partial control vector of the interactive object. According to the above method for acquiring characteristic information samples, using the obtained at least one partial control vector to label the acoustic characteristics of the speech frame corresponding to the first image frame, the characteristic information corresponding to the acoustic characteristics of the speech frame is obtained. sample.

Those skilled in the art should understand that the feature information sample is not limited to the above, and corresponding to various features of various types of voice data units, corresponding feature information samples can be obtained.

After obtaining the characteristic information sample, the initial recurrent neural network is trained according to the characteristic information sample, and the recurrent neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the recurrent neural network. The difference between the control parameter value predicted by the neural network and the marked control parameter value.

In the embodiment of the present disclosure, the video segment of a character is split into corresponding multiple first image frames and multiple voice frames, and the first image frame containing the real person is converted into the second image frame containing the interactive object. The image frame is used to obtain the control parameter value corresponding to the feature information of at least one voice frame, so that the feature information has a better correspondence with the control parameter value, so as to obtain high-quality feature information samples, so that the posture of the interactive object is closer to that of the corresponding character Real posture.

FIG. 6 shows a schematic structural diagram of a driving device for an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 6, the device may include: a first acquiring unit 601 for acquiring driving data of the interactive object, and Determine the driving mode of the driving data; the second obtaining unit 602 is configured to obtain the control parameter value of the interactive object according to the driving data in response to the driving mode; the driving unit 603 is configured to obtain the control parameter value according to the control parameter The value controls the posture of the interactive object.

In some embodiments, the device further includes an output unit for controlling the display device to output voice and/or display text according to the driving data.

In some embodiments, when determining the driving mode corresponding to the driving data, the first acquiring unit is specifically configured to: acquire a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data The data sequence includes a plurality of voice data units; if it is detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the first driving mode, and the target data and the preset control parameter value of the interactive object Corresponding; in response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, setting the preset control parameter value corresponding to the target data As the control parameter value of the interactive object.

In some embodiments, the target data includes keywords or keywords, and the keywords or keywords correspond to preset control parameter values of the set action of the interactive object; or, the target data includes syllables , The syllable corresponds to the preset control parameter value for setting the mouth movement of the interactive object.

In some embodiments, when identifying the driving mode of the driving data, the first acquiring unit is specifically configured to: acquire a voice data sequence corresponding to the driving data according to the type of the driving data, and the voice data The sequence includes a plurality of voice data units; if it is not detected that the voice data unit includes target data, it is determined that the driving mode of the driving data is the second driving mode, and the target data and the preset control parameter value of the interactive object Corresponding; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the second driving mode, acquiring the value of at least one voice data unit in the voice data sequence Characteristic information; obtaining the control parameter value of the interactive object corresponding to the characteristic information.

In some embodiments, the voice data sequence includes a phoneme sequence, and when the feature information of at least one voice data unit in the voice data sequence is acquired, the second acquiring unit is specifically configured to: The feature code is used to obtain the first coding sequence corresponding to the phoneme sequence; the feature code corresponding to at least one phoneme is obtained according to the first coding sequence; the feature information of the at least one phoneme is obtained according to the feature code.

In some embodiments, the voice data sequence includes a voice frame sequence, and when acquiring characteristic information of at least one voice data unit in the voice data sequence, the second acquiring unit is specifically configured to: acquire the voice frame A first acoustic feature sequence corresponding to a sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; and at least one acoustic feature sequence is acquired according to the first acoustic feature sequence Acoustic feature vector corresponding to the speech frame; according to the acoustic feature vector, the feature information corresponding to the at least one speech frame is obtained.

In some embodiments, the control parameters of the interactive object include facial posture parameters, and the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle; When data acquires the control parameter value of the interactive object, the second acquisition unit is specifically configured to: acquire the facial muscle control coefficient of the interactive object according to the driving data; the driving unit is specifically configured to: according to the acquired The facial muscle control coefficient drives the interactive object to make facial actions that match the drive data; the device also includes a limb drive unit, which is used to obtain the drive data of the body posture associated with the facial posture parameter; The driving data of the body posture associated with the facial posture parameter value drives the interactive object to make limb movements.

In some embodiments, the control parameter of the interactive object includes a control vector of at least one local area of the interactive object; when acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit Specifically configured to: obtain a control vector of at least one local area of the interactive object according to the driving data; the driving unit is specifically configured to: control the interactive object according to the acquired control vector of the at least one local area Facial movements and/or body movements.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any one of the embodiments provided in the present disclosure is realized.

At least one embodiment of this specification also provides an electronic device. As shown in FIG. 7, the device includes a memory and a processor. The memory is used to store computer instructions that can run on the processor. The method for driving interactive objects described in any embodiment of the present disclosure is realized by computer instructions.

At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.

Those skilled in the art should understand that one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown in order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or among them A combination of one or more. The embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules. Alternatively or in addition, the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission. The processing device executes. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processing and logic flow described in this specification can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output. The processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.

Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit. Generally, the central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, the computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both. However, the computer does not have to have such equipment. In addition, the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.

Although this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claimed protection, but are mainly used to describe the features of specific embodiments of a particular invention. Certain features described in multiple embodiments in this specification can also be implemented in combination in a single embodiment. On the other hand, various features described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. In addition, although features may function in certain combinations as described above and even initially claimed as such, one or more features from the claimed combination may in some cases be removed from the combination, and the claimed The combination of protection can be directed to a sub-combination or a variant of the sub-combination.

Similarly, although operations are depicted in a specific order in the drawings, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or requiring all illustrated operations to be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the above embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can usually be integrated together in a single software product. In, or packaged into multiple software products.

Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desired results. In addition, the processes depicted in the drawings are not necessarily in the specific order or sequential order shown in order to achieve the desired result. In some implementations, multitasking and parallel processing may be advantageous.

The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. All within the spirit and principle of one or more embodiments of this specification, Any modification, equivalent replacement, improvement, etc. made should be included in the protection scope of one or more embodiments of this specification.

Claims

A method for driving an interactive object, the interactive object being displayed in a display device, the method including:

Acquiring driving data of the interactive object, and determining a driving mode of the driving data;

In response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data;

The posture of the interactive object is controlled according to the control parameter value.
The method according to claim 1, further comprising: controlling the display device to output voice and/or display text according to the driving data.
The method according to claim 1 or 2, wherein determining the driving mode corresponding to the driving data comprises:

Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;

In response to detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object;

In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:

In response to the first driving mode, the preset control parameter value corresponding to the target data is used as the control parameter value of the interaction object.
The method according to claim 3, wherein the target data includes keywords or keywords, and the keywords or the keywords correspond to preset control parameter values of the setting action of the interactive object; or ,

The target data includes syllables, and the syllables correspond to preset control parameter values for setting mouth movements of the interactive object.
The method according to any one of claims 1 to 4, wherein determining the driving mode of the driving data comprises:

Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;

In response to not detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object;

In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:

In response to the second driving mode, acquiring characteristic information of at least one voice data unit in the voice data sequence;

Obtain the control parameter value of the interactive object corresponding to the characteristic information.
The method according to claim 5, wherein the voice data sequence comprises a phoneme sequence, and acquiring characteristic information of at least one voice data unit in the voice data sequence comprises:

Performing feature encoding on the phoneme sequence to obtain a first encoding sequence corresponding to the phoneme sequence;

Obtaining a feature code corresponding to at least one phoneme according to the first coding sequence;

According to the feature encoding, feature information of the at least one phoneme is obtained.
The method according to claim 5, wherein the voice data sequence comprises a voice frame sequence, and acquiring characteristic information of at least one voice data unit in the voice data sequence comprises:

Acquiring a first acoustic feature sequence corresponding to the voice frame sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;

Obtaining an acoustic feature vector corresponding to at least one speech frame according to the first acoustic feature sequence;

According to the acoustic feature vector, feature information corresponding to the at least one speech frame is obtained.
The method according to any one of claims 1 to 7, wherein the control parameters of the interactive object include facial posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control The movement state of at least one facial muscle;

Obtaining the control parameter value of the interactive object according to the driving data includes:

Acquiring the facial muscle control coefficient of the interactive object according to the driving data;

Controlling the posture of the interactive object according to the control parameter value includes:

According to the acquired facial muscle control coefficient, the interactive object is driven to make facial movements matching the driving data.
The method according to claim 8, further comprising:

Acquiring driving data of the body posture associated with the facial posture parameter;

According to the driving data of the body posture associated with the facial posture parameter value, the interactive object is driven to make a limb movement.
The method according to any one of claims 1 to 9, wherein the control parameter of the interactive object comprises a control vector of at least one local area of the interactive object;

Obtaining the control parameter value of the interactive object according to the driving data includes:

Acquiring a control vector of at least one local area of the interactive object according to the driving data;

Controlling the posture of the interactive object according to the control parameter value includes:

According to the acquired control vector of the at least one local area, the facial movement and/or the limb movement of the interactive object are controlled.
The method according to claim 5, wherein obtaining the control parameter value of the interactive object corresponding to the characteristic information comprises:

The characteristic information is input to a pre-trained recurrent neural network, and the control parameter value of the interactive object corresponding to the characteristic information is obtained.
A driving device for an interactive object, the interactive object is displayed in a display device, and the device includes:

The first acquiring unit is configured to acquire the driving data of the interactive object and determine the driving mode of the driving data;

The second acquiring unit is configured to acquire the control parameter value of the interactive object according to the driving data in response to the driving mode;

The driving unit is used to control the posture of the interactive object according to the control parameter value.
The apparatus according to claim 12, further comprising an output unit for controlling the display device to output voice and/or display text according to the driving data.
The device according to claim 12 or 13, wherein, when determining the driving mode corresponding to the driving data, the first obtaining unit is configured to:

Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;

In response to detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object;

In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:

In response to the first driving mode, using the preset control parameter value corresponding to the target data as the control parameter value of the interaction object;

Wherein, the target data includes keywords or keywords, and the keywords or the keywords correspond to the preset control parameter values of the setting action of the interactive object; or,

The target data includes syllables, and the syllables correspond to preset control parameter values for setting mouth movements of the interactive object.
The device according to any one of claims 12 to 14, wherein, when determining the driving mode of the driving data, the first obtaining unit is configured to:

Obtaining a voice data sequence corresponding to the drive data according to the type of the drive data, where the voice data sequence includes a plurality of voice data units;

In response to not detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object;

In response to the driving mode, obtaining the control parameter value of the interactive object according to the driving data includes:

In response to the second driving mode, acquiring characteristic information of at least one voice data unit in the voice data sequence;

Obtain the control parameter value of the interactive object corresponding to the characteristic information.
The device according to claim 15, wherein the voice data sequence comprises a phoneme sequence, and when acquiring characteristic information of at least one voice data unit in the voice data sequence, the second acquiring unit is configured to:

Performing feature encoding on the phoneme sequence to obtain a first encoding sequence corresponding to the phoneme sequence;

Obtaining a feature code corresponding to at least one phoneme according to the first coding sequence;

Obtaining the characteristic information of the at least one phoneme according to the characteristic encoding;

Alternatively, the voice data sequence includes a voice frame sequence, and when acquiring characteristic information of at least one voice data unit in the voice data sequence, the second acquiring unit is configured to:

Acquiring a first acoustic feature sequence corresponding to the voice frame sequence, where the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence;

Obtaining an acoustic feature vector corresponding to at least one speech frame according to the first acoustic feature sequence;

According to the acoustic feature vector, feature information corresponding to the at least one speech frame is obtained.
The device according to any one of claims 12 to 16, wherein the control parameters of the interactive object include facial posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control at least one The movement state of facial muscles;

When acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit is configured to:

Acquiring the facial muscle control coefficient of the interactive object according to the driving data;

The driving unit is used for:

According to the acquired facial muscle control coefficient, driving the interactive object to make a facial action matching the driving data;

The device also includes a limb drive unit, which is used to obtain body posture drive data associated with the facial posture parameter; and drive the body posture drive data associated with the facial posture parameter value. The interactive object makes physical movements.
The device according to any one of claims 12 to 16, wherein the control parameter of the interactive object comprises a control vector of at least one local area of the interactive object;

When acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit is configured to:

Acquiring a control vector of at least one local area of the interactive object according to the driving data;

The driving unit is used for:

According to the acquired control vector of the at least one local area, the facial movement and/or the limb movement of the interactive object are controlled.
An electronic device, comprising a memory and a processor, the memory is used to store computer instructions that can run on the processor, and the processor is used to implement any one of claims 1 to 11 when the computer instructions are executed Methods.
A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method according to any one of claims 1 to 11 is realized.