CN111460785A

CN111460785A - Interactive object driving method, device, equipment and storage medium

Info

Publication number: CN111460785A
Application number: CN202010245802.4A
Authority: CN
Inventors: 吴文岩; 吴潜溢; 钱晨; 白晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28
Anticipated expiration: 2040-03-31
Also published as: TW202138992A; JP2022530935A; CN111460785B; SG11202111909QA; WO2021196644A1; KR20210124307A

Abstract

A driving method, a device, equipment and a storage medium of an interactive object are disclosed, wherein the method comprises the following steps: acquiring a phoneme sequence corresponding to the text data; acquiring a control parameter value of at least one local area of an interactive object matched with the phoneme sequence; and controlling the posture of the interactive object according to the acquired control parameter value.

Description

Interactive object driving method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for driving an interactive object.

Background

The man-machine interaction mode is mostly based on key pressing, touch and voice input, and responses are carried out by presenting images, texts or virtual characters on a display screen. The virtual character is improved on the basis of the voice assistant, and the interaction between the user and the virtual character is still on the surface.

Disclosure of Invention

The embodiment of the disclosure provides a driving scheme for an interactive object.

According to an aspect of the present disclosure, there is provided a driving method of an interactive object, the method including: acquiring a phoneme sequence corresponding to the text data; acquiring a control parameter value of at least one local area of an interactive object matched with the phoneme sequence; and controlling the posture of the interactive object according to the acquired control parameter value.

In combination with any embodiment provided by the present disclosure, the method further comprises: and controlling display equipment for displaying the interactive object to display a text according to the text data, and/or controlling the display equipment to output voice according to a phoneme sequence corresponding to the text data.

In combination with any one of the embodiments provided by the present disclosure, the obtaining a control parameter value of at least one local region of the interactive object matching the phoneme sequence includes: performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; acquiring a feature code corresponding to at least one phoneme according to the first coding sequence; and acquiring the attitude control vector of at least one local area of the interactive object corresponding to the feature code.

In combination with any embodiment provided by the present disclosure, the performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence includes: aiming at a plurality of phonemes contained in the phoneme sequence, generating a sub-coding sequence corresponding to each phoneme; and obtaining a first coding sequence corresponding to the phoneme sequence according to the sub-coding sequences corresponding to the multiple phonemes respectively.

In combination with any one embodiment provided by the present disclosure, the generating, for a plurality of phonemes included in the phoneme sequence, a sub-coding sequence corresponding to each phoneme includes: detecting whether a first phoneme corresponds to each time point, wherein the first phoneme is any one of the multiple phonemes; and setting the coding value at the time point with the first phoneme as a first numerical value, and setting the coding value at the time without the first phoneme as a second numerical value to obtain a sub-coding sequence corresponding to the first phoneme.

In combination with any embodiment provided by the present disclosure, the method further comprises: and performing Gaussian convolution operation on the continuous value of the first phoneme in time by using a Gaussian filter for the sub-coding sequence corresponding to the first phoneme, wherein the first phoneme is any one of the multiple phonemes.

In combination with any one of the embodiments provided by the present disclosure, the controlling the posture of the interactive object according to the obtained control parameter value includes: acquiring a sequence of attitude control vectors corresponding to the second coding sequence; and controlling the posture of the interactive object according to the sequence of the posture control vector.

In combination with any embodiment provided by the present disclosure, the method further comprises: and under the condition that the time interval between the phonemes in the phoneme sequence is greater than a set threshold value, controlling the posture of the interactive object according to the set control parameter value of the local area.

In combination with any one of the embodiments provided by the present disclosure, the acquiring an attitude control vector of at least one local area of the interactive object corresponding to the feature code includes: and inputting the feature codes into a recurrent neural network, and obtaining the attitude control vector of at least one local area of the interactive object corresponding to the feature codes.

In combination with any one of the embodiments provided by the present disclosure, the neural network is obtained by obtaining a feature coding sample for training; the method further comprises the following steps: acquiring a video segment of a character sending voice, and acquiring a plurality of first image frames containing the character according to the video segment; extracting a corresponding voice segment from the video segment, acquiring a sample phoneme sequence according to the voice segment, and performing feature coding on the sample phoneme sequence; acquiring feature codes of at least one phoneme corresponding to the first image frame; converting the first image frame into a second image frame containing the interactive object, and acquiring a posture control vector value of at least one local area corresponding to the second image frame; and marking the feature code corresponding to the first image frame according to the attitude control vector value to obtain a feature code sample.

In combination with any embodiment provided by the present disclosure, the method further comprises: training an initial cyclic neural network according to the feature coding sample, and obtaining the cyclic neural network after the change of network loss meets a convergence condition, wherein the network loss comprises the difference between the attitude control vector value of the at least one local area predicted by the cyclic neural network and the labeled attitude control vector value.

According to an aspect of the present disclosure, there is provided an apparatus for driving an interactive object, the apparatus including: the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring a phoneme sequence corresponding to text data; a second obtaining unit, configured to obtain a control parameter value of at least one local region of the interactive object matching the phoneme sequence; and the driving unit is used for controlling the posture of the interactive object according to the acquired control parameter value.

In combination with any embodiment provided by the present disclosure, the apparatus further includes an output unit, configured to control a display device displaying the interactive object to display a text according to the text data, and/or control the display device to output a voice according to a phoneme sequence corresponding to the text data.

In combination with any one of the embodiments provided by the present disclosure, the second obtaining unit is specifically configured to: performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; acquiring a feature code corresponding to at least one phoneme according to the first coding sequence; and acquiring the attitude control vector of at least one local area of the interactive object corresponding to the feature code.

With reference to any embodiment provided by the present disclosure, when the second obtaining unit is configured to perform feature coding on the phoneme sequence to obtain the first coding sequence corresponding to the phoneme sequence, the second obtaining unit is specifically configured to: aiming at a plurality of phonemes contained in the phoneme sequence, generating a sub-coding sequence corresponding to each phoneme; and obtaining a first coding sequence corresponding to the phoneme sequence according to the sub-coding sequences corresponding to the multiple phonemes respectively.

In combination with any embodiment provided by the present disclosure, when the second obtaining unit is configured to generate a sub-coding sequence corresponding to each phoneme for a plurality of phonemes included in the phoneme sequence, the second obtaining unit is specifically configured to: detecting whether a first phoneme corresponds to each time point, wherein the first phoneme is any one of the multiple phonemes; and setting the coding value at the time point with the first phoneme as a first numerical value, and setting the coding value at the time without the first phoneme as a second numerical value to obtain a sub-coding sequence corresponding to the first phoneme.

In combination with any embodiment provided by the present disclosure, the apparatus further includes a filtering unit, configured to perform, on a sub-coding sequence corresponding to a first phoneme, a gaussian convolution operation on a continuous value of the first phoneme in time by using a gaussian filter, where the first phoneme is any one of the multiple phonemes.

In combination with any embodiment provided by the present disclosure, when the second obtaining unit is configured to obtain, according to the first coding sequence, a feature code corresponding to at least one phoneme, the second obtaining unit is specifically configured to: and sliding the coding sequence by a time window with a set length and a set step length, taking the feature codes in the time window as the feature codes of the corresponding at least one phoneme, and obtaining a second coding sequence according to a plurality of feature codes obtained by completing the sliding window.

In combination with any one of the embodiments provided by the present disclosure, the driving unit is specifically configured to: acquiring a sequence of attitude control vectors corresponding to the second coding sequence; and controlling the posture of the interactive object according to the sequence of the posture control vector.

In combination with any embodiment provided by the present disclosure, the apparatus further includes a pause driving unit, configured to control the posture of the interactive object according to the set control parameter value of the local region when the time interval between the phonemes in the phoneme sequence is greater than a set threshold.

In combination with any embodiment provided by the present disclosure, when the second obtaining unit is configured to obtain the attitude control vector of at least one local area of the interactive object corresponding to the feature code, the second obtaining unit is specifically configured to: and inputting the feature codes into a recurrent neural network, and obtaining the attitude control vector of at least one local area of the interactive object corresponding to the feature codes.

In combination with any one of the embodiments provided by the present disclosure, the neural network is obtained by obtaining a feature coding sample for training; the apparatus further comprises a sample acquisition unit for: acquiring a video segment of a character sending voice, and acquiring a plurality of first image frames containing the character according to the video segment; extracting a corresponding voice segment from the video segment, acquiring a sample phoneme sequence according to the voice segment, and performing feature coding on the sample phoneme sequence; acquiring feature codes of at least one phoneme corresponding to the first image frame; converting the first image frame into a second image frame containing the interactive object, and acquiring a posture control vector value of at least one local area corresponding to the second image frame; and marking the feature code corresponding to the first image frame according to the attitude control vector value to obtain a feature code sample.

In combination with any embodiment provided by the present disclosure, the apparatus further includes a training unit, configured to train an initial recurrent neural network according to the feature coding samples, and train to obtain the recurrent neural network after a change of a network loss satisfies a convergence condition, where the network loss includes a difference between an attitude control vector value of the at least one local region predicted by the recurrent neural network and an annotated attitude control vector value.

According to an aspect of the present disclosure, there is provided an electronic device, the device including a memory for storing computer instructions executable on a processor, and the processor being configured to implement a driving method of an interactive object according to any one of the embodiments provided in the present disclosure when executing the computer instructions.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the driving method of an interactive object according to any one of the embodiments provided in the present disclosure.

The driving method, the driving device, the driving equipment and the computer-readable storage medium of the interactive object according to one or more embodiments of the present disclosure control the gesture of the interactive object by acquiring the phoneme sequence corresponding to the text data and acquiring the control parameter value of at least one local region of the interactive object matching the phoneme sequence, so that the interactive object can make a gesture matching the phonemes included in the text data, including a facial gesture and a limb gesture, thereby enabling the target object to generate a feeling that the interactive object is speaking the text content, and improving the interactive experience of the target object.

Drawings

In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive exercise.

Fig. 1 is a schematic diagram of a display device in a driving method of an interactive object according to at least one embodiment of the present disclosure;

fig. 2 is a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure;

fig. 3 is a schematic diagram of a process for feature coding a phoneme sequence according to at least one embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

At least one embodiment of the present disclosure provides a driving method for an interactive object, where the driving method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game console, a desktop computer, an advertisement machine, a kiosk, a vehicle-mounted terminal, and the like, and the server includes a local server or a cloud server, and the method may also be implemented by a way that a processor calls a computer-readable instruction stored in a memory.

In the embodiment of the present disclosure, the interactive object may be any interactive object capable of interacting with the target object, and may be a virtual character, a virtual animal, a virtual article, a cartoon image, or other virtual images capable of implementing an interactive function, where the presentation form of the virtual image may be a 2D form or a 3D form, and the present disclosure is not limited thereto. The target object can be a user, a robot or other intelligent equipment. The interaction mode between the interaction object and the target object can be an active interaction mode or a passive interaction mode. In one example, the target object may issue a demand by making a gesture or a limb action, and the interaction object is triggered to interact with the target object by active interaction. In another example, the interactive object may interact with the interactive object in a passive manner by actively calling a call, prompting the target object to make an action, and the like.

The interactive object may be displayed through a terminal device, and the terminal device may be a television, an all-in-one machine with a display function, a projector, a Virtual Reality (VR) device, an Augmented Reality (AR) device, or the like.

Fig. 1 illustrates a display device proposed by at least one embodiment of the present disclosure. As shown in fig. 1, the display device has a display device of a transparent display screen, which can display a stereoscopic picture on the transparent display screen to present a virtual scene with a stereoscopic effect and an interactive object. For example, the interactive objects displayed on the transparent display screen in fig. 1 are virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the display device with the transparent display screen, where the display device is configured with a memory and a processor, the memory is used to store computer instructions executable on the processor, and the processor is used to implement the driving method for the interactive object provided in the present disclosure when executing the computer instructions, so as to drive the interactive object displayed in the transparent display screen to respond to the target object.

In some embodiments, the interactive object may emit a specified voice to the target object in response to the terminal device receiving sound driving data for driving the interactive object to output the voice. The voice driving data can be generated according to the action, expression, identity, preference and the like of the target object around the terminal equipment, so that the interactive object is driven to respond by sending out the specified voice, and therefore the anthropomorphic service is provided for the target object. In the interaction process of the interaction object and the target object, the interaction object is driven to emit the specified voice according to the voice driving data, and meanwhile, the interaction object cannot be driven to make the face action synchronous with the specified voice, so that the interaction object is stiff and unnatural when the voice is emitted, and the target object and the interaction experience are influenced. Based on this, at least one embodiment of the present disclosure provides a driving method for an interactive object, so as to improve the experience of interaction between a target object and the interactive object.

Fig. 2 shows a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 2, the method includes steps 201 to 203.

In step 201, a phoneme sequence corresponding to the text data is obtained.

The text data may be driving data for driving the interactive object. The driving data may be driving data generated according to an action, an expression, an identity, a preference, and the like of a target object interacting with the interaction object, or driving data called by the terminal device from an internal memory. The present disclosure does not limit the manner of acquiring the text data.

In the embodiment of the present disclosure, a phoneme included in a morpheme may be obtained according to the morpheme included in a text, so as to obtain a phoneme sequence corresponding to the text. Wherein, the phoneme is the minimum voice unit divided according to the natural attribute of the voice, and a real character can form a phoneme by a pronunciation action.

In response to the text being a Chinese text, a sequence of phonemes may be generated using pinyin by converting the Chinese text word to pinyin, and generating a time stamp for each phoneme.

In step 202, control parameter values for at least one local region of the interactive object matching the phoneme sequence are obtained.

The local region is obtained by dividing the whole body (including face and/or body) of the interactive object. One of the local regions of the face may correspond to a series of facial expressions or actions of the interactive object, for example, the eye region may correspond to facial actions of the interactive object such as opening eyes, closing eyes, blinking eyes, changing visual angles, and the like; further, for example, the mouth region corresponds to a face motion such as the interactive object closing the mouth and opening the mouth to a different degree. While one of the partial areas of the body may correspond to a sequence of limb movements of the interactive object, for example the leg area may correspond to a walking, jumping, kicking, etc. movement of the interactive object.

And the control parameters of the local area of the interactive object comprise attitude control vectors of the local area. The attitude control vector of each local area is used for driving the local area of the interactive object to perform action. Different attitude control vector values correspond to different motions or motion magnitudes. For example, for the pose control vectors of the mouth region, one set of pose control vector values may slightly open the mouth of the interactive object, and another set of pose control vector values may open the mouth of the interactive object. By driving the interaction objects with different attitude control vector values, the corresponding local regions can be made to make different motions or motions of different magnitudes.

The local area can be selected according to the action of the interactive object to be controlled, for example, when the face and the limbs of the interactive object need to be controlled to simultaneously act, the posture control vectors of all the local areas can be obtained; when the expression of the interactive object needs to be controlled, the gesture control vector of the local area corresponding to the face can be acquired.

In the embodiment of the present disclosure, the phoneme sequence may be subjected to feature coding, and a pose parameter value corresponding to the feature coding may be determined, so as to determine a pose parameter value corresponding to the phoneme sequence. Different coding modes may embody different characteristics of the phoneme sequence. This disclosure is not limited to a particular encoding scheme.

In the embodiment of the present disclosure, a correspondence between a feature code of a phoneme sequence corresponding to the text data and a control parameter value of an interactive object may be pre-established, and when the text-driven data is obtained, the corresponding control parameter value may be obtained. A detailed description will be given later on of a specific method of acquiring the control parameter value matching the feature code of the phoneme sequence of the text data.

In step 203, the posture of the interactive object is controlled according to the acquired control parameter value.

Wherein the control parameter value, such as a pose control vector value, is matched to a phoneme sequence contained in the text data. For example, when a display device showing the interactive object is controlled to show text according to the text data and/or the display device is controlled to output voice according to a phoneme sequence corresponding to the text data, the gesture made by the interactive object is synchronous with the output voice and/or the shown text, so that the target object can be provided with the feeling that the interactive object is speaking.

In the embodiment of the disclosure, the gesture of the interactive object is controlled by acquiring the phoneme sequence corresponding to the text data and acquiring the control parameter value of at least one local region of the interactive object matched with the phoneme sequence, so that the interactive object can make gestures matched with phonemes contained in the text data, including facial gestures and limb gestures, thereby enabling the target object to generate a feeling that the interactive object is speaking text content, and improving the interactive experience of the target object.

In some embodiments, the method is applied to a server, including a local server or a cloud server, and the server processes text data, generates a gesture parameter value of the interactive object, and performs rendering by using a three-dimensional rendering engine according to the gesture parameter value to obtain a response animation of the interactive object. The server can send the response animation to the terminal for displaying to respond to the target object, and can also send the response animation to the cloud end, so that the terminal can obtain the response animation from the cloud end to respond to the target object. After the server generates the attitude parameter value of the interactive object, the attitude parameter value can be sent to the terminal so that the terminal can complete the processes of rendering, generating response animation and displaying.

In some embodiments, the method is applied to a terminal, the terminal processes text data, generates a posture parameter value of the interactive object, and performs rendering by using a three-dimensional rendering engine according to the posture parameter value to obtain a response animation of the interactive object, and the terminal can display the response animation to respond to a target object.

In some embodiments, a display device displaying the interactive object may be controlled to display text according to the text data, and/or the display device may be controlled to output voice according to a phoneme sequence corresponding to the text data.

In the embodiment of the present disclosure, since the value of the posture parameter is matched to the phoneme sequence of the text data, in the case where the output of the voice and/or text from the text data is synchronized with the control of the posture of the interactive object according to the value of the posture parameter, the posture made by the interactive object is synchronized with the output of the voice and/or text, giving the target object a feeling that the interactive object is speaking.

In some embodiments, in case the control parameter of the at least one local region of the interaction object comprises an attitude control vector, the attitude control vector may be obtained in the following way.

Firstly, carrying out feature coding on the phoneme sequence to obtain a coding sequence corresponding to the phoneme sequence. Here, in order to distinguish from the coding sequence mentioned later, the coding sequence corresponding to the phoneme sequence of the text data is referred to as a first coding sequence.

And generating a sub-coding sequence corresponding to each phoneme aiming at the phonemes contained in the phoneme sequence.

In one example, whether a first phoneme corresponds to each time point is detected, wherein the first phoneme is any one of the phonemes; and setting the coding value at the time point with the first phoneme as a first numerical value, setting the coding value at the time without the first phoneme as a second numerical value, and obtaining the coding sequence corresponding to the first phoneme after assigning the coding values at the time points. For example, the coding value at the time with the first phoneme may be set to 1, and the coding value at the time without the first phoneme may be set to 0. It will be understood by those skilled in the art that the above-mentioned setting of the encoding value is only an example, and other values may be set as well, and the present disclosure does not limit this.

And then, obtaining a first coding sequence corresponding to the phoneme sequence according to the sub-coding sequences corresponding to the multiple phonemes respectively.

In one example, for the sub-coding sequence corresponding to the first phoneme, a gaussian filter may be used to perform a gaussian convolution operation on the time continuous values of the first phoneme to filter the matrix corresponding to the feature code, so as to smooth the transition action of the mouth region at each phoneme transition.

Fig. 3 illustrates a schematic diagram of a driving method of an interactive object according to at least one embodiment of the present disclosure. As shown in fig. 3, the phoneme sequence 310 includes phonemes j, i1, j, and ie4 (for simplicity, only a part of the phonemes are shown), and

sub-coding sequences

321, 322, and 323 corresponding to the phonemes j, i1, and ie4 are obtained for each phoneme. In each sub-coding sequence, the coding value corresponding to the time with the phoneme is a first numerical value (for example, 1), and the coding value corresponding to the time without the phoneme is a second numerical value (for example, 0). Taking the sub-coding sequence 321 as an example, at the time of phoneme j in the phoneme sequence 310, the value of the sub-coding sequence 321 is the first numerical value, and at the time of no phoneme j, the value of the sub-coding sequence 321 is the second numerical value. All the sub-coding sequences constitute a first coding sequence 320.

And then, acquiring a feature code corresponding to at least one phoneme according to the first coding sequence.

The feature information of the

sub-coding sequences

321, 322, 323 can be obtained according to the coding values of the

sub-coding sequences

321, 322, 323 corresponding to the phonemes j, i1, ie4, respectively, and the durations of the phonemes corresponding to the three sub-coding sequences, i.e., the duration of j in the sub-coding sequence 321, the duration of i1 in the sub-coding sequence 322, and the duration of ie4 in the sub-coding sequence 323.

In one example, a gaussian filter may be used to perform a gaussian convolution operation on temporally continuous values of phonemes j, i1, ie4 in the respective

sub-encoding sequences

321, 322, 323 to smooth the feature encoding, resulting in the smoothed first encoding sequence 330. That is, the gaussian convolution operation is performed on temporally continuous values of 0 to 1 of the phoneme by the gaussian filter so that the change phase of the coded value from the second value to the first value or from the first value to the second value in each coded sequence becomes smooth. For example, the values of the coded sequence also present intermediate state values, such as 0.2, 0.3 and the like, besides 0 and 1, and the gesture control vector obtained according to the intermediate state values enables excessive movement and more gradual and natural expression change of the interactive character, thereby improving the interactive experience of the target object.

In some embodiments, the feature codes corresponding to at least one phoneme may be obtained by performing a sliding window on the first coding sequence. Wherein the first coding sequence may be a coding sequence after a gaussian convolution operation.

And sliding the coding sequence by a time window with a set length and a set step length, taking the feature codes in the time window as the feature codes of the corresponding at least one phoneme, and obtaining a second coding sequence according to the obtained plurality of feature codes after the sliding is finished. As shown in fig. 3, by sliding a time window with a set length on the first coded sequence 320 or the smoothed first coded sequence 330, feature code 1, feature code 2, feature code 3 are obtained, and so on, after traversing the first coded sequence,

feature codes

1, 2, 3, …, M are obtained, and thus the second coded sequence 340 is obtained. Wherein, M is a positive integer, and the value is determined according to the length of the first coding sequence, the length of the time window and the step length of sliding the time window.

From the

feature encodings

1, 2, 3, …, M, the corresponding

attitude control vectors

1, 2, 3, …, M, respectively, may be obtained, thereby obtaining a sequence of attitude control vectors 350.

The sequence of pose control vectors 350 is temporally aligned with the second encoded sequence 340, and each feature vector in the sequence of pose control vectors 350 is also obtained from at least one phoneme in the sequence of phonemes since each encoding feature in the second encoded sequence is obtained from at least one phoneme in the sequence of phonemes. And when the phoneme sequence corresponding to the text data is played, the interactive object is driven to make an action according to the sequence of the attitude control vector, namely the interactive object is driven to make a sound corresponding to the text content, and simultaneously the action synchronous with the sound is made, so that the target object can have the feeling that the interactive object is speaking, and the interactive experience of the target object is improved.

Assuming that the encoding feature starts to be output at the setting time of the first time window, the gesture control vector before the setting time may be set as a default value, that is, the interactive object is made to perform a default action when the phoneme sequence starts to be played, and the interactive object is driven to perform an action by using the sequence of the gesture control vector obtained according to the first encoding sequence after the setting time. Taking fig. 3 as an example, the output of the encoding feature 1 is started at time t0, and the default posture control vector is assigned before time t 0.

The duration of the time window is related to the amount of information contained in the coding features. In the case of a large amount of information contained in the time window, a more uniform result is output through the recurrent neural network processing. If the time window is too large, the expression of the interactive object cannot correspond to part of characters when speaking; if the time window is too small, it may cause the interactive object to appear stiff when speaking. Therefore, the duration of the time window needs to be determined according to the minimum duration of the phoneme corresponding to the text data, so that the action of driving the interactive object has stronger relevance to the sound.

The step size for performing the sliding window is related to the time interval (frequency) for obtaining the attitude control vector, i.e. the frequency for driving the interactive object to make the motion. The time length and the step length of the time window can be set according to the actual interactive scene, so that the relevance between the expression and the action of the interactive object and the sound is stronger, and the interactive object is more vivid and natural.

In some embodiments, in the case that the time interval between phonemes in the phoneme sequence is greater than a set threshold, the interactive object is driven to make an action according to the set posture control vector of the local region. That is, when the interactive character pauses for a long time, the interactive object is driven to make a set action. For example, when the output sound is in a large pause, the interactive character can be made to have a smiling expression or to make a slight swing of the body, so that the interactive character is prevented from standing upright without expression when the pause is long, the speaking process of the interactive object is natural and smooth, and the interactive feeling of the target object is improved.

The recurrent neural network may be, for example, a long Short Term Memory network (L ong Short-Term Memory, L STM), because the recurrent neural network is a time-recursive neural network that may learn historical information of the input encoded features, the attitude control vector for the at least one local region may be output from the encoded feature sequence.

In the embodiment of the disclosure, a pre-trained recurrent neural network is used to obtain the attitude control vector of at least one local area of the interactive object corresponding to the coding features, and the historical feature information and the current feature information of the coding features are fused, so that the historical attitude control vector influences the change of the current attitude control vector, and the expression change and the limb action of the interactive character are more gradual and natural.

In some embodiments, the circular convolutional neural network may be trained in the following manner.

Firstly, a feature coding sample is obtained, wherein a real value is marked on the feature coding sample, and the real value is an attitude control vector value of at least one local area of the interactive object.

After obtaining the obtained feature coding sample, training an initial recurrent neural network according to the feature coding sample, and training to obtain the recurrent neural network after the change of network loss meets a convergence condition, wherein the network loss comprises the difference between the attitude control vector value of the at least one local area predicted by the initial recurrent neural network and the true value.

In some embodiments, the feature encoding samples may be obtained by the following method.

Firstly, a video segment of a character which sends voice is obtained, and a plurality of first image frames containing the character are obtained according to the video segment. For example, a video segment in which an actual person is speaking can be acquired.

Then, extracting a corresponding voice segment from the video segment, obtaining a sample phoneme sequence according to the voice segment, and performing feature coding on the sample phoneme sequence. The sample phoneme sequence is encoded in the same manner as the phoneme sequence corresponding to the text data.

And acquiring the characteristic code of at least one phoneme corresponding to the first image frame according to the sample coding sequence obtained by characteristic coding the sample phoneme sequence. Wherein the at least one phoneme may be a phoneme within a set range of the first image frame occurrence time.

Then, the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained. The attitude control vector values of all the local regions may be obtained, and the attitude control vector values of some of the local regions may also be obtained.

Taking the first image frame as an example of an image frame containing a real person, the real person may be converted into a second image frame containing an avatar represented by an interactive object, and the pose control vectors of the respective local regions of the real person and the pose control vectors of the respective local regions of the interactive object are corresponding.

And finally, labeling the obtained feature code of at least one phoneme corresponding to the first image frame according to the attitude control vector value to obtain a feature code sample.

In the embodiment of the disclosure, a video segment of a character is split into a plurality of corresponding first image frames and voice segments, and the first image frames containing a real character are converted into second image frames containing an interactive object to obtain a posture control vector corresponding to the feature coding of a phoneme, so that the correspondence between the coding features and the posture control vector is good, a high-quality coding feature sample is obtained, and the action of the interactive object is closer to the real action of the corresponding character.

Fig. 4 illustrates a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 4, the apparatus may include: a first obtaining unit 401, configured to obtain a phoneme sequence corresponding to text data; a second obtaining unit 402, configured to obtain a control parameter value of at least one local region of an interactive object matching the phoneme sequence; a driving unit 403, configured to control the posture of the interactive object according to the obtained control parameter value.

In some embodiments, the apparatus further includes an output unit, configured to control a display device displaying the interactive object to display text according to the text data, and/or control the display device to output speech according to a phoneme sequence corresponding to the text data.

In some embodiments, the second obtaining unit is specifically configured to: performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; acquiring a feature code corresponding to at least one phoneme according to the first coding sequence; and acquiring the attitude control vector of at least one local area of the interactive object corresponding to the feature code.

In some embodiments, when the second obtaining unit is configured to perform feature coding on the phoneme sequence to obtain the first coding sequence corresponding to the phoneme sequence, the second obtaining unit is specifically configured to: aiming at a plurality of phonemes contained in the phoneme sequence, generating a sub-coding sequence corresponding to each phoneme; and obtaining a first coding sequence corresponding to the phoneme sequence according to the sub-coding sequences corresponding to the multiple phonemes respectively.

In some embodiments, the second obtaining unit, when configured to generate a sub-coding sequence corresponding to each of the phonemes for a plurality of phonemes included in the phoneme sequence, is specifically configured to: detecting whether a first phoneme corresponds to each time point, wherein the first phoneme is any one of the multiple phonemes; and setting the coding value at the time point with the first phoneme as a first numerical value, and setting the coding value at the time without the first phoneme as a second numerical value to obtain a sub-coding sequence corresponding to the first phoneme.

In some embodiments, the apparatus further includes a filtering unit configured to perform a gaussian convolution operation on temporally consecutive values of a first phoneme with a gaussian filter for a sub-coding sequence corresponding to the first phoneme, where the first phoneme is any one of the plurality of phonemes.

In some embodiments, when the second obtaining unit is configured to obtain the feature code corresponding to the at least one phoneme according to the first coding sequence, the second obtaining unit is specifically configured to: and sliding the coding sequence by a time window with a set length and a set step length, taking the feature codes in the time window as the feature codes of the corresponding at least one phoneme, and obtaining a second coding sequence according to a plurality of feature codes obtained by completing the sliding window.

In some embodiments, the drive unit is specifically configured to: acquiring a sequence of attitude control vectors corresponding to the second coding sequence; and controlling the posture of the interactive object according to the sequence of the posture control vector.

In some embodiments, the apparatus further comprises a pause driving unit for controlling the posture of the interactive object according to the set control parameter value of the local region in case the time interval between the phonemes in the phoneme sequence is larger than a set threshold.

In some embodiments, the second obtaining unit, when configured to obtain the attitude control vector of the at least one local region of the interactive object corresponding to the feature code, is specifically configured to: and inputting the feature codes into a recurrent neural network, and obtaining the attitude control vector of at least one local area of the interactive object corresponding to the feature codes.

In some embodiments, the neural network is trained from a phoneme sequence sample; the apparatus further comprises a sample acquisition unit for: acquiring a video segment of a character sending voice, and acquiring a plurality of first image frames containing the character according to the video segment; extracting a corresponding voice segment from the video segment, acquiring a sample phoneme sequence according to the voice segment, and performing feature coding on the sample phoneme sequence; acquiring feature codes of at least one phoneme corresponding to the first image frame; converting the first image frame into a second image frame containing the interactive object, and acquiring a posture control vector value of at least one local area corresponding to the second image frame; and marking the feature code corresponding to the first image frame according to the attitude control vector value to obtain a feature code sample.

In some embodiments, the apparatus further includes a training unit configured to train an initial recurrent neural network according to the feature coding samples, and train the initial recurrent neural network after a change in a network loss satisfies a convergence condition, where the network loss includes a difference between the attitude control vector value of the at least one local region predicted by the recurrent neural network and the labeled attitude control vector value.

At least one embodiment of the present specification further provides an electronic device, as shown in fig. 5, where the device includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement the driving method of the interactive object according to any embodiment of the present disclosure when executing the computer instructions.

At least one embodiment of the present specification also provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the driving method of the interactive object according to any one of the embodiments of the present disclosure.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method of driving an interactive object, the method comprising:

acquiring a phoneme sequence corresponding to the text data;

acquiring a control parameter value of at least one local area of an interactive object matched with the phoneme sequence;

and controlling the posture of the interactive object according to the acquired control parameter value.

2. The method of claim 1, further comprising: and controlling display equipment for displaying the interactive object to display a text according to the text data, and/or controlling the display equipment to output voice according to a phoneme sequence corresponding to the text data.

3. The method according to claim 1 or 2, wherein the control parameters of the local region of the interactive object comprise a pose control vector of the local region, and the obtaining of the control parameter values of at least one local region of the interactive object matching the phoneme sequence comprises:

performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence;

acquiring a feature code corresponding to at least one phoneme according to the first coding sequence;

and acquiring the attitude control vector of at least one local area of the interactive object corresponding to the feature code.

4. The method according to claim 3, wherein said feature coding the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence comprises:

aiming at a plurality of phonemes contained in the phoneme sequence, generating a sub-coding sequence corresponding to each phoneme;

and obtaining a first coding sequence corresponding to the phoneme sequence according to the sub-coding sequences corresponding to the multiple phonemes respectively.

5. The method according to claim 4, wherein the generating a sub-coding sequence corresponding to each phoneme for a plurality of phonemes contained in the phoneme sequence comprises:

detecting whether a first phoneme corresponds to each time point, wherein the first phoneme is any one of the multiple phonemes;

and setting the coding value at the time point with the first phoneme as a first numerical value, and setting the coding value at the time without the first phoneme as a second numerical value to obtain a sub-coding sequence corresponding to the first phoneme.

6. The method of claim 5, further comprising:

and performing Gaussian convolution operation on the continuous value of the first phoneme in time by using a Gaussian filter for the sub-coding sequence corresponding to the first phoneme, wherein the first phoneme is any one of the multiple phonemes.

7. The method according to any one of claims 3 to 6, wherein the obtaining of the feature code corresponding to at least one phoneme according to the first coding sequence comprises:

performing sliding window on the coding sequence according to a time window with a set length and a set step length, taking the feature codes in the time window as the feature codes of at least one corresponding phoneme, and obtaining a second coding sequence according to a plurality of feature codes obtained by completing the sliding window;

the controlling the posture of the interactive object according to the acquired control parameter value comprises:

acquiring a sequence of attitude control vectors corresponding to the second coding sequence;

and controlling the posture of the interactive object according to the sequence of the posture control vector.

8. The method according to any one of claims 1 to 7, further comprising:

and under the condition that the time interval between the phonemes in the phoneme sequence is greater than a set threshold value, controlling the posture of the interactive object according to the set control parameter value of the local area.

9. The method of claim 3, wherein the obtaining the pose control vector of the at least one local region of the interactive object corresponding to the feature code comprises:

and inputting the feature codes into a recurrent neural network, and obtaining the attitude control vector of at least one local area of the interactive object corresponding to the feature codes.

10. The method of claim 9, wherein the neural network is trained by obtaining feature code samples;

the method further comprises the following steps:

acquiring a video segment of a character sending voice, and acquiring a plurality of first image frames containing the character according to the video segment;

extracting a corresponding voice segment from the video segment, acquiring a sample phoneme sequence according to the voice segment, and performing feature coding on the sample phoneme sequence;

acquiring feature codes of at least one phoneme corresponding to the first image frame;

converting the first image frame into a second image frame containing the interactive object, and acquiring a posture control vector value of at least one local area corresponding to the second image frame;

and marking the feature code corresponding to the first image frame according to the attitude control vector value to obtain a feature code sample.

11. The method of claim 10, further comprising:

training an initial cyclic neural network according to the feature coding sample, and obtaining the cyclic neural network after the change of network loss meets a convergence condition, wherein the network loss comprises the difference between the attitude control vector value of the at least one local area predicted by the cyclic neural network and the labeled attitude control vector value.

12. An apparatus for driving an interactive object, the apparatus comprising:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring a phoneme sequence corresponding to text data;

a second obtaining unit, configured to obtain a control parameter value of at least one local region of the interactive object matching the phoneme sequence;

and the driving unit is used for controlling the posture of the interactive object according to the acquired control parameter value.

13. The apparatus according to claim 12, further comprising an output unit configured to control a display device displaying the interactive object to display text according to the text data and/or control the display device to output speech according to a phoneme sequence corresponding to the text data.

14. The apparatus according to claim 12 or 13, wherein the second obtaining unit is specifically configured to:

acquiring an attitude control vector of at least one local area of the interactive object corresponding to the feature code;

wherein, the performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence includes:

15. The apparatus according to claim 14, wherein the second obtaining unit, when configured to obtain the feature code corresponding to the at least one phoneme according to the first coding sequence, is specifically configured to:

the drive unit is specifically configured to:

16. The apparatus according to any of the claims 12 to 15, further comprising a pause driving unit for controlling the pose of the interactive object according to the set control parameter values of the local region in case the time interval between the phonemes in the sequence of phonemes is larger than a set threshold value.

17. The apparatus according to claim 14, wherein the second obtaining unit, when configured to obtain the pose control vector of the at least one local region of the interactive object corresponding to the feature code, is specifically configured to: and inputting the feature codes into a recurrent neural network, and obtaining the attitude control vector of at least one local area of the interactive object corresponding to the feature codes.

18. The apparatus of claim 17, further comprising a sample acquisition unit to:

marking the feature code corresponding to the first image frame according to the attitude control vector value to obtain a feature code sample;

the device further comprises a training unit, which is used for training an initial cyclic neural network according to the feature coding samples, and obtaining the cyclic neural network after the change of network loss meets a convergence condition, wherein the network loss comprises the difference between the attitude control vector value of the at least one local area predicted by the cyclic neural network and the labeled attitude control vector value.

19. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 11 when executing the computer instructions.

20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 11.