CN110930481A

CN110930481A - Method and system for predicting mouth shape control parameters

Info

Publication number: CN110930481A
Application number: CN201911266594.XA
Authority: CN
Inventors: 赵永驰; 李步宇; 渠思源
Original assignee: Beijing Huiye Technology Co Ltd
Current assignee: Beijing Huiye Technology Co Ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2020-03-27
Anticipated expiration: 2039-12-11
Also published as: CN110930481B

Abstract

The embodiment of the application discloses a method and a system for predicting a mouth shape control parameter. The method comprises the following steps: acquiring audio data; determining a mouth shape control parameter based at least on the audio data and the machine learning model; wherein the mouth shape control parameter is capable of reflecting at least a mouth shape of the animated character corresponding to the audio data.

Description

Method and system for predicting mouth shape control parameters

Technical Field

The application relates to the field of intelligent voice analysis, in particular to a method and a system for predicting a mouth shape control parameter.

Background

In the fields of animation, movie and television and the like, the mouth shape and/or the facial expression of an animated character when speaking are important problems in the character behavior display. To make an animated character more vivid when speaking, the mouth shape and/or facial expression of the animated character when speaking is often determined by the various controllers of the animated character's mouth shape based on the audio it is speaking.

At present, animation character mouth shapes and/or facial expressions are generated by generally adopting animation producers to adjust mouth shape controller parameters frame by frame according to information such as character audio contents, tone and the like, so that the working efficiency is extremely low, and the requirements on the animation producers are higher. Still another method is to capture the mouth shape and/or facial expression of the dubbed actor when dubbing the animated character by the expression capture device, and then import the data into the animation software to be corrected and adjusted by the animator, thus improving the production efficiency but increasing the production cost.

Disclosure of Invention

One embodiment of the present application provides a method for predicting a mouth shape control parameter, where the method includes: acquiring audio data; determining a mouth shape control parameter based on at least audio data and a machine learning model, wherein the mouth shape control parameter is capable of reflecting at least a mouth shape of an animated character corresponding to the audio data.

One of the embodiments of the present application provides a system for predicting a mouth shape control parameter, where the system includes: the audio data module is used for acquiring audio data; a mouth shape control parameter determination module for determining mouth shape control parameters based on at least the audio data and the machine learning model; wherein the mouth shape control parameter is capable of reflecting at least a mouth shape of an animated character corresponding to the audio data.

One of the embodiments of the present application provides a device for predicting a mouth shape control parameter, which includes a processor, where the processor is configured to execute the above method for predicting a mouth shape control parameter.

One of the embodiments of the present application provides a computer-readable storage medium, where the storage medium stores computer instructions, and after the computer reads the computer instructions in the storage medium, the computer executes the method for predicting the mouth shape control parameter.

Drawings

The present application will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a schematic diagram of an application scenario of a prediction system for a mouth shape control parameter according to some embodiments of the present application;

FIG. 2 is a block diagram of a prediction system for a die control parameter according to some embodiments of the present application;

FIG. 3 is an exemplary flow chart of a method of predicting a die control parameter according to some embodiments of the present application;

FIG. 4 is an exemplary sub-flowchart illustrating the determination of a die control parameter according to some embodiments of the present application;

FIG. 5 is an exemplary diagram of a machine learning model usage process, shown in accordance with some embodiments of the present application;

FIG. 6 is an exemplary diagram of a machine learning model usage process, shown in accordance with some embodiments of the present application;

FIG. 7 is an exemplary diagram of a machine learning model usage process, shown in accordance with some embodiments of the present application.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only examples or embodiments of the application, from which the application can also be applied to other similar scenarios without inventive effort for a person skilled in the art. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

In general, some animation characters appear in the fields of games, movies, animations, etc., and in order to make these animation characters appear vividly, different animation characters are often made to have corresponding facial expressions and/or mouth shapes when speaking, wherein the facial expressions and/or mouth shapes of the animation characters are associated with the speaking contents, tone, character personality characteristics, etc. In some embodiments, the facial expression and/or mouth shape of the animated character may be determined by changes in one or more facial positions in the mouth corners, lips, chin, jaw, etc. of the animated character.

In some embodiments, the animator can adjust the parameters of the mouth shape controller frame by frame through the information of the speaking content, tone and the like in the character audio by combining the experience of the animator, and finally generate the facial expression and/or mouth shape of the animated character during speaking. The method causes great workload of animation producers, low animation production efficiency and can only produce fine animations of a few seconds in one day.

In some embodiments, the animated character may be dubbed by a dubbing actor, and then mouth movements of the dubbing actor may be captured using an expression capture device, and then the mouth movement data of the dubbing actor may be imported into animation software, and then modified and adjusted by an animator, thereby generating facial expressions and/or mouth shapes of the animated character as it speaks. This approach has much improved animation efficiency over the above-described approach, but increases the economic cost to the dubbing actors and the expression capture device, and also has higher requirements on the performance of the dubbing actors, thereby increasing new time costs.

It should be noted that, the technical solution disclosed in the present application is to predict the mouth shape control parameter corresponding to the character speaking by using the machine learning model, and finally automatically generate the facial expression and/or mouth shape of the animated character speaking according to the mouth shape control parameter. The technical scheme of the application can be executed by a computer, so that the labor cost is greatly reduced, the animation production efficiency is increased, and the accuracy of the facial expression and/or mouth shape result when the animation character speaks is improved.

Fig. 1 is a schematic diagram of an application scenario of a prediction system for a mouth shape control parameter according to some embodiments of the present application.

Fig. 1 is a schematic diagram of an application scenario of a prediction system for a mouth shape control parameter according to some embodiments of the present application. As shown in fig. 1, the prediction system 100 may include a processing device 110, a network 120, a terminal 130, a storage device 140, a player 150.

In some embodiments, the processing device 110 may be used to perform one or more of the functions disclosed in this specification. For example, the processing device 110 may determine the mouth shape control parameter based on the audio data, and for example, the processing device 110 may simultaneously output one or more mouth shape control parameters to the terminal. In some embodiments, the processing device 110 may be located on the server side; in some embodiments, the processing device 110 may be located on the terminal 130 side. In some embodiments, a processing device may include one or more processing engines (e.g., a single core processing engine or a multi-core processor).

In some embodiments, network 120 may facilitate the exchange of data and/or information. In some embodiments, one or more components in the prediction system 100 (e.g., processing device 110, display terminal 130, storage device 140, player 150) may send data and/or information to other components in the prediction system 100 via the network 120. For example, the player 150 plays audio data that can be read or played back via the network 120 that is stored in the storage device 140. As another example, the processing device 110 may retrieve audio data in the storage device 140 via the network 120. In some embodiments, network 120 may be any type of wired or wireless network. One or more components of prediction system 100 may be connected to network 120 to exchange data and/or information.

In some embodiments, terminal 130 may be a device with data acquisition, storage, and/or display capabilities. In some embodiments, the terminal 130 may be used to obtain the mouth shape control parameters output by the processing device 110. In some embodiments, terminal 130 may include at least one animation processor that may be used to process the mouth shape control parameters to generate facial expressions and/or mouth shapes when the corresponding animated character speaks. In some embodiments, the terminal 130 can include a display device that can be used to display the facial expressions and/or mouth shapes of the respective animated character as discussed above while speaking. In some embodiments, the terminal 130 may include, but is not limited to, a smartphone, a tablet, a laptop, a desktop, etc., or any combination thereof.

In some embodiments, storage device 140 may store data and/or instructions. In some embodiments, storage device 140 may store data and/or instructions for execution or use by processing device 110, which processing device 110 may execute and/or use to implement the example methods described herein. In some embodiments, the storage device 140 may store audio data being played or played by the player 150. In some embodiments, the storage device may store historical audio data and/or historical animation frame numbers. In some embodiments, a portion of the storage device 140 may be disposed on the processing device 110. For example, the portion of the storage device for storing the operation instructions and/or the audio data and/or the historical animation frame number may be provided on the processing device. In some embodiments, a portion of the storage device 140 may also be disposed on the player 150. For example, the portion of the storage device 140 used to store the audio data to be played may be provided on the processing device. In some embodiments, storage device 140 may include mass storage, removable storage, volatile read-and-write memory (e.g., random access memory, RAM), read-only memory (ROM), the like, or any combination thereof.

In some embodiments, the player 150 includes at least an audio player. In some embodiments, the player 150 may play only audio data, or may play animated video with sound. In some embodiments, the audio data may be played by a player, and the processing device 110 obtains the audio data played by the player; in some embodiments, audio data stored in the player 150 may also be transmitted to the processing device 110 over a network.

FIG. 2 is a block diagram of a prediction system for a die control parameter according to some embodiments of the present application. As shown in fig. 2, the prediction system 200 may include an audio data acquisition module 210, a mouth shape control parameter determination module 220, an output module 230, and a training module 240.

In some embodiments, an audio data acquisition module may be used to acquire audio data.

In some embodiments, the mouth shape control parameter determination module 220 may be to determine mouth shape control parameters based at least on the audio data and a machine learning model; wherein the mouth shape control parameter is capable of reflecting at least a mouth shape of an animated character corresponding to the audio data. In some embodiments, the mouth shape control parameters include one or more of mouth angle control parameters, lip control parameters, chin control parameters, and jaw control parameters. In some embodiments, the mouth shape control parameter determining module 220 may be further configured to perform feature coding on the audio data based on a preset algorithm to determine coding features.

In some embodiments, the mouth shape control parameter determination module 220 may be further configured to determine a target audio unit based on the audio data, determine a group of audio units based on the target audio unit and one or more neighboring audio units neighboring thereto; determining a plurality of mouth shape control parameters corresponding to each audio unit in the audio unit group based on the audio unit group and the machine learning model; and carrying out weighted average processing on the plurality of mouth shape control parameters, wherein the processing result is used as a target mouth shape control parameter corresponding to the target audio unit.

In some embodiments, the mouth shape control parameter determination module 220 may be further configured to process the audio data through a machine learning model to determine classification intervals and deviation values corresponding to the mouth shape control parameters.

In some embodiments, the output module 230 may be used to output port-type control parameters.

In some embodiments, the training module 240 may be used to obtain a machine learning model that determines the mouth shape control parameters. In some embodiments, the training module 240 may be configured to obtain a training sample set including historical audio data and historical mouth shape control parameters corresponding to the audio data; determining historical audio characteristics corresponding to the historical mouth shape control parameters based on the historical audio data, and taking the historical audio characteristics as input data; taking historical mouth shape control parameters as output data or reference standards; and training the initial machine learning model by using the input data and the corresponding output data or the reference standard to obtain the trained machine learning model.

In some embodiments, the prediction system 200 further comprises an animation frame number obtaining module, configured to obtain an animation frame number corresponding to the audio data; the number of the audio features corresponds to the number of the animation frames.

It should be understood that the system and its modules shown in FIG. 2 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules of the present application may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

In some embodiments, the training module 240 may be disposed in a server-side processing device, or may be disposed in a client-side processing device, or a portion of the training module may be disposed in a server-side processing device and another portion of the training module may be disposed in a client-side processing device, and thus is represented by a dashed line.

It should be noted that the above descriptions of the candidate item display and determination system and the modules thereof are only for convenience of description, and are not intended to limit the present application within the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, for example, the audio data acquisition module 210, the mouth shape control parameter determination module 220, the output module 230, and the training module 240 disclosed in fig. 2 may be different modules in a system, or may be a module that implements the functions of two or more of the above modules. For example, the audio data acquisition module 210 and the training module 240 may be two modules, or one module may have both the acquisition and training functions. For example, each module may share one memory module, and each module may have its own memory module. Such variations are within the scope of the present application.

FIG. 3 is an exemplary flow chart of a method of predicting a die control parameter according to some embodiments of the present application.

At step 310, audio data is obtained.

In some embodiments, step 310 may be performed by audio data acquisition module 210.

In some embodiments, audio data may be understood as data information capable of reflecting the content of speech. In some embodiments, the audio data may be a piece of sound information. For example, the sound played by the player or dubbed by a dubber at a recording site. In some embodiments, the audio data may include audio files stored in the storage device corresponding to a certain piece of speech content. In some embodiments, the audio data may also include audio features corresponding to a certain piece of speech content, which may be directly input into the machine learning model for use. In some embodiments, if the acquired audio data is sound information, it needs to be converted into corresponding audio features when being input into a machine learning model or used in combination with some algorithms.

In some embodiments, the manner in which the audio data is obtained may include the processing device directly obtaining the audio data in the storage device, which may include sound information or audio features. In some embodiments, the method may also include obtaining sound information of a dubbing by a dubber in a recording scene, and subsequently converting the obtained sound information into corresponding audio features. In some embodiments, it may also include obtaining sound information played from the player, and then converting the sound information into corresponding audio features through processing.

At step 320, a mouth shape control parameter is determined based at least on the audio data and the machine learning model.

In some embodiments, step 320 may be performed by the die control parameter determination module 220.

In some embodiments, the mouth shape control parameter may be understood as parameter information capable of reflecting the expression of a character in an animation, for example, a value of a mouth shape or an expression controller in finger-drawing software, and in some embodiments, different values of the controller can represent different expressions of the character. In some embodiments, the value of the controller may be spatial (e.g., on the X, Y, Z axis) position information, which in some embodiments may be coordinate values. In some embodiments, the controller includes at least a mouth angle controller for reflecting a mouth shape of the person. In some embodiments, the controller further includes one or more of a lip controller, a chin controller, etc. for enriching the character's expression to show the animated character's content of speaking, tone, character personality, etc. Correspondingly, in some embodiments, the mouth shape control parameters may include one or any combination of mouth angle control parameters, lip control parameters, chin control parameters, jaw control parameters, and the like. In some embodiments, the form of the mouth shape control parameter may include one or a combination of letters, numbers and symbols, which are not limited by one or more embodiments of the present specification as long as they can reflect the expression of the animated character. In some embodiments, the animation software may include 3DMAX, MAYA, LIGHT AVE, etc., and the invention is not limited to the animation software employed.

In some embodiments, the mouth shape control parameters may be based on the values of the respective controls to determine the position of one or more of the mouth corners, lips, and chin, to reflect the mouth shape or expression of the animated character as it speaks. In some embodiments, the mouth shape control parameters may be determined based on audio data and machine learning models, or may be directly input or modified in the animation software by the animator.

In some embodiments, the obtained audio data is sound information without feature coding, and feature coding needs to be performed on the audio data based on a preset algorithm to determine audio features; the audio features are then input into a machine learning model to determine mouth shape control parameters corresponding to the audio features. In some embodiments, the audio features may be determined based on feature encoding the audio data based on a preset algorithm. In some embodiments, the audio data feature encoding may include one or a combination of waveform encoding, parametric encoding, and hybrid encoding. In some embodiments, the format of the audio data feature encoding may include PCM encoding, WAV format, MPC encoding, WMA format, and the like, or combinations thereof. In some embodiments, audio features refer to digital feature vectors obtained by feature encoding from audio data and capable of being processed by a computer. In some embodiments, the audio features may include, but are not limited to, feature parameters such as perceptual linear prediction coefficients (PLPs), Linear Prediction Cepstral Coefficients (LPCCs), and mel-frequency cepstral coefficients (MFCCs). In some embodiments, the preset algorithm used for audio data feature coding may include, but is not limited to, signal processing algorithms such as Short Time Fourier Transform (STFT), wavelet transform, W-V transform, etc., and the algorithm used for audio data feature coding is not limited by the present invention.

In some embodiments, determining the respective mouth shape control parameters based at least on the audio features and the machine learning model may comprise: and predicting one or more audio features in the trained machine learning model to obtain corresponding mouth shape control parameters. For example, a plurality of consecutive audio features obtained by encoding a segment of audio data are input to the machine learning model, and one or more sets of mouth shape control parameters corresponding to the segment of audio data can be obtained. In some embodiments, the machine learning model may include one or a combination of a supervised learning model, an unsupervised learning model, and a reinforcement learning model. In some embodiments, the supervised learning model may include decision trees, deep neural networks, SVMs, the like, or combinations thereof. In some embodiments, the machine learning model may include a deep neural network model, such as a MobileNet-V2 network model.

In some scenes with determined animation frame numbers, in order to match with each frame of animation, the mouth shape control parameters with the number corresponding to the animation frame number can be obtained, that is, the audio data corresponding to the animation needs to be processed into the audio features with the number corresponding to the animation frame number in advance. Correspondingly, in some embodiments, when the processing device needs to acquire the number of frames of the animation corresponding to the audio data and then perform feature coding on the audio data based on a preset algorithm to determine the audio features, the number of the audio features corresponds to the number of frames of the animation, and in some embodiments, the number of the audio features may be the same as the number of frames of the animation.

In some embodiments, the animation frame number may be retrieved from a storage device. In some embodiments, the number of animation frames is the number of frames included in a segment of continuous animation, where each frame is a still image. In some embodiments, an animation corresponding to a piece of audio data is determined, and the number of frames of the animation is also determined. In order to predict the facial expression of the character in each frame of the animation, the audio data is processed into the same number of audio features as the number of the animation frames during feature coding, so as to obtain the mouth shape control parameters corresponding to each frame of the animation.

In some embodiments, a segment of audio data may be directly obtained and feature-encoded to obtain at least one audio feature, and then a corresponding mouth shape control parameter is determined based on the audio feature, and finally, the number of animation frames having the same number as the number of the audio features is determined according to the mouth shape control parameter.

In some embodiments, determining the mouth shape control parameters based on the audio data and the machine learning model may directly predict the corresponding mouth shape control parameters based on the audio data and the machine learning model; in some embodiments, the classification interval and the deviation value of the mouth shape control parameter corresponding to the audio data can be indirectly predicted based on the audio data and the machine learning model, and then the mouth shape control parameter is further determined through the two values.

In some embodiments, as shown in fig. 5, the audio data and/or audio features corresponding to one or more frames are input into the trained machine learning model i for prediction, so as to directly output the corresponding mouth shape control parameters.

In some embodiments, the same machine learning model (e.g., machine learning model three) may be used to determine the classification interval and the deviation value corresponding to the mouth shape control parameter to be predicted. In some embodiments, the classification interval corresponding to the mouth shape control parameter may refer to a value range of the mouth shape control parameter value. For example, the classification interval is set to [0,1] according to the general numerical range of the mouth shape control parameter; [1,2 ]; [2,3 ]; [3,4 ]; [4,5]. In some embodiments, the deviation value corresponding to the die control parameter may be a difference between the real value of the die control parameter and the central value of the classification interval. In some embodiments, the deviation value corresponding to the die control parameter may be positive or negative. In some embodiments, the center value of the classification interval corresponding to the mouth shape control parameter may be a value distributed in the center of the interval, for example, the center values of the classification intervals [0,1], [1,2], [2,3] are 0.5, 1.5, 2.5, respectively. In some embodiments, after the predicted classification interval and the deviation value are determined, the sum of the central value and the deviation value of the classification interval may be used as the final predicted value of the mouth shape control parameter. For example, after an audio feature is input into the machine learning model, the predicted classification interval is [1,2 ]; a deviation value of 0.8 indicates that the machine learning model predicts that the value of the mouth shape control parameter corresponding to the audio feature falls within the [1,2] interval, and a deviation of the value of the mouth shape control parameter from the central value of the interval of 1.5 is 0.8, so that the true value of the mouth shape control parameter is the sum of 1.5 and 0.8, i.e. 2.3.

Specifically, as shown in fig. 7, audio data or audio features corresponding to one or more frames of animations are input to the machine learning model three to obtain a corresponding mouth shape control parameter classification interval and an offset value, and then the offset value of the mouth shape control parameter is added to a central value of the classification interval to output a true value of the mouth shape control parameter; and secondly, the deviation between the mouth shape control parameter and the central value of the distribution interval is predicted, so that the obtained mouth shape control parameter is more accurate and is more matched with corresponding audio data or audio characteristics, and the accuracy of generating facial expression and/or mouth shape results can be improved.

In some embodiments, the classification interval and the prediction of the deviation value may also be predicted by using two machine learning models respectively. In some embodiments, two machine learning models may predict both classification intervals and bias values simultaneously; one of them may predict the classification interval first and the other the deviation value. Specifically, the same audio data or audio features are input into two trained machine learning models, one machine learning model predicts a classification interval corresponding to the mouth shape control parameter, the other machine learning model predicts a deviation value corresponding to the mouth shape control parameter, and then the deviation value of the mouth shape control parameter is added to a central value of the classification interval, so that the true value of the mouth shape control parameter can be output. In some embodiments, the two machine learning models may be the same or different. The types of machine learning models are described elsewhere in this specification and will not be described further herein.

And step 330, outputting the output port type control parameters.

In some embodiments, step 330 may be performed by output module 230.

In some embodiments, after determining the mouth shape control parameter, the processing device may output the mouth shape control parameter to the terminal for use, or output the mouth shape control parameter to the memory for storage, and the terminal may obtain the mouth shape control parameter from the memory through the network when it needs to be used. In some embodiments, the terminal comprises at least one data interface, and the data interface can be used for receiving the mouth shape control parameters sent to the terminal by the processing device; in some embodiments, the terminal may also retrieve the mouth shape control parameters from the memory via the data interface.

In some embodiments, the animation software provided on the terminal includes one or more expression controllers that can determine the character expressions in the animation images of the corresponding frames according to the mouth shape control parameters acquired by the terminal. In some embodiments, the processing device may output the mouth shape control parameters directly through the data interface into a controller of the animation software; the mouth shape control parameters can also be output to the terminal equipment through the data interface and then transmitted to the animation software by the terminal equipment. In some embodiments, the display device on the terminal may be configured to display the mouth shape or the expression corresponding to each frame when the corresponding animated character speaks, and may also be configured to display the change process of the mouth shape and the expression when the corresponding animated character speaks the corresponding voice content.

In some embodiments, when the processing device is provided on the server side, the mouth shape control parameters determined by the processing device may be transmitted to the terminal in the form of binary codes through the network. In some embodiments, when the processing device is provided on the terminal side, the processing device may directly send the mouth shape control parameters to the respective expression controllers of the animation software on the terminal.

It should be noted that the above description of flowchart 300 is for purposes of example and illustration only and is not intended to limit the applicability of one or more embodiments of the present disclosure. Various modifications and alterations to flow 300 may occur to those skilled in the art, as guided by one or more of the embodiments described herein. However, such modifications and variations are intended to be within the scope of the present description. For example, step 320 may be split into multiple steps; as another example, all of the steps in flow 300 may be embodied in a computer readable medium comprising a set of instructions, which may be transmitted in an electronic stream or an electronic signal.

FIG. 4 is an exemplary flow chart of a method of determining a die control parameter according to some embodiments of the present application.

In some embodiments, in order to obtain more stable and accurate prediction results, several consecutive audio features may be predicted simultaneously, and then a weighted average of the multiple prediction results is taken as the prediction value corresponding to one of the audio features. In some embodiments, a segment of audio data may be feature-encoded, several consecutive audio features of the segment of audio data may be selected, and the selected audio features may be simultaneously input into a trained machine learning model for prediction, so as to obtain a mouth shape control parameter corresponding to each of consecutive audio features. And then carrying out weighted average processing on a plurality of mouth shape control parameters corresponding to the several continuous audio features, and taking the obtained result as the mouth shape control parameter corresponding to one of the audio features. The mouth shape control parameters obtained in the way are stable and accurate, so that the stability and the accuracy of the generation of the facial expression or the mouth shape of the animated character can be improved.

Step 410, a target audio unit is determined based on the audio data.

In some embodiments, the target audio unit may be determined based on the audio data. In some embodiments, a target audio unit may be understood as an audio unit for which it is desired to predict the corresponding mouth-shape control parameter. The target audio unit may be any one audio unit in the entire piece of audio data. In some embodiments, a segment of audio data may be divided into a plurality of unit audio segments according to time, and the audio features obtained by feature coding each unit audio segment are audio units. For example, 1 second of audio data may be divided into 10 unit audio pieces by 0.1 second, corresponding to 10 audio units. In some embodiments, a unit audio clip may correspond to one frame of a picture. For example, a second animation may include 30 frames of pictures, and 30 consecutive audio units may be obtained after audio data corresponding to the second animation is feature-coded.

Step 420, an audio unit group is determined based on the target audio unit and one or more neighboring audio units neighboring thereto.

In some embodiments, the neighboring audio units of the target audio unit may be determined based on the target audio unit. In some embodiments, a neighboring audio unit may be understood as an audio unit that is temporally adjacent to the target audio unit. For example, a piece of audio data includes a plurality of audio units arranged in a time order of the audio, where one or more audio units adjacent to the target audio unit may be considered as adjacent audio units. In some embodiments, the neighboring audio units may include one or more audio units that are adjacent to the target audio unit in front of the target audio unit and/or one or more audio units that are behind the target audio unit.

In some embodiments, consecutive audio units of the target audio unit and its neighboring audio units may be treated as groups of audio units. In some embodiments, the group of audio units may include the target audio unit and one or more neighboring audio units in front of it, may also include the target audio unit and one or more neighboring audio units in front of it, and may also include the target audio unit and one or more neighboring audio units in front of it. For example, the target audio unit is audio unit i, and the audio unit group may be audio unit i-1 and audio unit i; or audio unit i-2, audio unit i-1, audio unit i; or audio unit i, audio unit i + 1; or audio unit i, audio unit i +1, audio unit i + 2; or audio unit i-1, audio unit i + 1; audio unit i-2, audio unit i-1, audio unit i +1, and audio unit i +2(i ═ 1,2,3 … …) may also be used.

Step 430, determining a plurality of mouth shape control parameters corresponding to each audio unit in the audio unit group based on the audio unit group and the machine learning model.

In some embodiments, the mouth shape control parameter corresponding to each audio unit in the audio unit group may be predicted by using a machine learning model, so as to determine a plurality of mouth shape control parameters corresponding to each audio unit in the audio unit group. Specifically, in some embodiments, the determined group of audio units may be input into a machine learning model, and a plurality of mouth shape control parameters may be output through processing of the machine learning model, where the mouth shape control parameters respectively correspond to the audio units in the input group of audio units. For example, referring to fig. 6, each audio unit in a certain audio unit group and its sequence include: and after the audio unit group is input into a machine learning model II, an output port type control parameter j-1, a port type control parameter j and a port type control parameter j +1 can be output, wherein the port type control parameter j-1 corresponds to the audio unit i-1, the port type control parameter j corresponds to the audio unit i, and the port type control parameter j-1 corresponds to the audio unit i + 1.

Step 440, performing weighted average processing on the plurality of mouth shape control parameters, and taking the processing result as a target mouth shape control parameter corresponding to the target audio unit.

In some embodiments, a plurality of mouth shape control parameters corresponding to each audio unit in the audio unit group are subjected to weighted average processing, and the processing result is used as a final predicted value of the mouth shape control parameter corresponding to a target audio unit in the audio unit group, that is, a target mouth shape control parameter. Therefore, the deviation of the mouth shape control parameters of each frame of picture can be reduced, the mouth shape control parameters are more accurate, and the stability of the generation of the facial expression and/or the mouth shape of the animated character is improved.

In some embodiments, when performing weighted average processing on a plurality of mouth shape control parameters corresponding to each audio unit in the audio unit group, a larger weight may be set for a target audio unit, and a relatively smaller weight may be set for an adjacent audio unit, and in addition, the weight of the adjacent audio unit farther away from the target audio unit is smaller. In some embodiments, the weights of the target audio unit and the adjacent audio unit in the audio unit group may also be the same, that is, the mouth shape control parameters corresponding to the audio unit group are subjected to arithmetic average processing, and the obtained average value is the target mouth shape control parameter corresponding to the target audio unit in the audio unit group.

In some embodiments, performing weighted average processing on a plurality of mouth shape control parameters corresponding to each audio unit in the audio unit group and performing mouth shape control parameter prediction may be performed by the same machine learning model. In some embodiments, the plurality of audio units in the audio unit group may be respectively predicted by using a machine learning model, and then the predicted values of the plurality of mouth shape control parameters are subjected to weighted average processing by using a preset algorithm, and the processing result is used as the target mouth shape control parameter of the target audio unit. The training process of the machine learning model will be described below. In one or more of the embodiments described above, the machine learning model may be obtained by:

acquiring a training sample set; and training the initial machine learning model by utilizing the training sample set to obtain the trained machine learning model. The training sample set may include historical audio data and historical mouth shape control parameters corresponding to the historical audio data. In some embodiments, the historical audio data is feature encoded to obtain historical audio features, and the historical audio features are used as input data. In embodiments of different machine learning models, the obtained historical mouth shape control parameters may be adjusted according to different types of machine learning models to obtain output data or reference criteria corresponding to the different machine learning models.

In some embodiments, when the trained machine learning model includes the machine learning model one, the training of the initial machine learning model one may be started by directly using the historical mouth shape control parameters corresponding to the historical audio features as output data and the historical audio features as input data, and using the input data and the output data. In some embodiments, a trained machine learning model may be obtained through training of a certain amount of sample data. In some embodiments, the initial machine learning model may be a deep neural network model, for example, the initial machine learning model may be a MobileNet-V2 network model.

In some embodiments, when the trained machine learning model includes machine learning model two, three consecutive historical audio features may be used as the historical audio feature set, and the historical audio feature set may be used as the input data. And determining the historical mouth shape control parameters corresponding to the continuous three historical audio features as a historical mouth shape control parameter group, and using the historical mouth shape control parameter group as output data. And training a second initial machine learning model by using the input data and the output data.

In some embodiments, when the trained machine learning model includes machine learning model three, the initial machine learning model three may be trained using the input data, the output data, and the reference standard, with the historical audio features as input data, the classification interval determined based on the historical mouth shape control parameters corresponding to the historical audio features, and the deviation value determined, the classification interval used as the reference standard, and the deviation value used as the output data. For example, in some embodiments, if the historical mouth shape control parameter is 1.8, it may be determined that the classification interval in which the historical mouth shape control parameter is located is [1,2], and the center value of the classification interval is 1.5; and the deviation value of the historical mouth shape control parameter and the central point 1.5 of the classification interval is 0.3.

The beneficial effects that may be brought by the embodiments of the present application include, but are not limited to: (1) the animation production cost is reduced, and the animation production efficiency is increased; (2) the accuracy and the stability of the mouth shape control parameters are improved, and the mouth shape and/or the facial expression are more vivid and natural when the animation characters speak. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be considered merely illustrative and not restrictive of the broad application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.

Also, this application uses specific language to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present application may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereon. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which elements and sequences of the processes described herein are processed, the use of alphanumeric characters, or the use of other designations, is not intended to limit the order of the processes and methods described herein, unless explicitly claimed. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

The entire contents of each patent, patent application publication, and other material cited in this application, such as articles, books, specifications, publications, documents, and the like, are hereby incorporated by reference into this application. Except where the application is filed in a manner inconsistent or contrary to the present disclosure, and except where the claim is filed in its broadest scope (whether present or later appended to the application) as well. It is noted that the descriptions, definitions and/or use of terms in this application shall control if they are inconsistent or contrary to the statements and/or uses of the present application in the material attached to this application.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present application. Other variations are also possible within the scope of the present application. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the present application can be viewed as being consistent with the teachings of the present application. Accordingly, the embodiments of the present application are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of predicting a mouth shape control parameter, the method being performed by at least one processor, the method comprising:

acquiring audio data;

determining a mouth shape control parameter based at least on the audio data and the machine learning model; wherein the mouth shape control parameter is capable of reflecting at least a mouth shape of an animated character corresponding to the audio data.

2. The method of claim 1, wherein the mouth shape control parameters comprise one or more of mouth angle control parameters, lip control parameters, chin control parameters, and chin control parameters.

3. The method of claim 1, wherein determining the mouth shape control parameter based on at least the audio data and a machine learning model comprises:

performing feature coding on the audio data based on a preset algorithm to determine audio features;

determining mouth shape control parameters corresponding to the audio features based on the audio features and a machine learning model.

4. The method of claim 3, further comprising: acquiring the number of animation frames corresponding to the audio data;

the number of the audio features is equal to the number of the animation frames.

5. The method of claim 1, wherein determining the mouth shape control parameter based on at least the audio data and a machine learning model comprises:

determining a target audio unit based on the audio data;

determining a group of audio units based on the target audio unit and one or more neighboring audio units that are neighboring thereto;

determining a plurality of mouth shape control parameters corresponding to each audio unit in the audio unit group based on the audio unit group and the machine learning model;

and carrying out weighted average processing on the plurality of mouth shape control parameters, wherein the processing result is used as a target mouth shape control parameter corresponding to the target audio unit.

6. The method of claim 1, wherein determining the mouth shape control parameter based on at least the audio data and a machine learning model comprises:

processing the audio data through a machine learning model to determine a classification interval and a deviation value corresponding to the mouth shape control parameter;

determining the mouth shape control parameter based on the classification interval and the deviation value.

7. The method of claim 1, wherein the machine learning model is obtained by:

acquiring a training sample set, wherein the training sample set comprises historical audio data and historical mouth shape control parameters corresponding to the audio data;

determining historical audio features corresponding to the historical mouth shape control parameters based on the historical audio data, and using the historical audio features as input data; determining output data or reference criteria based on the historical mouth shape control parameters;

an initial machine learning model is trained using the input data and corresponding output data and/or reference criteria.

8. The method of claim 1, further comprising:

and outputting the mouth shape control parameters.

9. A system for predicting a profile control parameter, the system comprising:

the audio data module is used for acquiring audio data;

a mouth shape control parameter determination module for determining mouth shape control parameters based on at least the audio data and the machine learning model;

wherein the mouth shape control parameter is capable of reflecting at least a mouth shape of an animated character corresponding to the audio data.

10. The system of claim 9, wherein the mouth shape control parameters comprise one or more of mouth angle control parameters, lip control parameters, chin control parameters, and jaw control parameters.

11. The system of claim 9, wherein the die control parameter determination module is further configured to:

12. The system of claim 11, further comprising an animation frame number obtaining module for obtaining an animation frame number corresponding to the audio data;

13. The system of claim 9, wherein the die control parameter determination module is further configured to:

determining a target audio unit based on the audio data;

14. The system of claim 9, wherein the die control parameter determination module is further configured to:

15. The system of claim 9, comprising a training module to:

an initial machine learning model is trained using the input data and corresponding output data or reference criteria.

16. The system of claim 9, further comprising an output module for outputting the mouth shape control parameter.

17. An apparatus for predicting a mouth shape control parameter, comprising a processor, wherein the processor is configured to perform the method for predicting a mouth shape control parameter according to any one of claims 1 to 8.

18. A computer-readable storage medium storing computer instructions which, when read by a computer, cause the computer to perform the method of predicting a mouth shape control parameter according to any one of claims 1 to 8.