CN113379875A

CN113379875A - Cartoon character animation generation method, device, equipment and storage medium

Info

Publication number: CN113379875A
Application number: CN202110301883.XA
Authority: CN
Inventors: 陈聪; 侯翠琴; 李剑锋
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-09-10
Anticipated expiration: 2041-03-22
Also published as: CN113379875B

Abstract

The invention relates to the field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for generating cartoon character animation, which are used for improving the correlation between music cartoon character animation and music scenes. The method for generating cartoon character animation comprises the following steps: coding the music text data in the music parameter data to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; weighting micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in basic vector characteristics of music character image data through a neural network self-attention mechanism to generate a basic cartoon character image; respectively generating a target cartoon character image and a target music voice based on a preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation. The invention also relates to a blockchain technique, in which music parameter data can be stored.

Description

Cartoon character animation generation method, device, equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a cartoon role animation generation method, a cartoon role animation generation device, cartoon role animation generation equipment and a cartoon role animation generation storage medium.

Background

With the continuous satisfaction of physical life, more and more people begin to pursue the mental satisfaction, and the music culture with long history just fills the mental vacancy of people. From the earliest singing poem to the popular music at present, the music can directly convey the thought and emotion of a musician as an expression form, and the mode of popularizing or spreading music culture is more scientific and technological at present along with the progress of science and technology and the development of times, wherein the most important spreading mode is to spread the music culture by using cartoon character animation of music.

In the process of producing the music cartoon animation, the key frames of the appointed action are often drawn directly through the original picture of the existing music cartoon character, then the transition frames of the action are correspondingly inserted in a hand-drawing mode according to the difference between two adjacent key frames to generate the corresponding music cartoon animation, but the relevance between the music cartoon character animation generated in the mode and the music scene is low.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for generating cartoon character animation, which are used for improving the correlation between the cartoon character animation and a music scene.

The invention provides a cartoon role animation generation method in a first aspect, which comprises the following steps: acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics; inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.

Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model includes: acquiring music text data in the music parameter data, and extracting text characters in the music text data; searching standard characters which are the same as the text characters in a preset Unicode character table, taking byte codes corresponding to the standard characters as coded data corresponding to the text characters, determining the coded data corresponding to the text characters in the music text data as music content data, and enabling each standard character to correspond to one byte code; and converting the music content data into music voice data by adopting a voice generation model.

Optionally, in a second implementation manner of the first aspect of the present invention, the converting the music content data into music voice data by using a voice generation model includes: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model; segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by utilizing an alignment function in the voice generation model to obtain aligned phonemes; inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations; and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music speech data.

Optionally, in a third implementation manner of the first aspect of the present invention, the extracting, in the preset cartoon character generation model, a basic vector feature of a cartoon character corresponding to music character image data in the music parameter data, performing weighting processing on a micro expression vector feature, a gesture vector feature, and a limb movement vector feature in the basic vector feature through a neural network attention mechanism, and calculating a summary vector feature of the basic vector feature, and generating a basic cartoon character image according to the summary vector feature includes: inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector features in the music character image data from the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb action vector features of the cartoon character; calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in the preset cartoon role generation model; under the condition of increasing the weight occupied by the attention distribution of the micro expression vector features, the gesture vector features and the limb action vector features, summarizing the attention distribution of the basic vector features by using a summarizing formula to obtain summarizing vector features, wherein the summarizing formula is as follows:

wherein att (X, q) represents the summary vector feature, α₁Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics₁Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature₁Representing micro-expression vector features, alpha₂The attention distribution value, beta, corresponding to the gesture vector feature is represented₂Representing a gesture vector feature corresponding to a weighted attention distribution value, x₂Representing a gesture vector feature, alpha₃To representAttention distribution value, beta, corresponding to the body motion vector characteristics₃Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature₃Representing the motion vector characteristic of the limb, alpha_iIndicates the attention distribution value, beta, corresponding to the ith residual vector feature_iRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, x_iRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and calculating a loss function value of the summary vector characteristic by adopting a cross entropy loss function, adjusting the summary vector characteristic by using the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summary vector characteristic.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the calculating, by a neural network self-attention mechanism in the preset cartoon character generation model, an attention distribution of the basis vector features includes: acquiring query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:

wherein ,α_mIndicating the attention distribution value corresponding to the mth basis vector feature,

representing the attention scoring function, y_mRepresenting the m-th basis vector feature, y_nThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the respectively inputting the basic cartoon character image and the music voice data into a preset time-series neural network, and respectively generating a target cartoon character image and a target music voice based on the preset time-series neural network includes: respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted; acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment to generate data to be predicted at the next moment; and merging a plurality of data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.

Optionally, in a sixth implementation manner of the first aspect of the present invention, before the obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model, the method for generating the cartoon character animation further includes: and acquiring music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon role generation model.

The second aspect of the present invention provides a cartoon character animation generation apparatus, including: the acquisition module is used for acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; the calculation module is used for extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb motion vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics; the prediction module is used for respectively inputting the basic cartoon character image and the music voice data into a preset time sequence neural network and respectively generating a target cartoon character image and a target music voice based on the preset time sequence neural network; and the combination module is used for combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.

Optionally, in a first implementation manner of the second aspect of the present invention, the obtaining module includes: the extraction unit is used for acquiring music text data in the music parameter data and extracting text characters in the music text data; the determining unit is used for searching a standard character which is the same as the text character in a preset unicode character table, using a byte code corresponding to the standard character as coded data corresponding to the text character, and determining the coded data corresponding to the text character in the music text data as music content data, wherein each standard character corresponds to one byte code; and the conversion unit is used for converting the music content data into music voice data by adopting a voice generation model.

Optionally, in a second implementation manner of the second aspect of the present invention, the conversion unit is specifically configured to: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model; segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by utilizing an alignment function in the voice generation model to obtain aligned phonemes; inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations; and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music speech data.

Optionally, in a third implementation manner of the second aspect of the present invention, the calculation module includes: the input unit is used for inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector characteristics in the music character image data from the preset cartoon character generation model, wherein the basic vector characteristics at least comprise micro expression vector characteristics, gesture vector characteristics and limb action vector characteristics of the cartoon character; the calculation unit is used for calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in the preset cartoon role generation model; the summarizing unit is used for summarizing the attention distribution of the basic vector features by using a summarizing formula under the condition of increasing the weight occupied by the attention distribution of the micro expression vector features, the gesture vector features and the limb action vector features to obtain summarizing vector features, wherein the summarizing formula is as follows:

wherein att (X, q) represents the summary vector feature, α₁Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics₁Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature₁Representing micro-expression vector features, alpha₂The attention distribution value, beta, corresponding to the gesture vector feature is represented₂Representing a gesture vector feature corresponding to a weighted attention distribution value, x₂Representing a gesture vector feature, alpha₃The attention distribution value, beta, corresponding to the body motion vector characteristic is represented₃Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature₃Representing the motion vector characteristic of the limb, alpha_iExpressing the attention score corresponding to the ith residual vector featureCloth value, beta_iRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, x_iRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and the adjusting unit is used for calculating a loss function value of the summarized vector characteristic by adopting a cross entropy loss function, adjusting the summarized vector characteristic by the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summarized vector characteristic.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the calculating unit is specifically configured to: acquiring query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the prediction module is specifically configured to: respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted; acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment to generate data to be predicted at the next moment; and merging a plurality of data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the apparatus for generating a cartoon character animation further includes: and the generating module is used for acquiring the music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism and generating a preset cartoon role generating model.

The third aspect of the present invention provides a cartoon character animation generation device, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the cartoon character animation generation device to execute the cartoon character animation generation method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the method for generating a cartoon character animation described above.

In the technical scheme provided by the invention, music parameter data are obtained, music text data in the music parameter data are encoded by utilizing a preset Unicode character table to obtain music content data, and the music content data are converted into music voice data by adopting a voice generation model; extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics; inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation. In the embodiment of the invention, music parameter data are encoded and converted to generate music content data and music voice data, a neural network attention mechanism is utilized to perform weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in the music parameter data to generate a basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain a music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.

Drawings

FIG. 1 is a diagram of an embodiment of a method for generating cartoon character animation according to the embodiment of the invention;

FIG. 2 is a diagram of another embodiment of a method for generating cartoon character animations according to the present invention;

FIG. 3 is a diagram of an embodiment of a cartoon character animation generation apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of the cartoon character animation generation device in the embodiment of the invention;

fig. 5 is a schematic diagram of an embodiment of a device for generating cartoon character animation according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method, a device, equipment and a storage medium for generating cartoon character animation, which are used for improving the correlation between the cartoon character animation and a music scene.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of the method for generating cartoon character animation according to the embodiment of the present invention includes:

101. acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;

it is to be understood that the executing entity of the present invention may be a generating device of cartoon character animation, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

The music parameter data acquired by the server specifically includes two types of data:

1. music text data: in particular data related to music and having a content type in text form.

2. Music character image data: specifically, the data is data related to music and has a content type in an image form, where the format of the music character image may be JPEG, TIFF, RAW, or the like, and the format of the music character image is not limited in this application.

After the server obtains the music parameter data, the server needs to encode the music text data in the music parameter data by using a preset Unicode character table, and convert the music text data into characters which can be identified by a computer, wherein the preset Unicode character table is a character encoding table corresponding to Unicode (Unicode), and a uniform and unique binary code is set for each character in each language, so as to meet the requirements of performing text conversion and processing across languages and platforms.

It should be noted that, after the server obtains the music content data, the server converts the music content data into music speech data by using a speech generation model, where the speech generation model refers to Text To Speech (TTS), which is a technology capable of converting an arbitrary input text into a corresponding speech. The speech generation model mainly comprises a front end part and a back end part, wherein the front end part mainly analyzes input music text data, and extracts information required by back end modeling from the music text data, such as: the method comprises the steps of word segmentation, part of speech tagging, prosodic structure prediction, polyphonic disambiguation and the like of music text data. And the back end part reads in an analysis result obtained by analyzing the music text data by the front end part, models the voice part by combining the analysis result, and generates a voice signal for output by using the music text data and an acoustic model trained in advance in the synthesis process.

It is emphasized that, in order to further ensure the privacy and security of the music parameter data, the music parameter data may also be stored in a node of a block chain.

102. Extracting the basic vector characteristics of the cartoon roles corresponding to music role image data in music parameter data from a preset cartoon role generation model, carrying out weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;

after the server obtains music content data and music voice data, the server needs to process music role image data in the music parameter data, a preset cartoon role generation model is utilized, basic vector features in the music role image data are extracted from the preset cartoon role generation model, attention distribution of the basic vector features is calculated through a neural network self-attention mechanism, micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features are weighted in the calculation process, summary vector features of the basic vector features are calculated, and finally the server generates basic cartoon role images according to the summary vector features.

It should be noted that, the basic vector feature herein refers to a pixel vector feature in music character image data, one music character image data has a plurality of basic vector features, and when the server calculates attention distribution by using a neural network self-attention mechanism, the purpose of performing weighting processing on the micro expression vector feature, the gesture vector feature and the limb motion vector feature is to perform specific analysis on cartoon characters, so that the correlation between the basic cartoon character image obtained through calculation and a music scene is closer.

103. Inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network;

the basic cartoon character image and the music voice data obtained by the server at the moment are non-time sequence, so the server needs to generate a target cartoon character image and a target music voice which are arranged according to a certain time sequence by using a preset time sequence neural network. The preset time sequence neural network refers to a Recurrent Neural Network (RNN), which is a neural network for processing time sequence input, the lengths of time sequence data input into the RNN are different, and the contexts of the input time sequence data are related, the input data is subjected to convolution calculation through a plurality of hidden layers in the RNN, and finally, the convolved data is output through an output layer, so that data arranged according to a certain time sequence can be generated.

104. And combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.

And after the server acquires the target cartoon character image and the target music voice which are arranged according to the time sequence, the music content data, the target cartoon character image and the target music voice are combined together to obtain the music cartoon character animation.

In the embodiment of the invention, music parameter data are encoded and converted to generate music content data and music voice data, a neural network attention mechanism is utilized to perform weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in the music parameter data to generate a basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain a music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.

Referring to fig. 2, another embodiment of the method for generating cartoon character animation according to the embodiment of the present invention includes:

201. acquiring music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon role generation model;

before processing the music parameter data, the server needs to collect a large amount of music character animation data, train the large amount of music character animation data, and generate a preset cartoon character generation model. Wherein the music character animation data at least includes: symphony Orchestra for music animation, fantasy music for music animation 2000, and golden string for music animation.

When a large amount of music character animation data are trained, the adopted mode is a neural network autonomous power mechanism, a preset cartoon character generation model obtained by training can generate a corresponding cartoon character image according to animation or images input into the model, and the music character animation data training process is the same as that in the step 203, so that the details are not repeated.

202. Acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;

It should be noted that the unicode character table preset here is used to record the byte code corresponding to the standard character, for example: the byte code corresponding to the standard character "a" is "& # x 0041" and the byte code corresponding to the standard character "leaf" is "& # x53F 6", so that the same standard character as a text character in music text data can be searched in a preset unicode character table, and when the standard character is searched by the server, the encoded data corresponding to the text character can be clarified from the preset unicode character table, thereby converting the text character in the music text data into a computer readable and writable language.

Here, the speech synthesis technology is adopted to convert music content data into music speech data, and the speech synthesis technology divides music text data into 4 parts for speech synthesis, and the specific steps are as follows:

1. text-to-phoneme

The server inputs the music content data into the speech generation model, but because different languages have the phenomenon of 'same character and different tones', each text character in the music content data needs to be converted into corresponding phoneme information by using a phonetic notation algorithm, and the Chinese text character is converted into pinyin.

2. Audio segmentation

After the server obtains the phoneme information, a segmentation function is needed to segment the phoneme information, the start of the phoneme information is determined, segmented phonemes are obtained, that is, it is determined which phonemes can form a complete character phonetic symbol, and after the start of the phoneme information is determined, the segmented phonemes need to be processed by using an alignment function to obtain aligned phonemes, so that the phoneme duration can be conveniently predicted in the subsequent process.

3. Phoneme duration prediction

The server inputs the aligned phonemes into the duration prediction model, and then the predicted duration corresponding to the aligned phonemes can be output, and the server calculates the predicted duration to facilitate subsequent generation of the sound waveform.

4. Acoustic model

The server inputs the phoneme information of the predicted duration into an acoustic model, the acoustic model is equivalent to a vocoder and is used for converting the input phoneme information into corresponding sound waveforms, therefore, the sound waveform corresponding to each text character can be obtained, and the music voice data can be obtained by splicing a plurality of sound waveforms. It should be noted that there are further improvements to the acoustic model, such as: increasing the number of network layers, increasing the number of residual channels, replacing upsampling convolution by matrix multiplication, optimizing a CPU (Central processing Unit), optimizing a GPU (graphics processing Unit) and the like.

203. Extracting the basic vector characteristics of the cartoon roles corresponding to music role image data in music parameter data from a preset cartoon role generation model, carrying out weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;

specifically, music character image data in the music parameter data are input into a preset cartoon character generation model, basic vector features in the music character image data are extracted from the preset cartoon character generation model, and the basic vector features at least comprise micro-expression vector features, gesture vector features and limb action vector features of the cartoon characters; calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in a preset cartoon role generation model; under the condition of increasing the weight occupied by the attention distribution of the micro expression vector characteristics, the gesture vector characteristics and the limb action vector characteristics, summarizing the attention distribution of the basic vector characteristics by using a summarizing formula to obtain summarizing vector characteristics, wherein the summarizing formula is as follows:

wherein att (X, q) represents the summary vector feature, α₁Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics₁Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature₁Representing micro-expression vector features, alpha₂The attention distribution value, beta, corresponding to the gesture vector feature is represented₂Representing a gesture vector feature corresponding to a weighted attention distribution value, x₂Representing a gesture vector feature, alpha₃The attention distribution value, beta, corresponding to the body motion vector characteristic is represented₃Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature₃Representing the motion vector characteristic of the limb, alpha_iIndicates the attention distribution value, beta, corresponding to the ith residual vector feature_iRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, x_iRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and calculating a loss function value of the summary vector characteristic by adopting a cross entropy loss function, adjusting the summary vector characteristic by using the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summary vector characteristic.

The process of the server for calculating the attention distribution of the basic vector characteristics through the neural network self-attention mechanism in the preset cartoon character generation model is as follows: the server acquires query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image; the server calculates the attention distribution of each basic vector characteristic under the condition of setting the query vector characteristic by using a calculation formula of a neural network self-attention mechanism in a preset cartoon character generation model, wherein the calculation formula is as follows:

Here, the query vector feature in the music character image data is used to indicate information related to the query task, for example, in the present application, the query task refers to generating a cartoon character from the music character image data, that is, the query vector feature should be a vector feature related to the cartoon character in the music character image data.

It is further explained that the attention scoring function is a dot product model in the present application, and the attention scoring function may be:

1. bilinear model:

wherein ,s(y_mQ) denotes the attention scoring function, y_mThe feature of the mth basic vector is shown, q is a query vector, W is a learning parameter, and m is a positive integer.

2. Scaling the dot product model:

wherein ,s(y_mQ) denotes the attention scoring function, y_mRepresenting the m-th basic vector feature, q representing the query vector, d representing the dimension of the basic vector feature, and m being a positive integer.

204. Inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network;

because the generated basic cartoon character image and the music voice data take one frame as a generating unit and have no corresponding time sequence order, the server can not generate coherent animation, and therefore the server utilizes a preset time sequence neural network to perform time sequence processing on the basic cartoon character image and the music voice data. The specific process of the preset time sequence neural network for time sequence processing is as follows:

an input layer: performing convolution calculation on the data to be predicted at the last moment and the current data to be predicted, and inputting an obtained first convolution result into a first hidden layer;

a first hidden layer: performing convolution calculation on the first convolution result at the previous moment and the first convolution result at the next moment (one current first convolution result is separated by one middle interval), and inputting the obtained second convolution result into a second hidden layer;

a second hidden layer: performing convolution calculation on two second convolution results which are arranged in front of and behind the three second convolution results at intervals, and inputting an obtained third convolution result into a third hidden layer;

a third hidden layer: performing convolution calculation on two third convolution results before and after the seven third convolution results at the middle interval, and inputting the obtained target prediction data into an output layer;

an output layer: and outputting the target prediction data.

Further, the basic cartoon character image and the music voice data are respectively subjected to time sequence processing, and the obtained target cartoon character image and the target music voice are combined to obtain target prediction data.

205. And combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.

In the above description of the method for generating a cartoon character animation according to the embodiment of the present invention, a device for generating a cartoon character animation according to the embodiment of the present invention is described below with reference to fig. 3, where an embodiment of the device for generating a cartoon character animation according to the embodiment of the present invention includes:

the acquiring module 301 is configured to acquire music parameter data, encode music text data in the music parameter data by using a preset unicode character table to obtain music content data, and convert the music content data into music voice data by using a voice generation model; the calculation module 302 is configured to extract basic vector features of the cartoon roles corresponding to the music role image data in the music parameter data from a preset cartoon role generation model, perform weighting processing on micro-expression vector features, gesture vector features and limb movement vector features in the basic vector features through a neural network self-attention mechanism, calculate summary vector features of the basic vector features, and generate a basic cartoon role image according to the summary vector features; the prediction module 303 is configured to input the basic cartoon character image and the music voice data into a preset time-series neural network, and generate a target cartoon character image and a target music voice based on the preset time-series neural network; and the combination module 304 is used for combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.

Referring to fig. 4, another embodiment of the apparatus for generating cartoon character animation according to the embodiment of the present invention includes:

Optionally, the obtaining module 301 includes: an extracting unit 3011, configured to obtain music text data in the music parameter data, and extract text characters in the music text data; a determining unit 3012, configured to search a preset unicode character table for a standard character that is the same as the text character, use a byte code corresponding to the standard character as a coded data corresponding to the text character, and determine a coded data corresponding to the text character in the music text data as music content data, where each standard character corresponds to one byte code; a conversion unit 3013, configured to convert the music content data into music voice data using a voice generation model.

Optionally, the transformation unit 3013 is specifically configured to: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model; segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by adopting an alignment function in the voice generation model to obtain aligned phonemes; inputting the aligned phonemes into a duration prediction model in a speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations; and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing the sound waveforms to obtain music speech data.

Optionally, the calculating module 302 includes: the input unit 3021 is configured to input the music character image data in the music parameter data into a preset cartoon character generation model, and extract basic vector features in the music character image data in the preset cartoon character generation model, where the basic vector features at least include micro expression vector features, gesture vector features, and limb motion vector features of the cartoon character; the calculation unit 3022 is configured to calculate the attention distribution of the basis vector features through a neural network self-attention mechanism in a preset cartoon character generation model; a summarizing unit 3023, configured to summarize the attention distribution of the basic vector features by using a summarizing formula under the condition that the weights occupied by the attention distributions of the micro expression vector features, the gesture vector features, and the limb motion vector features are increased, to obtain summarizing vector features, where the summarizing formula is:

wherein att (X, q) represents the summary vector feature, α₁Representation of micro-expression vector featuresCharacterizing the corresponding attention distribution value, β₁Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature₁Representing micro-expression vector features, alpha₂The attention distribution value, beta, corresponding to the gesture vector feature is represented₂Representing a gesture vector feature corresponding to a weighted attention distribution value, x₂Representing a gesture vector feature, alpha₃The attention distribution value, beta, corresponding to the body motion vector characteristic is represented₃Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature₃Representing the motion vector characteristic of the limb, alpha_iIndicates the attention distribution value, beta, corresponding to the ith residual vector feature_iRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, x_iRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and the adjusting unit 3024 is configured to calculate a loss function value of the summarized vector feature by using the cross entropy loss function, adjust the summarized vector feature by using the loss function value, and generate a corresponding basic cartoon character image by using the adjusted summarized vector feature.

Optionally, the computing unit 3022 is specifically configured to: acquiring query vector characteristics in the music character image data, wherein the query vector characteristics are used for expressing basic vector characteristics related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in a preset cartoon character generation model, wherein the calculation formula is as follows:

s(y_mq) denotes the attention scoring function, y_mIs shown asm basis vector features, y_nThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.

Optionally, the prediction module 303 is specifically configured to: respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted; acquiring data to be predicted at the last moment and data to be predicted at the current moment, inputting the data to be predicted at the last moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the last moment and the data to be predicted at the current moment through the hidden layer to generate data to be predicted at the next moment; and merging the data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.

Optionally, the apparatus for generating cartoon character animation further includes: the generating module 305 is configured to obtain music character animation data, train the music character animation data by using a neural network attention mechanism, and generate a preset cartoon character generating model.

Fig. 3 and fig. 4 describe the generation apparatus of the cartoon character animation in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the generation apparatus of the cartoon character animation in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a cartoon character animation generation device according to an embodiment of the present invention, where the cartoon character animation generation device 500 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the generation apparatus 500 for animating cartoon characters. Still further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the cartoon character animation generating device 500.

The cartoon character animation generation device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the cartoon character animation generation device shown in fig. 5 does not constitute a limitation of the cartoon character animation generation device, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

The invention further provides a cartoon character animation generation device, the computer device comprises a memory and a processor, computer readable instructions are stored in the memory, and when the computer readable instructions are executed by the processor, the processor is caused to execute the steps of the cartoon character animation generation method in the embodiments.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the method for generating a cartoon character animation.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A cartoon character animation generation method is characterized by comprising the following steps:

acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;

extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;

inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network;

and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.

2. The method for generating cartoon character animation of claim 1, wherein the obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model comprises:

acquiring music text data in the music parameter data, and extracting text characters in the music text data;

searching standard characters which are the same as the text characters in a preset Unicode character table, taking byte codes corresponding to the standard characters as coded data corresponding to the text characters, determining the coded data corresponding to the text characters in the music text data as music content data, and enabling each standard character to correspond to one byte code;

and converting the music content data into music voice data by adopting a voice generation model.

3. The method for generating cartoon character animation of claim 2, wherein said converting the music content data into music voice data using the voice generation model comprises:

converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model;

segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by utilizing an alignment function in the voice generation model to obtain aligned phonemes;

inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations;

and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music speech data.

4. The method for generating cartoon character animation according to claim 1, wherein the extracting, in the preset cartoon character generation model, the basis vector features of the cartoon character corresponding to the music character image data in the music parameter data, performing weighting processing on the micro expression vector features, the gesture vector features and the limb motion vector features in the basis vector features through a neural network attention machine mechanism, and calculating summary vector features of the basis vector features, and the generating a basis cartoon character image according to the summary vector features includes:

inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector features in the music character image data from the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb action vector features of the cartoon character;

calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in the preset cartoon role generation model;

under the condition of increasing the weight occupied by the attention distribution of the micro expression vector features, the gesture vector features and the limb action vector features, summarizing the attention distribution of the basic vector features by using a summarizing formula to obtain summarizing vector features, wherein the summarizing formula is as follows:

wherein att (X, q) represents the summary vector feature, α₁Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics₁Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature₁Representing micro-expression vector features, alpha₂The attention distribution value, beta, corresponding to the gesture vector feature is represented₂Representing a gesture vector feature corresponding to a weighted attention distribution value, x₂Representing a gesture vector feature, alpha₃The attention distribution value, beta, corresponding to the body motion vector characteristic is represented₃Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature₃Representing the motion vector characteristic of the limb, alpha_iIndicates the attention distribution value, beta, corresponding to the ith residual vector feature_iRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, x_iRepresenting the ith residual vector feature, i, NIs a positive integer, the residual vector feature is a base vector feature other than the micro-expression vector feature, the gesture vector feature, and the limb motion vector feature;

and calculating a loss function value of the summary vector characteristic by adopting a cross entropy loss function, adjusting the summary vector characteristic by using the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summary vector characteristic.

5. The method for generating cartoon character animation of claim 4, wherein said calculating the attention distribution of the basis vector features by a neural network self-attention mechanism in the preset cartoon character generation model comprises:

acquiring query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image;

calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:

s(y_mq) denotes the attention scoring function, y_mRepresenting the m-th basis vector feature, y_nThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.

6. The method for generating cartoon character animation of claim 1, wherein said inputting the base cartoon character image and the music voice data into a preset time-series neural network respectively, and generating the target cartoon character image and the target music voice based on the preset time-series neural network respectively comprises:

respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted;

acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment to generate data to be predicted at the next moment;

and merging a plurality of data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.

7. The method for generating cartoon character animation of any one of claims 1-6, wherein before the obtaining music parameter data, encoding the music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model, the method for generating cartoon character animation further comprises:

and acquiring music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon role generation model.

8. A cartoon character animation generation device is characterized by comprising:

the acquisition module is used for acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;

the calculation module is used for extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;

the prediction module is used for respectively inputting the basic cartoon character image and the music voice data into a preset time sequence neural network and respectively generating a target cartoon character image and a target music voice based on the preset time sequence neural network;

and the combination module is used for combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.

9. A cartoon character animation generation device, comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the cartoon character animation generation device to perform the cartoon character animation generation method of any one of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement a method for generating a cartoon character animation according to any one of claims 1-7.