CN113379875A - Cartoon character animation generation method, device, equipment and storage medium - Google Patents

Cartoon character animation generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN113379875A
CN113379875A CN202110301883.XA CN202110301883A CN113379875A CN 113379875 A CN113379875 A CN 113379875A CN 202110301883 A CN202110301883 A CN 202110301883A CN 113379875 A CN113379875 A CN 113379875A
Authority
CN
China
Prior art keywords
music
data
vector
cartoon character
cartoon
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110301883.XA
Other languages
Chinese (zh)
Other versions
CN113379875B (en
Inventor
陈聪
侯翠琴
李剑锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110301883.XA priority Critical patent/CN113379875B/en
Publication of CN113379875A publication Critical patent/CN113379875A/en
Application granted granted Critical
Publication of CN113379875B publication Critical patent/CN113379875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Abstract

The invention relates to the field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for generating cartoon character animation, which are used for improving the correlation between music cartoon character animation and music scenes. The method for generating cartoon character animation comprises the following steps: coding the music text data in the music parameter data to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; weighting micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in basic vector characteristics of music character image data through a neural network self-attention mechanism to generate a basic cartoon character image; respectively generating a target cartoon character image and a target music voice based on a preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation. The invention also relates to a blockchain technique, in which music parameter data can be stored.

Description

Cartoon character animation generation method, device, equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a cartoon role animation generation method, a cartoon role animation generation device, cartoon role animation generation equipment and a cartoon role animation generation storage medium.
Background
With the continuous satisfaction of physical life, more and more people begin to pursue the mental satisfaction, and the music culture with long history just fills the mental vacancy of people. From the earliest singing poem to the popular music at present, the music can directly convey the thought and emotion of a musician as an expression form, and the mode of popularizing or spreading music culture is more scientific and technological at present along with the progress of science and technology and the development of times, wherein the most important spreading mode is to spread the music culture by using cartoon character animation of music.
In the process of producing the music cartoon animation, the key frames of the appointed action are often drawn directly through the original picture of the existing music cartoon character, then the transition frames of the action are correspondingly inserted in a hand-drawing mode according to the difference between two adjacent key frames to generate the corresponding music cartoon animation, but the relevance between the music cartoon character animation generated in the mode and the music scene is low.
Disclosure of Invention
The invention provides a method, a device, equipment and a storage medium for generating cartoon character animation, which are used for improving the correlation between the cartoon character animation and a music scene.
The invention provides a cartoon role animation generation method in a first aspect, which comprises the following steps: acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics; inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model includes: acquiring music text data in the music parameter data, and extracting text characters in the music text data; searching standard characters which are the same as the text characters in a preset Unicode character table, taking byte codes corresponding to the standard characters as coded data corresponding to the text characters, determining the coded data corresponding to the text characters in the music text data as music content data, and enabling each standard character to correspond to one byte code; and converting the music content data into music voice data by adopting a voice generation model.
Optionally, in a second implementation manner of the first aspect of the present invention, the converting the music content data into music voice data by using a voice generation model includes: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model; segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by utilizing an alignment function in the voice generation model to obtain aligned phonemes; inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations; and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music speech data.
Optionally, in a third implementation manner of the first aspect of the present invention, the extracting, in the preset cartoon character generation model, a basic vector feature of a cartoon character corresponding to music character image data in the music parameter data, performing weighting processing on a micro expression vector feature, a gesture vector feature, and a limb movement vector feature in the basic vector feature through a neural network attention mechanism, and calculating a summary vector feature of the basic vector feature, and generating a basic cartoon character image according to the summary vector feature includes: inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector features in the music character image data from the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb action vector features of the cartoon character; calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in the preset cartoon role generation model; under the condition of increasing the weight occupied by the attention distribution of the micro expression vector features, the gesture vector features and the limb action vector features, summarizing the attention distribution of the basic vector features by using a summarizing formula to obtain summarizing vector features, wherein the summarizing formula is as follows:
Figure RE-GDA0003133239530000031
wherein att (X, q) represents the summary vector feature, α1Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics1Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature1Representing micro-expression vector features, alpha2The attention distribution value, beta, corresponding to the gesture vector feature is represented2Representing a gesture vector feature corresponding to a weighted attention distribution value, x2Representing a gesture vector feature, alpha3To representAttention distribution value, beta, corresponding to the body motion vector characteristics3Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature3Representing the motion vector characteristic of the limb, alphaiIndicates the attention distribution value, beta, corresponding to the ith residual vector featureiRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, xiRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and calculating a loss function value of the summary vector characteristic by adopting a cross entropy loss function, adjusting the summary vector characteristic by using the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summary vector characteristic.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the calculating, by a neural network self-attention mechanism in the preset cartoon character generation model, an attention distribution of the basis vector features includes: acquiring query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:
Figure RE-GDA0003133239530000032
wherein ,αmIndicating the attention distribution value corresponding to the mth basis vector feature,
Figure RE-GDA0003133239530000033
representing the attention scoring function, ymRepresenting the m-th basis vector feature, ynThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the respectively inputting the basic cartoon character image and the music voice data into a preset time-series neural network, and respectively generating a target cartoon character image and a target music voice based on the preset time-series neural network includes: respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted; acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment to generate data to be predicted at the next moment; and merging a plurality of data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.
Optionally, in a sixth implementation manner of the first aspect of the present invention, before the obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model, the method for generating the cartoon character animation further includes: and acquiring music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon role generation model.
The second aspect of the present invention provides a cartoon character animation generation apparatus, including: the acquisition module is used for acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model; the calculation module is used for extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb motion vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics; the prediction module is used for respectively inputting the basic cartoon character image and the music voice data into a preset time sequence neural network and respectively generating a target cartoon character image and a target music voice based on the preset time sequence neural network; and the combination module is used for combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
Optionally, in a first implementation manner of the second aspect of the present invention, the obtaining module includes: the extraction unit is used for acquiring music text data in the music parameter data and extracting text characters in the music text data; the determining unit is used for searching a standard character which is the same as the text character in a preset unicode character table, using a byte code corresponding to the standard character as coded data corresponding to the text character, and determining the coded data corresponding to the text character in the music text data as music content data, wherein each standard character corresponds to one byte code; and the conversion unit is used for converting the music content data into music voice data by adopting a voice generation model.
Optionally, in a second implementation manner of the second aspect of the present invention, the conversion unit is specifically configured to: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model; segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by utilizing an alignment function in the voice generation model to obtain aligned phonemes; inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations; and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music speech data.
Optionally, in a third implementation manner of the second aspect of the present invention, the calculation module includes: the input unit is used for inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector characteristics in the music character image data from the preset cartoon character generation model, wherein the basic vector characteristics at least comprise micro expression vector characteristics, gesture vector characteristics and limb action vector characteristics of the cartoon character; the calculation unit is used for calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in the preset cartoon role generation model; the summarizing unit is used for summarizing the attention distribution of the basic vector features by using a summarizing formula under the condition of increasing the weight occupied by the attention distribution of the micro expression vector features, the gesture vector features and the limb action vector features to obtain summarizing vector features, wherein the summarizing formula is as follows:
Figure RE-GDA0003133239530000051
wherein att (X, q) represents the summary vector feature, α1Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics1Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature1Representing micro-expression vector features, alpha2The attention distribution value, beta, corresponding to the gesture vector feature is represented2Representing a gesture vector feature corresponding to a weighted attention distribution value, x2Representing a gesture vector feature, alpha3The attention distribution value, beta, corresponding to the body motion vector characteristic is represented3Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature3Representing the motion vector characteristic of the limb, alphaiExpressing the attention score corresponding to the ith residual vector featureCloth value, betaiRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, xiRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and the adjusting unit is used for calculating a loss function value of the summarized vector characteristic by adopting a cross entropy loss function, adjusting the summarized vector characteristic by the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summarized vector characteristic.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the calculating unit is specifically configured to: acquiring query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:
Figure RE-GDA0003133239530000061
wherein ,αmIndicating the attention distribution value corresponding to the mth basis vector feature,
Figure RE-GDA0003133239530000062
representing the attention scoring function, ymRepresenting the m-th basis vector feature, ynThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the prediction module is specifically configured to: respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted; acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment to generate data to be predicted at the next moment; and merging a plurality of data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the apparatus for generating a cartoon character animation further includes: and the generating module is used for acquiring the music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism and generating a preset cartoon role generating model.
The third aspect of the present invention provides a cartoon character animation generation device, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the cartoon character animation generation device to execute the cartoon character animation generation method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the method for generating a cartoon character animation described above.
In the technical scheme provided by the invention, music parameter data are obtained, music text data in the music parameter data are encoded by utilizing a preset Unicode character table to obtain music content data, and the music content data are converted into music voice data by adopting a voice generation model; extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics; inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network; and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation. In the embodiment of the invention, music parameter data are encoded and converted to generate music content data and music voice data, a neural network attention mechanism is utilized to perform weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in the music parameter data to generate a basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain a music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.
Drawings
FIG. 1 is a diagram of an embodiment of a method for generating cartoon character animation according to the embodiment of the invention;
FIG. 2 is a diagram of another embodiment of a method for generating cartoon character animations according to the present invention;
FIG. 3 is a diagram of an embodiment of a cartoon character animation generation apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of another embodiment of the cartoon character animation generation device in the embodiment of the invention;
fig. 5 is a schematic diagram of an embodiment of a device for generating cartoon character animation according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a method, a device, equipment and a storage medium for generating cartoon character animation, which are used for improving the correlation between the cartoon character animation and a music scene.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of the method for generating cartoon character animation according to the embodiment of the present invention includes:
101. acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;
it is to be understood that the executing entity of the present invention may be a generating device of cartoon character animation, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
The music parameter data acquired by the server specifically includes two types of data:
1. music text data: in particular data related to music and having a content type in text form.
2. Music character image data: specifically, the data is data related to music and has a content type in an image form, where the format of the music character image may be JPEG, TIFF, RAW, or the like, and the format of the music character image is not limited in this application.
After the server obtains the music parameter data, the server needs to encode the music text data in the music parameter data by using a preset Unicode character table, and convert the music text data into characters which can be identified by a computer, wherein the preset Unicode character table is a character encoding table corresponding to Unicode (Unicode), and a uniform and unique binary code is set for each character in each language, so as to meet the requirements of performing text conversion and processing across languages and platforms.
It should be noted that, after the server obtains the music content data, the server converts the music content data into music speech data by using a speech generation model, where the speech generation model refers to Text To Speech (TTS), which is a technology capable of converting an arbitrary input text into a corresponding speech. The speech generation model mainly comprises a front end part and a back end part, wherein the front end part mainly analyzes input music text data, and extracts information required by back end modeling from the music text data, such as: the method comprises the steps of word segmentation, part of speech tagging, prosodic structure prediction, polyphonic disambiguation and the like of music text data. And the back end part reads in an analysis result obtained by analyzing the music text data by the front end part, models the voice part by combining the analysis result, and generates a voice signal for output by using the music text data and an acoustic model trained in advance in the synthesis process.
It is emphasized that, in order to further ensure the privacy and security of the music parameter data, the music parameter data may also be stored in a node of a block chain.
102. Extracting the basic vector characteristics of the cartoon roles corresponding to music role image data in music parameter data from a preset cartoon role generation model, carrying out weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;
after the server obtains music content data and music voice data, the server needs to process music role image data in the music parameter data, a preset cartoon role generation model is utilized, basic vector features in the music role image data are extracted from the preset cartoon role generation model, attention distribution of the basic vector features is calculated through a neural network self-attention mechanism, micro-expression vector features, gesture vector features and limb motion vector features in the basic vector features are weighted in the calculation process, summary vector features of the basic vector features are calculated, and finally the server generates basic cartoon role images according to the summary vector features.
It should be noted that, the basic vector feature herein refers to a pixel vector feature in music character image data, one music character image data has a plurality of basic vector features, and when the server calculates attention distribution by using a neural network self-attention mechanism, the purpose of performing weighting processing on the micro expression vector feature, the gesture vector feature and the limb motion vector feature is to perform specific analysis on cartoon characters, so that the correlation between the basic cartoon character image obtained through calculation and a music scene is closer.
103. Inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network;
the basic cartoon character image and the music voice data obtained by the server at the moment are non-time sequence, so the server needs to generate a target cartoon character image and a target music voice which are arranged according to a certain time sequence by using a preset time sequence neural network. The preset time sequence neural network refers to a Recurrent Neural Network (RNN), which is a neural network for processing time sequence input, the lengths of time sequence data input into the RNN are different, and the contexts of the input time sequence data are related, the input data is subjected to convolution calculation through a plurality of hidden layers in the RNN, and finally, the convolved data is output through an output layer, so that data arranged according to a certain time sequence can be generated.
104. And combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
And after the server acquires the target cartoon character image and the target music voice which are arranged according to the time sequence, the music content data, the target cartoon character image and the target music voice are combined together to obtain the music cartoon character animation.
In the embodiment of the invention, music parameter data are encoded and converted to generate music content data and music voice data, a neural network attention mechanism is utilized to perform weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in the music parameter data to generate a basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain a music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.
Referring to fig. 2, another embodiment of the method for generating cartoon character animation according to the embodiment of the present invention includes:
201. acquiring music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon role generation model;
before processing the music parameter data, the server needs to collect a large amount of music character animation data, train the large amount of music character animation data, and generate a preset cartoon character generation model. Wherein the music character animation data at least includes: symphony Orchestra for music animation, fantasy music for music animation 2000, and golden string for music animation.
When a large amount of music character animation data are trained, the adopted mode is a neural network autonomous power mechanism, a preset cartoon character generation model obtained by training can generate a corresponding cartoon character image according to animation or images input into the model, and the music character animation data training process is the same as that in the step 203, so that the details are not repeated.
202. Acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;
it is emphasized that, in order to further ensure the privacy and security of the music parameter data, the music parameter data may also be stored in a node of a block chain.
It should be noted that the unicode character table preset here is used to record the byte code corresponding to the standard character, for example: the byte code corresponding to the standard character "a" is "& # x 0041" and the byte code corresponding to the standard character "leaf" is "& # x53F 6", so that the same standard character as a text character in music text data can be searched in a preset unicode character table, and when the standard character is searched by the server, the encoded data corresponding to the text character can be clarified from the preset unicode character table, thereby converting the text character in the music text data into a computer readable and writable language.
Here, the speech synthesis technology is adopted to convert music content data into music speech data, and the speech synthesis technology divides music text data into 4 parts for speech synthesis, and the specific steps are as follows:
1. text-to-phoneme
The server inputs the music content data into the speech generation model, but because different languages have the phenomenon of 'same character and different tones', each text character in the music content data needs to be converted into corresponding phoneme information by using a phonetic notation algorithm, and the Chinese text character is converted into pinyin.
2. Audio segmentation
After the server obtains the phoneme information, a segmentation function is needed to segment the phoneme information, the start of the phoneme information is determined, segmented phonemes are obtained, that is, it is determined which phonemes can form a complete character phonetic symbol, and after the start of the phoneme information is determined, the segmented phonemes need to be processed by using an alignment function to obtain aligned phonemes, so that the phoneme duration can be conveniently predicted in the subsequent process.
3. Phoneme duration prediction
The server inputs the aligned phonemes into the duration prediction model, and then the predicted duration corresponding to the aligned phonemes can be output, and the server calculates the predicted duration to facilitate subsequent generation of the sound waveform.
4. Acoustic model
The server inputs the phoneme information of the predicted duration into an acoustic model, the acoustic model is equivalent to a vocoder and is used for converting the input phoneme information into corresponding sound waveforms, therefore, the sound waveform corresponding to each text character can be obtained, and the music voice data can be obtained by splicing a plurality of sound waveforms. It should be noted that there are further improvements to the acoustic model, such as: increasing the number of network layers, increasing the number of residual channels, replacing upsampling convolution by matrix multiplication, optimizing a CPU (Central processing Unit), optimizing a GPU (graphics processing Unit) and the like.
203. Extracting the basic vector characteristics of the cartoon roles corresponding to music role image data in music parameter data from a preset cartoon role generation model, carrying out weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;
specifically, music character image data in the music parameter data are input into a preset cartoon character generation model, basic vector features in the music character image data are extracted from the preset cartoon character generation model, and the basic vector features at least comprise micro-expression vector features, gesture vector features and limb action vector features of the cartoon characters; calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in a preset cartoon role generation model; under the condition of increasing the weight occupied by the attention distribution of the micro expression vector characteristics, the gesture vector characteristics and the limb action vector characteristics, summarizing the attention distribution of the basic vector characteristics by using a summarizing formula to obtain summarizing vector characteristics, wherein the summarizing formula is as follows:
Figure RE-GDA0003133239530000121
wherein att (X, q) represents the summary vector feature, α1Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics1Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature1Representing micro-expression vector features, alpha2The attention distribution value, beta, corresponding to the gesture vector feature is represented2Representing a gesture vector feature corresponding to a weighted attention distribution value, x2Representing a gesture vector feature, alpha3The attention distribution value, beta, corresponding to the body motion vector characteristic is represented3Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature3Representing the motion vector characteristic of the limb, alphaiIndicates the attention distribution value, beta, corresponding to the ith residual vector featureiRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, xiRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and calculating a loss function value of the summary vector characteristic by adopting a cross entropy loss function, adjusting the summary vector characteristic by using the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summary vector characteristic.
The process of the server for calculating the attention distribution of the basic vector characteristics through the neural network self-attention mechanism in the preset cartoon character generation model is as follows: the server acquires query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image; the server calculates the attention distribution of each basic vector characteristic under the condition of setting the query vector characteristic by using a calculation formula of a neural network self-attention mechanism in a preset cartoon character generation model, wherein the calculation formula is as follows:
Figure RE-GDA0003133239530000131
wherein ,αmIndicating the attention distribution value corresponding to the mth basis vector feature,
Figure RE-GDA0003133239530000132
representing the attention scoring function, ymRepresenting the m-th basis vector feature, ynThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.
Here, the query vector feature in the music character image data is used to indicate information related to the query task, for example, in the present application, the query task refers to generating a cartoon character from the music character image data, that is, the query vector feature should be a vector feature related to the cartoon character in the music character image data.
It is further explained that the attention scoring function is a dot product model in the present application, and the attention scoring function may be:
1. bilinear model:
Figure RE-GDA0003133239530000133
wherein ,s(ymQ) denotes the attention scoring function, ymThe feature of the mth basic vector is shown, q is a query vector, W is a learning parameter, and m is a positive integer.
2. Scaling the dot product model:
Figure RE-GDA0003133239530000134
wherein ,s(ymQ) denotes the attention scoring function, ymRepresenting the m-th basic vector feature, q representing the query vector, d representing the dimension of the basic vector feature, and m being a positive integer.
204. Inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network;
because the generated basic cartoon character image and the music voice data take one frame as a generating unit and have no corresponding time sequence order, the server can not generate coherent animation, and therefore the server utilizes a preset time sequence neural network to perform time sequence processing on the basic cartoon character image and the music voice data. The specific process of the preset time sequence neural network for time sequence processing is as follows:
an input layer: performing convolution calculation on the data to be predicted at the last moment and the current data to be predicted, and inputting an obtained first convolution result into a first hidden layer;
a first hidden layer: performing convolution calculation on the first convolution result at the previous moment and the first convolution result at the next moment (one current first convolution result is separated by one middle interval), and inputting the obtained second convolution result into a second hidden layer;
a second hidden layer: performing convolution calculation on two second convolution results which are arranged in front of and behind the three second convolution results at intervals, and inputting an obtained third convolution result into a third hidden layer;
a third hidden layer: performing convolution calculation on two third convolution results before and after the seven third convolution results at the middle interval, and inputting the obtained target prediction data into an output layer;
an output layer: and outputting the target prediction data.
Further, the basic cartoon character image and the music voice data are respectively subjected to time sequence processing, and the obtained target cartoon character image and the target music voice are combined to obtain target prediction data.
205. And combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
And after the server acquires the target cartoon character image and the target music voice which are arranged according to the time sequence, the music content data, the target cartoon character image and the target music voice are combined together to obtain the music cartoon character animation.
In the embodiment of the invention, music parameter data are encoded and converted to generate music content data and music voice data, a neural network attention mechanism is utilized to perform weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in the music parameter data to generate a basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain a music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.
In the above description of the method for generating a cartoon character animation according to the embodiment of the present invention, a device for generating a cartoon character animation according to the embodiment of the present invention is described below with reference to fig. 3, where an embodiment of the device for generating a cartoon character animation according to the embodiment of the present invention includes:
the acquiring module 301 is configured to acquire music parameter data, encode music text data in the music parameter data by using a preset unicode character table to obtain music content data, and convert the music content data into music voice data by using a voice generation model; the calculation module 302 is configured to extract basic vector features of the cartoon roles corresponding to the music role image data in the music parameter data from a preset cartoon role generation model, perform weighting processing on micro-expression vector features, gesture vector features and limb movement vector features in the basic vector features through a neural network self-attention mechanism, calculate summary vector features of the basic vector features, and generate a basic cartoon role image according to the summary vector features; the prediction module 303 is configured to input the basic cartoon character image and the music voice data into a preset time-series neural network, and generate a target cartoon character image and a target music voice based on the preset time-series neural network; and the combination module 304 is used for combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
In the embodiment of the invention, music parameter data are encoded and converted to generate music content data and music voice data, a neural network attention mechanism is utilized to perform weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in the music parameter data to generate a basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain a music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.
Referring to fig. 4, another embodiment of the apparatus for generating cartoon character animation according to the embodiment of the present invention includes:
the acquiring module 301 is configured to acquire music parameter data, encode music text data in the music parameter data by using a preset unicode character table to obtain music content data, and convert the music content data into music voice data by using a voice generation model; the calculation module 302 is configured to extract basic vector features of the cartoon roles corresponding to the music role image data in the music parameter data from a preset cartoon role generation model, perform weighting processing on micro-expression vector features, gesture vector features and limb movement vector features in the basic vector features through a neural network self-attention mechanism, calculate summary vector features of the basic vector features, and generate a basic cartoon role image according to the summary vector features; the prediction module 303 is configured to input the basic cartoon character image and the music voice data into a preset time-series neural network, and generate a target cartoon character image and a target music voice based on the preset time-series neural network; and the combination module 304 is used for combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
Optionally, the obtaining module 301 includes: an extracting unit 3011, configured to obtain music text data in the music parameter data, and extract text characters in the music text data; a determining unit 3012, configured to search a preset unicode character table for a standard character that is the same as the text character, use a byte code corresponding to the standard character as a coded data corresponding to the text character, and determine a coded data corresponding to the text character in the music text data as music content data, where each standard character corresponds to one byte code; a conversion unit 3013, configured to convert the music content data into music voice data using a voice generation model.
Optionally, the transformation unit 3013 is specifically configured to: converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model; segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by adopting an alignment function in the voice generation model to obtain aligned phonemes; inputting the aligned phonemes into a duration prediction model in a speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations; and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing the sound waveforms to obtain music speech data.
Optionally, the calculating module 302 includes: the input unit 3021 is configured to input the music character image data in the music parameter data into a preset cartoon character generation model, and extract basic vector features in the music character image data in the preset cartoon character generation model, where the basic vector features at least include micro expression vector features, gesture vector features, and limb motion vector features of the cartoon character; the calculation unit 3022 is configured to calculate the attention distribution of the basis vector features through a neural network self-attention mechanism in a preset cartoon character generation model; a summarizing unit 3023, configured to summarize the attention distribution of the basic vector features by using a summarizing formula under the condition that the weights occupied by the attention distributions of the micro expression vector features, the gesture vector features, and the limb motion vector features are increased, to obtain summarizing vector features, where the summarizing formula is:
Figure RE-GDA0003133239530000161
wherein att (X, q) represents the summary vector feature, α1Representation of micro-expression vector featuresCharacterizing the corresponding attention distribution value, β1Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature1Representing micro-expression vector features, alpha2The attention distribution value, beta, corresponding to the gesture vector feature is represented2Representing a gesture vector feature corresponding to a weighted attention distribution value, x2Representing a gesture vector feature, alpha3The attention distribution value, beta, corresponding to the body motion vector characteristic is represented3Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature3Representing the motion vector characteristic of the limb, alphaiIndicates the attention distribution value, beta, corresponding to the ith residual vector featureiRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, xiRepresenting the ith residual vector feature, wherein i and N are positive integers, and the residual vector feature is a basic vector feature except the micro expression vector feature, the gesture vector feature and the limb action vector feature; and the adjusting unit 3024 is configured to calculate a loss function value of the summarized vector feature by using the cross entropy loss function, adjust the summarized vector feature by using the loss function value, and generate a corresponding basic cartoon character image by using the adjusted summarized vector feature.
Optionally, the computing unit 3022 is specifically configured to: acquiring query vector characteristics in the music character image data, wherein the query vector characteristics are used for expressing basic vector characteristics related to cartoon characters in the music character image; calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in a preset cartoon character generation model, wherein the calculation formula is as follows:
Figure RE-GDA0003133239530000171
wherein ,αmIndicating the attention distribution value corresponding to the mth basis vector feature,
Figure RE-GDA0003133239530000172
s(ymq) denotes the attention scoring function, ymIs shown asm basis vector features, ynThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.
Optionally, the prediction module 303 is specifically configured to: respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted; acquiring data to be predicted at the last moment and data to be predicted at the current moment, inputting the data to be predicted at the last moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the last moment and the data to be predicted at the current moment through the hidden layer to generate data to be predicted at the next moment; and merging the data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.
Optionally, the apparatus for generating cartoon character animation further includes: the generating module 305 is configured to obtain music character animation data, train the music character animation data by using a neural network attention mechanism, and generate a preset cartoon character generating model.
In the embodiment of the invention, music parameter data are encoded and converted to generate music content data and music voice data, a neural network attention mechanism is utilized to perform weighting processing on micro-expression vector characteristics, gesture vector characteristics and limb motion vector characteristics in the music parameter data to generate a basic cartoon character image, and finally the music content data, the music voice data and the basic cartoon character image are integrated to obtain a music cartoon character animation, so that the correlation between the music cartoon character animation and a music scene is improved.
Fig. 3 and fig. 4 describe the generation apparatus of the cartoon character animation in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the generation apparatus of the cartoon character animation in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 5 is a schematic structural diagram of a cartoon character animation generation device according to an embodiment of the present invention, where the cartoon character animation generation device 500 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the generation apparatus 500 for animating cartoon characters. Still further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the cartoon character animation generating device 500.
The cartoon character animation generation device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the cartoon character animation generation device shown in fig. 5 does not constitute a limitation of the cartoon character animation generation device, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.
The invention further provides a cartoon character animation generation device, the computer device comprises a memory and a processor, computer readable instructions are stored in the memory, and when the computer readable instructions are executed by the processor, the processor is caused to execute the steps of the cartoon character animation generation method in the embodiments.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the method for generating a cartoon character animation.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A cartoon character animation generation method is characterized by comprising the following steps:
acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;
extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;
inputting the basic cartoon character image and the music voice data into a preset time sequence neural network respectively, and generating a target cartoon character image and a target music voice respectively based on the preset time sequence neural network;
and combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
2. The method for generating cartoon character animation of claim 1, wherein the obtaining music parameter data, encoding music text data in the music parameter data by using a preset unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model comprises:
acquiring music text data in the music parameter data, and extracting text characters in the music text data;
searching standard characters which are the same as the text characters in a preset Unicode character table, taking byte codes corresponding to the standard characters as coded data corresponding to the text characters, determining the coded data corresponding to the text characters in the music text data as music content data, and enabling each standard character to correspond to one byte code;
and converting the music content data into music voice data by adopting a voice generation model.
3. The method for generating cartoon character animation of claim 2, wherein said converting the music content data into music voice data using the voice generation model comprises:
converting each text character in the music content data into corresponding phoneme information by adopting a phonetic notation algorithm in a speech generation model;
segmenting the phoneme information by utilizing a segmentation function in the voice generation model to obtain segmented phonemes, and aligning the segmented phonemes by utilizing an alignment function in the voice generation model to obtain aligned phonemes;
inputting the aligned phonemes into a duration prediction model in the speech generation model, and predicting phoneme durations of the aligned phonemes through the duration prediction model to obtain predicted durations;
and inputting the phoneme information and the predicted duration into an acoustic model in the speech generation model, generating a sound waveform corresponding to each text character, and splicing a plurality of sound waveforms to obtain music speech data.
4. The method for generating cartoon character animation according to claim 1, wherein the extracting, in the preset cartoon character generation model, the basis vector features of the cartoon character corresponding to the music character image data in the music parameter data, performing weighting processing on the micro expression vector features, the gesture vector features and the limb motion vector features in the basis vector features through a neural network attention machine mechanism, and calculating summary vector features of the basis vector features, and the generating a basis cartoon character image according to the summary vector features includes:
inputting the music character image data in the music parameter data into a preset cartoon character generation model, and extracting basic vector features in the music character image data from the preset cartoon character generation model, wherein the basic vector features at least comprise micro expression vector features, gesture vector features and limb action vector features of the cartoon character;
calculating the attention distribution of the basic vector characteristics through a neural network self-attention mechanism in the preset cartoon role generation model;
under the condition of increasing the weight occupied by the attention distribution of the micro expression vector features, the gesture vector features and the limb action vector features, summarizing the attention distribution of the basic vector features by using a summarizing formula to obtain summarizing vector features, wherein the summarizing formula is as follows:
Figure RE-FDA0003133239520000021
wherein att (X, q) represents the summary vector feature, α1Expressing the attention distribution value, beta, corresponding to the micro expression vector characteristics1Representing the corresponding weighted attention distribution value, x, of the micro expression vector feature1Representing micro-expression vector features, alpha2The attention distribution value, beta, corresponding to the gesture vector feature is represented2Representing a gesture vector feature corresponding to a weighted attention distribution value, x2Representing a gesture vector feature, alpha3The attention distribution value, beta, corresponding to the body motion vector characteristic is represented3Representing the corresponding weighted attention distribution value, x, of the limb motion vector feature3Representing the motion vector characteristic of the limb, alphaiIndicates the attention distribution value, beta, corresponding to the ith residual vector featureiRepresenting the ith residual vector feature corresponding to the weighted attention distribution value, xiRepresenting the ith residual vector feature, i, NIs a positive integer, the residual vector feature is a base vector feature other than the micro-expression vector feature, the gesture vector feature, and the limb motion vector feature;
and calculating a loss function value of the summary vector characteristic by adopting a cross entropy loss function, adjusting the summary vector characteristic by using the loss function value, and generating a corresponding basic cartoon character image by using the adjusted summary vector characteristic.
5. The method for generating cartoon character animation of claim 4, wherein said calculating the attention distribution of the basis vector features by a neural network self-attention mechanism in the preset cartoon character generation model comprises:
acquiring query vector features in the music character image data, wherein the query vector features are used for expressing basic vector features related to cartoon characters in the music character image;
calculating the attention distribution of each basic vector feature under the condition of setting the query vector feature by using a calculation formula of a neural network self-attention mechanism in the preset cartoon character generation model, wherein the calculation formula is as follows:
Figure RE-FDA0003133239520000031
wherein ,αmIndicating the attention distribution value corresponding to the mth basis vector feature,
Figure RE-FDA0003133239520000032
s(ymq) denotes the attention scoring function, ymRepresenting the m-th basis vector feature, ynThe characteristics of the nth basic vector are shown, q represents a query vector, and M, n and M are positive integers.
6. The method for generating cartoon character animation of claim 1, wherein said inputting the base cartoon character image and the music voice data into a preset time-series neural network respectively, and generating the target cartoon character image and the target music voice based on the preset time-series neural network respectively comprises:
respectively sequencing the basic cartoon character images and the music voice data according to a preset input time sequence, and integrating the sequenced basic cartoon character images and music voice data into data to be predicted;
acquiring data to be predicted at the previous moment and data to be predicted at the current moment, inputting the data to be predicted at the previous moment and the data to be predicted at the current moment into a preset hidden layer of a time sequence neural network, and performing convolution iterative calculation on the data to be predicted at the next moment through the hidden layer, the data to be predicted at the previous moment and the data to be predicted at the current moment to generate data to be predicted at the next moment;
and merging a plurality of data to be predicted at the next moment to obtain target prediction data, wherein the target prediction data comprises a target cartoon character image and target music voice.
7. The method for generating cartoon character animation of any one of claims 1-6, wherein before the obtaining music parameter data, encoding the music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by using a voice generation model, the method for generating cartoon character animation further comprises:
and acquiring music role animation data, training the music role animation data by utilizing a neural network self-attention mechanism, and generating a preset cartoon role generation model.
8. A cartoon character animation generation device is characterized by comprising:
the acquisition module is used for acquiring music parameter data, encoding music text data in the music parameter data by using a preset Unicode character table to obtain music content data, and converting the music content data into music voice data by adopting a voice generation model;
the calculation module is used for extracting the basic vector characteristics of the cartoon roles corresponding to the music role image data in the music parameter data from the preset cartoon role generation model, performing weighting processing on the micro-expression vector characteristics, the gesture vector characteristics and the limb movement vector characteristics in the basic vector characteristics through a neural network self-attention mechanism, calculating the summary vector characteristics of the basic vector characteristics, and generating basic cartoon role images according to the summary vector characteristics;
the prediction module is used for respectively inputting the basic cartoon character image and the music voice data into a preset time sequence neural network and respectively generating a target cartoon character image and a target music voice based on the preset time sequence neural network;
and the combination module is used for combining the music content data, the target cartoon character image and the target music voice to obtain the music cartoon character animation.
9. A cartoon character animation generation device, comprising: a memory and at least one processor, the memory having instructions stored therein;
the at least one processor invokes the instructions in the memory to cause the cartoon character animation generation device to perform the cartoon character animation generation method of any one of claims 1-7.
10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement a method for generating a cartoon character animation according to any one of claims 1-7.
CN202110301883.XA 2021-03-22 2021-03-22 Cartoon character animation generation method, device, equipment and storage medium Active CN113379875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110301883.XA CN113379875B (en) 2021-03-22 2021-03-22 Cartoon character animation generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110301883.XA CN113379875B (en) 2021-03-22 2021-03-22 Cartoon character animation generation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113379875A true CN113379875A (en) 2021-09-10
CN113379875B CN113379875B (en) 2023-09-29

Family

ID=77569751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110301883.XA Active CN113379875B (en) 2021-03-22 2021-03-22 Cartoon character animation generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113379875B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609969A (en) * 2012-02-17 2012-07-25 上海交通大学 Method for processing face and speech synchronous animation based on Chinese text drive
CN106653052A (en) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 Virtual human face animation generation method and device
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN110827804A (en) * 2019-11-14 2020-02-21 福州大学 Sound event labeling method from audio frame sequence to event label sequence
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
CN111383307A (en) * 2018-12-29 2020-07-07 上海智臻智能网络科技股份有限公司 Video generation method and device based on portrait and storage medium
CN112184858A (en) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 Virtual object animation generation method and device based on text, storage medium and terminal
US20210019479A1 (en) * 2018-09-05 2021-01-21 Tencent Technology (Shenzhen) Company Limited Text translation method and apparatus, storage medium, and computer device
CN112420014A (en) * 2020-11-17 2021-02-26 平安科技(深圳)有限公司 Virtual face construction method and device, computer equipment and computer readable medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609969A (en) * 2012-02-17 2012-07-25 上海交通大学 Method for processing face and speech synchronous animation based on Chinese text drive
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN106653052A (en) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 Virtual human face animation generation method and device
US20210019479A1 (en) * 2018-09-05 2021-01-21 Tencent Technology (Shenzhen) Company Limited Text translation method and apparatus, storage medium, and computer device
CN111383307A (en) * 2018-12-29 2020-07-07 上海智臻智能网络科技股份有限公司 Video generation method and device based on portrait and storage medium
CN110827804A (en) * 2019-11-14 2020-02-21 福州大学 Sound event labeling method from audio frame sequence to event label sequence
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
CN112184858A (en) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 Virtual object animation generation method and device based on text, storage medium and terminal
CN112420014A (en) * 2020-11-17 2021-02-26 平安科技(深圳)有限公司 Virtual face construction method and device, computer equipment and computer readable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RYUHEI SAKURAI ET AL: "Synthesis of Expressive Talking Heads from Speech with Recurrent Neural Network", JOURNAL OF KOREA ROBOTICS SOCIETY, pages 16 - 25 *
阳珊 等: "基于 BLSTM-RNN的语音驱动逼真面部动画合成", 清华大学学报 (自然科学版), pages 250 - 256 *

Also Published As

Publication number Publication date
CN113379875B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN112863483B (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
CN112687259B (en) Speech synthesis method, device and readable storage medium
EP2958105B1 (en) Method and apparatus for speech synthesis based on large corpus
CN112086086A (en) Speech synthesis method, device, equipment and computer readable storage medium
JP2007108749A (en) Method and device for training in statistical model of prosody, method and device for analyzing prosody, and method and system for synthesizing text speech
CN111916054B (en) Lip-based voice generation method, device and system and storage medium
CN112735371B (en) Method and device for generating speaker video based on text information
CN112069809B (en) Missing text generation method and system
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN112802446A (en) Audio synthesis method and device, electronic equipment and computer-readable storage medium
CN111488486B (en) Electronic music classification method and system based on multi-sound-source separation
CN116958343A (en) Facial animation generation method, device, equipment, medium and program product
CN116564270A (en) Singing synthesis method, device and medium based on denoising diffusion probability model
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113379875B (en) Cartoon character animation generation method, device, equipment and storage medium
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
Ghorpade et al. ITTS model: speech generation for image captioning using feature extraction for end-to-end synthesis
CN112634861A (en) Data processing method and device, electronic equipment and readable storage medium
KR102639322B1 (en) Voice synthesis system and method capable of duplicating tone and prosody styles in real time
CN113838445B (en) Song creation method and related equipment
CN114783402B (en) Variation method and device for synthetic voice, electronic equipment and storage medium
Choi et al. Label Embedding for Chinese Grapheme-to-Phoneme Conversion.
CN116524074A (en) Method, device, equipment and storage medium for generating digital human gestures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant