CN114049880A - Voice-driven motion generation method, device, computer device and storage medium - Google Patents
Voice-driven motion generation method, device, computer device and storage medium Download PDFInfo
- Publication number
- CN114049880A CN114049880A CN202111331817.3A CN202111331817A CN114049880A CN 114049880 A CN114049880 A CN 114049880A CN 202111331817 A CN202111331817 A CN 202111331817A CN 114049880 A CN114049880 A CN 114049880A
- Authority
- CN
- China
- Prior art keywords
- action
- voice
- motion
- human body
- generated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000033001 locomotion Effects 0.000 title claims description 100
- 230000009471 action Effects 0.000 claims abstract description 221
- 238000013145 classification model Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 238000001228 spectrum Methods 0.000 claims description 15
- 230000033764 rhythmic process Effects 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 230000015654 memory Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 230000000306 recurrent effect Effects 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000000875 corresponding effect Effects 0.000 description 66
- 238000010586 diagram Methods 0.000 description 10
- 208000003443 Unconsciousness Diseases 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000001276 controlling effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000036544 posture Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- BULVZWIRKLYCBC-UHFFFAOYSA-N phorate Chemical compound CCOP(=S)(OCC)SCSCC BULVZWIRKLYCBC-UHFFFAOYSA-N 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Processing Or Creating Images (AREA)
Abstract
The embodiment of the invention discloses a voice-driven action generation method, a voice-driven action generation device, computer equipment and a storage medium. The method comprises the following steps: acquiring voice and action style; judging whether corresponding human body actions need to be generated or not according to the voice; if the corresponding human body action needs to be generated, calculating action parameters of the human body action according to the voice and the action style; and generating the digital human action according to the action parameters. By implementing the method provided by the embodiment of the invention, the voice-driven human body action generation can be realized, the time consumption is short, the action stability is high, and the generated human body action is more natural.
Description
Technical Field
The present invention relates to a motion simulation method, and more particularly, to a method and apparatus for generating voice-driven motion, a computer device, and a storage medium.
Background
The action driving of the current digital human is successful, but the action driving is performed according to the setting, and the action driving is not flexible and natural. In reality, people often accompany unconscious hand and head movements when speaking, and in the prior art, digital people lack the technical presentation in the aspect. How to make a digital person generate coherent and natural actions and generate actions and postures similar to the actions and postures of a real person unconsciously is one of the current research directions.
Chinese patent CN202010836241.5 discloses a method for driving virtual character actions by real-time voice, which adds corresponding variable conditions to the actions of a virtual task, performs voice recognition, matches the result to the variable conditions, and further drives the actions of the virtual task. The method is limited by variable conditions during implementation, action errors or repetition and the like can be caused by improper voice recognition, the pre-designed action is very limited, and the virtual character is quite possibly rigid. Chinese patent CN202011219858.9 discloses a method for driving human gestures by voice, which extracts voice features as the input of an autoregressive model, and predicts a joint rotation sequence through the model to further generate gestures. The method can generate two gestures by colleagues, and the continuous gesture is obtained by setting the structure of the autoregressive model. However, model prediction may cause phenomena such as jitter and floating, and gestures are generated by human setting, which are also easy to cause stiffness and rigidity.
In addition, the action amplitude, the action time and the like of the digital person are specified by people, so that not only is time consumed, but also the stability of the action is easily influenced, and the naturalness of the action is greatly influenced.
Therefore, it is necessary to design a new method, which takes a short time to generate human body motion driven by voice, has high motion stability, and generates human body motion more naturally.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a sound driving action generation method, a sound driving action generation device, computer equipment and a storage medium.
In order to achieve the purpose, the invention adopts the following technical scheme: a voice-driven motion generation method, comprising:
acquiring voice and action style;
judging whether corresponding human body actions need to be generated or not according to the voice;
if the corresponding human body action needs to be generated, calculating action parameters of the human body action according to the voice and the action style;
and generating the digital human action according to the action parameters.
The further technical scheme is as follows: the motion style includes motion amplitude, motion speed, and motion frequency.
The further technical scheme is as follows: the judging whether corresponding human body actions need to be generated according to the voice comprises the following steps:
and judging whether corresponding human body actions need to be generated or not by adopting a classification model and combining the voice.
The further technical scheme is as follows: the method for judging whether corresponding human body actions need to be generated or not by adopting the classification model and combining the voice comprises the following steps:
converting the voice into pinyin with rhythm;
converting the pinyin with rhythm into word cards;
calculating a Mel frequency spectrum of the speech;
merging the word tokens and the Mel's spectrum of the speech to form features;
inputting the features into the classification model to obtain a classification result;
judging whether the classification result is the category of the corresponding human body action needing to be generated or not;
if the classification result is the category of the corresponding human body action needing to be generated, determining that the corresponding human body action needs to be generated currently;
and if the classification result is not the category of the corresponding human body action needing to be generated, determining that the corresponding human body action does not need to be generated currently.
The further technical scheme is as follows: the classification model comprises at least one of a recurrent neural network, a convolutional neural network and an attention mechanism.
The further technical scheme is as follows: the calculating of the motion parameters of the human body motion according to the voice and the motion style comprises the following steps:
merging the features with the action style to form feature quantities;
randomly discarding part of the content of the feature quantity to form a new feature set;
and inputting the new characteristic set and the white noise into a neural network LSTM to generate an action sequence so as to obtain action parameters of the human body action.
The further technical scheme is as follows: the motion parameters of the human body motion comprise past motion, current motion, and coordinates, angles and directions corresponding to future motion.
The present invention also provides a voice-driven motion generating apparatus including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice and action styles;
the judging unit is used for judging whether corresponding human body actions need to be generated or not according to the voice;
the parameter calculation unit is used for calculating motion parameters of the human body motion according to the voice and the motion style if the corresponding human body motion needs to be generated;
and the action generating unit is used for generating the digital human action according to the action parameters.
The invention also provides computer equipment which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the method when executing the computer program.
The invention also provides a storage medium storing a computer program which, when executed by a processor, is operable to carry out the method as described above.
Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of obtaining voice and action styles, determining whether corresponding human body actions need to be generated or not by utilizing a supervision classification task, calculating action parameters by utilizing a neural network LSTM in combination with the voice and the action styles to generate digital human actions when needed, and rapidly determining whether the human body actions need to be generated and rapidly calculating the action parameters by adopting a classification model and the neural network LSTM, so that the voice-driven human body actions are generated with short time consumption, high action stability and more natural human body actions.
The invention is further described below with reference to the accompanying drawings and specific embodiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a voice-driven action generation method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a voice-driven action generating method according to an embodiment of the present invention;
fig. 3 is a schematic sub-flow diagram of a voice-driven action generating method according to an embodiment of the present invention;
fig. 4 is a schematic sub-flow diagram of a voice-driven action generating method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an action sequence provided by an embodiment of the present invention;
fig. 6 is a schematic block diagram of a voice-driven motion generating apparatus according to an embodiment of the present invention;
fig. 7 is a schematic block diagram of a determination unit of the voice-driven motion generation apparatus according to the embodiment of the present invention;
fig. 8 is a schematic block diagram of a parameter calculation unit of the voice-driven motion generation apparatus provided by the embodiment of the present invention;
FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a voice-driven action generating method according to an embodiment of the present invention. Fig. 2 is a schematic flow chart of a voice-driven motion generation method according to an embodiment of the present invention. The voice-driven action generation method is applied to a server, the server and a terminal carry out data interaction, voice and action styles are input through the terminal, whether corresponding actions need to be generated or not is judged firstly, if yes, corresponding action parameters are calculated, and finally human body actions are generated.
Fig. 2 is a flowchart illustrating a method for generating a voice-driven action according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S140.
And S110, acquiring voice and action style.
In the present embodiment, the voice refers to audio indicating whether or not a human motion is to be generated; the action style refers to relevant content for controlling actions of the digital human, and specifically comprises action amplitude, action speed and action frequency.
Voice and action styles, the former being audio and the latter being optional inputs for artificially controlling the action style of the digital person. The action style comprises three parameters of action amplitude, action speed and action frequency, and the default is 1. When the value is 0-1, reducing corresponding amplitude, speed and frequency; when the frequency is larger than 1, the action amplitude, speed and frequency become fast. The value ranges of the three parameters are all [0,10], namely, the value ranges are more than or equal to 0 and less than or equal to 10. Each parameter is controlled individually.
And S120, judging whether corresponding human body actions need to be generated or not according to the voice.
Specifically, a classification model is combined with the voice to judge whether corresponding human body actions need to be generated.
For the input voice, firstly, whether the human body action is generated under the voice is calculated, the essence of the method is a supervised classification task, in the embodiment, a recurrent neural network is adopted, the voice rhythm is input, a binary mark is output, two numerical values of the binary mark represent two categories, namely a category needing to generate the corresponding human body action and a category needing not to generate the corresponding human body action. If the output is 1, the character action needs to be generated; if the output is 0, the human motion does not need to be generated.
In an embodiment, referring to fig. 3, the step S120 may include steps S121 to S128.
And S121, converting the voice into pinyin with rhythm.
In this embodiment, the existing natural language processing tool is used to convert the speech into the pinyin with prosody.
S122, converting the pinyin with rhythm into word cards;
s123, calculating a Mel frequency spectrum of the voice;
s124, combining the word cards and the Mel frequency spectrums of the voice to form features;
s125, inputting the features into the classification model to obtain a classification result;
s126, judging whether the classification result is the type of the corresponding human body action needing to be generated or not;
s127, if the classification result is the category of the corresponding human body action needing to be generated, determining that the corresponding human body action needs to be generated at present;
and S128, if the classification result is not the category of the corresponding human body action needing to be generated, determining that the corresponding human body action does not need to be generated currently.
The word brands and the Mel frequency spectrums are combined to be used as characteristics and input into the convolutional neural network to make judgment, the convolutional neural network can be designed with different structures and parameters, and can be replaced by the convolutional neural network or an attention mechanism to achieve an approximate effect, and the structure of the convolutional neural network is shown in table 1.
TABLE 1 Structure of recurrent neural networks
Neural network | Core | Span of | Activating a |
Convolutional layer | |||
1×3 | 1×2 | | |
Convolutional layer | |||
1×3 | 1×2 | | |
Convolutional layer | |||
1×3 | 1×2 | | |
Convolutional layer | |||
1×3 | 1×2 | | |
Convolutional layer | |||
1×3 | 1×2 | | |
Convolutional layer | |||
1×3 | 1×2 | | |
Convolutional layer | |||
1×2 | 1×2 | ReLU | |
Output layer | - | - | softmax |
In this embodiment, the classification model includes at least one of a recurrent neural network, a convolutional neural network, and an attention mechanism.
And S130, if the corresponding human body action needs to be generated, calculating the action parameters of the human body action according to the voice and the action style.
In this embodiment, the motion parameters of the human body motion refer to the motion sequence and the corresponding parameters, and can be used to generate the digital human motion.
In an embodiment, referring to fig. 4, the step S130 may include steps S131 to S133.
S131, combining the characteristics and the action style to form characteristic quantities.
In this embodiment, the feature quantity is a result of combining a feature formed by combining the word token and the mel-frequency spectrum of the speech with the motion style.
S132, randomly discarding part of contents of the feature quantity to form a new feature set.
In this embodiment, the new feature set refers to the content left after the feature quantity is randomly discarded, and the ratio value range of the discarded quantity is 0 to 1, generally 0.5.
And S133, inputting the new feature set and the white noise into a neural network LSTM to generate an action sequence so as to obtain action parameters of the human body action.
The motion parameters of the human body motion comprise past motion, current motion, and coordinates, angles and directions corresponding to future motion.
For the voice which needs to generate human body action, the action parameters are further calculated, the input of the action parameters is voice, action style and white noise, and the three input signals are synchronous in time. Word-token and Mel-spectrum merging of speechAnd then combining with the action style to be used as the total characteristic X. The feature quantity becomes a new feature set after randomly discarding a part of featuresThe discarding ratio is in the range of 0-1, generally 0.5. Feature(s)And white noise Z is input into a neural network LSTM (Long Short-Term Memory) to output an action sequence P. The motion sequence here contains past, current and future motions, each as a series of motion parameters including coordinates, angles and directions.
As shown in fig. 5, the motion of the digital person is a changing process, so the motion output by the system is a sequence corresponding to time, and each current sequence is calculated from a past motion sequence, a feature set and white noise,wherein, the action sequence at the current time t is PtGenerated by the LSTM, and the input to the LSTM contains four parameters: sequence of actions p at past timest-1And pt-2(ii) a Characteristic set at t-2, t-1, t +1 and t +2 momentsWhite noise Z at the present timet。
And S140, generating the digital human motion according to the motion parameters.
If the corresponding human body motion does not need to be generated, the step S110 is executed.
And automatically driving the digital human to act by means of the existing software according to the action parameters, which belongs to the prior art and is not described herein again.
The present embodiment gives the virtual digital person an unconscious action accompanied when talking, produces a result similar to a real person communication, and allows the user to specify a specific action based thereon.
According to the voice-driven action generation method, voice and action styles are obtained, whether corresponding human body actions need to be generated or not is determined by a supervision classification task, when the corresponding human body actions need to be generated, action parameters are calculated by combining the voice and the action styles through a neural network LSTM to generate digital human actions, whether the human body actions need to be generated or not is quickly determined by adopting a classification model and the neural network LSTM, the action parameters are quickly calculated, time consumption is short when the voice-driven human body actions are generated, action stability is high, and the generated human body actions are natural.
Fig. 6 is a schematic block diagram of a voice-driven motion generating apparatus 300 according to an embodiment of the present invention. As shown in fig. 6, the present invention also provides a voice-driven motion generating apparatus 300 corresponding to the above voice-driven motion generating method. The voice-driven motion generation apparatus 300 includes a unit for executing the above-described voice-driven motion generation method, and the apparatus may be configured in a server. Specifically, referring to fig. 6, the voice-driven motion generating apparatus 300 includes an acquiring unit 301, a determining unit 302, a parameter calculating unit 303, and a motion generating unit 304.
An acquisition unit 301 for acquiring a voice and an action style; a judging unit 302, configured to judge whether a corresponding human body action needs to be generated according to the voice; a parameter calculating unit 303, configured to calculate an action parameter of the human body action according to the voice and the action style if a corresponding human body action needs to be generated; and an action generating unit 304, configured to generate a digital human action according to the action parameter.
In an embodiment, the determining unit 302 is configured to determine whether generating a corresponding human body action is required by using a classification model in combination with the voice.
In one embodiment, as shown in fig. 7, the determining unit 302 includes a first transforming subunit 3021, a second transforming subunit 3022, a calculating subunit 3023, a combining subunit 3024, an inputting subunit 3025, a result determining subunit 3026, a first determining subunit 3027, and a second determining subunit 3028.
A first conversion subunit 3021, configured to convert the speech into a pinyin with prosody; a second conversion unit 3022, configured to convert the pinyin with prosody into word tokens; a computing subunit 3023 configured to compute a mel spectrum of the speech; a merging subunit 3024, configured to merge the word tokens and the mel-frequency spectrum of the speech to form features; an input subunit 3025, configured to input the features into the classification model to obtain a classification result; a result judging subunit 3026, configured to judge whether the classification result is a category that needs to generate a corresponding human body action; a first determining subunit 3027, configured to determine that a corresponding human body action needs to be generated currently if the classification result is a category in which the corresponding human body action needs to be generated; a second determining subunit 3028, configured to determine that the corresponding human body action does not need to be generated currently if the classification result is not the category in which the corresponding human body action needs to be generated.
In one embodiment, as shown in fig. 8, the parameter calculation unit 303 includes a feature quantity generation sub-unit 3031, a discarding sub-unit 3032, and a sequence generation sub-unit 3033.
A feature quantity generation subunit 3031 configured to combine the feature with the action style to form a feature quantity; a discarding subunit 3032, configured to discard part of the content randomly according to the feature quantity to form a new feature set; and a sequence generating subunit 3033, configured to input the new feature set and white noise to the neural network LSTM to generate an action sequence, so as to obtain an action parameter of the human body action.
It should be noted that, as can be clearly understood by those skilled in the art, the specific implementation processes of the voice-driven motion generating apparatus 300 and each unit may refer to the corresponding descriptions in the foregoing method embodiments, and for convenience and brevity of description, no further description is provided herein.
The voice-driven motion generation apparatus 300 described above may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 9.
Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, wherein the server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 9, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 comprises program instructions that, when executed, cause the processor 502 to perform a speech driven action generation method.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the execution of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute a voice-driven action generation method.
The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 9 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 500 to which the present application may be applied, and that a particular computer device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps:
acquiring voice and action style; judging whether corresponding human body actions need to be generated or not according to the voice; if the corresponding human body action needs to be generated, calculating action parameters of the human body action according to the voice and the action style; and generating the digital human action according to the action parameters.
Wherein the action style comprises an action amplitude, an action speed and an action frequency.
In an embodiment, when the step of determining whether to generate the corresponding human body action according to the voice is implemented by the processor 502, the following steps are specifically implemented:
and judging whether corresponding human body actions need to be generated or not by adopting a classification model and combining the voice.
Wherein the classification model comprises at least one of a recurrent neural network, a convolutional neural network, and an attention mechanism.
In an embodiment, when the step of determining whether to generate the corresponding human body action by using the classification model and the speech is implemented by the processor 502, the following steps are specifically implemented:
converting the voice into pinyin with rhythm; converting the pinyin with rhythm into word cards; calculating a Mel frequency spectrum of the speech; merging the word tokens and the Mel's spectrum of the speech to form features; inputting the features into the classification model to obtain a classification result; judging whether the classification result is the category of the corresponding human body action needing to be generated or not; if the classification result is the category of the corresponding human body action needing to be generated, determining that the corresponding human body action needs to be generated currently; and if the classification result is not the category of the corresponding human body action needing to be generated, determining that the corresponding human body action does not need to be generated currently.
In an embodiment, when the processor 502 implements the step of calculating the motion parameter of the human body motion according to the voice and the motion style, the following steps are specifically implemented:
merging the features with the action style to form feature quantities; randomly discarding part of the content of the feature quantity to form a new feature set; and inputting the new characteristic set and the white noise into a neural network LSTM to generate an action sequence so as to obtain action parameters of the human body action.
The motion parameters of the human body motion comprise coordinates, angles and directions corresponding to the past motion, the current motion and the future motion.
It should be understood that, in the embodiment of the present Application, the Processor 502 may be a Graphics Processing Unit (GPU) or/and a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field-Programmable Gate arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. The computer program includes program instructions, and the computer program may be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present invention also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, wherein the computer program, when executed by a processor, causes the processor to perform the steps of:
acquiring voice and action style; judging whether corresponding human body actions need to be generated or not according to the voice; if the corresponding human body action needs to be generated, calculating action parameters of the human body action according to the voice and the action style; and generating the digital human action according to the action parameters.
Wherein the action style comprises an action amplitude, an action speed and an action frequency.
In an embodiment, when the processor executes the computer program to implement the step of determining whether to generate the corresponding human body action according to the voice, the following steps are specifically implemented:
and judging whether corresponding human body actions need to be generated or not by adopting a classification model and combining the voice.
Wherein the classification model comprises at least one of a recurrent neural network, a convolutional neural network, and an attention mechanism.
In an embodiment, when the processor executes the computer program to implement the step of determining whether to generate the corresponding human body action by using the classification model in combination with the speech, the following steps are specifically implemented:
converting the voice into pinyin with rhythm; converting the pinyin with rhythm into word cards; calculating a Mel frequency spectrum of the speech; merging the word tokens and the Mel's spectrum of the speech to form features; inputting the features into the classification model to obtain a classification result; judging whether the classification result is the category of the corresponding human body action needing to be generated or not; if the classification result is the category of the corresponding human body action needing to be generated, determining that the corresponding human body action needs to be generated currently; and if the classification result is not the category of the corresponding human body action needing to be generated, determining that the corresponding human body action does not need to be generated currently.
In an embodiment, when the processor executes the computer program to implement the step of calculating the motion parameter of the human body motion according to the voice and the motion style, the following steps are specifically implemented:
merging the features with the action style to form feature quantities; randomly discarding part of the content of the feature quantity to form a new feature set; and inputting the new characteristic set and the white noise into a neural network LSTM to generate an action sequence so as to obtain action parameters of the human body action.
The motion parameters of the human body motion comprise coordinates, angles and directions corresponding to the past motion, the current motion and the future motion.
The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be merged, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A voice-driven motion generation method is characterized by comprising:
acquiring voice and action style;
judging whether corresponding human body actions need to be generated or not according to the voice;
if the corresponding human body action needs to be generated, calculating action parameters of the human body action according to the voice and the action style;
and generating the digital human action according to the action parameters.
2. The voice-driven motion generation method according to claim 1, wherein the motion style includes a motion amplitude, a motion speed, and a motion frequency.
3. The voice-driven motion generation method according to claim 1, wherein the determining whether the corresponding human motion needs to be generated according to the voice includes:
and judging whether corresponding human body actions need to be generated or not by adopting a classification model and combining the voice.
4. The method according to claim 3, wherein the determining whether the corresponding human body motion needs to be generated by using the classification model in combination with the voice comprises:
converting the voice into pinyin with rhythm;
converting the pinyin with rhythm into word cards;
calculating a Mel frequency spectrum of the speech;
merging the word tokens and the Mel's spectrum of the speech to form features;
inputting the features into the classification model to obtain a classification result;
judging whether the classification result is the category of the corresponding human body action needing to be generated or not;
if the classification result is the category of the corresponding human body action needing to be generated, determining that the corresponding human body action needs to be generated currently;
and if the classification result is not the category of the corresponding human body action needing to be generated, determining that the corresponding human body action does not need to be generated currently.
5. The speech-driven motion generation method of claim 4, wherein the classification model comprises at least one of a recurrent neural network, a convolutional neural network, and an attention mechanism.
6. The voice-driven motion generation method according to claim 4, wherein the calculating motion parameters of the human motion according to the voice and the motion style comprises:
merging the features with the action style to form feature quantities;
randomly discarding part of the content of the feature quantity to form a new feature set;
and inputting the new characteristic set and the white noise into a neural network LSTM to generate an action sequence so as to obtain action parameters of the human body action.
7. The voice-driven motion generation method according to claim 6, wherein the motion parameters of the human motion include coordinates, angles, and directions corresponding to a past motion, a current motion, and a future motion.
8. A voice-driven motion generating device is characterized by comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice and action styles;
the judging unit is used for judging whether corresponding human body actions need to be generated or not according to the voice;
the parameter calculation unit is used for calculating motion parameters of the human body motion according to the voice and the motion style if the corresponding human body motion needs to be generated;
and the action generating unit is used for generating the digital human action according to the action parameters.
9. A computer device, characterized in that the computer device comprises a memory, on which a computer program is stored, and a processor, which when executing the computer program implements the method according to any of claims 1 to 7.
10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111331817.3A CN114049880A (en) | 2021-11-11 | 2021-11-11 | Voice-driven motion generation method, device, computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111331817.3A CN114049880A (en) | 2021-11-11 | 2021-11-11 | Voice-driven motion generation method, device, computer device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114049880A true CN114049880A (en) | 2022-02-15 |
Family
ID=80208401
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111331817.3A Pending CN114049880A (en) | 2021-11-11 | 2021-11-11 | Voice-driven motion generation method, device, computer device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114049880A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116168686A (en) * | 2023-04-23 | 2023-05-26 | 碳丝路文化传播(成都)有限公司 | Digital human dynamic simulation method, device and storage medium |
-
2021
- 2021-11-11 CN CN202111331817.3A patent/CN114049880A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116168686A (en) * | 2023-04-23 | 2023-05-26 | 碳丝路文化传播(成都)有限公司 | Digital human dynamic simulation method, device and storage medium |
CN116168686B (en) * | 2023-04-23 | 2023-07-11 | 碳丝路文化传播(成都)有限公司 | Digital human dynamic simulation method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020135194A1 (en) | Emotion engine technology-based voice interaction method, smart terminal, and storage medium | |
US20220058848A1 (en) | Virtual avatar driving method and apparatus, device, and storage medium | |
CN110599359B (en) | Social contact method, device, system, terminal equipment and storage medium | |
US11839815B2 (en) | Adaptive audio mixing | |
US11735206B2 (en) | Emotionally responsive virtual personal assistant | |
CN112734889A (en) | Mouth shape animation real-time driving method and system for 2D character | |
US20200176019A1 (en) | Method and system for recognizing emotion during call and utilizing recognized emotion | |
CN114049880A (en) | Voice-driven motion generation method, device, computer device and storage medium | |
CN107463684A (en) | Voice replying method and device, computer installation and computer-readable recording medium | |
CN109658931A (en) | Voice interactive method, device, computer equipment and storage medium | |
CN113035198A (en) | Lip movement control method, device and medium for three-dimensional face | |
CN117203675A (en) | Artificial intelligence for capturing facial expressions and generating mesh data | |
JP2015038725A (en) | Utterance animation generation device, method, and program | |
CN117563227A (en) | Virtual character face control method, device, computer equipment and storage medium | |
JP7201984B2 (en) | Android gesture generator and computer program | |
CN112786047B (en) | Voice processing method, device, equipment, storage medium and intelligent sound box | |
US20030033149A1 (en) | Methods and systems of simulating movement accompanying speech | |
KR20220071522A (en) | A method and a TTS system for generating synthetic speech | |
CN111798862A (en) | Audio noise reduction method, system, device and storage medium | |
JP7348027B2 (en) | Dialogue system, dialogue program, and method of controlling the dialogue system | |
JP6993034B1 (en) | Content playback method and content playback system | |
CN118397985A (en) | Music generation method, device, electronic equipment and storage medium | |
US20240202984A1 (en) | Systems and methods for generating images to achieve a style | |
CN117496017A (en) | Virtual image generation method, device and related equipment | |
JP2023139731A (en) | Learning method and learning apparatus for gesture generation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |