CN116189034A - Head posture driving method and device, equipment, medium and product thereof - Google Patents

Head posture driving method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN116189034A
CN116189034A CN202211635506.0A CN202211635506A CN116189034A CN 116189034 A CN116189034 A CN 116189034A CN 202211635506 A CN202211635506 A CN 202211635506A CN 116189034 A CN116189034 A CN 116189034A
Authority
CN
China
Prior art keywords
head
scene type
target scene
voice
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211635506.0A
Other languages
Chinese (zh)
Inventor
冯进亨
戴长军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huanju Shidai Information Technology Co Ltd
Original Assignee
Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huanju Shidai Information Technology Co Ltd filed Critical Guangzhou Huanju Shidai Information Technology Co Ltd
Priority to CN202211635506.0A priority Critical patent/CN116189034A/en
Publication of CN116189034A publication Critical patent/CN116189034A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Processing Or Creating Images (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application relates to a head posture driving method and a device, equipment, medium and product thereof, wherein the method comprises the following steps: acquiring a voice characteristic sequence of the audio data, wherein the voice characteristic sequence comprises characteristic coding information of a plurality of voice frames corresponding to time sequences; synchronously determining a target scene type corresponding to each time sequence and a head posture parameter corresponding to the target scene type according to semantic feature information of the voice feature sequence; and controlling a three-dimensional model of the digital person according to the target scene type to generate a head image according to the head posture parameter. According to the method and the device, the type of the target scene can be accurately predicted according to the audio data, the head gesture parameters corresponding to the type of the target scene are obtained, the digital human head image of the corresponding target scene is generated according to the head gesture parameters, the head activities displayed by the head image are coordinated with the original sound, the style of the target scene is matched, the method and the device are more vivid and natural, and the method and the device are suitable for various application scenes related to digital human.

Description

Head posture driving method and device, equipment, medium and product thereof
Technical Field
The present application relates to animation control technology, and in particular, to a head gesture driving method and apparatus, device, medium, and product thereof.
Background
In application scenarios such as digital puppets, virtual living houses, etc., conversational conversation is an indispensable social behavior, so that digital puppets are required to speak through voice-driven mouth and express different emotions.
At present, from the technical point of view, the digital doll facial expression generally obtains the corresponding expression under the real condition through the linear fusion of a plurality of basic deformation targets (blendshape), including basic expression actions such as mouth opening, mouth closing, smiling, mouth closing, eyebrow and eye movement, and the like. However, these techniques do not generally take into account the fact that when people actually talk, the head movements are also associated with movements, and the general voice-driven expression or voice synchronization algorithm is only aimed at the mouth and the expression, and does not involve the function related to the head gesture, so that the situation that the corresponding generated digital human animation still exists is not natural enough.
In addition, the voice and the head gesture have no strong correlation, and even the same speaker can have different head movements under different scenes with the same speaking content, and the weak correlation is one of the technical difficulties that cause difficulty in obtaining good effects when digital people are animated. If the head pose is to be driven well by various models, a large amount of head pose data with voice in different scenes is required, and if the acquisition condition is marked manually, the acquisition cost of the data is quite high.
As can be seen from the above, the main technical problems of voice-driven digital person animation generation currently exist as follows:
1. the application of the pure voice driving head gesture is very few, and the pure voice driving head gesture is mainly video driving head gesture, so that a person normally speaks to spontaneously drive the head, and the pure video driving is lack of technical flexibility;
2. the voice data and the head gesture show weak correlation, but have high synchronous generation probability, so the voice-driven head gesture model needs a large amount of data and has high cost;
3. there are different head pose situations for different speaking scenes, and if the head pose control model only presents average results, it is difficult to fit most of the scenes.
In view of this, there is a need to further explore techniques for generating head animation of digital persons in order to make industrial progress in a number of ways.
Disclosure of Invention
It is an object of the present application to solve the above-mentioned problems and provide a head pose driving method and corresponding apparatus, device, non-volatile readable storage medium, and computer program product.
According to an aspect of the present application, there is provided a head posture driving method including the steps of:
acquiring a voice characteristic sequence of the audio data, wherein the voice characteristic sequence comprises characteristic coding information of a plurality of voice frames corresponding to time sequences;
synchronously determining a target scene type corresponding to each time sequence and a head posture parameter corresponding to the target scene type according to semantic feature information of the voice feature sequence;
and controlling a three-dimensional model of the digital person according to the target scene type to generate a head image according to the head posture parameter.
According to another aspect of the present application, there is provided a head posture driving device including:
the audio acquisition module is used for acquiring a voice characteristic sequence of the audio data, wherein the voice characteristic sequence comprises characteristic coding information of a plurality of voice frames corresponding to time sequences;
The parameter generation module is used for synchronously determining a target scene type corresponding to each time sequence and a head gesture parameter corresponding to the target scene type according to semantic feature information of the voice feature sequence;
and the image generation module is used for controlling a three-dimensional model of the digital person according to the target scene type to generate a head image according to the head posture parameters.
According to another aspect of the present application, there is provided a head pose driving apparatus including a central processor and a memory, the central processor being for invoking a computer program stored in the memory for performing the steps of the head pose driving method described herein.
According to another aspect of the present application, there is provided a non-volatile readable storage medium storing a computer program implemented according to the head pose driving method in the form of computer readable instructions, the computer program executing the steps comprised in the method when being invoked by a computer.
According to another aspect of the present application, there is provided a computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the method as described in any of the embodiments of the present application.
Compared with the prior art, the method and the device have the advantages that aiming at the audio data, the target scene types of each time sequence are determined by utilizing the voice characteristic sequences of the audio data, the head gesture parameters of each time sequence corresponding to the target scene types are synchronously generated, the head gesture parameters can be generated according to the corresponding target scene types and can be matched with the target scene, the cooperative relationship between sound and head images is finely controlled, the head gestures in the generated digital human head images can be kept in coordinated synchronization with the actual human voice activity condition of the audio data, so that the head activities of each time sequence are more vivid and natural, the relative application of the meta space can be helped to be deployed more quickly, the user experience of the relative application such as social communication, game, entertainment, live broadcast and electric business based on digital people is improved, and more optimistic expected economic benefits are obtained.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a network architecture corresponding to a live service applied in the present application;
FIG. 2 is a schematic diagram of a network architecture of a head pose control model illustratively employed herein;
FIG. 3 is a flow chart of one embodiment of a head pose driving method of the present application;
FIG. 4 is a flow chart of the method for acquiring audio data and a voice feature sequence thereof;
FIG. 5 is a flow chart of an exemplary head pose control model according to the present application for determining target scene types and head pose parameters for each time sequence based on a speech feature sequence;
FIG. 6 is a flow chart of an exemplary head pose control model training process of the present application;
FIG. 7 is a schematic diagram of a flow chart for generating sample data of an exemplary head pose control model according to the present application;
FIG. 8 is a flow chart of an inference process of an exemplary head pose control model of the present application;
FIG. 9 is a schematic diagram of a loss value calculation flow of an exemplary head pose control model of the present application;
FIG. 10 is a flow chart of generating a digital human head image in an embodiment of the present application;
FIG. 11 is a functional block diagram of a head pose drive device of the present application;
fig. 12 is a schematic structural view of a head posture driving device employed in the present application.
Detailed Description
Models referenced or potentially referenced in the present application, including traditional machine learning models or deep learning models, may be deployed, unless explicitly specified, either at a remote server and to implement remote invocation at a client or at a client with sufficient device capabilities to invoke directly, and in some embodiments, their corresponding intelligence may be obtained through transfer learning when running at the client, so as to reduce the requirements on the client hardware operating resources, avoiding excessive occupation of the client hardware operating resources.
Referring to fig. 1, a network architecture adopted in an exemplary network live broadcast application scenario of the present application may be used to deploy a live broadcast service based on digital persons, and implement multiple purposes such as entertainment, e-commerce sales, explanation, etc. by performing network live broadcast by digital persons. The application server 81 shown in fig. 1 may be used to support the operation of the live service, while the media server 82 may be used to store or forward video streams of users, wherein a terminal device such as a computer 83, a mobile phone 84, etc. is typically provided as a client to the end user for uploading or downloading of playing video streams. The method or the device can be operated in the application server 81, the media server 82 and the terminal equipment, wherein the head gesture parameters corresponding to the time sequence are generated according to the audio data, and the head gesture of the digital person in the animation based on the digital person is controlled, so that the digital person naturally sounds according to the audio data when the video stream is played by the terminal equipment, and the head activity and the sound content are coordinated and unified. Similarly, the network architecture of the present application may also be used to deploy services based on digital person social communications, gaming, entertainment, and the like.
Another exemplary application scenario may be implemented in a terminal device independent of a public network, by running a computer program product in which the present application is installed, generating a corresponding digital person's head image for audio data input by a user, thereby creating a corresponding animation. In the process of generating the head image of the digital person, not only the head posture parameter of the digital person can be referred to for controlling the head movement of the digital person, but also the facial expression parameter corresponding to the digital person can be referred to for controlling the facial expression movement of the digital person.
The above application scenarios are all exemplary, and in fact, the technical scheme implemented by the application is a basic technology, and can be applied only if the requirements are matched, so that the application method is generally used in application scenarios with matched requirements.
Referring to fig. 2, an exemplary deep learning model, referred to as a head pose control model, is also provided according to the inventive spirit of the present application, and is put into inference after being trained in advance to a converged state, including a convolutional neural network (CNN, convolutional Neural Network), a first recurrent neural network, a plurality of second recurrent neural networks, and a classifier.
The convolutional neural network may be constructed in a multi-stage convolutional layer for extracting initial feature information of an input voice feature sequence. In one embodiment, the convolutional neural network may include two convolutional layers, where a first convolutional layer performs a convolutional operation on a speech feature sequence of the audio data of an original scale to obtain feature information of a plurality of channels, and a second convolutional layer performs a convolutional operation on the feature information of the plurality of channels to implement feature compression and restore initial feature information corresponding to the original scale. In one embodiment, the convolutional layer may be a time-sequential convolutional network (TCN, temporal Convolutional Network) that may better process time-sequential longer audio data.
The first and second recurrent neural networks may be constructed by adopting a recurrent neural network (RNN, recurrent Neural Network), or other networks suitable for extracting characteristic information for time series data may be used, such as a Long Short-Term Memory (LSTM) network, and in some embodiments, various networks with self-attention layers added thereto based on the recurrent neural network, such as a transfomer encoder, etc., where a characteristic common to the recurrent neural networks is that the characteristic arrangement can be performed by referring to context information in the time series data, so as to obtain a corresponding characteristic representation, and the obtained characteristic representation is more accurate and effective. The first cyclic neural network and the second cyclic neural network can be of the same type or of different types, for example, the first cyclic neural network can adopt a transducer encoder, and the second cyclic neural network can adopt an LSTM; or both the first recurrent neural network and the second recurrent neural network may employ LSTM.
The first cyclic neural network is mainly used for extracting semantic features from initial feature information output by the convolutional neural network to obtain shallow semantic information, the second cyclic neural network is mainly used for further extracting deep semantic information according to the shallow semantic information obtained by the first cyclic neural network, and the shallow semantic information can have higher dimensionality relative to the deep semantic information so that the first cyclic neural network fully digs out the semantic features, and the second cyclic neural network highly concentrates the shallow semantic information to enable the semantic features to be more condensed, so that the deep semantic information becomes head gesture parameters of corresponding time sequences. The head posture parameter can be expressed in the form of a posture vector, and each time sequence corresponds to one posture vector to play a role in controlling the posture of the head in the three-dimensional model of the digital person under the corresponding time sequence. The specific definition mode of the head gesture activity of the deformation target can be adapted, and the gesture vector can be subjected to format conversion, for example, the gesture vector is converted into an Euler angle adaptation expression mode from a rotation matrix expressed by multiple dimensions. After the conversion into Euler angle representation, one head posture can be represented by Yaw angle (Yaw), pitch angle (Pitch) and roll angle (Tilt) in Euler angles, so that one head image corresponding to each time sequence can be used for defining the posture of the head of the digital person in the model space by Euler angle representation.
The second cyclic neural network has a plurality of numbers, each corresponding to a specific scene type, the specific scene can be different scenes such as interpersonal dialogue, a lecturer, singing performance and the like, and the division of the scenes is determined according to the activity characteristics of human head activities in actual life. For example, in an interpersonal dialogue scene, when people are in dialogue, the head part has natural actions in all directions, and the steering degree is generally smaller; in a lecturer scene, the general head gesture is fixed in a large amount of speaking time, the whole left and right directions are more, the situation is similar when the lecture is performed and the hosting is performed, and the head gesture has larger head action only after the specific emotion or the specific language is triggered; in a singing scene, a song head with a fast rhythm can quickly move, such as quick nodding; when a song is singed, a large head motion exists, and the steering speed is relatively low. Therefore, activities with the same kind of characteristics are classified into the same scene type by analyzing the activity characteristics of human heads under various different scenes, abstraction of the head activity characteristics of different scenes can be realized, different cyclic neural networks are set for different scenes to carry out deep processing, and related data can be acquired more finely and accurately.
The classifier can be a multi-classifier constructed by using a Softmax function and is mainly used for mapping shallow semantic information output by the first cyclic neural network to a preset classification space so as to obtain classification probabilities corresponding to various scene types in the classification space, so that the predicted target scene type is determined according to the classification probabilities.
After the voice feature sequence is input into the convolutional neural network as time sequence data, preliminary feature extraction is carried out by the convolutional neural network to obtain initial feature information corresponding to each time sequence in the voice feature sequence, and then the initial feature information is input into the first cyclic neural network to carry out context combing to obtain feature information which is relatively shallow, so that the feature information can be called shallow semantic information.
The shallow semantic information obtained by the first cyclic neural network can relatively speaking, the original semantics of the voice feature sequence of the audio data can be represented more conveniently, and the corresponding scene type can be analyzed through the features such as speaking content, intonation, rhythm and the like, so that a first branch is expanded, the shallow semantic information is classified and mapped in the first branch by means of the classifier, the classification probability of each class mapped in the corresponding classification space is obtained, and the target scene type can be determined according to the classification probability.
In order to determine deformation parameters corresponding to a deformation target set for controlling generation of the head image, expanding a plurality of second branches aiming at the shallow semantic information of each time sequence, and carrying out feature compression on the shallow semantic information associated context by means of a second cyclic neural network in each second branch to obtain deep semantic information, namely obtaining a posture vector of the corresponding time sequence.
Therefore, according to the head gesture control model, the target scene type corresponding to each sampling time sequence in the audio data and the gesture vector acting on the deformation target set can be determined according to the voice feature sequence of the audio data, and the head gesture parameter output by the second cyclic neural network in the second branch corresponding to the target scene type can be obtained according to the target scene type under the time sequence corresponding to each time sequence, so that the head gesture of the deformation target under the time sequence is controlled. Because the shallow semantic information is acquired based on the network structure shared by the first branch and the second branch, the acquired target scene type and head gesture parameters are also time-sequence synchronous, so that after the digital human animation is generated, the coordination synchronization between the voice and the image can be ensured.
Referring to fig. 3, in one embodiment, a head pose driving method according to the present application includes the following steps:
step S1100, a voice characteristic sequence of audio data is obtained, wherein the voice characteristic sequence comprises characteristic coding information of a plurality of voice frames corresponding to time sequences;
audio data for driving the head activity of the digital person, which may be extracted from the original video stream, may be obtained as a real-time recording, or may be obtained by directly invoking a given audio text, is predetermined. The content presented after the audio data is played may be natural speaking content, song singing content, speaking content such as speech, explanation, host, etc.
In order to facilitate the time-sequential processing of the audio data, a predetermined frequency may be given, where the predetermined frequency is a sampling frequency, and the audio data is sampled according to the predetermined frequency, so as to obtain a plurality of speech frames, where the speech frames form a speech frame sequence of the audio data according to a time-sequential relationship.
In order to facilitate the processing by means of the head posture control model, the feature representation of each voice frame in the voice frame sequence can be further realized by extracting voice features from each voice frame in the voice frame sequence and performing corresponding encoding to obtain feature encoding information. The voice characteristics can be extracted and determined from any one of the following information: time spectrum information, mel spectrum information, scale contour information, CQT filter information, chroma information.
The time spectrum information is formed by pre-emphasis, framing, windowing and Short Time Fourier Transform (STFT) of the audio data in the time domain to obtain data corresponding to a spectrogram.
The mel-spectrum information can be obtained by filtering the time-spectrum information by a mel-scale filter bank, and the method is also applicable to obtaining corresponding mel-spectrum information by taking logarithm of the mel-spectrum information and performing DCT (discrete cosine transform). It will be appreciated that mel-spectrum information and its mel-cepstrum information can better describe style-invariant features of audio data, such as pitch, intonation, timbre, and the like.
The CQT filtering information is more suitable for processing audio data containing the singing content of the song. Since in music all tones are composed of 12-octave equal-temperaments together, i.e., twelve-octave equal-temperament, corresponding to twelve semitones on one octave in a piano. The frequency ratio between these semitone neighbors is 21/12. Obviously, two octaves of the same level, high octaves are twice as frequent as low octaves. Therefore, in music, sound is distributed exponentially, but the audio spectrum obtained by fourier transformation is distributed linearly, and the frequency points of the two are not in one-to-one correspondence, which can cause errors in the estimation of certain scale frequencies. So that the time-frequency transformation algorithm CQT can be used to replace the Fourier transformation means for voice analysis. CQT, constant Q Transform, constant Q transform, refers to a filter bank with exponentially distributed center frequencies, different filter bandwidths, but with a constant ratio of center frequency to bandwidth Q. Unlike fourier transforms, the transverse axis frequency of the spectrum is not linear, but is based on log2, and the filter window length can be varied according to the spectral line frequency to obtain better performance. Since the CQT is identical to the distribution of scale frequencies, by calculating the CQT spectrum of the music signal, the amplitude value of the music signal at each note frequency can be directly obtained, which is more perfect for the signal processing of music.
The tone scale contour information, including PCP (Pitch Class Profile), HPCP (Harmonic Pitch Class Profile), is equally applicable to audio data containing the singing content of a song, which is intended to extract its corresponding pitch sequence from the audio data of a song segment, to be converted into a melody contour sequence after normalization, merging, segmentation, and then to be converted into a corresponding feature representation using standard pitch differences generated by standard pitch. The characteristic coding information constructed based on the tone level contour information is adopted, so that the method has good robustness to environmental noise.
The Chroma feature information is also more suitable for audio data containing singing content of songs, and is a collective name of Chroma vectors (Chroma vectors) and Chroma maps (Chroma maps). A chrominance vector is a vector of 12 elements, which represent the energy in 12 levels of a period of time (e.g., 1 frame), respectively, and the same level of energy for different octaves is accumulated, and a chrominance map is a sequence of chrominance vectors. Specifically, after a voice frame of audio data of a song segment is converted from a time domain to a frequency domain through short-time Fourier transform, noise reduction processing is performed, and tuning is performed; converting the absolute time into frames according to the length of the selected window, and recording the energy of each pitch in each frame to form a pitch map; on the basis of the pitch spectrum, the energy (in loudness) of notes of the same time, the same scale and different octaves is superimposed on the element of the scale in the chrominance vector to form the chrominance spectrum. The data corresponding to the chromaticity diagram is the Chroma characteristic information.
As can be seen, there are various ways of extracting speech features from speech frames of audio data, and the serialization result of feature coding information obtained according to speech feature coding at each sampling timing can be used as a speech feature sequence of the audio data.
Step 1200, synchronously determining a target scene type corresponding to each time sequence and a head gesture parameter corresponding to the target scene type according to semantic feature information of the voice feature sequence;
the voice feature sequence can be used for extracting semantic feature information, the semantic feature information can be shallow or deep, or the voice feature sequence can be flexibly used according to different semantic depths, and the mapped target scene types and head gesture parameters under various scene types, particularly under the target scene types, are synchronously determined through flexible utilization of the semantic feature information.
Taking the head gesture control model as an example, the voice feature sequence is input into the head gesture control model, and the convolution neural network and the first circulation neural network are processed successively, so that shallow semantic information of each sampling time sequence of the voice feature sequence can be correspondingly obtained. In an embodiment, the shallow semantic information may also be multiplexed in a later step. The shallow semantic information is characterized in that the initial feature extraction is carried out through the convolutional neural network, and the context features are initially carded through the first cyclic neural network, so that the contained semantics are relatively original, but the context information is associated, and whether voice activity exists or not can be more accurately represented according to the front-back time sequence relation, so that the original and context carded semantics are more conducive to correspondingly determining the target scene type corresponding to each time sequence according to the features such as style, intonation and rhythm in voice.
For example, the plurality of scene types corresponding to the voice feature sequence may be defined as three scene types of interpersonal dialogue, lecturer and singing performance, and in fact, any two of the scene types may be used, or more scene types may be determined by subdividing different specific scenes, which may be implemented by persons skilled in the art according to the principles disclosed herein. When a scene type is predicted for a speech feature sequence at each timing, the scene type that is hit is the target scene type corresponding to the timing.
It is easy to understand that the shallow semantic information obtained by the first recurrent neural network is also the serialized information output corresponding to each sampling time sequence, wherein the shallow semantic information of each time sequence sequentially reaches the classifier in the first branch in the head gesture control model, the classifier can map the shallow semantic information of each time sequence to a preset classification space, the classification space is provided with a plurality of corresponding classes corresponding to the scene types, and each class indicates a corresponding scene type, so that in the classification space, the shallow semantic information of each time sequence can be obtained and mapped to classification probabilities corresponding to the classes, and it is easy to understand that the scene type represented by the class with the largest classification probability is the target scene type obtained by detection under the corresponding time sequence.
It will be understood that, according to the above principle, for the shallow semantic information of the speech frame sequence at each time sequence, the target scene type corresponding to each time sequence can be determined. However, in an alternative embodiment, considering that the scenes corresponding to the same audio data are determined, the entire corresponding audio data may be synthesized, the overall corresponding target scene type is predicted according to the voice feature sequence of the audio data, and the target scene type covers all the time sequences corresponding to the voice feature sequence, which may also have the same effect.
On the basis of the shallow semantic information, the deep semantic information can be further proposed. In one embodiment, in a plurality of second branches provided by applying the head gesture control model of the embodiment of the present application and corresponding to each specific scene type, deep semantic information corresponding to each time sequence can be obtained by further extracting deep semantic from shallow semantic information associated context of each time sequence through a second cyclic neural network in each second branch. As described above, the deep semantic information is essentially a head pose parameter, expressed in the form of a pose vector, which can be transformed as necessary to act on the three-dimensional model of the digital person, thereby controlling the head pose change of the digital person. Thus, the deep semantic information of each time sequence can be used as the gesture vector of the corresponding time sequence for controlling the generation of the head image of the digital person.
It should be understood that, the exemplary head pose control model may synchronously correspond to each scene type to generate the head control parameters corresponding to each scene type, but because the first branch has already determined the target scene type, only the head pose parameters output by a corresponding second branch need to be used according to the target scene type in actual use, and the head pose parameters of other scene types under the same time sequence need not be used.
In the above process, it can be seen that under the action of the head gesture control model, the head gesture parameters corresponding to the target scene types corresponding to the voice feature sequences and the respective scene types are synchronously generated, and the time sequence correspondence exists between the target scene type judgment and the head gesture parameters of the respective scene types, so that the head gesture of the digital person can be precisely controlled in specific time sequences, and the description is very fine.
Step S1300, controlling a three-dimensional model of the digital person according to the target scene type to generate a head image according to the head posture parameters.
According to the above process, for the voice feature sequence, the target scene type corresponding to the time sequence and the gesture vector of the deformation target can be obtained synchronously under each time sequence, so in one embodiment, the gesture vector corresponding to the current time sequence can be applied to the deformation target of the corresponding time sequence according to the target scene type of each time sequence, and the head image corresponding to the time sequence can be generated in a real-time control manner according to the target scene type of the audio data under the time sequence.
In other further improved embodiments, the method can also be used for smoothly weighting the current time sequence and the gesture vectors of a plurality of subsequent time sequences and then applying the weighted gesture vectors to the deformation target, so that the gesture change process of the generated head image is more natural and smooth. And so forth, those skilled in the art can flexibly set according to the principles disclosed above.
In some embodiments, when the gesture vector corresponding to each time sequence is applied, facial expression parameters of the corresponding time sequence are also introduced, and the expression state of the digital person in the corresponding time sequence is controlled through the facial expression parameters, so that the digital person can be flexibly implemented by the person skilled in the art, and the application of the application is not influenced.
According to the above embodiment, the method and the device for generating the head gesture parameters of the digital human head image according to the audio data can determine the target scene type of each time sequence by utilizing the voice characteristic sequence of the audio data, synchronously generate the head gesture parameters of each time sequence corresponding to the target scene type, generate the head gesture parameters according to the corresponding target scene type, match the target scene, finely control the cooperative relationship between the voice and the head image, ensure that the head gesture in the generated digital human head image can keep the coordinated synchronization with the actual human voice activity condition of the audio data, and enable the head activity of each time sequence to be more vivid and natural, thereby helping the relevant application of the meta space to be deployed more quickly, improving the relevant application such as social communication, game, entertainment, live broadcast and user experience of an electronic commerce based on digital human, and obtaining more optimistic expected economic benefits.
On the basis of any embodiment of the present application, referring to fig. 4, a voice feature sequence of audio data is obtained, including:
step S1110, obtaining audio data generated by recording equipment;
in one embodiment, the computer program product of the present application may be run in a terminal device, such as a mobile phone or a personal computer, and when the host user turns on the live broadcast, a live broadcast interface is displayed, wherein a digital person set by the user may appear, so that the user may speak in real time through a recording device, such as a microphone, and corresponding audio data is generated by a sound card, and these audio data are read to drive the head animation of the digital person.
Step S1120, sampling the audio data according to a preset frequency to obtain a voice frame sequence, wherein the voice frame sequence comprises a plurality of voice frames corresponding to time sequences;
to facilitate speech feature extraction, the audio data may be sampled. In general, the frequency required for sampling is only required to adopt a preset frequency for sampling the audio data of the training sample when the head posture control model performs training, so that the input information processed by the head posture control model in the training stage and the reasoning stage is kept consistent in format.
The sampling of the audio data may be performed in a conventional manner of audio data processing. In one embodiment, the audio data is pre-emphasized, framed, windowed, etc. and then analyzed in the time domain or frequency domain, i.e., the audio data analysis is implemented. The purpose of pre-emphasis is to boost the high frequency part in the speech signal and smooth the spectrum; the pre-emphasis is typically implemented by a first order high pass filter. Before the audio data is analyzed, the audio data is further required to be framed, the length of each frame of the audio data is generally set to 20ms, and in consideration of a frame shift factor, two adjacent frames can have 10ms overlap. To achieve framing, this may be achieved by windowing the audio data. Different window selections have an impact on the results of the audio data analysis, and it is common to implement windowing using a Hamming window (Hamm) corresponding window function. The audio data can be converted into a plurality of voice frames through framing, so that corresponding voice frames under each time sequence determined according to the preset frequency sampling are obtained, and a voice frame sequence is formed.
Step S1130, encoding the voice features in the voice frame to obtain feature encoding information, and forming the voice feature sequence from the feature encoding information of each time sequence.
In this embodiment, mel spectrum information may be extracted according to the speech frame sequence to perform encoding, so as to obtain a corresponding speech feature sequence. Specifically, short time fourier transform (STFT, short-Time Fourier Transform) is performed on each voice frame on the basis of the voice frame sequence, and the voice frame sequence is transformed from the time domain to the frequency domain, so that data corresponding to a spectrogram is obtained, and time spectrum information of the audio data is obtained. Further, the mel-scale filter bank may be used to filter the time spectrum information to obtain mel-scale information, so that the corresponding coding algorithm may be applied to encode the mel-scale information to obtain feature coding information of each voice frame, and the feature coding information of each time sequence forms a voice feature sequence corresponding to the audio data.
For example, let the step length of the cross sequence be 640, the number of windows be 64, the sampling frequency be 16KHz, the frame rate of the facial expression animation be 25fps/s, take the audio data with the length of n seconds as an example, the dimension of the obtained voice feature sequence be (n/25, 64), where t=n/25 is the time sequence dimension, i.e. the total number of sequences, and 64 is the feature dimension of the feature coding information, i.e. the total number of voice features.
After the voice characteristic sequence of the audio data is determined, the voice characteristic sequence can be used as input information of a head gesture control model of the application, and the head gesture control model determines target scene types corresponding to each time sequence and head gesture parameters under various scene types according to the input information.
According to the embodiment, the audio data generated by the recording device are acquired, the corresponding voice characteristic sequence is obtained through real-time sampling and conversion, the quick coding of the speaking of the user is completed, the speaking activity can be converted into the speaking activity of the digital person in real time, and key leading conditions are provided for realizing various applications based on the digital person.
On the basis of any embodiment of the present application, referring to fig. 5, determining, according to semantic feature information of the voice feature sequence, a target scene type corresponding to each time sequence and a head gesture parameter corresponding to the target scene type synchronously includes:
step S1210, extracting initial characteristic information corresponding to each time sequence from the voice characteristic sequence by adopting a convolutional neural network in a preset head gesture control model;
the convolutional neural network in the head gesture control model can comprise two stages of convolutional layers, wherein each stage of convolutional layer activates output after performing convolutional operation, and the initial characteristic information corresponding to each time sequence is extracted through the two stages of convolutional.
The first convolution layer is responsible for receiving the input information, where the input information may be obtained from the voice feature sequence of the audio data through an expansion dimension, for example, for the foregoing example, the voice feature sequence (t=n/25, 64) may be dimension-expanded through an expansion dimension to obtain a format of (1, t, 64), where the first dimension represents a batch of voice feature sequences, the second dimension represents a channel number, the third dimension represents a time sequence, and the fourth dimension represents a length of a voice feature. In one embodiment, the number of input channels of the first convolution layer is set to 1, the number of output channels is set to 16, the convolution kernel size is set to 3, the convolution step size is set to 1, and thus the obtained dimension information of the output information is (1, 16, t, 64), so that the simultaneous convolution of the time dimension and the feature dimension is realized.
The second convolution layer further processes the output information of the first convolution layer, so that the number of input channels is set to 16 correspondingly, the number of output channels is set to 1, the convolution kernel size is set to 1, the convolution step size is also set to 1, and after the convolution operation is performed by the second convolution layer and the output is activated, corresponding initial characteristic information is obtained, and it is easy to understand that the dimensional information of the initial characteristic information is restored to (1, t, 64). Therefore, the second convolution layer performs channel compression on the output information of the first convolution layer, and preliminary semantic extraction of the voice feature sequence is achieved through cooperation of the two convolution layers in the whole convolution neural network, and preliminary representation of the semantics of the voice feature sequence is completed.
Step S1220, performing feature extraction on the initial feature information by using a first recurrent neural network in the head gesture control model, so as to obtain shallow semantic information corresponding to each time sequence;
the first recurrent neural network in the head pose control model, for example, may be an LSTM model, which is responsible for extracting a semantic feature of a deeper layer from the initial feature information in association with a context, so that an input length of the first recurrent neural network is set to 64 and a hidden layer size is configurable to 128 corresponding to the dimensional information of the initial feature information in the foregoing example, whereby the dimensional information of the input information of the first recurrent neural network is constrained to (1, t, 64), wherein a first dimension represents a batch of a voice feature sequence, a second dimension represents a time sequence length, and a third dimension represents a voice feature length. Correspondingly, after the initial characteristic information is subjected to semantic mining through the first cyclic neural network, the obtained dimension information of the output result is (1, t, 128). It can be seen that the first recurrent neural network further associates the context on the basis of the initial feature information to mine out shallow semantic information corresponding to each time sequence of the voice feature sequence. The shallow semantic information can better reflect the stable characteristics of style, intonation, rhythm and the like in the audio data, so that the shallow semantic information can be used for detecting scene types.
Step S1230, classifying and mapping the shallow semantic information by adopting a classifier in the head gesture control model, and determining a target scene type to which each time sequence belongs in a plurality of scene types;
and for the shallow semantic information, namely the output result of the first cyclic neural network, classifying mapping can be further carried out through a classifier in the first branch of the head gesture control model, and the classifier is mapped to a preset classifying space, so that the classifying probability corresponding to each scene type is obtained. Specifically, the classifier maps the shallow semantic information to an output layer through a full connection layer, and the classification probability corresponding to each class in the classification space is calculated in the output layer by adopting a softmax function, wherein the class corresponding to the highest classification probability indicates the type of the target scene.
For each time sequence, the corresponding classification probability is obtained for each category in the classification space, wherein the state represented by the category with the largest classification probability is the target scene type, so that the target scene type of the whole voice feature sequence under each time sequence can be determined according to the target scene type represented by the category with the largest classification probability in each time sequence.
Step 1240, predicting the head gesture parameters under the corresponding scene types by using the second recurrent neural network corresponding to the plurality of scene types in the head gesture control model.
In the head gesture control model, corresponding second cyclic neural networks are arranged in each second branch corresponding to each scene type, and the second cyclic neural networks can also be LSTM models which are responsible for extracting semantic features of a deeper layer from the shallow semantic information by associating contexts to obtain deep semantic information corresponding to each time sequence. The second cyclic neural network is used for realizing deep semantic extraction of shallow semantic information, so that proper feature compression is needed, and finally obtained features are directly used as head gesture parameters. Thus, for the previous example, the input information of the second recurrent neural network, i.e. shallow semantic information, the dimensional information of which is (1, t, 128), but the dimensional information of the output information of which is configured as (1, t, sum), where sum represents the dimension of the posture vector corresponding to each time sequence, for example, if the posture parameter is represented as a 9-dimensional rotation matrix corresponding to the euler angle, the corresponding dimensional information is (1, t, 9), i.e. one posture vector can be obtained for each time sequence.
In one embodiment, the head posture parameters of the second recurrent neural network may be converted into format according to the requirement that the head posture parameters required by the digital person in control are expressed in euler angles, and each posture vector in the format is converted from a rotation matrix representation to the euler angles. The method comprises the following steps:
let the representation of the pose vector be:
Figure SMS_1
let euler angles be represented as yaw angle α, pitch angle β, roll angle γ, and euler angles of attitude vectors be represented as [ α, β, γ ], and there are:
Figure SMS_2
Figure SMS_3
Figure SMS_4
in practice, the second recurrent neural network obtains the corresponding gesture vector of the audio data under each sampling time sequence through regression, and can be used for controlling the head gesture of the digital person under the corresponding time sequence, so as to realize the generation of the head image corresponding to each time sequence.
According to the embodiment, the shallow semantic information of the voice feature sequence is skillfully multiplexed, so that the shallow semantic information is used for predicting the target scene type, on the one hand, regression processing is performed by using the second cyclic neural network corresponding to different scene types on the basis of the shallow semantic information, the head posture parameters of the corresponding scene are obtained, and the time sequence synchronization of scene type prediction and head posture parameter prediction is ensured, so that the generation of the head image of the digital person can be controlled more finely by using the corresponding relation.
The training data set of the head posture control model can be prepared on the basis of a high-quality video data set, and the head posture control model is trained in a convergence state through the training data set, so that the head posture control model is suitable for judging the corresponding target scene type and the head posture parameters of the digital person under the corresponding scene type for audio data. For this purpose, referring to fig. 6, before obtaining the voice feature sequence of the audio data according to any embodiment of the present application, the method includes:
step S2100, acquiring a video data set, where the video data set includes a plurality of video data, and each video data identifies a scene type to which the video data set belongs;
the plurality of video data included in the video data set is generally collected correspondingly according to a plurality of scene types aimed by the head gesture control model, for example, corresponding video data in three scenes including an interpersonal dialogue, a lecturer and a singing performance are collected, and each scene collects a large amount of video data, so that various features are enriched as much as possible. When selecting video data, the video data can be moderately screened, for example, the video data with the time length longer than 1 second is selected, the voice signal to noise ratio is required to be larger than 20dB, and the like.
In order to collect training data that can cover different types of speech as well as different scenes, different video data that is spoken in multiple languages may also be collected, which may be obtained from any public data source to save costs.
For various video data collected for different scene types, the corresponding scene types are marked incidentally so as to improve the preparation efficiency of the training data set.
Step S2200, preprocessing the video data, determining a voice characteristic sequence corresponding to the audio data of each video data, a head posture parameter corresponding to the image data and a scene type label thereof as sample data, and forming a training data set by the sample data;
in order to generate a corresponding training data set according to the video data set, a corresponding preprocessing operation is required to be performed on each video data, and the main task is to determine a voice characteristic sequence of standardized duration of audio data in the video data and a head posture parameter corresponding to image data in the video data according to each video data in an audio preprocessing mode, and generate a corresponding scene type label according to a scene type marked by the video data, thereby constructing corresponding sample data and storing the sample data in the training data set.
The video data with larger duration can be divided into a plurality of sections, and the sample data can be manufactured for each section to obtain a plurality of sample data.
When audio data in video data is multi-channel data, the audio data can be converted into mono data, and when channel conversion is performed, for example, the sampling rate can be set to 16khz,16bit, and wav formats.
The extraction of head pose parameters corresponding to image data in video data may be performed by means of various third party tools. When the third party tool is inconsistent with the specification of the input information of the head posture control model, appropriate format conversion can be performed, so that the model is matched with the input information, and the head posture parameters predicted by the model are more robust.
The scene type tag may be represented in the form of a one-hot coded vector.
In the preprocessing, the maximum or minimum amplitude of the data in the voice characteristic sequence can be normalized, for example, to the value interval of [ -0.8,0.8] for facilitating model processing.
Step S2300, performing iterative training on the head posture control model based on the sample data in the training data set, and training the head posture control model to a convergence state.
After the preparation of the training data set is completed, the training data set can be used for training the head posture control model of the application, and the iterative training is carried out on the head posture control model by iteratively adopting sample data in the training data set, so that the head posture control model is trained to a convergence state.
And when each iteration training is carried out, correspondingly adopting sample data, wherein a voice characteristic sequence in the sample data is used as input information of the model, a scene type label in the sample data is used as a supervision label corresponding to a classification result of a classifier in a first branch of the model, and a head gesture parameter in the sample data is used as a supervision label of a head gesture parameter predicted by a second cyclic neural network in a second branch corresponding to a target scene type predicted by the model, so that a loss value during model training is calculated, gradient update is carried out on the model according to the loss value, and the model is promoted to continuously approach convergence.
In practical training, in order to improve the robustness of the model, a batch iterative update mode can be adopted to control the gradient update of the head gesture control model, namely after each batch is trained by adopting a plurality of sample data inputs and corresponding loss values are determined, the gradient update of the model is not implemented according to single sample data, the loss values of sample data of each phase in the whole batch are synthesized, for example, an average value is calculated, and then the gradient update is carried out on the model according to the obtained result.
According to the embodiment, the method and the device distinguish different scene types, the video data of the public data source is utilized to manufacture sample data required by training the head posture control model, the sample manufacturing cost is high, after the head posture control model is trained to be converged by adopting the sample data, the head posture control of a digital person can be served, the whole implementation cost is low and high-efficiency, and the obtained head posture control model has better robustness in the aspect of outputting head posture parameters through the training process.
Referring to fig. 7, on the basis of any embodiment of the present application, preprocessing the video data includes:
step S2210 of separating audio data and image data from the video data;
the video data is composed of audio data and image data, which can be separated to obtain the audio data and the image data of the video data for processing, and the time corresponding relation between the audio data and the image data is correspondingly established when the video data and the image data are separated, so that the time sequence synchronization between various information obtained by the audio data and the image data can be maintained later.
Step S2220, extracting head posture parameters of a head image in the image data according to the image data, and determining sampling frequency of audio data according to the frame rate of the image data;
The image data includes a plurality of image frames corresponding to each second according to a frame rate inherent to the video data, for example, when the frame rate is 25, there are 25 image frames per second, which means that each image frame actually corresponds to a duration of 1/25=40 milliseconds, so that a sampling frequency for sampling the audio data can be determined according to a correspondence between the frame rate and the duration of the image data, and each voice frame after each audio data is sampled is kept to be 40 milliseconds.
For unified operation, for example, the video data are unified into a format of 25 frames/second in advance, so that the input information of the whole head posture control model is normalized. But may also be processed at other frame rates.
On the basis of the above image data, a third-party tool can be used to extract the head pose parameters in the head image from the image data. For example, openface software is adopted to obtain related information such as facial expression parameters and head posture parameters in the video data in batches. The obtained various information, in particular the head pose parameters, may be stored as a local file.
In one embodiment, in order to remove noise and disturbance generated during video data recording, a smoothing process is required for each euler angle of each frame obtained through Openface, where the smoothing formula is as follows, and input is n To input values, eluer n-1 For smooth values, α is an empirical value, and can be flexibly selected, for example, 0.8:
eluer n =(1-α)input n +(α)eluer n-1
eluer 0 =input 0
the head posture parameters obtained by Openface are expressed in terms of euler angles, and for convenience in processing a head posture control model of the application, the head posture parameters can be converted from euler angle representation to rotation matrix representation, and the formula is exemplified as follows:
Figure SMS_5
step S2230, performing audio preprocessing according to the audio data to obtain a voice characteristic sequence composed of characteristic coding information of voice frames of each time sequence obtained by encoding after sampling according to sampling frequency;
the process of extracting speech features from audio data for encoding to obtain their corresponding sequences of speech features may be referred to as shown in the foregoing description of the application, which is not repeated here.
It should be noted that, for the purpose of producing sample data, the audio data needs to be sampled according to the sampling frequency determined by adapting to the frame rate of the image data to generate corresponding voice frames, so that the duration of each voice frame is consistent with the duration of each image frame.
Step S2240, converts the scene type of the video data into a scene type tag expressed in a one-hot encoding vector.
The scene type of the video data is labeled correspondingly in advance, so that the scene type is only required to be adapted to the form required by the actual calculation of the model, and the numerical representation is carried out on the scene type, and in the application, the scene type is converted into a scene type label expressed by a single-heat coding vector.
For example, let the label of the interpersonal dialogue scene be originally labeled 0, the original label of the lecture judicial scene be 1, the original label of the singing performance be 2, let a scene type corresponding to video data be the lecture judicial scene be 1, accordingly, the scene type label after conversion be expressed as:
[0,1,0], wherein the first element corresponds to an interpersonal dialogue scene, the second element corresponds to a lecture judicial scene, the third element corresponds to a singing performance scene, the binary value of 0 indicates that the correspondence does not belong to a corresponding scene, 1 indicates that the correspondence belongs to a corresponding scene, one scene type label has only one element value of 1, and the values of the other elements are all 0.
According to the above embodiment, the video data is preprocessed and converted into the sample data, so that a corresponding relationship is established between the voice feature sequence of the audio data and the head gesture parameters and scene type labels of the image data based on time sequence, when the video data is used for training the head gesture control model, the capability of predicting the target scene types corresponding to each time sequence and the head gesture parameters corresponding to each scene type according to the voice feature sequence can be obtained by using the corresponding relationship, and then the head gesture parameters of the corresponding scene can be obtained according to the target scene types, so that the video data can be used for controlling the head action presentation of the digital person, the overall implementation cost is low, and the model obtained by training is also inevitably provided with stronger robust output due to the fact that the sample data is subjected to refinement processing.
On the basis of any embodiment of the present application, referring to fig. 8, performing iterative training on the head posture control model based on sample data in the training data set, training the head posture control model to a convergence state, including:
step S2310, obtaining single sample data, and inputting a voice characteristic sequence into the head gesture control model to predict a target scene type corresponding to the voice characteristic sequence and head gesture parameters corresponding to each scene type;
according to the architecture and principle of the head gesture control model disclosed in the foregoing application, after the speech feature sequence in the sample data of the time-series sampling organization is provided as the input information to the head gesture control model, the classification probability corresponding to each scene type can be obtained from the first branch to determine the target scene type, and at the same time, the head gesture parameters corresponding to each time series can also be obtained from each second branch.
Step S2320, calculating a comprehensive loss value between the target scene type and the head gesture parameter predicted corresponding to the target scene type based on the scene type tag and the head gesture parameter in the sample data;
For each sample data, loss values of the head pose parameters of the target scene type and the corresponding target scene type predicted by the head pose control model can be correspondingly calculated, and the loss values are summarized into a comprehensive loss value corresponding to the sample data.
It should be noted that, considering that the head pose control model has predicted its corresponding target scene type for the sample data, it is not necessary to consider the loss value of the head pose parameter obtained for the second branch not belonging to the target scene type when calculating the head pose parameter, but only for the second branch belonging to the target scene type.
For this purpose, the scene type tag based on the sample data has been converted into a representation of a one-hot encoded vector, and the following formula can be used to uniformly calculate the integrated loss value loss corresponding to each sample data:
loss=y cls (0)*loss talking +y cls (1)*loss sp +y cls (2)*loss singing +loss type
wherein loss is talking 、loss speech 、loss singing Loss values corresponding to head posture parameters of an interpersonal dialogue scene, a lecturer scene and a singing performance scene are respectively represented, y cls () Watch (watch)Element values of the one-hot coded vector showing the corresponding scene type predicted by the first branch, e.g. y when the predicted target scene type is a lecturer scene cls (1) Has a value of 1, y cls (0)、y cls (2) The value of (2) is 0, so that practically the whole formula only takes the loss value of the head posture parameter corresponding to the lecture judicial scene to the loss value loss corresponding to the target scene type type The aggregate loss value loss is determined and is not used for the other two scenes.
Step S2330, gradient updating is carried out on the head posture control model according to the comprehensive loss value until the head posture control model is trained to be converged by iterating a plurality of sample data.
After determining the comprehensive loss value corresponding to the sample data, determining whether to terminate training of the head posture control model according to whether the comprehensive loss value reaches a preset threshold or whether the comprehensive loss value reaches an iteration frequency threshold, when the threshold is not reached, indicating that the head posture control model does not reach a convergence state, implementing gradient update on the model according to the comprehensive loss value, then continuously calling the next sample data from the training data set to start a new iteration process on the model until one of the thresholds is reached, and indicating that the head posture control model reaches the convergence state, thereby ending the whole training process, and putting the head posture control model on line for reasoning.
Of course, as described above, a batch update method may also be adopted, and after further integrating the integrated loss values of the plurality of sample data, the integrated loss values are compared with the preset threshold value to determine whether the head posture control model reaches the convergence state, and then a decision is made as to whether to perform gradient update and whether to continue iterative training.
According to the above embodiment, in the training process of the head gesture control model, according to the target scene type output by the first branch, the head gesture parameter output by the corresponding scene type is adopted to determine the comprehensive loss value required by model update, so that the target scene type corresponding to each time sequence and the head gesture parameter under the specific scene type are effectively supervised, the head gesture control model can synchronously and accurately predict the target scene type corresponding to each time sequence of the audio data and the head gesture parameter corresponding to the target scene type, synchronization is always kept about the time sequence, and the head gesture control model can be used as a basic component, and when the head gesture control model is used for head action control of a digital person, the robustness and the expressive force of the generated head animation are ensured to be stronger.
On the basis of any embodiment of the present application, referring to fig. 9, calculating, based on a scene type tag and a head pose parameter in the sample data, a comprehensive loss value between the target scene type and a head pose parameter predicted corresponding to the target scene type includes:
Step S2321, calculating a cross loss value corresponding to the target scene type based on the scene type label in the sample data;
after a sample data is predicted by the head posture control model to obtain the corresponding classification result, the scene type label in the sample data can be used for calculating the corresponding cross entropy loss as the corresponding cross loss value loss type The formula is exemplified as follows:
Figure SMS_6
wherein M is the batch used in each batch, N is the time sequence length, i.e. the instant number, y i For a scene type tag in the sample data,
Figure SMS_7
and (3) predicting classification results for the classifier in the first branch of the head posture control model, wherein t represents time sequence.
Step S2322, calculating a scene loss value between the predicted head posture parameters corresponding to the target scene type based on the head posture parameters in the sample data, wherein the scene loss value is a sum value of a similarity loss value corresponding to each time sequence and an inter-frame loss value between adjacent time sequences;
the head gesture control model predicts the head gesture parameters of the corresponding scene types through each second branch while predicting the classification results of the corresponding scene types in the first branch according to the sample data, and for the loss value of the head gesture parameters, which can be called a scene loss value, the scene loss value can be comprehensively determined by calculating and summarizing the similar loss value and the interframe loss value, and the formula is as follows:
loss scene =loss L1 + loss frame (6)
Wherein loss is scene Representing a particular scene type, may refer to the loss of the previous example talking 、loss speec 、loss singing Any of the three classes, loss L1 Representing similar operation values calculated based on L1 paradigm between head pose parameters in sample data and head pose parameters predicted by a second branch of a corresponding scene of a model, loss frame The inter-frame loss value corresponding to the current timing is represented.
For the similarity loss value, the following formula can be used for calculation:
Figure SMS_8
similarly, M is the batch used for each batch, N is the time series length, i.e. the instant number, hp i (t, j) representing a j-th timing head pose of the i-th lot corresponding to a j-th confidence label value of the rotation matrix,
Figure SMS_9
is hp i (t, j) corresponding model prediction results.
For the inter-frame loss value, the following formula may be used for calculation:
Figure SMS_10
the parameter description of the calculation formula of the interframe loss value is the same as that described above, and is not repeated.
It can be seen that the supervision function of the similarity loss value is to control the head posture parameter predicted by the model to be more and more accurate, and the inter-frame loss value can make the head posture parameter predicted by the model consider the smooth relationship between the adjacent frames, so that after the update function of the determined comprehensive loss value to the model, the head posture parameter obtained by the head posture control model can be more robust and stable, and the head posture parameter which is more accurately matched with the voice can be generated.
Step S2323, adding the cross loss value corresponding to the target scene type and the scene loss value to be used as the comprehensive loss value corresponding to the sample data.
After determining the corresponding cross loss value according to the target scene type and determining the corresponding scene loss value according to the head gesture parameter of the second branch corresponding to the target scene type, summing the two as a comprehensive loss value corresponding to the sample data, wherein the comprehensive loss value can be used for deciding whether the head gesture control model is converged or not and for implementing gradient updating.
According to the above embodiment, when the loss value required by model updating is measured in the training process of the head gesture control model, the information required by multiple aspects is comprehensively considered, the head gesture parameters which do not belong to the target scene type are eliminated by using the target scene type predicted by the model, the scene loss value is determined by using only the head gesture parameters which belong to the target scene type, the loss value of each scene loss value in two aspects of space and time sequence is considered, the scene loss value is determined by using the similar loss value and the interframe loss value, and finally the comprehensive loss value corresponding to each sample data is determined by using the scene loss value and the intersection loss value corresponding to the target scene type, so that the comprehensive loss value can effectively measure the model prediction loss, the weight parameters of the model can be quickly and accurately adjusted, and stronger reasoning capability is obtained.
On the basis of any embodiment of the present application, referring to fig. 10, controlling, according to the target scene type, a three-dimensional model of a digital person to generate a head image according to the head pose parameter, including:
step 1310, obtaining a smoothing weight set corresponding to the target scene type, and performing time sequence smoothing processing on the head gesture parameter to obtain an optimized head gesture parameter;
in different scene types, the head activity rhythms of the human body are not consistent, but the head activity is required to have a certain stability degree in the same scene, so that the head gesture parameters generated by the head gesture control model can be optimized by adopting a mode of setting different smoothing weights for distinguishing different scene types. For example, the head pose parameters obtained by the head pose control model for one audio data may be smoothed using the following formula:
output n =(1-α)input n +(α)output n-1
output 0 =[000]
wherein input is n For the head pose parameter corresponding to the nth frame, i.e. the nth time sequence, output n For the smooth output result obtained corresponding to the nth frame, and alpha is the smooth weight corresponding to the scene type, different values corresponding to different target scene types are provided, and corresponding calling is carried out according to the target scene type. For example, for the interpersonal dialogue scene, the lecturer scene, and the singing performance scene, the value of the smoothing weight α may be set to 0.8, 0.85, and 0.7, respectively. In particular, when voice switching occurs, 0.95 may be taken.
And after the head posture parameters output by the head posture control model are optimized by applying the formula, the optimized head posture parameters can be obtained.
Step S1320, applying the facial expression parameters corresponding to the optimized head posture parameters and the time sequence to a three-dimensional model of the digital person, and controlling the digital person to generate corresponding head posture and facial expression;
generally, when a digital person generates a head animation according to voice control, it is desirable to implement corresponding control on the facial expression of the digital person, so that facial expression parameters corresponding to each time sequence can be determined in various manners, the manner of determining the facial expression parameters can be generated by adopting various conventional voice-driven expression models based on deep learning, and based on the determined facial expression parameters, the head posture parameters and the facial expression parameters with the same time sequence are synchronously applied to a three-dimensional model of the digital person according to the time sequence corresponding relation, so that the three-dimensional model of the digital person is controlled by the head posture parameters to generate corresponding head activity state switching, and the facial expression parameters are controlled by the facial expression parameters to generate corresponding facial expression switching. Such control may be implemented for each timing sequence, whereby a head activity state of the three-dimensional model of the digital person corresponding to each timing sequence may be determined.
And step 1330, rendering the three-dimensional model to generate head images of the digital person corresponding to each time sequence.
After the three-dimensional model of the digital person is controlled to present the head activity state at each time sequence, the three-dimensional model can be rendered, so that an image frame corresponding to the head of the digital person at the corresponding time sequence is generated, and the image frame presents the head image of the digital person, wherein not only an activity state of the head but also a facial expression state can be presented. It will be appreciated that the sequential timing of the corresponding image frames is organized in order to form an animation of the digital person's head activity.
According to the embodiment, the generation of the head animation of the digital person is more vivid on the basis of realizing the head activity of the voice-driven digital person, and the head gesture parameters are correspondingly predicted according to different voice scene types and the target scene types corresponding to the voice are provided for the generation of the head animation, so that when the head animation of the digital person is generated, the head gesture parameters of the corresponding scene can be selected according to the target scene types corresponding to the voice, the facial expression parameters are combined to control the three-dimensional model activity of the digital person, thereby matching the head image of the digital person of the scene to which the voice belongs, corresponding video streams can be generated by a plurality of time-sequence head images, and the head animation of the corresponding digital person is presented when the video streams are played, so that the quality of the voice-driven digital person animation can be comprehensively improved, the method can be used for basic technical frames for related downstream application, and can be expected to bring about industrial revolutionary effects.
Referring to fig. 11, a head pose driving apparatus according to an aspect of the present application includes an audio acquisition module 1100, a parameter generation module 1200, and an image generation module 1300, wherein: the audio obtaining module 1100 is configured to obtain a voice feature sequence of the audio data, where the voice feature sequence includes feature coding information of a plurality of voice frames corresponding to time sequences; the parameter generating module 1200 is configured to determine, according to semantic feature information of the voice feature sequence, a target scene type corresponding to each time sequence and a head gesture parameter corresponding to the target scene type synchronously; the image generation module 1300 is configured to control the three-dimensional model of the digital person to generate a head image according to the head pose parameters according to the target scene type.
On the basis of any embodiment of the present application, the parameter generating module 1200 includes: the convolution operation unit is used for extracting initial characteristic information corresponding to each time sequence from the voice characteristic sequence by adopting a convolution neural network in a preset head gesture control model; the shallow processing unit is arranged for extracting the characteristics of the initial characteristic information by adopting a first cyclic neural network in the head gesture control model to obtain shallow semantic information corresponding to each time sequence; the classification processing unit is arranged to adopt a classifier in the head gesture control model to carry out classification mapping on the shallow semantic information and determine the target scene type to which each time sequence belongs in a plurality of scene types; and the parameter prediction unit is used for predicting the head gesture parameters under the corresponding scene types by adopting a second cyclic neural network which is arranged in the head gesture control model and corresponds to the scene types.
On the basis of any embodiment of the present application, the head posture driving device of the present application includes: the system comprises a material acquisition module, a data processing module and a data processing module, wherein the material acquisition module is used for acquiring a video data set, the video data set comprises a plurality of video data, and each video data identifies the type of a scene to which the video data belongs; the sample processing module is used for preprocessing the video data, determining a voice characteristic sequence corresponding to the audio data of each video data, a head posture parameter corresponding to the image data and a scene type label thereof as sample data, and forming a training data set by the sample data; and the training implementation module is used for implementing iterative training on the head posture control model based on the sample data in the training data set and training the head posture control model to a convergence state.
On the basis of any embodiment of the present application, the sample processing module includes: a picture-sound dividing unit configured to separate audio data and image data from the video data; an image processing unit configured to extract a head pose parameter of a head image therein from the image data, and determine a sampling frequency of audio data from a frame rate of the image data; the voice processing unit is used for carrying out audio preprocessing according to the audio data to obtain a voice characteristic sequence formed by characteristic coding information of voice frames of each time sequence obtained by encoding after sampling according to sampling frequency; and a tag processing unit configured to convert the scene type of the video data into a scene type tag expressed in a one-hot coded vector.
On the basis of any embodiment of the present application, the training implementation module includes: the sample prediction unit is used for acquiring single sample data, inputting a voice characteristic sequence in the single sample data into the head gesture control model so as to predict a target scene type corresponding to the voice characteristic sequence and head gesture parameters corresponding to each scene type; a loss calculation unit configured to calculate a comprehensive loss value between the target scene type and a head pose parameter predicted corresponding to the target scene type based on a scene type tag and the head pose parameter in the sample data; and the model updating unit is used for carrying out gradient updating on the head posture control model according to the comprehensive loss value until the head posture control model is trained to be converged by iterating a plurality of sample data.
On the basis of any embodiment of the present application, the loss calculation unit includes: a type calculation unit configured to calculate a cross-loss value corresponding to the target scene type based on a scene type tag in the sample data; an image calculation unit configured to calculate a scene loss value between the predicted head pose parameters corresponding to the target scene type based on the head pose parameters in the sample data, the scene loss value being a sum of a similarity loss value corresponding to each timing sequence and an inter-frame loss value between adjacent timing sequences; and the loss fusion unit is used for adding the cross loss value corresponding to the target scene type and the scene loss value to be used as the comprehensive loss value corresponding to the sample data.
On the basis of any embodiment of the present application, the image generating module 1300 includes: the smoothing optimization unit is used for acquiring smoothing weights which are set corresponding to the target scene types, and performing time sequence smoothing on the head gesture parameters to obtain optimized head gesture parameters; a parameter application unit configured to apply the optimized head pose parameter and facial expression parameter corresponding to time sequence to a three-dimensional model of a digital person, and control the digital person to generate a corresponding head pose and facial expression; and the rendering generation unit is used for rendering the three-dimensional model and generating head images of the digital person corresponding to each time sequence.
Another embodiment of the present application also provides a head posture driving apparatus. As shown in fig. 12, the internal structure of the head posture driving device is schematically shown. The head pose drive device includes a processor, a computer readable storage medium, a memory, and a network interface connected by a system bus. The computer readable non-volatile storage medium of the head posture driving device stores an operating system, a database and computer readable instructions, the database can store information sequences, and the computer readable instructions can enable the processor to realize a head posture driving method when the computer readable instructions are executed by the processor.
The processor of the head pose drive device is configured to provide computing and control capabilities to support the operation of the entire head pose drive device. The head pose drive device may have stored in a memory computer readable instructions that, when executed by a processor, may cause the processor to perform the head pose drive method of the present application. The network interface of the head pose drive device is for communicating with a terminal connection.
Those skilled in the art will appreciate that the structure shown in fig. 12 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the head pose drive apparatus to which the present application is applied, and that a particular head pose drive apparatus may include more or less components than shown in the drawings, or may combine certain components, or have a different arrangement of components.
The processor in this embodiment is configured to perform specific functions of each module in fig. 11, and the memory stores program codes and various types of data required for executing the above-described modules or sub-modules. The network interface is used for realizing data transmission between the user terminals or the servers. The nonvolatile readable storage medium in this embodiment stores therein program codes and data necessary for executing all modules in the head posture driving device of the present application, and the server can call the program codes and data of the server to execute the functions of all modules.
The present application also provides a non-transitory readable storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the head pose drive method of any of the embodiments of the present application.
The present application also provides a computer program product comprising computer programs/instructions which when executed by one or more processors implement the steps of the method described in any of the embodiments of the present application.
It will be appreciated by those skilled in the art that implementing all or part of the above-described methods according to the embodiments of the present application may be accomplished by way of a computer program stored in a non-transitory readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a computer readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
In summary, the method and the device can accurately predict the target scene type according to the audio data, obtain the head gesture parameters corresponding to the target scene type, generate the digital human head image of the corresponding target scene according to the head gesture parameters, coordinate the head activities displayed by the head image with the original sound, match the style of the target scene, are more vivid and natural, and are suitable for various application scenes related to the digital human.

Claims (10)

1. A head posture driving method, characterized by comprising:
acquiring a voice characteristic sequence of the audio data, wherein the voice characteristic sequence comprises characteristic coding information of a plurality of voice frames corresponding to time sequences;
synchronously determining a target scene type corresponding to each time sequence and a head posture parameter corresponding to the target scene type according to semantic feature information of the voice feature sequence;
and controlling a three-dimensional model of the digital person according to the target scene type to generate a head image according to the head posture parameter.
2. The head pose driving method according to claim 1, wherein determining a target scene type corresponding to each time sequence and a head pose parameter corresponding to the target scene type synchronously according to semantic feature information of the voice feature sequence comprises:
extracting initial characteristic information corresponding to each time sequence from the voice characteristic sequence by adopting a convolutional neural network in a preset head gesture control model;
extracting features of the initial feature information by adopting a first cyclic neural network in the head posture control model to obtain shallow semantic information corresponding to each time sequence;
Classifying and mapping the shallow semantic information by adopting a classifier in the head gesture control model, and determining a target scene type to which each time sequence belongs in a plurality of scene types;
and predicting the head posture parameters under the corresponding scene types by adopting a second cyclic neural network which is arranged in the head posture control model and corresponds to the scene types.
3. The head pose driving method according to claim 2, comprising, before acquiring a voice feature sequence of audio data:
obtaining a video data set, wherein the video data set comprises a plurality of video data, and each video data identifies the type of a scene to which the video data set belongs;
preprocessing the video data, determining a voice characteristic sequence corresponding to audio data of each video data, a head posture parameter corresponding to image data and a scene type label thereof as sample data, and forming a training data set by the sample data;
and performing iterative training on the head posture control model based on sample data in the training data set, and training the head posture control model to a convergence state.
4. A head pose driving method according to claim 3, wherein preprocessing the video data comprises:
Separating audio data and image data from the video data;
extracting head posture parameters of a head image in the image data according to the image data, and determining the sampling frequency of the audio data according to the frame rate of the image data;
performing audio preprocessing according to the audio data to obtain a voice characteristic sequence formed by characteristic coding information of voice frames of each time sequence obtained by encoding after sampling according to sampling frequency;
the scene type of the video data is converted into a scene type tag represented by a one-hot encoded vector.
5. A head pose driving method according to claim 3, wherein performing iterative training on the head pose control model based on sample data in the training data set, training it to a convergence state, comprises:
acquiring single sample data, and inputting a voice characteristic sequence in the single sample data into the head gesture control model so as to predict a target scene type corresponding to the voice characteristic sequence and head gesture parameters corresponding to each scene type;
calculating a comprehensive loss value between the target scene type and a head gesture parameter predicted corresponding to the target scene type based on a scene type tag and the head gesture parameter in the sample data;
And carrying out gradient update on the head posture control model according to the comprehensive loss value until the head posture control model is trained to be converged by iterating a plurality of sample data.
6. The head pose driving method according to claim 5, wherein calculating a comprehensive loss value between the target scene type and a head pose parameter predicted corresponding to the target scene type based on a scene type tag and a head pose parameter in the sample data comprises:
calculating a cross loss value corresponding to the target scene type based on the scene type label in the sample data;
calculating scene loss values among the predicted head posture parameters corresponding to the target scene type based on the head posture parameters in the sample data, wherein the scene loss values are sum values of similar loss values corresponding to each time sequence and interframe loss values between adjacent time sequences;
and adding the cross loss value corresponding to the target scene type and the scene loss value as the comprehensive loss value corresponding to the sample data.
7. The head pose driving method according to any one of claims 1 to 6, wherein controlling a three-dimensional model of a digital person according to the target scene type to generate a head image according to the head pose parameters comprises:
Acquiring a smoothing weight set corresponding to the target scene type, and performing time sequence smoothing on the head gesture parameters to obtain optimized head gesture parameters;
applying the facial expression parameters corresponding to the optimized head posture parameters and the time sequence to a three-dimensional model of the digital person, and controlling the digital person to generate corresponding head posture and facial expression;
rendering the three-dimensional model to generate head images of the digital person corresponding to each time sequence.
8. A head posture driving device, characterized by comprising:
the audio acquisition module is used for acquiring a voice characteristic sequence of the audio data, wherein the voice characteristic sequence comprises characteristic coding information of a plurality of voice frames corresponding to time sequences;
the parameter generation module is used for synchronously determining a target scene type corresponding to each time sequence and a head gesture parameter corresponding to the target scene type according to semantic feature information of the voice feature sequence;
and the image generation module is used for controlling a three-dimensional model of the digital person according to the target scene type to generate a head image according to the head posture parameters.
9. A head pose drive device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke a computer program stored in the memory to perform the steps of the method according to any of claims 1 to 7.
10. A non-transitory readable storage medium, characterized in that it stores in form of computer readable instructions a computer program implemented according to the method of any one of claims 1 to 7, which when invoked by a computer, performs the steps comprised by the corresponding method.
CN202211635506.0A 2022-12-19 2022-12-19 Head posture driving method and device, equipment, medium and product thereof Pending CN116189034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211635506.0A CN116189034A (en) 2022-12-19 2022-12-19 Head posture driving method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211635506.0A CN116189034A (en) 2022-12-19 2022-12-19 Head posture driving method and device, equipment, medium and product thereof

Publications (1)

Publication Number Publication Date
CN116189034A true CN116189034A (en) 2023-05-30

Family

ID=86447983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211635506.0A Pending CN116189034A (en) 2022-12-19 2022-12-19 Head posture driving method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN116189034A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117689783A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field
CN117745597A (en) * 2024-02-21 2024-03-22 荣耀终端有限公司 Image processing method and related device
CN118138833A (en) * 2024-05-07 2024-06-04 深圳威尔视觉科技有限公司 Digital person construction method and device and computer equipment
CN118411452A (en) * 2024-07-01 2024-07-30 中邮消费金融有限公司 Digital person generation method, device, equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117689783A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field
CN117689783B (en) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field
CN117745597A (en) * 2024-02-21 2024-03-22 荣耀终端有限公司 Image processing method and related device
CN118138833A (en) * 2024-05-07 2024-06-04 深圳威尔视觉科技有限公司 Digital person construction method and device and computer equipment
CN118411452A (en) * 2024-07-01 2024-07-30 中邮消费金融有限公司 Digital person generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN116189034A (en) Head posture driving method and device, equipment, medium and product thereof
CN112099628A (en) VR interaction method and device based on artificial intelligence, computer equipment and medium
US12027165B2 (en) Computer program, server, terminal, and speech signal processing method
CN110610534B (en) Automatic mouth shape animation generation method based on Actor-Critic algorithm
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
CN110517689A (en) A kind of voice data processing method, device and storage medium
Fu et al. Audio/visual mapping with cross-modal hidden Markov models
CN113314097B (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN112837669B (en) Speech synthesis method, device and server
CN112184859B (en) End-to-end virtual object animation generation method and device, storage medium and terminal
WO2023030235A1 (en) Target audio output method and system, readable storage medium, and electronic apparatus
CN114360493A (en) Speech synthesis method, apparatus, medium, computer device and program product
CN114329041A (en) Multimedia data processing method and device and readable storage medium
CN113782042B (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN114360491A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
JPH09507921A (en) Speech recognition system using neural network and method of using the same
CN118158462A (en) Digital man-driven video playing method under communication network
Barbulescu et al. Audio-visual speaker conversion using prosody features
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
CN112885318A (en) Multimedia data generation method and device, electronic equipment and computer storage medium
CN113744759B (en) Tone color template customizing method and device, equipment, medium and product thereof
CN113823300B (en) Voice processing method and device, storage medium and electronic equipment
CN116469369A (en) Virtual sound synthesis method and device and related equipment
CN116959447A (en) Training method, device, equipment and medium of voice conversion model
Gao Audio deepfake detection based on differences in human and machine generated speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination