CN116246649A

CN116246649A - Head action simulation method in three-dimensional image pronunciation process

Info

Publication number: CN116246649A
Application number: CN202211671532.9A
Authority: CN
Inventors: 周安斌; 晏武志; 李鑫; 彭辰; 潘见见
Original assignee: Shandong Jindong Digital Creative Co ltd
Current assignee: Shandong Jindong Digital Creative Co ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-06-09

Abstract

The invention provides a three-dimensional image pronunciation process head action simulation method, which belongs to the technical field of three-dimensional virtual images, and comprises the steps of obtaining face videos and corresponding audios from a video library, aligning video frames with audio frames, and extracting multi-frame face images, head gesture parameters and Mel frequency spectrums as training samples; preprocessing a face image to generate a face image after a mouth is erased; establishing a three-dimensional image head model and training the three-dimensional image head model by using a training sample, wherein the three-dimensional image head model comprises an audio feature extraction module, a lip synchronization module, a mouth generation module, a head posture module and a fusion module; generating a three-dimensional image head model aiming at specific audio frequency by using the trained three-dimensional image head model; the method greatly reduces the calculated amount, simultaneously ensures that the head gesture and the pronunciation have good linkage, and avoids the phenomenon of stiff three-dimensional image pronunciation process.

Description

Head action simulation method in three-dimensional image pronunciation process

Technical Field

The invention belongs to the technical field of three-dimensional virtual figures, and particularly relates to a head action simulation method in a three-dimensional figure pronunciation process.

Background

Many people speak with tiny head movements, and when speaking, the person does not notice, and when a camera is adopted to collect a face speaking image, the mouth needs to be tracked due to the change of the face head, so that a large amount of operations are brought; on the other hand, the tiny changes of the head gesture of each person during speaking are not universal, and the tiny head gesture changes are ignored during the acquisition time to acquire the face speaking image more quickly, so that the processing efficiency is improved.

The Chinese invention with the authorization number of CN111081270B (application number of CN 201911314031.3) discloses a real-time audio-driven virtual character mouth shape synchronous control method. The method comprises the following steps: a step of identifying a pixel probability from the real-time speech stream; a step of filtering the pixel probability; a step of converting the sampling rate of the pixel probability to the same sampling rate as the virtual character rendering frame rate; and converting the pixel probability into a standard mouth shape configuration and carrying out mouth shape rendering. The method can avoid the requirement of synchronously transmitting the phoneme sequence or the mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes for rendering virtual characters on display equipment.

Said invention and current many three-dimensional figures only have simple mouth shape change in the course of pronunciation, and the head gesture and pronunciation lack of linkage, so that the course of pronunciation of three-dimensional figures is stiff.

Disclosure of Invention

In view of the above, the invention provides a head action simulation method in the three-dimensional image pronunciation process, which can solve the technical problems that only a simple mouth shape is changed in the three-dimensional image pronunciation process, and the head gesture and the pronunciation lack of linkage, so that the three-dimensional image pronunciation process is rigid.

The invention is realized in the following way:

the invention provides a head action simulation method in a three-dimensional image pronunciation process, which comprises the following steps:

s10: acquiring face videos and corresponding audios from a video signal library, aligning video frames with audio frames, and extracting face images, head postures and Mel frequency spectrums of multiple frames as training samples; preprocessing a face image to generate a face image after a mouth is erased;

s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generating module, a head posture control module and a fusion module, wherein:

the audio feature extraction module is used for carrying out feature extraction on the Mel frequency spectrum obtained in the step S10 to generate final audio features;

the lip synchronization module is used for generating multi-stage lip image features according to the final audio features, generating a lip image according to the final-stage lip image features, and calculating lip loss between the generated lip image and the lip image in the face image sample, wherein the lip loss comprises mean square error loss and contrast loss;

the mouth generating module is used for generating multi-stage mouth image features according to the multi-stage lip image features, generating mouth images according to the last-stage mouth image features, and calculating mouth loss between the generated mouth images and mouth images in the face image sample, wherein the mouth loss uses mean square error loss;

the head posture control module is used for generating head image characteristics according to the central point;

the fusion module is used for fusing the head image features and the multi-stage mouth image features into the face image after the mouth is erased in the step S10, and calculating fusion loss, wherein the fusion loss uses the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;

s30: and generating the three-dimensional image head model aiming at the specific audio frequency by using the trained three-dimensional image head model.

The mouth erasing network adopts a Unet network and is used for generating a mouth mask representing the position of the mouth, and the mouth position in the face image is erased according to the mouth mask.

The audio feature extraction module is composed of a audio downsampling layers and an LSTM layer, firstly, the multi-frame Mel frequency spectrum is subjected to dimension reduction processing sequentially through the audio downsampling layers to generate multi-stage audio features, and then the LSTM layer is used for fusing the last-stage audio features of the multi-frame Mel frequency spectrum to generate final audio features.

The lip synchronous module consists of b lip up-sampling layers which are connected in series, wherein b is more than or equal to 3; and taking the final audio feature obtained by the audio feature extraction module as input, sequentially generating multi-stage lip image features by utilizing a plurality of lip up-sampling layers, and converting the lip image features of the last stage into lip images.

The mouth generating module consists of c mouth upper sampling layers which are connected in series, wherein c is more than or equal to 3; and splicing the first-stage lip image features and the head parameters generated by the lip synchronization module to be used as the input of a first mouth upper sampling layer, splicing the first-stage mouth image features output by the first mouth upper sampling layer and the second-stage lip image features to be used as the input of a second mouth upper sampling layer, splicing the second-stage mouth image features output by the second mouth upper sampling layer and the third-stage lip image features to be used as the input of a third mouth upper sampling layer, and taking the third-stage mouth image features output by the third mouth upper sampling layer as the input of a next mouth upper sampling layer until the last-stage mouth image features are generated and converted into mouth images.

The fusion module adopts a Unet network, takes the face image after the mouth is erased as the input of an encoder in the Unet network, fuses the output of each layer of the encoder and the multi-level mouth image characteristics generated by the mouth generation module into the input of each layer of a decoder, and generates a fused complete face image.

On the basis of the technical scheme, the head action simulation method in the three-dimensional image pronunciation process can be further improved as follows:

the method for establishing the video signal library comprises the following steps:

step one: plastic pellets with reflective outer walls are attached to the nose tips of the experimenters, and black small paper sheets are attached to the head posture key points of the experimenters;

step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end form a straight line with the plastic pellets, and the distance between the signal transmitting end and the signal receiving end is 1m;

step three: the method comprises the steps of taking a center point of a camera as a center, establishing a three-dimensional coordinate system, starting a signal transmitting end to transmit a signal, starting the camera, and reading by experimenters;

step four: after the experimenter finishes reading, the face audio and video recorded by the camera and the signal data received by the corresponding receiving end are stored in a video signal library.

Further, the step S10 specifically includes:

acquiring videos in a video signal library, wherein each frame in the videos comprises a complete face image and audio of a person speaking;

judging whether the head posture of the experimenter is changed or not according to the signal data received by the corresponding receiving end of the video;

if the head posture of the experimenter is not changed, extracting a face image set from all frames in the video, and intercepting a lip part in the face image as a sample lip image;

if the head posture of the experimenter is not changed, extracting plastic pellet images from all frames in the video, establishing three-dimensional coordinates of the plastic pellets in a three-dimensional coordinate system, and using corresponding lip shapes in a phoneme mouth shape driving method as sample lip shape images;

constructing a mouth erasing network, randomly taking out part of face images from a face image set, marking the mouth positions, training the mouth erasing network, recognizing and erasing the mouth positions of the face images with the untagged mouth positions by using the trained mouth erasing network, and reserving the face images;

and converting the audio of the time domain into a Mel frequency spectrum of the frequency domain, wherein the sampling rate of the frequency domain is consistent with the sampling rate of the video frame.

Further, the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiving end corresponding to the video specifically includes:

step 1: data processing is carried out on signals received by a receiving end;

step 2: the detection of the small ball is realized by using an extended Kalman filtering method;

step 3: calculating to obtain a likelihood ratio by using the obtained multipath time delay joint estimation value, and comparing the obtained likelihood ratio with a detection threshold value to obtain a detection result of whether the position of the ball is changed or not;

step 4: if the position of the small ball is changed, judging that the head posture of the experimenter is changed; if the position of the ball is not changed, judging that the head posture of the experimenter is not changed.

Further, the step 1 specifically includes:

the first step: the method comprises the steps of representing a transmitting end to transmit signals in a frequency domain form, wherein the frequency domain form of the transmitting end to transmit signals is S= [ S (0), S (1), K, S (K-1) ], and after underwater propagation, a receiving end receives signals in the frequency domain form of a matrix X;

and a second step of: the binary hypothesis testing method is adopted to carry out parameter estimation on the frequency domain form of the received signal with appointed times, and the method specifically comprises the following steps:

based on different hypotheses H in binary hypothesis testing ₀ And H ₁ For the frequency domain form X of the kth received signal _k (k=1, 2,3, l) performing parameter estimation;

and a third step of: the EM delay estimation algorithm is adopted to calculate the direct wave-transmitting multipath delay and the small ball scattered wave multipath delay, and the method specifically comprises the following steps:

using EM time delay estimation algorithm to obtain direct wave-transparent multipath time delay as

And the multipath time delay of the small sphere scattered wave is +.>

The number of the sound lines of the straight wave and the small ball scattering wave is respectively M and N, and the number of the sound lines of the straight wave and the small ball scattering wave is +.>

Representing the estimated value of the time delay represented by each sound ray, the estimated value of the time delay can be respectively abbreviated as +.>

And->

Further, the step 2 specifically includes:

the first step: according to the method of the extended Kalman filtering, a state equation and an observation equation of the extended Kalman filtering are established, and the method specifically comprises the following steps:

according to the extended Kalman filtering method, the state quantity x= [ x, v ] of the movement of the small ball is set _x ,y,v _y ] ^T Sum of observations

Establishing a state equation and an observation equation of the extended Kalman filter:

x _k ＝Fx _k-1 +w _k

z _k ＝h(x _k )+v _k

wherein: f is a state transition matrix, which is determined by the movement form of the small ball, h (·) is an observation function, w _k Representing a state noise matrix, obeying w _k -N (0, Q) and v _k To observe the noise matrix, obey v _k ～N(0,R)

And a second step of: according to the known information of the appointed moment, an extended Kalman filtering method is used for obtaining a state prediction equation and a predicted covariance matrix of the next moment, and the method specifically comprises the following steps:

according to the known information of the k-1 moment, an extended Kalman filtering method is used for obtaining a state prediction equation of the k moment

And predicted covariance matrix P _k|k-1 ：

P _k|k-1 ＝FP _k-1|k-1 F ^T +Q _k-1|k-1

And a third step of: the functional relation between the ball motion state and the multipath time delay is calculated, and the method specifically comprises the following steps:

due to the functional relationship h (x _k ) Nonlinear, according to the processing method of the extended Kalman filter, a first-order Taylor formula pair is usedWhich performs a linear approximation, requires a functional relation h (x _k ) The expression of the function relation is expressed by using a virtual source mirror image method to obtain the movement state x of the small ball _k With multipath delay

A functional relationship between them.

The functional relationship h (x _k ) The linearization approximation is carried out, and the propagation process of the small ball scattered wave is divided into two sections of a transmitting end, namely a small ball and a small ball, namely a receiving end, and the two sections are respectively described by using a virtual source mirroring method.

For the transmitting end-small ball (st) section, the number of sound rays is set as N ^st The relationship between the sound ray travel and the ball position is:

L

for the small ball-receiving end (tr) section, the number of sound rays is set to be N ^tr The relationship between the sound ray travel and the ball position is:

L

(x _s ,y _s ,z _s )、(x _t ,y _t ,z _t ) And (x) _r ,y _r ,z _r ) Representing the coordinates of the transmitting end, the pellet, and the receiving end, respectively. For the selection of the number of these two sound rays, N is followed ^st ×N ^tr The principle of N, ensures that the dimensions of the matrix are consistent. In shallow sea, the gradient change of sound velocity is not large, the sound velocity can be set as a constant value c in the time delay calculation, and then the multipath time delay can be expressed as:

therefore, the relation between the ball motion state and the multipath time delay is as follows:

find the observation function h (x _k ) Jacobian matrix of (a), i.e. observation matrix H _k 。

Fourth step: calculating predictions of observations and Kalman gains, in particular predictions of calculated observations

And Kalman gain K _k The calculation formula is as follows:

fifth step: updating the observed value, and representing a multipath time delay joint estimated value obtained by combining the small ball motion information by the updated observed value, wherein the method specifically comprises the following steps of:

observed quantity z at time k _k Then, the state update value x is obtained after the update process _k|k And error covariance update matrix P _k|k 。

P _k|k ＝P _k|k-1 -K _k H _k P _k|k-1

z _k|k ＝h(x _k|k )

Wherein the observed value updates the value

And representing the multipath time delay joint estimation value obtained by combining the small ball motion information.

Further, the step 3 specifically includes:

the first step: taking the multipath time delay joint estimated value as a parameter estimated value in generalized likelihood ratio detection, wherein the multipath time delay joint estimated value is expressed as

And a second step of: the likelihood ratio is calculated by using likelihood function with the obtained parameter estimation value, specifically: using hypothesis H ₀ And H ₁ The following likelihood function:

calculating to obtain likelihood ratio L _GLRT Wherein

And a third step of: comparing the likelihood ratio with a detection threshold value to obtain a detection result of whether the coordinates of the ball are changed beyond the threshold value, wherein the detection result specifically comprises:

the formula (1) is simplified to obtain a test statistic T (X) _k ) And compares it with a corresponding detection threshold eta ^* And comparing to judge whether the coordinates of the ball are changed beyond the threshold value.

Wherein the matrix

Only the multipath delays of the straight-through wave and the small-sphere scattered wave, respectively, and:

is->

Spatially projected matrix,/->

The step S30 specifically includes:

aiming at the Mel frequency spectrum of a given audio, acquiring a multi-frame face image and a corresponding head posture parameter of a small ball character after the mouth is erased according to the method of the step 1, and aligning the Mel frequency spectrum of a frequency domain with the multi-frame face image in time;

the method comprises the steps of utilizing a trained face counterfeiting generation model, firstly carrying out feature extraction on a Mel frequency spectrum of given audio by an audio feature extraction module to generate final audio features, then generating multi-level lip image features according to the final audio features by a lip synchronization module, then generating multi-level mouth image features according to the multi-level lip image features and head posture parameters by a mouth generation module, finally fusing the multi-level mouth image features into multi-frame face images after a small ball character erases a mouth, and generating a counterfeiting face image aiming at mouth actions under specific audio.

Further, the specific step of generating the head posture feature by the head posture module includes:

the plastic pellet is taken as a center point, and the head posture change of the experimenter is determined according to the coordinate change of the plastic pellet and the change of the black paper sheet.

Further, the head posture key points at least comprise points of left eyes, right eyes, left mouth, right mouth, top center of head, left top of ear, left bottom of ear, right top of ear and right bottom of ear of the face.

Compared with the prior art, the head action simulation method for the three-dimensional image pronunciation process has the beneficial effects that: the head gesture key points are adopted to describe head movements, the small balls at the nose tip are adopted as the small balls at the receiving end of the signal transmitting end, when the coordinates of the small balls change, signals received by the receiving end can be changed, so that the tiny changes of the head of an experimenter are judged, the threshold value of the changes of the coordinates of the small balls is set by adopting the detection threshold value, when the changes of the coordinates of the small balls exceed the threshold value, the head gesture of a human face is judged to change, and the lip shape in a phoneme mouth type driving method is adopted to replace the lip shape collected in a video image, so that the calculated amount is greatly reduced, meanwhile, the head gesture and pronunciation have good linkage, and the phenomenon of rigid three-dimensional image pronunciation process is avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a three-dimensional character pronunciation process head motion simulation method provided by the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

As shown in fig. 1, the present invention provides a flow chart of a three-dimensional image pronunciation process head motion simulation method, which comprises the following steps:

s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generation module, a head posture control module and a fusion module, wherein:

the fusion module is used for fusing the head image features and the multi-level mouth image features into the face image after the mouth is erased in the S10, calculating fusion loss, and using the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;

The mouth erasing network adopts a Unet network and is used for generating a mouth mask representing the mouth position, and the mouth position in the face image is erased according to the mouth mask.

In the above technical solution, the method for establishing the video signal library includes:

step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end are in a straight line with each other on a plastic pellet, and the distance between the signal transmitting end and the signal receiving end is 1m;

Further, in the above technical solution, S10 specifically includes:

Wherein, the phone-mouth-style driving method firstly converts voice or text into a phone sequence, and each phone corresponds to a specific visual element (corresponding to a specific mouth-style). In order to fit the mouth shape to the real scene, it is necessary to time smooth the rules of the design adopted for the video sequence. The algorithm includes two phases:

the first stage is irrelevant to a specific speaker and comprises three parallel networks which are respectively used for generating three groups of action parameters of mouth shape, eye-brow expression and head movement;

and in the second stage, synthesizing the specific speaker videos, and generating the speaking videos of different specific persons based on the self-adaptive attention network supervised by the three-dimensional face information.

Further, in the above technical solution, the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiving end corresponding to the video specifically includes:

step 1: data processing is carried out on signals received by a receiving end;

Further, in the above technical solution, step 1 specifically includes:

And the multipath time delay of the small sphere scattered wave is +.>

And->

Further, in the above technical solution, step 2 specifically includes:

x _k ＝Fx _k-1 +w _k

z _k ＝h(x _k )+v _k

And predicted covariance matrix P _k|k-1 ：

P _k|k-1 ＝FP _k-1|k-1 F ^T +Q _k-1|k-1

due to the functional relationship h (x _k ) Is nonlinear, and according to the processing method of the extended Kalman filter, a first-order Taylor formula is used for carrying out linearization approximation on the extended Kalman filter, and a functional relation h (x is needed to be obtained _k ) Is to using virtual source mirroringThe functional relation is expressed to obtain the ball motion state x _k With multipath delay

A functional relationship between them.

L

L

And Kalman gain K _k The calculation formula is as follows:

P _k|k ＝P _k|k-1 -K _k H _k P _k|k-1

z _k|k ＝h(x _k|k )

Wherein the observed value updates the value

Further, in the above technical solution, step 3 specifically includes:

calculating to obtain likelihood ratio L _GLRT Wherein

the formula (1) is simplified to obtain a test statistic T (X) _k ) And compares it with a corresponding detection threshold eta ^* And comparing, and judging whether the coordinates of the small ball are changed beyond a threshold value, wherein the detection threshold value is 0.2-0.5 cm.

Wherein the matrix

is->

Spatially projected matrix,/->

In the above technical solution, step S30 specifically includes:

Further, in the above technical solution, the specific step of generating the head posture feature by the head posture module includes:

Further, in the above technical solution, the head posture key points at least include points where the left eye corner, the right eye corner, the left mouth corner, the right mouth corner, the top center, the top of the head, the top of the left ear, the bottom of the left ear, the top of the right ear, and the bottom of the right ear of the human face are located.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A three-dimensional image pronunciation process head action simulation method is characterized by comprising the following steps:

2. The method for simulating the head motion in the three-dimensional avatar pronunciation process according to claim 1, wherein the method for creating the video signal library is as follows:

3. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 2, wherein S10 specifically comprises:

4. The method for simulating the head motion in the three-dimensional avatar pronunciation process according to claim 3, wherein the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiver corresponding to the video specifically comprises:

step 1: data processing is carried out on signals received by a receiving end;

5. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 1 specifically comprises:

And small sphere scattering waveMultipath delay of->

And->

6. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 2 specifically comprises:

x _k ＝Fx _k-1 +w _k

z _k ＝h(x _k )+v _k

And predicted covariance matrix P _k|k-1 ：

P _k|k-1 ＝FP _k-1|k-1 F ^T +Q _k-1|k-1

due to the functional relationship h (x _k ) Is nonlinear, and according to the processing method of the extended Kalman filter, a first-order Taylor formula is used for carrying out linearization approximation on the extended Kalman filter, and a functional relation h (x is needed to be obtained _k ) The expression of the function relation is expressed by using a virtual source mirror image method to obtain the movement state x of the small ball _k With multipath delay

A functional relationship between them.

L

L

And Kalman gain K _k The calculation formula is as follows:

observed quantity z at time k _k Then, the state update value x is obtained after the update process ^k|k And error covariance update matrix P _k|k 。

P _k|k ＝P _k|k-1 -K _k H _k P _k|k-1

z _k|k ＝h(x _k|k )

Wherein the observed value updates the value

7. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 3 specifically comprises:

calculating to obtain likelihood ratio L _GLRT Wherein

/>

Wherein the matrix

is->

Spatially projected matrix,/->

8. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 1, wherein the step S30 specifically comprises:

9. A method for simulating head movements in a three-dimensional visual pronunciation process according to claim 3, wherein the specific step of generating the head posture feature by the head posture module comprises:

10. The method for simulating the head movements in the three-dimensional visual pronunciation process according to claim 2, wherein the head posture key points at least comprise points of the left eye corner, the right eye corner, the left mouth corner, the right mouth corner, the top center of the head, the top of the left ear, the bottom of the left ear, the top of the right ear and the bottom of the right ear of the human face.