CN116246649A - Head action simulation method in three-dimensional image pronunciation process - Google Patents
Head action simulation method in three-dimensional image pronunciation process Download PDFInfo
- Publication number
- CN116246649A CN116246649A CN202211671532.9A CN202211671532A CN116246649A CN 116246649 A CN116246649 A CN 116246649A CN 202211671532 A CN202211671532 A CN 202211671532A CN 116246649 A CN116246649 A CN 116246649A
- Authority
- CN
- China
- Prior art keywords
- mouth
- head
- image
- lip
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/18—Details of the transformation process
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Abstract
The invention provides a three-dimensional image pronunciation process head action simulation method, which belongs to the technical field of three-dimensional virtual images, and comprises the steps of obtaining face videos and corresponding audios from a video library, aligning video frames with audio frames, and extracting multi-frame face images, head gesture parameters and Mel frequency spectrums as training samples; preprocessing a face image to generate a face image after a mouth is erased; establishing a three-dimensional image head model and training the three-dimensional image head model by using a training sample, wherein the three-dimensional image head model comprises an audio feature extraction module, a lip synchronization module, a mouth generation module, a head posture module and a fusion module; generating a three-dimensional image head model aiming at specific audio frequency by using the trained three-dimensional image head model; the method greatly reduces the calculated amount, simultaneously ensures that the head gesture and the pronunciation have good linkage, and avoids the phenomenon of stiff three-dimensional image pronunciation process.
Description
Technical Field
The invention belongs to the technical field of three-dimensional virtual figures, and particularly relates to a head action simulation method in a three-dimensional figure pronunciation process.
Background
Many people speak with tiny head movements, and when speaking, the person does not notice, and when a camera is adopted to collect a face speaking image, the mouth needs to be tracked due to the change of the face head, so that a large amount of operations are brought; on the other hand, the tiny changes of the head gesture of each person during speaking are not universal, and the tiny head gesture changes are ignored during the acquisition time to acquire the face speaking image more quickly, so that the processing efficiency is improved.
The Chinese invention with the authorization number of CN111081270B (application number of CN 201911314031.3) discloses a real-time audio-driven virtual character mouth shape synchronous control method. The method comprises the following steps: a step of identifying a pixel probability from the real-time speech stream; a step of filtering the pixel probability; a step of converting the sampling rate of the pixel probability to the same sampling rate as the virtual character rendering frame rate; and converting the pixel probability into a standard mouth shape configuration and carrying out mouth shape rendering. The method can avoid the requirement of synchronously transmitting the phoneme sequence or the mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes for rendering virtual characters on display equipment.
Said invention and current many three-dimensional figures only have simple mouth shape change in the course of pronunciation, and the head gesture and pronunciation lack of linkage, so that the course of pronunciation of three-dimensional figures is stiff.
Disclosure of Invention
In view of the above, the invention provides a head action simulation method in the three-dimensional image pronunciation process, which can solve the technical problems that only a simple mouth shape is changed in the three-dimensional image pronunciation process, and the head gesture and the pronunciation lack of linkage, so that the three-dimensional image pronunciation process is rigid.
The invention is realized in the following way:
the invention provides a head action simulation method in a three-dimensional image pronunciation process, which comprises the following steps:
s10: acquiring face videos and corresponding audios from a video signal library, aligning video frames with audio frames, and extracting face images, head postures and Mel frequency spectrums of multiple frames as training samples; preprocessing a face image to generate a face image after a mouth is erased;
s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generating module, a head posture control module and a fusion module, wherein:
the audio feature extraction module is used for carrying out feature extraction on the Mel frequency spectrum obtained in the step S10 to generate final audio features;
the lip synchronization module is used for generating multi-stage lip image features according to the final audio features, generating a lip image according to the final-stage lip image features, and calculating lip loss between the generated lip image and the lip image in the face image sample, wherein the lip loss comprises mean square error loss and contrast loss;
the mouth generating module is used for generating multi-stage mouth image features according to the multi-stage lip image features, generating mouth images according to the last-stage mouth image features, and calculating mouth loss between the generated mouth images and mouth images in the face image sample, wherein the mouth loss uses mean square error loss;
the head posture control module is used for generating head image characteristics according to the central point;
the fusion module is used for fusing the head image features and the multi-stage mouth image features into the face image after the mouth is erased in the step S10, and calculating fusion loss, wherein the fusion loss uses the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;
s30: and generating the three-dimensional image head model aiming at the specific audio frequency by using the trained three-dimensional image head model.
The mouth erasing network adopts a Unet network and is used for generating a mouth mask representing the position of the mouth, and the mouth position in the face image is erased according to the mouth mask.
The audio feature extraction module is composed of a audio downsampling layers and an LSTM layer, firstly, the multi-frame Mel frequency spectrum is subjected to dimension reduction processing sequentially through the audio downsampling layers to generate multi-stage audio features, and then the LSTM layer is used for fusing the last-stage audio features of the multi-frame Mel frequency spectrum to generate final audio features.
The lip synchronous module consists of b lip up-sampling layers which are connected in series, wherein b is more than or equal to 3; and taking the final audio feature obtained by the audio feature extraction module as input, sequentially generating multi-stage lip image features by utilizing a plurality of lip up-sampling layers, and converting the lip image features of the last stage into lip images.
The mouth generating module consists of c mouth upper sampling layers which are connected in series, wherein c is more than or equal to 3; and splicing the first-stage lip image features and the head parameters generated by the lip synchronization module to be used as the input of a first mouth upper sampling layer, splicing the first-stage mouth image features output by the first mouth upper sampling layer and the second-stage lip image features to be used as the input of a second mouth upper sampling layer, splicing the second-stage mouth image features output by the second mouth upper sampling layer and the third-stage lip image features to be used as the input of a third mouth upper sampling layer, and taking the third-stage mouth image features output by the third mouth upper sampling layer as the input of a next mouth upper sampling layer until the last-stage mouth image features are generated and converted into mouth images.
The fusion module adopts a Unet network, takes the face image after the mouth is erased as the input of an encoder in the Unet network, fuses the output of each layer of the encoder and the multi-level mouth image characteristics generated by the mouth generation module into the input of each layer of a decoder, and generates a fused complete face image.
On the basis of the technical scheme, the head action simulation method in the three-dimensional image pronunciation process can be further improved as follows:
the method for establishing the video signal library comprises the following steps:
step one: plastic pellets with reflective outer walls are attached to the nose tips of the experimenters, and black small paper sheets are attached to the head posture key points of the experimenters;
step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end form a straight line with the plastic pellets, and the distance between the signal transmitting end and the signal receiving end is 1m;
step three: the method comprises the steps of taking a center point of a camera as a center, establishing a three-dimensional coordinate system, starting a signal transmitting end to transmit a signal, starting the camera, and reading by experimenters;
step four: after the experimenter finishes reading, the face audio and video recorded by the camera and the signal data received by the corresponding receiving end are stored in a video signal library.
Further, the step S10 specifically includes:
acquiring videos in a video signal library, wherein each frame in the videos comprises a complete face image and audio of a person speaking;
judging whether the head posture of the experimenter is changed or not according to the signal data received by the corresponding receiving end of the video;
if the head posture of the experimenter is not changed, extracting a face image set from all frames in the video, and intercepting a lip part in the face image as a sample lip image;
if the head posture of the experimenter is not changed, extracting plastic pellet images from all frames in the video, establishing three-dimensional coordinates of the plastic pellets in a three-dimensional coordinate system, and using corresponding lip shapes in a phoneme mouth shape driving method as sample lip shape images;
constructing a mouth erasing network, randomly taking out part of face images from a face image set, marking the mouth positions, training the mouth erasing network, recognizing and erasing the mouth positions of the face images with the untagged mouth positions by using the trained mouth erasing network, and reserving the face images;
and converting the audio of the time domain into a Mel frequency spectrum of the frequency domain, wherein the sampling rate of the frequency domain is consistent with the sampling rate of the video frame.
Further, the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiving end corresponding to the video specifically includes:
step 1: data processing is carried out on signals received by a receiving end;
step 2: the detection of the small ball is realized by using an extended Kalman filtering method;
step 3: calculating to obtain a likelihood ratio by using the obtained multipath time delay joint estimation value, and comparing the obtained likelihood ratio with a detection threshold value to obtain a detection result of whether the position of the ball is changed or not;
step 4: if the position of the small ball is changed, judging that the head posture of the experimenter is changed; if the position of the ball is not changed, judging that the head posture of the experimenter is not changed.
Further, the step 1 specifically includes:
the first step: the method comprises the steps of representing a transmitting end to transmit signals in a frequency domain form, wherein the frequency domain form of the transmitting end to transmit signals is S= [ S (0), S (1), K, S (K-1) ], and after underwater propagation, a receiving end receives signals in the frequency domain form of a matrix X;
and a second step of: the binary hypothesis testing method is adopted to carry out parameter estimation on the frequency domain form of the received signal with appointed times, and the method specifically comprises the following steps:
based on different hypotheses H in binary hypothesis testing 0 And H 1 For the frequency domain form X of the kth received signal k (k=1, 2,3, l) performing parameter estimation;
and a third step of: the EM delay estimation algorithm is adopted to calculate the direct wave-transmitting multipath delay and the small ball scattered wave multipath delay, and the method specifically comprises the following steps:
using EM time delay estimation algorithm to obtain direct wave-transparent multipath time delay asAnd the multipath time delay of the small sphere scattered wave is +.>The number of the sound lines of the straight wave and the small ball scattering wave is respectively M and N, and the number of the sound lines of the straight wave and the small ball scattering wave is +.>Representing the estimated value of the time delay represented by each sound ray, the estimated value of the time delay can be respectively abbreviated as +.>And->
Further, the step 2 specifically includes:
the first step: according to the method of the extended Kalman filtering, a state equation and an observation equation of the extended Kalman filtering are established, and the method specifically comprises the following steps:
according to the extended Kalman filtering method, the state quantity x= [ x, v ] of the movement of the small ball is set x ,y,v y ] T Sum of observationsEstablishing a state equation and an observation equation of the extended Kalman filter:
x k =Fx k-1 +w k
z k =h(x k )+v k
wherein: f is a state transition matrix, which is determined by the movement form of the small ball, h (·) is an observation function, w k Representing a state noise matrix, obeying w k -N (0, Q) and v k To observe the noise matrix, obey v k ~N(0,R)
And a second step of: according to the known information of the appointed moment, an extended Kalman filtering method is used for obtaining a state prediction equation and a predicted covariance matrix of the next moment, and the method specifically comprises the following steps:
according to the known information of the k-1 moment, an extended Kalman filtering method is used for obtaining a state prediction equation of the k momentAnd predicted covariance matrix P k|k-1 :
P k|k-1 =FP k-1|k-1 F T +Q k-1|k-1
And a third step of: the functional relation between the ball motion state and the multipath time delay is calculated, and the method specifically comprises the following steps:
due to the functional relationship h (x k ) Nonlinear, according to the processing method of the extended Kalman filter, a first-order Taylor formula pair is usedWhich performs a linear approximation, requires a functional relation h (x k ) The expression of the function relation is expressed by using a virtual source mirror image method to obtain the movement state x of the small ball k With multipath delayA functional relationship between them.
The functional relationship h (x k ) The linearization approximation is carried out, and the propagation process of the small ball scattered wave is divided into two sections of a transmitting end, namely a small ball and a small ball, namely a receiving end, and the two sections are respectively described by using a virtual source mirroring method.
For the transmitting end-small ball (st) section, the number of sound rays is set as N st The relationship between the sound ray travel and the ball position is:
L
for the small ball-receiving end (tr) section, the number of sound rays is set to be N tr The relationship between the sound ray travel and the ball position is:
L
(x s ,y s ,z s )、(x t ,y t ,z t ) And (x) r ,y r ,z r ) Representing the coordinates of the transmitting end, the pellet, and the receiving end, respectively. For the selection of the number of these two sound rays, N is followed st ×N tr The principle of N, ensures that the dimensions of the matrix are consistent. In shallow sea, the gradient change of sound velocity is not large, the sound velocity can be set as a constant value c in the time delay calculation, and then the multipath time delay can be expressed as:
therefore, the relation between the ball motion state and the multipath time delay is as follows:
find the observation function h (x k ) Jacobian matrix of (a), i.e. observation matrix H k 。
Fourth step: calculating predictions of observations and Kalman gains, in particular predictions of calculated observationsAnd Kalman gain K k The calculation formula is as follows:
fifth step: updating the observed value, and representing a multipath time delay joint estimated value obtained by combining the small ball motion information by the updated observed value, wherein the method specifically comprises the following steps of:
observed quantity z at time k k Then, the state update value x is obtained after the update process k|k And error covariance update matrix P k|k 。
P k|k =P k|k-1 -K k H k P k|k-1
z k|k =h(x k|k )
Wherein the observed value updates the valueAnd representing the multipath time delay joint estimation value obtained by combining the small ball motion information.
Further, the step 3 specifically includes:
the first step: taking the multipath time delay joint estimated value as a parameter estimated value in generalized likelihood ratio detection, wherein the multipath time delay joint estimated value is expressed as
And a second step of: the likelihood ratio is calculated by using likelihood function with the obtained parameter estimation value, specifically: using hypothesis H 0 And H 1 The following likelihood function:calculating to obtain likelihood ratio L GLRT Wherein
And a third step of: comparing the likelihood ratio with a detection threshold value to obtain a detection result of whether the coordinates of the ball are changed beyond the threshold value, wherein the detection result specifically comprises:
the formula (1) is simplified to obtain a test statistic T (X) k ) And compares it with a corresponding detection threshold eta * And comparing to judge whether the coordinates of the ball are changed beyond the threshold value.
Wherein the matrixOnly the multipath delays of the straight-through wave and the small-sphere scattered wave, respectively, and:
The step S30 specifically includes:
aiming at the Mel frequency spectrum of a given audio, acquiring a multi-frame face image and a corresponding head posture parameter of a small ball character after the mouth is erased according to the method of the step 1, and aligning the Mel frequency spectrum of a frequency domain with the multi-frame face image in time;
the method comprises the steps of utilizing a trained face counterfeiting generation model, firstly carrying out feature extraction on a Mel frequency spectrum of given audio by an audio feature extraction module to generate final audio features, then generating multi-level lip image features according to the final audio features by a lip synchronization module, then generating multi-level mouth image features according to the multi-level lip image features and head posture parameters by a mouth generation module, finally fusing the multi-level mouth image features into multi-frame face images after a small ball character erases a mouth, and generating a counterfeiting face image aiming at mouth actions under specific audio.
Further, the specific step of generating the head posture feature by the head posture module includes:
the plastic pellet is taken as a center point, and the head posture change of the experimenter is determined according to the coordinate change of the plastic pellet and the change of the black paper sheet.
Further, the head posture key points at least comprise points of left eyes, right eyes, left mouth, right mouth, top center of head, left top of ear, left bottom of ear, right top of ear and right bottom of ear of the face.
Compared with the prior art, the head action simulation method for the three-dimensional image pronunciation process has the beneficial effects that: the head gesture key points are adopted to describe head movements, the small balls at the nose tip are adopted as the small balls at the receiving end of the signal transmitting end, when the coordinates of the small balls change, signals received by the receiving end can be changed, so that the tiny changes of the head of an experimenter are judged, the threshold value of the changes of the coordinates of the small balls is set by adopting the detection threshold value, when the changes of the coordinates of the small balls exceed the threshold value, the head gesture of a human face is judged to change, and the lip shape in a phoneme mouth type driving method is adopted to replace the lip shape collected in a video image, so that the calculated amount is greatly reduced, meanwhile, the head gesture and pronunciation have good linkage, and the phenomenon of rigid three-dimensional image pronunciation process is avoided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a three-dimensional character pronunciation process head motion simulation method provided by the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
As shown in fig. 1, the present invention provides a flow chart of a three-dimensional image pronunciation process head motion simulation method, which comprises the following steps:
s10: acquiring face videos and corresponding audios from a video signal library, aligning video frames with audio frames, and extracting face images, head postures and Mel frequency spectrums of multiple frames as training samples; preprocessing a face image to generate a face image after a mouth is erased;
s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generation module, a head posture control module and a fusion module, wherein:
the audio feature extraction module is used for carrying out feature extraction on the Mel frequency spectrum obtained in the step S10 to generate final audio features;
the lip synchronization module is used for generating multi-stage lip image features according to the final audio features, generating a lip image according to the final-stage lip image features, and calculating lip loss between the generated lip image and the lip image in the face image sample, wherein the lip loss comprises mean square error loss and contrast loss;
the mouth generating module is used for generating multi-stage mouth image features according to the multi-stage lip image features, generating mouth images according to the last-stage mouth image features, and calculating mouth loss between the generated mouth images and mouth images in the face image sample, wherein the mouth loss uses mean square error loss;
the head posture control module is used for generating head image characteristics according to the central point;
the fusion module is used for fusing the head image features and the multi-level mouth image features into the face image after the mouth is erased in the S10, calculating fusion loss, and using the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;
s30: and generating the three-dimensional image head model aiming at the specific audio frequency by using the trained three-dimensional image head model.
The mouth erasing network adopts a Unet network and is used for generating a mouth mask representing the mouth position, and the mouth position in the face image is erased according to the mouth mask.
The audio feature extraction module is composed of a audio downsampling layers and an LSTM layer, firstly, the multi-frame Mel frequency spectrum is subjected to dimension reduction processing sequentially through the audio downsampling layers to generate multi-stage audio features, and then the LSTM layer is used for fusing the last-stage audio features of the multi-frame Mel frequency spectrum to generate final audio features.
The lip synchronous module consists of b lip up-sampling layers which are connected in series, wherein b is more than or equal to 3; and taking the final audio feature obtained by the audio feature extraction module as input, sequentially generating multi-stage lip image features by utilizing a plurality of lip up-sampling layers, and converting the lip image features of the last stage into lip images.
The mouth generating module consists of c mouth upper sampling layers which are connected in series, wherein c is more than or equal to 3; and splicing the first-stage lip image features and the head parameters generated by the lip synchronization module to be used as the input of a first mouth upper sampling layer, splicing the first-stage mouth image features output by the first mouth upper sampling layer and the second-stage lip image features to be used as the input of a second mouth upper sampling layer, splicing the second-stage mouth image features output by the second mouth upper sampling layer and the third-stage lip image features to be used as the input of a third mouth upper sampling layer, and taking the third-stage mouth image features output by the third mouth upper sampling layer as the input of a next mouth upper sampling layer until the last-stage mouth image features are generated and converted into mouth images.
The fusion module adopts a Unet network, takes the face image after the mouth is erased as the input of an encoder in the Unet network, fuses the output of each layer of the encoder and the multi-level mouth image characteristics generated by the mouth generation module into the input of each layer of a decoder, and generates a fused complete face image.
In the above technical solution, the method for establishing the video signal library includes:
step one: plastic pellets with reflective outer walls are attached to the nose tips of the experimenters, and black small paper sheets are attached to the head posture key points of the experimenters;
step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end are in a straight line with each other on a plastic pellet, and the distance between the signal transmitting end and the signal receiving end is 1m;
step three: the method comprises the steps of taking a center point of a camera as a center, establishing a three-dimensional coordinate system, starting a signal transmitting end to transmit a signal, starting the camera, and reading by experimenters;
step four: after the experimenter finishes reading, the face audio and video recorded by the camera and the signal data received by the corresponding receiving end are stored in a video signal library.
Further, in the above technical solution, S10 specifically includes:
acquiring videos in a video signal library, wherein each frame in the videos comprises a complete face image and audio of a person speaking;
judging whether the head posture of the experimenter is changed or not according to the signal data received by the corresponding receiving end of the video;
if the head posture of the experimenter is not changed, extracting a face image set from all frames in the video, and intercepting a lip part in the face image as a sample lip image;
if the head posture of the experimenter is not changed, extracting plastic pellet images from all frames in the video, establishing three-dimensional coordinates of the plastic pellets in a three-dimensional coordinate system, and using corresponding lip shapes in a phoneme mouth shape driving method as sample lip shape images;
constructing a mouth erasing network, randomly taking out part of face images from a face image set, marking the mouth positions, training the mouth erasing network, recognizing and erasing the mouth positions of the face images with the untagged mouth positions by using the trained mouth erasing network, and reserving the face images;
and converting the audio of the time domain into a Mel frequency spectrum of the frequency domain, wherein the sampling rate of the frequency domain is consistent with the sampling rate of the video frame.
Wherein, the phone-mouth-style driving method firstly converts voice or text into a phone sequence, and each phone corresponds to a specific visual element (corresponding to a specific mouth-style). In order to fit the mouth shape to the real scene, it is necessary to time smooth the rules of the design adopted for the video sequence. The algorithm includes two phases:
the first stage is irrelevant to a specific speaker and comprises three parallel networks which are respectively used for generating three groups of action parameters of mouth shape, eye-brow expression and head movement;
and in the second stage, synthesizing the specific speaker videos, and generating the speaking videos of different specific persons based on the self-adaptive attention network supervised by the three-dimensional face information.
Further, in the above technical solution, the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiving end corresponding to the video specifically includes:
step 1: data processing is carried out on signals received by a receiving end;
step 2: the detection of the small ball is realized by using an extended Kalman filtering method;
step 3: calculating to obtain a likelihood ratio by using the obtained multipath time delay joint estimation value, and comparing the obtained likelihood ratio with a detection threshold value to obtain a detection result of whether the position of the ball is changed or not;
step 4: if the position of the small ball is changed, judging that the head posture of the experimenter is changed; if the position of the ball is not changed, judging that the head posture of the experimenter is not changed.
Further, in the above technical solution, step 1 specifically includes:
the first step: the method comprises the steps of representing a transmitting end to transmit signals in a frequency domain form, wherein the frequency domain form of the transmitting end to transmit signals is S= [ S (0), S (1), K, S (K-1) ], and after underwater propagation, a receiving end receives signals in the frequency domain form of a matrix X;
and a second step of: the binary hypothesis testing method is adopted to carry out parameter estimation on the frequency domain form of the received signal with appointed times, and the method specifically comprises the following steps:
based on different hypotheses H in binary hypothesis testing 0 And H 1 For the frequency domain form X of the kth received signal k (k=1, 2,3, l) performing parameter estimation;
and a third step of: the EM delay estimation algorithm is adopted to calculate the direct wave-transmitting multipath delay and the small ball scattered wave multipath delay, and the method specifically comprises the following steps:
using EM time delay estimation algorithm to obtain direct wave-transparent multipath time delay asAnd the multipath time delay of the small sphere scattered wave is +.>The number of the sound lines of the straight wave and the small ball scattering wave is respectively M and N, and the number of the sound lines of the straight wave and the small ball scattering wave is +.>Representing the estimated value of the time delay represented by each sound ray, the estimated value of the time delay can be respectively abbreviated as +.>And->
Further, in the above technical solution, step 2 specifically includes:
the first step: according to the method of the extended Kalman filtering, a state equation and an observation equation of the extended Kalman filtering are established, and the method specifically comprises the following steps:
according to the extended Kalman filtering method, the state quantity x= [ x, v ] of the movement of the small ball is set x ,y,v y ] T Sum of observationsEstablishing a state equation and an observation equation of the extended Kalman filter:
x k =Fx k-1 +w k
z k =h(x k )+v k
wherein: f is a state transition matrix, which is determined by the movement form of the small ball, h (·) is an observation function, w k Representing a state noise matrix, obeying w k -N (0, Q) and v k To observe the noise matrix, obey v k ~N(0,R)
And a second step of: according to the known information of the appointed moment, an extended Kalman filtering method is used for obtaining a state prediction equation and a predicted covariance matrix of the next moment, and the method specifically comprises the following steps:
according to the known information of the k-1 moment, an extended Kalman filtering method is used for obtaining a state prediction equation of the k momentAnd predicted covariance matrix P k|k-1 :
P k|k-1 =FP k-1|k-1 F T +Q k-1|k-1
And a third step of: the functional relation between the ball motion state and the multipath time delay is calculated, and the method specifically comprises the following steps:
due to the functional relationship h (x k ) Is nonlinear, and according to the processing method of the extended Kalman filter, a first-order Taylor formula is used for carrying out linearization approximation on the extended Kalman filter, and a functional relation h (x is needed to be obtained k ) Is to using virtual source mirroringThe functional relation is expressed to obtain the ball motion state x k With multipath delayA functional relationship between them.
The functional relationship h (x k ) The linearization approximation is carried out, and the propagation process of the small ball scattered wave is divided into two sections of a transmitting end, namely a small ball and a small ball, namely a receiving end, and the two sections are respectively described by using a virtual source mirroring method.
For the transmitting end-small ball (st) section, the number of sound rays is set as N st The relationship between the sound ray travel and the ball position is:
L
for the small ball-receiving end (tr) section, the number of sound rays is set to be N tr The relationship between the sound ray travel and the ball position is:
L
(x s ,y s ,z s )、(x t ,y t ,z t ) And (x) r ,y r ,z r ) Representing the coordinates of the transmitting end, the pellet, and the receiving end, respectively. For the selection of the number of these two sound rays, N is followed st ×N tr The principle of N, ensures that the dimensions of the matrix are consistent. In shallow sea, the gradient change of sound velocity is not large, the sound velocity can be set as a constant value c in the time delay calculation, and then the multipath time delay can be expressed as:
therefore, the relation between the ball motion state and the multipath time delay is as follows:
find the observation function h (x k ) Jacobian matrix of (a), i.e. observation matrix H k 。
Fourth step: calculating predictions of observations and Kalman gains, in particular predictions of calculated observationsAnd Kalman gain K k The calculation formula is as follows:
fifth step: updating the observed value, and representing a multipath time delay joint estimated value obtained by combining the small ball motion information by the updated observed value, wherein the method specifically comprises the following steps of:
observed quantity z at time k k Then, the state update value x is obtained after the update process k|k And error covariance update matrix P k|k 。
P k|k =P k|k-1 -K k H k P k|k-1
z k|k =h(x k|k )
Wherein the observed value updates the valueAnd representing the multipath time delay joint estimation value obtained by combining the small ball motion information.
Further, in the above technical solution, step 3 specifically includes:
the first step: taking the multipath time delay joint estimated value as a parameter estimated value in generalized likelihood ratio detection, wherein the multipath time delay joint estimated value is expressed as
And a second step of: the likelihood ratio is calculated by using likelihood function with the obtained parameter estimation value, specifically: using hypothesis H 0 And H 1 The following likelihood function:calculating to obtain likelihood ratio L GLRT Wherein
And a third step of: comparing the likelihood ratio with a detection threshold value to obtain a detection result of whether the coordinates of the ball are changed beyond the threshold value, wherein the detection result specifically comprises:
the formula (1) is simplified to obtain a test statistic T (X) k ) And compares it with a corresponding detection threshold eta * And comparing, and judging whether the coordinates of the small ball are changed beyond a threshold value, wherein the detection threshold value is 0.2-0.5 cm.
Wherein the matrixOnly the multipath delays of the straight-through wave and the small-sphere scattered wave, respectively, and:
In the above technical solution, step S30 specifically includes:
aiming at the Mel frequency spectrum of a given audio, acquiring a multi-frame face image and a corresponding head posture parameter of a small ball character after the mouth is erased according to the method of the step 1, and aligning the Mel frequency spectrum of a frequency domain with the multi-frame face image in time;
the method comprises the steps of utilizing a trained face counterfeiting generation model, firstly carrying out feature extraction on a Mel frequency spectrum of given audio by an audio feature extraction module to generate final audio features, then generating multi-level lip image features according to the final audio features by a lip synchronization module, then generating multi-level mouth image features according to the multi-level lip image features and head posture parameters by a mouth generation module, finally fusing the multi-level mouth image features into multi-frame face images after a small ball character erases a mouth, and generating a counterfeiting face image aiming at mouth actions under specific audio.
Further, in the above technical solution, the specific step of generating the head posture feature by the head posture module includes:
the plastic pellet is taken as a center point, and the head posture change of the experimenter is determined according to the coordinate change of the plastic pellet and the change of the black paper sheet.
Further, in the above technical solution, the head posture key points at least include points where the left eye corner, the right eye corner, the left mouth corner, the right mouth corner, the top center, the top of the head, the top of the left ear, the bottom of the left ear, the top of the right ear, and the bottom of the right ear of the human face are located.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.
Claims (10)
1. A three-dimensional image pronunciation process head action simulation method is characterized by comprising the following steps:
s10: acquiring face videos and corresponding audios from a video signal library, aligning video frames with audio frames, and extracting face images, head postures and Mel frequency spectrums of multiple frames as training samples; preprocessing a face image to generate a face image after a mouth is erased;
s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generating module, a head posture control module and a fusion module, wherein:
the audio feature extraction module is used for carrying out feature extraction on the Mel frequency spectrum obtained in the step S10 to generate final audio features;
the lip synchronization module is used for generating multi-stage lip image features according to the final audio features, generating a lip image according to the final-stage lip image features, and calculating lip loss between the generated lip image and the lip image in the face image sample, wherein the lip loss comprises mean square error loss and contrast loss;
the mouth generating module is used for generating multi-stage mouth image features according to the multi-stage lip image features, generating mouth images according to the last-stage mouth image features, and calculating mouth loss between the generated mouth images and mouth images in the face image sample, wherein the mouth loss uses mean square error loss;
the head posture control module is used for generating head image characteristics according to the central point;
the fusion module is used for fusing the head image features and the multi-stage mouth image features into the face image after the mouth is erased in the step S10, and calculating fusion loss, wherein the fusion loss uses the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;
s30: and generating the three-dimensional image head model aiming at the specific audio frequency by using the trained three-dimensional image head model.
2. The method for simulating the head motion in the three-dimensional avatar pronunciation process according to claim 1, wherein the method for creating the video signal library is as follows:
step one: plastic pellets with reflective outer walls are attached to the nose tips of the experimenters, and black small paper sheets are attached to the head posture key points of the experimenters;
step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end form a straight line with the plastic pellets, and the distance between the signal transmitting end and the signal receiving end is 1m;
step three: the method comprises the steps of taking a center point of a camera as a center, establishing a three-dimensional coordinate system, starting a signal transmitting end to transmit a signal, starting the camera, and reading by experimenters;
step four: after the experimenter finishes reading, the face audio and video recorded by the camera and the signal data received by the corresponding receiving end are stored in a video signal library.
3. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 2, wherein S10 specifically comprises:
acquiring videos in a video signal library, wherein each frame in the videos comprises a complete face image and audio of a person speaking;
judging whether the head posture of the experimenter is changed or not according to the signal data received by the corresponding receiving end of the video;
if the head posture of the experimenter is not changed, extracting a face image set from all frames in the video, and intercepting a lip part in the face image as a sample lip image;
if the head posture of the experimenter is not changed, extracting plastic pellet images from all frames in the video, establishing three-dimensional coordinates of the plastic pellets in a three-dimensional coordinate system, and using corresponding lip shapes in a phoneme mouth shape driving method as sample lip shape images;
constructing a mouth erasing network, randomly taking out part of face images from a face image set, marking the mouth positions, training the mouth erasing network, recognizing and erasing the mouth positions of the face images with the untagged mouth positions by using the trained mouth erasing network, and reserving the face images;
and converting the audio of the time domain into a Mel frequency spectrum of the frequency domain, wherein the sampling rate of the frequency domain is consistent with the sampling rate of the video frame.
4. The method for simulating the head motion in the three-dimensional avatar pronunciation process according to claim 3, wherein the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiver corresponding to the video specifically comprises:
step 1: data processing is carried out on signals received by a receiving end;
step 2: the detection of the small ball is realized by using an extended Kalman filtering method;
step 3: calculating to obtain a likelihood ratio by using the obtained multipath time delay joint estimation value, and comparing the obtained likelihood ratio with a detection threshold value to obtain a detection result of whether the position of the ball is changed or not;
step 4: if the position of the small ball is changed, judging that the head posture of the experimenter is changed; if the position of the ball is not changed, judging that the head posture of the experimenter is not changed.
5. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 1 specifically comprises:
the first step: the method comprises the steps of representing a transmitting end to transmit signals in a frequency domain form, wherein the frequency domain form of the transmitting end to transmit signals is S= [ S (0), S (1), K, S (K-1) ], and after underwater propagation, a receiving end receives signals in the frequency domain form of a matrix X;
and a second step of: the binary hypothesis testing method is adopted to carry out parameter estimation on the frequency domain form of the received signal with appointed times, and the method specifically comprises the following steps:
based on different hypotheses H in binary hypothesis testing 0 And H 1 For the frequency domain form X of the kth received signal k (k=1, 2,3, l) performing parameter estimation;
and a third step of: the EM delay estimation algorithm is adopted to calculate the direct wave-transmitting multipath delay and the small ball scattered wave multipath delay, and the method specifically comprises the following steps:
using EM time delay estimation algorithm to obtain direct wave-transparent multipath time delay asAnd small sphere scattering waveMultipath delay of->The number of the sound lines of the straight wave and the small ball scattering wave is respectively M and N, and the number of the sound lines of the straight wave and the small ball scattering wave is +.>Representing the estimated value of the time delay represented by each sound ray, the estimated value of the time delay can be respectively abbreviated as +.>And->
6. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 2 specifically comprises:
the first step: according to the method of the extended Kalman filtering, a state equation and an observation equation of the extended Kalman filtering are established, and the method specifically comprises the following steps:
according to the extended Kalman filtering method, the state quantity x= [ x, v ] of the movement of the small ball is set x ,y,v y ] T Sum of observationsEstablishing a state equation and an observation equation of the extended Kalman filter:
x k =Fx k-1 +w k
z k =h(x k )+v k
wherein: f is a state transition matrix, which is determined by the movement form of the small ball, h (·) is an observation function, w k Representing a state noise matrix, obeying w k -N (0, Q) and v k To observe the noise matrix, obey v k ~N(0,R)
And a second step of: according to the known information of the appointed moment, an extended Kalman filtering method is used for obtaining a state prediction equation and a predicted covariance matrix of the next moment, and the method specifically comprises the following steps:
according to the known information of the k-1 moment, an extended Kalman filtering method is used for obtaining a state prediction equation of the k momentAnd predicted covariance matrix P k|k-1 :
P k|k-1 =FP k-1|k-1 F T +Q k-1|k-1
And a third step of: the functional relation between the ball motion state and the multipath time delay is calculated, and the method specifically comprises the following steps:
due to the functional relationship h (x k ) Is nonlinear, and according to the processing method of the extended Kalman filter, a first-order Taylor formula is used for carrying out linearization approximation on the extended Kalman filter, and a functional relation h (x is needed to be obtained k ) The expression of the function relation is expressed by using a virtual source mirror image method to obtain the movement state x of the small ball k With multipath delayA functional relationship between them.
The functional relationship h (x k ) The linearization approximation is carried out, and the propagation process of the small ball scattered wave is divided into two sections of a transmitting end, namely a small ball and a small ball, namely a receiving end, and the two sections are respectively described by using a virtual source mirroring method.
For the transmitting end-small ball (st) section, the number of sound rays is set as N st The relationship between the sound ray travel and the ball position is:
L
for the small ball-receiving end (tr) section, the number of sound rays is set to be N tr The relationship between the sound ray travel and the ball position is:
L
(x s ,y s ,z s )、(x t ,y t ,z t ) And (x) r ,y r ,z r ) Representing the coordinates of the transmitting end, the pellet, and the receiving end, respectively. For the selection of the number of these two sound rays, N is followed st ×N tr The principle of N, ensures that the dimensions of the matrix are consistent. In shallow sea, the gradient change of sound velocity is not large, the sound velocity can be set as a constant value c in the time delay calculation, and then the multipath time delay can be expressed as:
therefore, the relation between the ball motion state and the multipath time delay is as follows:
find the observation function h (x k ) Jacobian matrix of (a), i.e. observation matrix H k 。
Fourth step: calculating predictions of observations and Kalman gains, in particular predictions of calculated observationsAnd Kalman gain K k The calculation formula is as follows:
fifth step: updating the observed value, and representing a multipath time delay joint estimated value obtained by combining the small ball motion information by the updated observed value, wherein the method specifically comprises the following steps of:
observed quantity z at time k k Then, the state update value x is obtained after the update process k|k And error covariance update matrix P k|k 。
P k|k =P k|k-1 -K k H k P k|k-1
z k|k =h(x k|k )
7. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 3 specifically comprises:
the first step: taking the multipath time delay joint estimated value as a parameter estimated value in generalized likelihood ratio detection, wherein the multipath time delay joint estimated value is expressed as
And a second step of: the likelihood ratio is calculated by using likelihood function with the obtained parameter estimation value, specifically: using hypothesis H 0 And H 1 The following likelihood function:calculating to obtain likelihood ratio L GLRT Wherein
And a third step of: comparing the likelihood ratio with a detection threshold value to obtain a detection result of whether the coordinates of the ball are changed beyond the threshold value, wherein the detection result specifically comprises:
the formula (1) is simplified to obtain a test statistic T (X) k ) And compares it with a corresponding detection threshold eta * And comparing to judge whether the coordinates of the ball are changed beyond the threshold value.
Wherein the matrixOnly the multipath delays of the straight-through wave and the small-sphere scattered wave, respectively, and:
8. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 1, wherein the step S30 specifically comprises:
aiming at the Mel frequency spectrum of a given audio, acquiring a multi-frame face image and a corresponding head posture parameter of a small ball character after the mouth is erased according to the method of the step 1, and aligning the Mel frequency spectrum of a frequency domain with the multi-frame face image in time;
the method comprises the steps of utilizing a trained face counterfeiting generation model, firstly carrying out feature extraction on a Mel frequency spectrum of given audio by an audio feature extraction module to generate final audio features, then generating multi-level lip image features according to the final audio features by a lip synchronization module, then generating multi-level mouth image features according to the multi-level lip image features and head posture parameters by a mouth generation module, finally fusing the multi-level mouth image features into multi-frame face images after a small ball character erases a mouth, and generating a counterfeiting face image aiming at mouth actions under specific audio.
9. A method for simulating head movements in a three-dimensional visual pronunciation process according to claim 3, wherein the specific step of generating the head posture feature by the head posture module comprises:
the plastic pellet is taken as a center point, and the head posture change of the experimenter is determined according to the coordinate change of the plastic pellet and the change of the black paper sheet.
10. The method for simulating the head movements in the three-dimensional visual pronunciation process according to claim 2, wherein the head posture key points at least comprise points of the left eye corner, the right eye corner, the left mouth corner, the right mouth corner, the top center of the head, the top of the left ear, the bottom of the left ear, the top of the right ear and the bottom of the right ear of the human face.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211671532.9A CN116246649A (en) | 2022-12-26 | 2022-12-26 | Head action simulation method in three-dimensional image pronunciation process |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211671532.9A CN116246649A (en) | 2022-12-26 | 2022-12-26 | Head action simulation method in three-dimensional image pronunciation process |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116246649A true CN116246649A (en) | 2023-06-09 |
Family
ID=86630472
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211671532.9A Pending CN116246649A (en) | 2022-12-26 | 2022-12-26 | Head action simulation method in three-dimensional image pronunciation process |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116246649A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116863046A (en) * | 2023-07-07 | 2023-10-10 | 广东明星创意动画有限公司 | Virtual mouth shape generation method, device, equipment and storage medium |
-
2022
- 2022-12-26 CN CN202211671532.9A patent/CN116246649A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116863046A (en) * | 2023-07-07 | 2023-10-10 | 广东明星创意动画有限公司 | Virtual mouth shape generation method, device, equipment and storage medium |
CN116863046B (en) * | 2023-07-07 | 2024-03-19 | 广东明星创意动画有限公司 | Virtual mouth shape generation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10891472B2 (en) | Automatic body movement recognition and association system | |
CN110175596B (en) | Virtual learning environment micro-expression recognition and interaction method based on double-current convolutional neural network | |
Beal et al. | A graphical model for audiovisual object tracking | |
Hong et al. | Real-time speech-driven face animation with expressions using neural networks | |
CN113378806B (en) | Audio-driven face animation generation method and system integrating emotion coding | |
CN110610534B (en) | Automatic mouth shape animation generation method based on Actor-Critic algorithm | |
CN112308949A (en) | Model training method, human face image generation device and storage medium | |
CN110458046B (en) | Human motion trajectory analysis method based on joint point extraction | |
CN112597814A (en) | Improved Openpos classroom multi-person abnormal behavior and mask wearing detection method | |
CN113838174B (en) | Audio-driven face animation generation method, device, equipment and medium | |
CN116246649A (en) | Head action simulation method in three-dimensional image pronunciation process | |
CN1952850A (en) | Three-dimensional face cartoon method driven by voice based on dynamic elementary access | |
Jarabese et al. | Sign to speech convolutional neural network-based filipino sign language hand gesture recognition system | |
CN113283372A (en) | Method and apparatus for processing image of person | |
Tur et al. | Isolated sign recognition with a siamese neural network of RGB and depth streams | |
CN116312512A (en) | Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device | |
RU2737231C1 (en) | Method of multimodal contactless control of mobile information robot | |
CN114882590A (en) | Lip reading method based on multi-granularity space-time feature perception of event camera | |
Mishra et al. | Environment descriptor for the visually impaired | |
CN114466178A (en) | Method and device for measuring synchronism of voice and image | |
Shreekumar et al. | Improved viseme recognition using generative adversarial networks | |
Hsieh et al. | Consonant Classification in Mandarin Based on the Depth Image Feature: A Pilot Study. | |
CN117153195B (en) | Method and system for generating speaker face video based on adaptive region shielding | |
Sams et al. | SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation | |
CN113838218B (en) | Speech driving virtual human gesture synthesis method for sensing environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |