CN116246649A - Head action simulation method in three-dimensional image pronunciation process - Google Patents

Head action simulation method in three-dimensional image pronunciation process Download PDF

Info

Publication number
CN116246649A
CN116246649A CN202211671532.9A CN202211671532A CN116246649A CN 116246649 A CN116246649 A CN 116246649A CN 202211671532 A CN202211671532 A CN 202211671532A CN 116246649 A CN116246649 A CN 116246649A
Authority
CN
China
Prior art keywords
mouth
head
image
lip
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211671532.9A
Other languages
Chinese (zh)
Inventor
周安斌
晏武志
李鑫
彭辰
潘见见
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jindong Digital Creative Co ltd
Original Assignee
Shandong Jindong Digital Creative Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jindong Digital Creative Co ltd filed Critical Shandong Jindong Digital Creative Co ltd
Priority to CN202211671532.9A priority Critical patent/CN116246649A/en
Publication of CN116246649A publication Critical patent/CN116246649A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/18Details of the transformation process
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Abstract

The invention provides a three-dimensional image pronunciation process head action simulation method, which belongs to the technical field of three-dimensional virtual images, and comprises the steps of obtaining face videos and corresponding audios from a video library, aligning video frames with audio frames, and extracting multi-frame face images, head gesture parameters and Mel frequency spectrums as training samples; preprocessing a face image to generate a face image after a mouth is erased; establishing a three-dimensional image head model and training the three-dimensional image head model by using a training sample, wherein the three-dimensional image head model comprises an audio feature extraction module, a lip synchronization module, a mouth generation module, a head posture module and a fusion module; generating a three-dimensional image head model aiming at specific audio frequency by using the trained three-dimensional image head model; the method greatly reduces the calculated amount, simultaneously ensures that the head gesture and the pronunciation have good linkage, and avoids the phenomenon of stiff three-dimensional image pronunciation process.

Description

Head action simulation method in three-dimensional image pronunciation process
Technical Field
The invention belongs to the technical field of three-dimensional virtual figures, and particularly relates to a head action simulation method in a three-dimensional figure pronunciation process.
Background
Many people speak with tiny head movements, and when speaking, the person does not notice, and when a camera is adopted to collect a face speaking image, the mouth needs to be tracked due to the change of the face head, so that a large amount of operations are brought; on the other hand, the tiny changes of the head gesture of each person during speaking are not universal, and the tiny head gesture changes are ignored during the acquisition time to acquire the face speaking image more quickly, so that the processing efficiency is improved.
The Chinese invention with the authorization number of CN111081270B (application number of CN 201911314031.3) discloses a real-time audio-driven virtual character mouth shape synchronous control method. The method comprises the following steps: a step of identifying a pixel probability from the real-time speech stream; a step of filtering the pixel probability; a step of converting the sampling rate of the pixel probability to the same sampling rate as the virtual character rendering frame rate; and converting the pixel probability into a standard mouth shape configuration and carrying out mouth shape rendering. The method can avoid the requirement of synchronously transmitting the phoneme sequence or the mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes for rendering virtual characters on display equipment.
Said invention and current many three-dimensional figures only have simple mouth shape change in the course of pronunciation, and the head gesture and pronunciation lack of linkage, so that the course of pronunciation of three-dimensional figures is stiff.
Disclosure of Invention
In view of the above, the invention provides a head action simulation method in the three-dimensional image pronunciation process, which can solve the technical problems that only a simple mouth shape is changed in the three-dimensional image pronunciation process, and the head gesture and the pronunciation lack of linkage, so that the three-dimensional image pronunciation process is rigid.
The invention is realized in the following way:
the invention provides a head action simulation method in a three-dimensional image pronunciation process, which comprises the following steps:
s10: acquiring face videos and corresponding audios from a video signal library, aligning video frames with audio frames, and extracting face images, head postures and Mel frequency spectrums of multiple frames as training samples; preprocessing a face image to generate a face image after a mouth is erased;
s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generating module, a head posture control module and a fusion module, wherein:
the audio feature extraction module is used for carrying out feature extraction on the Mel frequency spectrum obtained in the step S10 to generate final audio features;
the lip synchronization module is used for generating multi-stage lip image features according to the final audio features, generating a lip image according to the final-stage lip image features, and calculating lip loss between the generated lip image and the lip image in the face image sample, wherein the lip loss comprises mean square error loss and contrast loss;
the mouth generating module is used for generating multi-stage mouth image features according to the multi-stage lip image features, generating mouth images according to the last-stage mouth image features, and calculating mouth loss between the generated mouth images and mouth images in the face image sample, wherein the mouth loss uses mean square error loss;
the head posture control module is used for generating head image characteristics according to the central point;
the fusion module is used for fusing the head image features and the multi-stage mouth image features into the face image after the mouth is erased in the step S10, and calculating fusion loss, wherein the fusion loss uses the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;
s30: and generating the three-dimensional image head model aiming at the specific audio frequency by using the trained three-dimensional image head model.
The mouth erasing network adopts a Unet network and is used for generating a mouth mask representing the position of the mouth, and the mouth position in the face image is erased according to the mouth mask.
The audio feature extraction module is composed of a audio downsampling layers and an LSTM layer, firstly, the multi-frame Mel frequency spectrum is subjected to dimension reduction processing sequentially through the audio downsampling layers to generate multi-stage audio features, and then the LSTM layer is used for fusing the last-stage audio features of the multi-frame Mel frequency spectrum to generate final audio features.
The lip synchronous module consists of b lip up-sampling layers which are connected in series, wherein b is more than or equal to 3; and taking the final audio feature obtained by the audio feature extraction module as input, sequentially generating multi-stage lip image features by utilizing a plurality of lip up-sampling layers, and converting the lip image features of the last stage into lip images.
The mouth generating module consists of c mouth upper sampling layers which are connected in series, wherein c is more than or equal to 3; and splicing the first-stage lip image features and the head parameters generated by the lip synchronization module to be used as the input of a first mouth upper sampling layer, splicing the first-stage mouth image features output by the first mouth upper sampling layer and the second-stage lip image features to be used as the input of a second mouth upper sampling layer, splicing the second-stage mouth image features output by the second mouth upper sampling layer and the third-stage lip image features to be used as the input of a third mouth upper sampling layer, and taking the third-stage mouth image features output by the third mouth upper sampling layer as the input of a next mouth upper sampling layer until the last-stage mouth image features are generated and converted into mouth images.
The fusion module adopts a Unet network, takes the face image after the mouth is erased as the input of an encoder in the Unet network, fuses the output of each layer of the encoder and the multi-level mouth image characteristics generated by the mouth generation module into the input of each layer of a decoder, and generates a fused complete face image.
On the basis of the technical scheme, the head action simulation method in the three-dimensional image pronunciation process can be further improved as follows:
the method for establishing the video signal library comprises the following steps:
step one: plastic pellets with reflective outer walls are attached to the nose tips of the experimenters, and black small paper sheets are attached to the head posture key points of the experimenters;
step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end form a straight line with the plastic pellets, and the distance between the signal transmitting end and the signal receiving end is 1m;
step three: the method comprises the steps of taking a center point of a camera as a center, establishing a three-dimensional coordinate system, starting a signal transmitting end to transmit a signal, starting the camera, and reading by experimenters;
step four: after the experimenter finishes reading, the face audio and video recorded by the camera and the signal data received by the corresponding receiving end are stored in a video signal library.
Further, the step S10 specifically includes:
acquiring videos in a video signal library, wherein each frame in the videos comprises a complete face image and audio of a person speaking;
judging whether the head posture of the experimenter is changed or not according to the signal data received by the corresponding receiving end of the video;
if the head posture of the experimenter is not changed, extracting a face image set from all frames in the video, and intercepting a lip part in the face image as a sample lip image;
if the head posture of the experimenter is not changed, extracting plastic pellet images from all frames in the video, establishing three-dimensional coordinates of the plastic pellets in a three-dimensional coordinate system, and using corresponding lip shapes in a phoneme mouth shape driving method as sample lip shape images;
constructing a mouth erasing network, randomly taking out part of face images from a face image set, marking the mouth positions, training the mouth erasing network, recognizing and erasing the mouth positions of the face images with the untagged mouth positions by using the trained mouth erasing network, and reserving the face images;
and converting the audio of the time domain into a Mel frequency spectrum of the frequency domain, wherein the sampling rate of the frequency domain is consistent with the sampling rate of the video frame.
Further, the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiving end corresponding to the video specifically includes:
step 1: data processing is carried out on signals received by a receiving end;
step 2: the detection of the small ball is realized by using an extended Kalman filtering method;
step 3: calculating to obtain a likelihood ratio by using the obtained multipath time delay joint estimation value, and comparing the obtained likelihood ratio with a detection threshold value to obtain a detection result of whether the position of the ball is changed or not;
step 4: if the position of the small ball is changed, judging that the head posture of the experimenter is changed; if the position of the ball is not changed, judging that the head posture of the experimenter is not changed.
Further, the step 1 specifically includes:
the first step: the method comprises the steps of representing a transmitting end to transmit signals in a frequency domain form, wherein the frequency domain form of the transmitting end to transmit signals is S= [ S (0), S (1), K, S (K-1) ], and after underwater propagation, a receiving end receives signals in the frequency domain form of a matrix X;
and a second step of: the binary hypothesis testing method is adopted to carry out parameter estimation on the frequency domain form of the received signal with appointed times, and the method specifically comprises the following steps:
based on different hypotheses H in binary hypothesis testing 0 And H 1 For the frequency domain form X of the kth received signal k (k=1, 2,3, l) performing parameter estimation;
and a third step of: the EM delay estimation algorithm is adopted to calculate the direct wave-transmitting multipath delay and the small ball scattered wave multipath delay, and the method specifically comprises the following steps:
using EM time delay estimation algorithm to obtain direct wave-transparent multipath time delay as
Figure BDA0004016585290000051
And the multipath time delay of the small sphere scattered wave is +.>
Figure BDA0004016585290000052
The number of the sound lines of the straight wave and the small ball scattering wave is respectively M and N, and the number of the sound lines of the straight wave and the small ball scattering wave is +.>
Figure BDA0004016585290000053
Representing the estimated value of the time delay represented by each sound ray, the estimated value of the time delay can be respectively abbreviated as +.>
Figure BDA0004016585290000054
And->
Figure BDA0004016585290000055
Further, the step 2 specifically includes:
the first step: according to the method of the extended Kalman filtering, a state equation and an observation equation of the extended Kalman filtering are established, and the method specifically comprises the following steps:
according to the extended Kalman filtering method, the state quantity x= [ x, v ] of the movement of the small ball is set x ,y,v y ] T Sum of observations
Figure BDA0004016585290000056
Establishing a state equation and an observation equation of the extended Kalman filter:
x k =Fx k-1 +w k
z k =h(x k )+v k
wherein: f is a state transition matrix, which is determined by the movement form of the small ball, h (·) is an observation function, w k Representing a state noise matrix, obeying w k -N (0, Q) and v k To observe the noise matrix, obey v k ~N(0,R)
And a second step of: according to the known information of the appointed moment, an extended Kalman filtering method is used for obtaining a state prediction equation and a predicted covariance matrix of the next moment, and the method specifically comprises the following steps:
according to the known information of the k-1 moment, an extended Kalman filtering method is used for obtaining a state prediction equation of the k moment
Figure BDA0004016585290000057
And predicted covariance matrix P k|k-1
Figure BDA0004016585290000058
P k|k-1 =FP k-1|k-1 F T +Q k-1|k-1
And a third step of: the functional relation between the ball motion state and the multipath time delay is calculated, and the method specifically comprises the following steps:
due to the functional relationship h (x k ) Nonlinear, according to the processing method of the extended Kalman filter, a first-order Taylor formula pair is usedWhich performs a linear approximation, requires a functional relation h (x k ) The expression of the function relation is expressed by using a virtual source mirror image method to obtain the movement state x of the small ball k With multipath delay
Figure BDA0004016585290000061
A functional relationship between them.
Figure BDA0004016585290000062
The functional relationship h (x k ) The linearization approximation is carried out, and the propagation process of the small ball scattered wave is divided into two sections of a transmitting end, namely a small ball and a small ball, namely a receiving end, and the two sections are respectively described by using a virtual source mirroring method.
For the transmitting end-small ball (st) section, the number of sound rays is set as N st The relationship between the sound ray travel and the ball position is:
Figure BDA0004016585290000063
Figure BDA0004016585290000064
Figure BDA0004016585290000065
L
for the small ball-receiving end (tr) section, the number of sound rays is set to be N tr The relationship between the sound ray travel and the ball position is:
Figure BDA0004016585290000066
Figure BDA0004016585290000067
Figure BDA0004016585290000068
L
(x s ,y s ,z s )、(x t ,y t ,z t ) And (x) r ,y r ,z r ) Representing the coordinates of the transmitting end, the pellet, and the receiving end, respectively. For the selection of the number of these two sound rays, N is followed st ×N tr The principle of N, ensures that the dimensions of the matrix are consistent. In shallow sea, the gradient change of sound velocity is not large, the sound velocity can be set as a constant value c in the time delay calculation, and then the multipath time delay can be expressed as:
Figure BDA0004016585290000069
therefore, the relation between the ball motion state and the multipath time delay is as follows:
Figure BDA0004016585290000071
find the observation function h (x k ) Jacobian matrix of (a), i.e. observation matrix H k
Figure BDA0004016585290000072
Fourth step: calculating predictions of observations and Kalman gains, in particular predictions of calculated observations
Figure BDA0004016585290000073
And Kalman gain K k The calculation formula is as follows:
Figure BDA0004016585290000074
Figure BDA0004016585290000075
fifth step: updating the observed value, and representing a multipath time delay joint estimated value obtained by combining the small ball motion information by the updated observed value, wherein the method specifically comprises the following steps of:
observed quantity z at time k k Then, the state update value x is obtained after the update process k|k And error covariance update matrix P k|k
Figure BDA0004016585290000076
P k|k =P k|k-1 -K k H k P k|k-1
z k|k =h(x k|k )
Wherein the observed value updates the value
Figure BDA0004016585290000077
And representing the multipath time delay joint estimation value obtained by combining the small ball motion information.
Further, the step 3 specifically includes:
the first step: taking the multipath time delay joint estimated value as a parameter estimated value in generalized likelihood ratio detection, wherein the multipath time delay joint estimated value is expressed as
Figure BDA0004016585290000081
And a second step of: the likelihood ratio is calculated by using likelihood function with the obtained parameter estimation value, specifically: using hypothesis H 0 And H 1 The following likelihood function:
Figure BDA0004016585290000082
calculating to obtain likelihood ratio L GLRT Wherein
Figure BDA0004016585290000083
And a third step of: comparing the likelihood ratio with a detection threshold value to obtain a detection result of whether the coordinates of the ball are changed beyond the threshold value, wherein the detection result specifically comprises:
the formula (1) is simplified to obtain a test statistic T (X) k ) And compares it with a corresponding detection threshold eta * And comparing to judge whether the coordinates of the ball are changed beyond the threshold value.
Figure BDA0004016585290000084
Wherein the matrix
Figure BDA0004016585290000085
Only the multipath delays of the straight-through wave and the small-sphere scattered wave, respectively, and:
Figure BDA0004016585290000086
Figure BDA0004016585290000087
is->
Figure BDA0004016585290000089
Spatially projected matrix,/->
Figure BDA0004016585290000088
The step S30 specifically includes:
aiming at the Mel frequency spectrum of a given audio, acquiring a multi-frame face image and a corresponding head posture parameter of a small ball character after the mouth is erased according to the method of the step 1, and aligning the Mel frequency spectrum of a frequency domain with the multi-frame face image in time;
the method comprises the steps of utilizing a trained face counterfeiting generation model, firstly carrying out feature extraction on a Mel frequency spectrum of given audio by an audio feature extraction module to generate final audio features, then generating multi-level lip image features according to the final audio features by a lip synchronization module, then generating multi-level mouth image features according to the multi-level lip image features and head posture parameters by a mouth generation module, finally fusing the multi-level mouth image features into multi-frame face images after a small ball character erases a mouth, and generating a counterfeiting face image aiming at mouth actions under specific audio.
Further, the specific step of generating the head posture feature by the head posture module includes:
the plastic pellet is taken as a center point, and the head posture change of the experimenter is determined according to the coordinate change of the plastic pellet and the change of the black paper sheet.
Further, the head posture key points at least comprise points of left eyes, right eyes, left mouth, right mouth, top center of head, left top of ear, left bottom of ear, right top of ear and right bottom of ear of the face.
Compared with the prior art, the head action simulation method for the three-dimensional image pronunciation process has the beneficial effects that: the head gesture key points are adopted to describe head movements, the small balls at the nose tip are adopted as the small balls at the receiving end of the signal transmitting end, when the coordinates of the small balls change, signals received by the receiving end can be changed, so that the tiny changes of the head of an experimenter are judged, the threshold value of the changes of the coordinates of the small balls is set by adopting the detection threshold value, when the changes of the coordinates of the small balls exceed the threshold value, the head gesture of a human face is judged to change, and the lip shape in a phoneme mouth type driving method is adopted to replace the lip shape collected in a video image, so that the calculated amount is greatly reduced, meanwhile, the head gesture and pronunciation have good linkage, and the phenomenon of rigid three-dimensional image pronunciation process is avoided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a three-dimensional character pronunciation process head motion simulation method provided by the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
As shown in fig. 1, the present invention provides a flow chart of a three-dimensional image pronunciation process head motion simulation method, which comprises the following steps:
s10: acquiring face videos and corresponding audios from a video signal library, aligning video frames with audio frames, and extracting face images, head postures and Mel frequency spectrums of multiple frames as training samples; preprocessing a face image to generate a face image after a mouth is erased;
s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generation module, a head posture control module and a fusion module, wherein:
the audio feature extraction module is used for carrying out feature extraction on the Mel frequency spectrum obtained in the step S10 to generate final audio features;
the lip synchronization module is used for generating multi-stage lip image features according to the final audio features, generating a lip image according to the final-stage lip image features, and calculating lip loss between the generated lip image and the lip image in the face image sample, wherein the lip loss comprises mean square error loss and contrast loss;
the mouth generating module is used for generating multi-stage mouth image features according to the multi-stage lip image features, generating mouth images according to the last-stage mouth image features, and calculating mouth loss between the generated mouth images and mouth images in the face image sample, wherein the mouth loss uses mean square error loss;
the head posture control module is used for generating head image characteristics according to the central point;
the fusion module is used for fusing the head image features and the multi-level mouth image features into the face image after the mouth is erased in the S10, calculating fusion loss, and using the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;
s30: and generating the three-dimensional image head model aiming at the specific audio frequency by using the trained three-dimensional image head model.
The mouth erasing network adopts a Unet network and is used for generating a mouth mask representing the mouth position, and the mouth position in the face image is erased according to the mouth mask.
The audio feature extraction module is composed of a audio downsampling layers and an LSTM layer, firstly, the multi-frame Mel frequency spectrum is subjected to dimension reduction processing sequentially through the audio downsampling layers to generate multi-stage audio features, and then the LSTM layer is used for fusing the last-stage audio features of the multi-frame Mel frequency spectrum to generate final audio features.
The lip synchronous module consists of b lip up-sampling layers which are connected in series, wherein b is more than or equal to 3; and taking the final audio feature obtained by the audio feature extraction module as input, sequentially generating multi-stage lip image features by utilizing a plurality of lip up-sampling layers, and converting the lip image features of the last stage into lip images.
The mouth generating module consists of c mouth upper sampling layers which are connected in series, wherein c is more than or equal to 3; and splicing the first-stage lip image features and the head parameters generated by the lip synchronization module to be used as the input of a first mouth upper sampling layer, splicing the first-stage mouth image features output by the first mouth upper sampling layer and the second-stage lip image features to be used as the input of a second mouth upper sampling layer, splicing the second-stage mouth image features output by the second mouth upper sampling layer and the third-stage lip image features to be used as the input of a third mouth upper sampling layer, and taking the third-stage mouth image features output by the third mouth upper sampling layer as the input of a next mouth upper sampling layer until the last-stage mouth image features are generated and converted into mouth images.
The fusion module adopts a Unet network, takes the face image after the mouth is erased as the input of an encoder in the Unet network, fuses the output of each layer of the encoder and the multi-level mouth image characteristics generated by the mouth generation module into the input of each layer of a decoder, and generates a fused complete face image.
In the above technical solution, the method for establishing the video signal library includes:
step one: plastic pellets with reflective outer walls are attached to the nose tips of the experimenters, and black small paper sheets are attached to the head posture key points of the experimenters;
step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end are in a straight line with each other on a plastic pellet, and the distance between the signal transmitting end and the signal receiving end is 1m;
step three: the method comprises the steps of taking a center point of a camera as a center, establishing a three-dimensional coordinate system, starting a signal transmitting end to transmit a signal, starting the camera, and reading by experimenters;
step four: after the experimenter finishes reading, the face audio and video recorded by the camera and the signal data received by the corresponding receiving end are stored in a video signal library.
Further, in the above technical solution, S10 specifically includes:
acquiring videos in a video signal library, wherein each frame in the videos comprises a complete face image and audio of a person speaking;
judging whether the head posture of the experimenter is changed or not according to the signal data received by the corresponding receiving end of the video;
if the head posture of the experimenter is not changed, extracting a face image set from all frames in the video, and intercepting a lip part in the face image as a sample lip image;
if the head posture of the experimenter is not changed, extracting plastic pellet images from all frames in the video, establishing three-dimensional coordinates of the plastic pellets in a three-dimensional coordinate system, and using corresponding lip shapes in a phoneme mouth shape driving method as sample lip shape images;
constructing a mouth erasing network, randomly taking out part of face images from a face image set, marking the mouth positions, training the mouth erasing network, recognizing and erasing the mouth positions of the face images with the untagged mouth positions by using the trained mouth erasing network, and reserving the face images;
and converting the audio of the time domain into a Mel frequency spectrum of the frequency domain, wherein the sampling rate of the frequency domain is consistent with the sampling rate of the video frame.
Wherein, the phone-mouth-style driving method firstly converts voice or text into a phone sequence, and each phone corresponds to a specific visual element (corresponding to a specific mouth-style). In order to fit the mouth shape to the real scene, it is necessary to time smooth the rules of the design adopted for the video sequence. The algorithm includes two phases:
the first stage is irrelevant to a specific speaker and comprises three parallel networks which are respectively used for generating three groups of action parameters of mouth shape, eye-brow expression and head movement;
and in the second stage, synthesizing the specific speaker videos, and generating the speaking videos of different specific persons based on the self-adaptive attention network supervised by the three-dimensional face information.
Further, in the above technical solution, the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiving end corresponding to the video specifically includes:
step 1: data processing is carried out on signals received by a receiving end;
step 2: the detection of the small ball is realized by using an extended Kalman filtering method;
step 3: calculating to obtain a likelihood ratio by using the obtained multipath time delay joint estimation value, and comparing the obtained likelihood ratio with a detection threshold value to obtain a detection result of whether the position of the ball is changed or not;
step 4: if the position of the small ball is changed, judging that the head posture of the experimenter is changed; if the position of the ball is not changed, judging that the head posture of the experimenter is not changed.
Further, in the above technical solution, step 1 specifically includes:
the first step: the method comprises the steps of representing a transmitting end to transmit signals in a frequency domain form, wherein the frequency domain form of the transmitting end to transmit signals is S= [ S (0), S (1), K, S (K-1) ], and after underwater propagation, a receiving end receives signals in the frequency domain form of a matrix X;
and a second step of: the binary hypothesis testing method is adopted to carry out parameter estimation on the frequency domain form of the received signal with appointed times, and the method specifically comprises the following steps:
based on different hypotheses H in binary hypothesis testing 0 And H 1 For the frequency domain form X of the kth received signal k (k=1, 2,3, l) performing parameter estimation;
and a third step of: the EM delay estimation algorithm is adopted to calculate the direct wave-transmitting multipath delay and the small ball scattered wave multipath delay, and the method specifically comprises the following steps:
using EM time delay estimation algorithm to obtain direct wave-transparent multipath time delay as
Figure BDA0004016585290000131
And the multipath time delay of the small sphere scattered wave is +.>
Figure BDA0004016585290000141
The number of the sound lines of the straight wave and the small ball scattering wave is respectively M and N, and the number of the sound lines of the straight wave and the small ball scattering wave is +.>
Figure BDA0004016585290000142
Representing the estimated value of the time delay represented by each sound ray, the estimated value of the time delay can be respectively abbreviated as +.>
Figure BDA0004016585290000143
And->
Figure BDA0004016585290000144
Further, in the above technical solution, step 2 specifically includes:
the first step: according to the method of the extended Kalman filtering, a state equation and an observation equation of the extended Kalman filtering are established, and the method specifically comprises the following steps:
according to the extended Kalman filtering method, the state quantity x= [ x, v ] of the movement of the small ball is set x ,y,v y ] T Sum of observations
Figure BDA0004016585290000145
Establishing a state equation and an observation equation of the extended Kalman filter:
x k =Fx k-1 +w k
z k =h(x k )+v k
wherein: f is a state transition matrix, which is determined by the movement form of the small ball, h (·) is an observation function, w k Representing a state noise matrix, obeying w k -N (0, Q) and v k To observe the noise matrix, obey v k ~N(0,R)
And a second step of: according to the known information of the appointed moment, an extended Kalman filtering method is used for obtaining a state prediction equation and a predicted covariance matrix of the next moment, and the method specifically comprises the following steps:
according to the known information of the k-1 moment, an extended Kalman filtering method is used for obtaining a state prediction equation of the k moment
Figure BDA0004016585290000146
And predicted covariance matrix P k|k-1
Figure BDA0004016585290000147
P k|k-1 =FP k-1|k-1 F T +Q k-1|k-1
And a third step of: the functional relation between the ball motion state and the multipath time delay is calculated, and the method specifically comprises the following steps:
due to the functional relationship h (x k ) Is nonlinear, and according to the processing method of the extended Kalman filter, a first-order Taylor formula is used for carrying out linearization approximation on the extended Kalman filter, and a functional relation h (x is needed to be obtained k ) Is to using virtual source mirroringThe functional relation is expressed to obtain the ball motion state x k With multipath delay
Figure BDA0004016585290000151
A functional relationship between them.
Figure BDA0004016585290000152
The functional relationship h (x k ) The linearization approximation is carried out, and the propagation process of the small ball scattered wave is divided into two sections of a transmitting end, namely a small ball and a small ball, namely a receiving end, and the two sections are respectively described by using a virtual source mirroring method.
For the transmitting end-small ball (st) section, the number of sound rays is set as N st The relationship between the sound ray travel and the ball position is:
Figure BDA0004016585290000153
Figure BDA0004016585290000154
Figure BDA0004016585290000155
L
for the small ball-receiving end (tr) section, the number of sound rays is set to be N tr The relationship between the sound ray travel and the ball position is:
Figure BDA0004016585290000156
Figure BDA0004016585290000157
Figure BDA0004016585290000158
L
(x s ,y s ,z s )、(x t ,y t ,z t ) And (x) r ,y r ,z r ) Representing the coordinates of the transmitting end, the pellet, and the receiving end, respectively. For the selection of the number of these two sound rays, N is followed st ×N tr The principle of N, ensures that the dimensions of the matrix are consistent. In shallow sea, the gradient change of sound velocity is not large, the sound velocity can be set as a constant value c in the time delay calculation, and then the multipath time delay can be expressed as:
Figure BDA0004016585290000159
therefore, the relation between the ball motion state and the multipath time delay is as follows:
Figure BDA0004016585290000161
find the observation function h (x k ) Jacobian matrix of (a), i.e. observation matrix H k
Figure BDA0004016585290000162
Fourth step: calculating predictions of observations and Kalman gains, in particular predictions of calculated observations
Figure BDA0004016585290000163
And Kalman gain K k The calculation formula is as follows:
Figure BDA0004016585290000164
Figure BDA0004016585290000165
fifth step: updating the observed value, and representing a multipath time delay joint estimated value obtained by combining the small ball motion information by the updated observed value, wherein the method specifically comprises the following steps of:
observed quantity z at time k k Then, the state update value x is obtained after the update process k|k And error covariance update matrix P k|k
Figure BDA0004016585290000166
P k|k =P k|k-1 -K k H k P k|k-1
z k|k =h(x k|k )
Wherein the observed value updates the value
Figure BDA0004016585290000167
And representing the multipath time delay joint estimation value obtained by combining the small ball motion information.
Further, in the above technical solution, step 3 specifically includes:
the first step: taking the multipath time delay joint estimated value as a parameter estimated value in generalized likelihood ratio detection, wherein the multipath time delay joint estimated value is expressed as
Figure BDA0004016585290000171
And a second step of: the likelihood ratio is calculated by using likelihood function with the obtained parameter estimation value, specifically: using hypothesis H 0 And H 1 The following likelihood function:
Figure BDA0004016585290000172
calculating to obtain likelihood ratio L GLRT Wherein
Figure BDA0004016585290000173
And a third step of: comparing the likelihood ratio with a detection threshold value to obtain a detection result of whether the coordinates of the ball are changed beyond the threshold value, wherein the detection result specifically comprises:
the formula (1) is simplified to obtain a test statistic T (X) k ) And compares it with a corresponding detection threshold eta * And comparing, and judging whether the coordinates of the small ball are changed beyond a threshold value, wherein the detection threshold value is 0.2-0.5 cm.
Figure BDA0004016585290000174
Wherein the matrix
Figure BDA0004016585290000175
Only the multipath delays of the straight-through wave and the small-sphere scattered wave, respectively, and:
Figure BDA0004016585290000176
Figure BDA0004016585290000177
is->
Figure BDA0004016585290000179
Spatially projected matrix,/->
Figure BDA0004016585290000178
In the above technical solution, step S30 specifically includes:
aiming at the Mel frequency spectrum of a given audio, acquiring a multi-frame face image and a corresponding head posture parameter of a small ball character after the mouth is erased according to the method of the step 1, and aligning the Mel frequency spectrum of a frequency domain with the multi-frame face image in time;
the method comprises the steps of utilizing a trained face counterfeiting generation model, firstly carrying out feature extraction on a Mel frequency spectrum of given audio by an audio feature extraction module to generate final audio features, then generating multi-level lip image features according to the final audio features by a lip synchronization module, then generating multi-level mouth image features according to the multi-level lip image features and head posture parameters by a mouth generation module, finally fusing the multi-level mouth image features into multi-frame face images after a small ball character erases a mouth, and generating a counterfeiting face image aiming at mouth actions under specific audio.
Further, in the above technical solution, the specific step of generating the head posture feature by the head posture module includes:
the plastic pellet is taken as a center point, and the head posture change of the experimenter is determined according to the coordinate change of the plastic pellet and the change of the black paper sheet.
Further, in the above technical solution, the head posture key points at least include points where the left eye corner, the right eye corner, the left mouth corner, the right mouth corner, the top center, the top of the head, the top of the left ear, the bottom of the left ear, the top of the right ear, and the bottom of the right ear of the human face are located.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. A three-dimensional image pronunciation process head action simulation method is characterized by comprising the following steps:
s10: acquiring face videos and corresponding audios from a video signal library, aligning video frames with audio frames, and extracting face images, head postures and Mel frequency spectrums of multiple frames as training samples; preprocessing a face image to generate a face image after a mouth is erased;
s20: the three-dimensional image head model is established and trained by using a training sample, and comprises an audio feature extraction module, a lip synchronization module, a mouth generating module, a head posture control module and a fusion module, wherein:
the audio feature extraction module is used for carrying out feature extraction on the Mel frequency spectrum obtained in the step S10 to generate final audio features;
the lip synchronization module is used for generating multi-stage lip image features according to the final audio features, generating a lip image according to the final-stage lip image features, and calculating lip loss between the generated lip image and the lip image in the face image sample, wherein the lip loss comprises mean square error loss and contrast loss;
the mouth generating module is used for generating multi-stage mouth image features according to the multi-stage lip image features, generating mouth images according to the last-stage mouth image features, and calculating mouth loss between the generated mouth images and mouth images in the face image sample, wherein the mouth loss uses mean square error loss;
the head posture control module is used for generating head image characteristics according to the central point;
the fusion module is used for fusing the head image features and the multi-stage mouth image features into the face image after the mouth is erased in the step S10, and calculating fusion loss, wherein the fusion loss uses the fusion loss corresponding to the PCONV network; updating parameters of the three-dimensional image head model according to the sum of the weighted losses of lip loss, mouth loss and fusion loss;
s30: and generating the three-dimensional image head model aiming at the specific audio frequency by using the trained three-dimensional image head model.
2. The method for simulating the head motion in the three-dimensional avatar pronunciation process according to claim 1, wherein the method for creating the video signal library is as follows:
step one: plastic pellets with reflective outer walls are attached to the nose tips of the experimenters, and black small paper sheets are attached to the head posture key points of the experimenters;
step two: the method comprises the steps that a camera is arranged on the right opposite side of an experimenter, a signal transmitting end and a signal receiving end are arranged on two sides of the face of the experimenter, wherein the signal transmitting end and the signal receiving end form a straight line with the plastic pellets, and the distance between the signal transmitting end and the signal receiving end is 1m;
step three: the method comprises the steps of taking a center point of a camera as a center, establishing a three-dimensional coordinate system, starting a signal transmitting end to transmit a signal, starting the camera, and reading by experimenters;
step four: after the experimenter finishes reading, the face audio and video recorded by the camera and the signal data received by the corresponding receiving end are stored in a video signal library.
3. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 2, wherein S10 specifically comprises:
acquiring videos in a video signal library, wherein each frame in the videos comprises a complete face image and audio of a person speaking;
judging whether the head posture of the experimenter is changed or not according to the signal data received by the corresponding receiving end of the video;
if the head posture of the experimenter is not changed, extracting a face image set from all frames in the video, and intercepting a lip part in the face image as a sample lip image;
if the head posture of the experimenter is not changed, extracting plastic pellet images from all frames in the video, establishing three-dimensional coordinates of the plastic pellets in a three-dimensional coordinate system, and using corresponding lip shapes in a phoneme mouth shape driving method as sample lip shape images;
constructing a mouth erasing network, randomly taking out part of face images from a face image set, marking the mouth positions, training the mouth erasing network, recognizing and erasing the mouth positions of the face images with the untagged mouth positions by using the trained mouth erasing network, and reserving the face images;
and converting the audio of the time domain into a Mel frequency spectrum of the frequency domain, wherein the sampling rate of the frequency domain is consistent with the sampling rate of the video frame.
4. The method for simulating the head motion in the three-dimensional avatar pronunciation process according to claim 3, wherein the step of determining whether the head posture of the experimenter is changed according to the signal data received by the receiver corresponding to the video specifically comprises:
step 1: data processing is carried out on signals received by a receiving end;
step 2: the detection of the small ball is realized by using an extended Kalman filtering method;
step 3: calculating to obtain a likelihood ratio by using the obtained multipath time delay joint estimation value, and comparing the obtained likelihood ratio with a detection threshold value to obtain a detection result of whether the position of the ball is changed or not;
step 4: if the position of the small ball is changed, judging that the head posture of the experimenter is changed; if the position of the ball is not changed, judging that the head posture of the experimenter is not changed.
5. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 1 specifically comprises:
the first step: the method comprises the steps of representing a transmitting end to transmit signals in a frequency domain form, wherein the frequency domain form of the transmitting end to transmit signals is S= [ S (0), S (1), K, S (K-1) ], and after underwater propagation, a receiving end receives signals in the frequency domain form of a matrix X;
and a second step of: the binary hypothesis testing method is adopted to carry out parameter estimation on the frequency domain form of the received signal with appointed times, and the method specifically comprises the following steps:
based on different hypotheses H in binary hypothesis testing 0 And H 1 For the frequency domain form X of the kth received signal k (k=1, 2,3, l) performing parameter estimation;
and a third step of: the EM delay estimation algorithm is adopted to calculate the direct wave-transmitting multipath delay and the small ball scattered wave multipath delay, and the method specifically comprises the following steps:
using EM time delay estimation algorithm to obtain direct wave-transparent multipath time delay as
Figure FDA0004016585280000031
And small sphere scattering waveMultipath delay of->
Figure FDA0004016585280000032
The number of the sound lines of the straight wave and the small ball scattering wave is respectively M and N, and the number of the sound lines of the straight wave and the small ball scattering wave is +.>
Figure FDA0004016585280000033
Representing the estimated value of the time delay represented by each sound ray, the estimated value of the time delay can be respectively abbreviated as +.>
Figure FDA0004016585280000034
And->
Figure FDA0004016585280000035
6. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 2 specifically comprises:
the first step: according to the method of the extended Kalman filtering, a state equation and an observation equation of the extended Kalman filtering are established, and the method specifically comprises the following steps:
according to the extended Kalman filtering method, the state quantity x= [ x, v ] of the movement of the small ball is set x ,y,v y ] T Sum of observations
Figure FDA0004016585280000036
Establishing a state equation and an observation equation of the extended Kalman filter:
x k =Fx k-1 +w k
z k =h(x k )+v k
wherein: f is a state transition matrix, which is determined by the movement form of the small ball, h (·) is an observation function, w k Representing a state noise matrix, obeying w k -N (0, Q) and v k To observe the noise matrix, obey v k ~N(0,R)
And a second step of: according to the known information of the appointed moment, an extended Kalman filtering method is used for obtaining a state prediction equation and a predicted covariance matrix of the next moment, and the method specifically comprises the following steps:
according to the known information of the k-1 moment, an extended Kalman filtering method is used for obtaining a state prediction equation of the k moment
Figure FDA0004016585280000041
And predicted covariance matrix P k|k-1
Figure FDA0004016585280000042
P k|k-1 =FP k-1|k-1 F T +Q k-1|k-1
And a third step of: the functional relation between the ball motion state and the multipath time delay is calculated, and the method specifically comprises the following steps:
due to the functional relationship h (x k ) Is nonlinear, and according to the processing method of the extended Kalman filter, a first-order Taylor formula is used for carrying out linearization approximation on the extended Kalman filter, and a functional relation h (x is needed to be obtained k ) The expression of the function relation is expressed by using a virtual source mirror image method to obtain the movement state x of the small ball k With multipath delay
Figure FDA0004016585280000043
A functional relationship between them.
Figure FDA0004016585280000044
The functional relationship h (x k ) The linearization approximation is carried out, and the propagation process of the small ball scattered wave is divided into two sections of a transmitting end, namely a small ball and a small ball, namely a receiving end, and the two sections are respectively described by using a virtual source mirroring method.
For the transmitting end-small ball (st) section, the number of sound rays is set as N st The relationship between the sound ray travel and the ball position is:
Figure FDA0004016585280000045
Figure FDA0004016585280000046
Figure FDA0004016585280000047
L
for the small ball-receiving end (tr) section, the number of sound rays is set to be N tr The relationship between the sound ray travel and the ball position is:
Figure FDA0004016585280000051
Figure FDA0004016585280000052
Figure FDA0004016585280000053
L
(x s ,y s ,z s )、(x t ,y t ,z t ) And (x) r ,y r ,z r ) Representing the coordinates of the transmitting end, the pellet, and the receiving end, respectively. For the selection of the number of these two sound rays, N is followed st ×N tr The principle of N, ensures that the dimensions of the matrix are consistent. In shallow sea, the gradient change of sound velocity is not large, the sound velocity can be set as a constant value c in the time delay calculation, and then the multipath time delay can be expressed as:
Figure FDA0004016585280000054
therefore, the relation between the ball motion state and the multipath time delay is as follows:
Figure FDA0004016585280000055
find the observation function h (x k ) Jacobian matrix of (a), i.e. observation matrix H k
Figure FDA0004016585280000056
Fourth step: calculating predictions of observations and Kalman gains, in particular predictions of calculated observations
Figure FDA0004016585280000057
And Kalman gain K k The calculation formula is as follows:
Figure FDA0004016585280000058
Figure FDA0004016585280000059
fifth step: updating the observed value, and representing a multipath time delay joint estimated value obtained by combining the small ball motion information by the updated observed value, wherein the method specifically comprises the following steps of:
observed quantity z at time k k Then, the state update value x is obtained after the update process k|k And error covariance update matrix P k|k
Figure FDA0004016585280000061
P k|k =P k|k-1 -K k H k P k|k-1
z k|k =h(x k|k )
Wherein the observed value updates the value
Figure FDA0004016585280000062
And representing the multipath time delay joint estimation value obtained by combining the small ball motion information.
7. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 4, wherein the step 3 specifically comprises:
the first step: taking the multipath time delay joint estimated value as a parameter estimated value in generalized likelihood ratio detection, wherein the multipath time delay joint estimated value is expressed as
Figure FDA0004016585280000063
And a second step of: the likelihood ratio is calculated by using likelihood function with the obtained parameter estimation value, specifically: using hypothesis H 0 And H 1 The following likelihood function:
Figure FDA0004016585280000064
calculating to obtain likelihood ratio L GLRT Wherein
Figure FDA0004016585280000065
And a third step of: comparing the likelihood ratio with a detection threshold value to obtain a detection result of whether the coordinates of the ball are changed beyond the threshold value, wherein the detection result specifically comprises:
the formula (1) is simplified to obtain a test statistic T (X) k ) And compares it with a corresponding detection threshold eta * And comparing to judge whether the coordinates of the ball are changed beyond the threshold value.
Figure FDA0004016585280000071
/>
Wherein the matrix
Figure FDA0004016585280000072
Only the multipath delays of the straight-through wave and the small-sphere scattered wave, respectively, and:
Figure FDA0004016585280000073
Figure FDA0004016585280000074
is->
Figure FDA0004016585280000075
Spatially projected matrix,/->
Figure FDA0004016585280000076
8. The method for simulating the head motion of a three-dimensional avatar pronunciation process according to claim 1, wherein the step S30 specifically comprises:
aiming at the Mel frequency spectrum of a given audio, acquiring a multi-frame face image and a corresponding head posture parameter of a small ball character after the mouth is erased according to the method of the step 1, and aligning the Mel frequency spectrum of a frequency domain with the multi-frame face image in time;
the method comprises the steps of utilizing a trained face counterfeiting generation model, firstly carrying out feature extraction on a Mel frequency spectrum of given audio by an audio feature extraction module to generate final audio features, then generating multi-level lip image features according to the final audio features by a lip synchronization module, then generating multi-level mouth image features according to the multi-level lip image features and head posture parameters by a mouth generation module, finally fusing the multi-level mouth image features into multi-frame face images after a small ball character erases a mouth, and generating a counterfeiting face image aiming at mouth actions under specific audio.
9. A method for simulating head movements in a three-dimensional visual pronunciation process according to claim 3, wherein the specific step of generating the head posture feature by the head posture module comprises:
the plastic pellet is taken as a center point, and the head posture change of the experimenter is determined according to the coordinate change of the plastic pellet and the change of the black paper sheet.
10. The method for simulating the head movements in the three-dimensional visual pronunciation process according to claim 2, wherein the head posture key points at least comprise points of the left eye corner, the right eye corner, the left mouth corner, the right mouth corner, the top center of the head, the top of the left ear, the bottom of the left ear, the top of the right ear and the bottom of the right ear of the human face.
CN202211671532.9A 2022-12-26 2022-12-26 Head action simulation method in three-dimensional image pronunciation process Pending CN116246649A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211671532.9A CN116246649A (en) 2022-12-26 2022-12-26 Head action simulation method in three-dimensional image pronunciation process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211671532.9A CN116246649A (en) 2022-12-26 2022-12-26 Head action simulation method in three-dimensional image pronunciation process

Publications (1)

Publication Number Publication Date
CN116246649A true CN116246649A (en) 2023-06-09

Family

ID=86630472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211671532.9A Pending CN116246649A (en) 2022-12-26 2022-12-26 Head action simulation method in three-dimensional image pronunciation process

Country Status (1)

Country Link
CN (1) CN116246649A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863046A (en) * 2023-07-07 2023-10-10 广东明星创意动画有限公司 Virtual mouth shape generation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863046A (en) * 2023-07-07 2023-10-10 广东明星创意动画有限公司 Virtual mouth shape generation method, device, equipment and storage medium
CN116863046B (en) * 2023-07-07 2024-03-19 广东明星创意动画有限公司 Virtual mouth shape generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US10891472B2 (en) Automatic body movement recognition and association system
CN110175596B (en) Virtual learning environment micro-expression recognition and interaction method based on double-current convolutional neural network
Beal et al. A graphical model for audiovisual object tracking
Hong et al. Real-time speech-driven face animation with expressions using neural networks
CN113378806B (en) Audio-driven face animation generation method and system integrating emotion coding
CN110610534B (en) Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN112308949A (en) Model training method, human face image generation device and storage medium
CN110458046B (en) Human motion trajectory analysis method based on joint point extraction
CN112597814A (en) Improved Openpos classroom multi-person abnormal behavior and mask wearing detection method
CN113838174B (en) Audio-driven face animation generation method, device, equipment and medium
CN116246649A (en) Head action simulation method in three-dimensional image pronunciation process
CN1952850A (en) Three-dimensional face cartoon method driven by voice based on dynamic elementary access
Jarabese et al. Sign to speech convolutional neural network-based filipino sign language hand gesture recognition system
CN113283372A (en) Method and apparatus for processing image of person
Tur et al. Isolated sign recognition with a siamese neural network of RGB and depth streams
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
RU2737231C1 (en) Method of multimodal contactless control of mobile information robot
CN114882590A (en) Lip reading method based on multi-granularity space-time feature perception of event camera
Mishra et al. Environment descriptor for the visually impaired
CN114466178A (en) Method and device for measuring synchronism of voice and image
Shreekumar et al. Improved viseme recognition using generative adversarial networks
Hsieh et al. Consonant Classification in Mandarin Based on the Depth Image Feature: A Pilot Study.
CN117153195B (en) Method and system for generating speaker face video based on adaptive region shielding
Sams et al. SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation
CN113838218B (en) Speech driving virtual human gesture synthesis method for sensing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination