CN114495908A - Method and system for driving mouth shape by voice based on time sequence convolution - Google Patents

Method and system for driving mouth shape by voice based on time sequence convolution Download PDF

Info

Publication number
CN114495908A
CN114495908A CN202210116972.1A CN202210116972A CN114495908A CN 114495908 A CN114495908 A CN 114495908A CN 202210116972 A CN202210116972 A CN 202210116972A CN 114495908 A CN114495908 A CN 114495908A
Authority
CN
China
Prior art keywords
mouth
fourier transform
unit
time sequence
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210116972.1A
Other languages
Chinese (zh)
Inventor
王松坡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Shenzhi Technology Co ltd
Original Assignee
Beijing Zhongke Shenzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Shenzhi Technology Co ltd filed Critical Beijing Zhongke Shenzhi Technology Co ltd
Priority to CN202210116972.1A priority Critical patent/CN114495908A/en
Publication of CN114495908A publication Critical patent/CN114495908A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a method and a system for driving a mouth shape by voice based on time sequence convolution, comprising the following steps: adopting the blenshape to represent the movement of the mouth, outputting a plurality of weights of the blenshape through a neural network, and combining the values of the blenshape to obtain a reasonable representation of the movement of the mouth; the reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion. The invention introduces time sequence convolution, uses a time sequence convolution network for processing the voice frequency spectrum characteristics, and better solves the problems of time sequence information dependence and single generation mode.

Description

Method and system for driving mouth shape by voice based on time sequence convolution
Technical Field
The invention belongs to the technical field of animation production, and particularly relates to a method and a system for driving a mouth shape by voice based on time sequence convolution.
Background
Speech-driven mouth shapes are typically either linguistic-based models or neural-network-based model implementations.
The method of the linguistic model is to divide phonemes based on the characteristics of audio and to pinch out a corresponding mouth shape for each phoneme. The resulting mouth shape is a weighted average of these extracted phoneme shapes. The method based on the neural network model does not need to extract specific types of phonemes aiming at data, and can directly map audio data into a mouth shape due to the strong function fitting capability of the neural network, wherein the output of the neural network can output any value according to different task settings. For the scheme using the neural network, it is most important to select a reasonable data representation and a network structure, and for the scheme which is more common at present, a mesh method is used for representing the representation of the facial mouth shape, and the network structure uses a convolutional neural network or a cyclic neural network. The convolutional neural network has proved its powerful ability to extract features in the field of computer vision, and speech features can be processed by the convolutional neural network after being converted into frequency spectrum, but it is known that audio signals have continuous time sequence, and information in the time dimension is lost by using the convolutional network. The cyclic neural network can well utilize the previous time sequence characteristics, but the cyclic neural network belongs to a generation network, so that the problems that the generation mode is single and the output tends to be average are easily caused.
Therefore, how to provide a method and a system for driving a mouth shape based on time-series convolution voice becomes a problem to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of this, the present invention provides a method and a system for driving a mouth shape by using a voice based on time sequence convolution, which introduces time sequence convolution, uses a time sequence convolution network to process voice spectrum characteristics, and better solves the problems of time sequence information dependency and single generation mode.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method of speech-driven mouth-shaping based on time-series convolution, comprising: representing the mouth action by adopting a blendshape, outputting weights of a plurality of blendshapes through a neural network, and combining the values of the blendshapes to obtain reasonable representation of the mouth action; the reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion.
Further, the method for feature transformation comprises the following steps: and pre-emphasizing and windowing the voice signal, performing discrete Fourier transform, performing logarithm calculation after passing through a Mel filter bank, and performing discrete cosine transform to obtain the MFCC characteristics.
Further, the neural network adopts a time sequence convolution network, the loss adopts mean square error loss MSE, and the calculation formula is as follows:
Figure BDA0003497462050000021
wherein T is a period, yiIn order to be the true value of the value,
Figure BDA0003497462050000022
for the predicted value, the network is constrained by measuring the Euclidean distance between the predicted value and the true value.
A speech driving mouth shape system based on time sequence convolution comprises a data acquisition module and an audio characteristic processing module, wherein the data acquisition module adopts a blendshape to represent the action of a mouth, outputs the weights of a plurality of blendshapes through a neural network, and obtains reasonable representation of the action of the mouth by combining the values of the blendshapes; and the audio characteristic processing module is used for reasonably representing discretization of the mouth action, the discretized sound signal is a time domain signal, and the time domain signal is converted into a frequency domain through Fourier transform to complete characteristic conversion.
Furthermore, the audio characteristic processing module comprises a pre-emphasis unit, a windowing unit, a discrete Fourier transform unit, a Mel filter bank, a logarithm calculation unit and a discrete cosine transform unit; a pre-emphasis unit for emphasizing the energy of the high-frequency part of the speech signal; the windowing unit is used for weighting the data in the sliding window; a discrete Fourier transform unit for performing discrete Fourier transform on the weighted data; the Mel filter bank is used for converting the frequency spectrum after the discrete Fourier transform to a Mel scale; the logarithm calculation unit is used for interconversion between the Mel scale and the Hertz; and the discrete cosine transform unit is used for performing inverse discrete Fourier transform to obtain the MFCC characteristics.
The invention has the beneficial effects that:
the invention introduces time sequence convolution, uses the time sequence convolution network for processing the voice frequency spectrum characteristics, and better solves the problems of time sequence information dependence and single generation mode; in the network part, compared with a cyclic neural network and a traditional convolutional network, the method gives consideration to information dependence on a time sequence and also reflects the accuracy of data generation. The use of the blendshape is simpler than the mesh, and the complex mouth action representation can be represented by using less data.
Drawings
In order to illustrate the present invention or the technical solutions in the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only the present embodiments of the invention, and other drawings can be obtained by those skilled in the art without creative efforts based on the provided drawings.
FIG. 1 is a flow chart of a method of feature transformation according to the present invention.
Fig. 2 is a schematic diagram of the feature transformation after pre-emphasis according to the present invention.
FIG. 3 is a graphical representation of several window functions of the present invention.
FIG. 4 is a schematic diagram of the nonlinear relationship between the Mel scale and the Hertz scale according to the present invention.
FIG. 5 is a diagram illustrating the conversion of the Meyer filter bank according to the present invention.
FIG. 6 is a partial structural diagram of the TCN of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Referring to fig. 1, the present invention provides a method for driving a mouth shape based on a speech of a time series convolution, comprising: the method comprises the steps of adopting 27 blenshapes to represent mouth movements, outputting weights of the 27 blenshapes through a neural network, and combining values of the blenshapes to obtain reasonable representation of the mouth movements, wherein sound and corresponding face blenshape data can be obtained by recording data through a mobile phone and other devices. The reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion.
The Blendshape acquisition is that the face expression can be captured by an iphone mobile phone (iphoneX and above), and 51 blendshapes are used for representation (reference links: https:// leveller. apple. com/documentation/arkit/artifact/someachor/blendshaperation), wherein the present invention uses only the Blendshape representation of the mouth, and the total number is 27. (refer to the Mouth and Jaw section of the above links.)
The neural network used here is based on TCN and has modified inputs and outputs to adapt to the data structure of the present invention, which is referred to fig. 6, a more detailed reference paper (An Empirical Evaluation of genetic consistent and current Networks for Sequence Modeling), and the present invention uses the network structure to model the processed speech features and mouth actions blenshape, so that the trained TCN can complete the mapping between speech and mouth actions.
Each blendshape represents a value in the range of 0-100, and the TCN can output the specific values of the corresponding 27 blendshapes. These values represent a particular mouth movement.
The discretization referred to above is the change from a continuous representation of the real world to a discontinuous representation of the digital world. Specifically, in the present invention the mouth movements are discretized into a representation of the mouth blendshape of 30 frames per second.
The reasonable expression of the discretized mouth motion and the discretized sound signals are in one-to-one correspondence, and the mapping is implemented by a TCN network. Specifically, the input of the TCN network is a "discretized sound signal", and the output is a "rational representation of the discretized mouth movement".
The sound belongs to waves, and the storage and representation of the invention need to be discretized, and the discretization can cause information loss to a certain extent, wherein, several key parameters are provided.
(1) Sampling rate: the number of sampling points per second is 16kHz, 44.1kHz and the like. The higher the sampling rate, the higher the frequency of the described sound wave, and the more real and natural the restoring degree of the sound wave.
(2) The number of channels: i.e., the number of channels, commonly left and right channels, 5.1 channels, etc., we use monophonic channels in processing audio data.
(3) Bit depth: it is understood that the loudness (amplitude) of a sound is sampled, which affects the signal-to-noise ratio and dynamic range of the sound, and 16 bits, 32 bits, etc. are common.
(4) Bit rate: the number of bits processed per second, such as a bit rate of 16kHz at a sampling rate and 16 bits at a bit depth, is: 16000 × 16 ═ 256 kbit/s.
The representation by only discretization is not enough, and the sound signal is called time-domain signal, and in order to dig more information, the time-domain signal needs to be converted into the frequency domain through Fourier transform.
After conversion to a frequency domain signal, further feature conversions are required, and the benefits of performing these conversions are: therefore, the voice information is easier to expose, and the algorithm optimization difficulty is reduced. The robustness of the signal to human voice, noise, channels and the like is enhanced. The method plays a role in reducing dimension, for example, the sampling rate of 16kHz and 25ms have 400 numerical values in total, and only 40-dimensional features exist after conversion.
"fourier transforming a time domain signal into the frequency domain" is one step of "performing a feature transformation", all steps of which are shown in fig. 1.
In this embodiment, the method for feature conversion includes: and pre-emphasizing and windowing the voice signal, performing discrete Fourier transform, performing logarithm calculation after passing through a Mel filter bank, and performing discrete cosine transform to obtain the MFCC characteristics.
Pre-emphasis refers to emphasizing the energy of the high frequency part of the speech signal because the low frequency part of the original speech signal has higher energy, which is called spectral tilt.
Calculation method
x′[t]=x[t]-αx[t-1]
Where t is the time step, α is the weighting factor, x is the discrete speech signal, x [ t ] is the signal value at time t, x [ t-1] is the signal value at time t-1, and x' [ t ] is the signal value at time t after pre-emphasis.
The diagram of the feature transformation after pre-emphasis is shown in fig. 2.
Windowing refers to weighting the data within a sliding window. The reason is that the rectangular window causes spectral leakage in the FFT calculation.
Calculation method
x′[n]=w[n]x[n]
Several window functions are schematically shown in FIG. 3, where ω [ N ] is a coefficient value, N represents an index within a window size N, x [ N ] is a signal value with an index of N within the window, and x' [ N ] is a signal value with an index of N after windowing.
Discrete Fourier transform: after the above preprocessing, the Fourier transform step is obtained, and the calculation method is as follows, for the N point sequence { x [ N ]]}0≤n<NIs provided with
Figure BDA0003497462050000071
Where X is the Fourier transformed sequence, k is the Fourier transformed index, N is the window size, X [ N ] is the signal value with index N in the window, exp is the base of the natural logarithm, j is the imaginary unit, and π is the circumferential ratio.
Mel filter bank: the sensitivity of the human ear to sounds of different frequencies is not linear, and the mel scale is a depiction of this sensitivity.
The conversion relation between the Mel scale and Hertz is
Figure BDA0003497462050000072
f=700(10m/2595-1)
The non-linear relationship between the two scales is shown in fig. 4.
The mel filter bank converts the frequency spectrum to the mel scale, which is schematically shown in fig. 5.
MFCC characteristics: the output of the last step can be called as a FilterBank feature, and the low-frequency harmonic component removed from the FilterBank feature is the MFCC feature. The method comprises the following specific steps:
the envelope and harmonics can be separated by performing an inverse discrete fourier transform (equivalent to a discrete cosine transform) on the Filterbank feature.
The low-order bits of the generated result are retained (usually the first 13 bits), and are the MFCC features, which are also called Mel-frequency cepstrum coefficients.
The audio data processing is now complete and the data can be taken directly to the neural network for processing.
The network structure of the invention adopts a time sequence convolutional network (TCN), which is different from the traditional convolutional neural network and can extract characteristics of the whole input sequence. As shown in fig. 6, the input layer samples each point, the second layer samples every other point, and the receptive field in the last layer may cover the entire input. And compared with a recurrent neural network, the method can complete the extraction and training of the whole features at one time. The training and forwarding efficiency of the network is greatly improved. And the TCN has the advantages of lightness, simplicity and good effect.
The same as the traditional convolutional neural network, a residual connecting structure is also adopted, and the characteristic can ensure that the problem of gradient disappearance does not occur in the training of the deep-level network.
The neural network adopts a time sequence convolution network, the loss adopts mean square error loss MSE, and the calculation formula is as follows:
Figure BDA0003497462050000081
wherein the content of the first and second substances,t is one period, yiIn order to be the true value of the value,
Figure BDA0003497462050000082
for the predicted value, the network is constrained by measuring the Euclidean distance between the predicted value and the true value.
Example 2
The embodiment provides a voice-driven mouth shape system based on time sequence convolution, which comprises a data acquisition module and an audio characteristic processing module, wherein the data acquisition module adopts a blendshape to represent the action of a mouth, outputs the weights of a plurality of blendshapes through a neural network, and obtains the reasonable representation of the action of the mouth by combining the values of the blendshapes; and the audio characteristic processing module is used for reasonably representing discretization of the mouth action, the discretized sound signal is a time domain signal, and the time domain signal is converted into a frequency domain through Fourier transform to complete characteristic conversion.
The audio characteristic processing module comprises a pre-emphasis unit, a windowing unit, a discrete Fourier transform unit, a Mel filter bank, a logarithm calculation unit and a discrete cosine transform unit; a pre-emphasis unit for emphasizing the energy of the high-frequency part of the speech signal; the windowing unit is used for weighting the data in the sliding window; a discrete Fourier transform unit for performing discrete Fourier transform on the weighted data; the Mel filter bank is used for converting the frequency spectrum after the discrete Fourier transform to a Mel scale; the logarithm calculation unit is used for interconversion between the Mel scale and the Hertz; and the discrete cosine transform unit is used for performing inverse discrete Fourier transform to obtain the MFCC characteristics.
The invention introduces time sequence convolution, uses the time sequence convolution network for processing the voice frequency spectrum characteristics, and better solves the problems of time sequence information dependence and single generation mode; in the network part, compared with a cyclic neural network and a traditional convolutional network, the method gives consideration to information dependence on a time sequence and also reflects the accuracy of data generation. The use of the blendshape is simpler than the mesh, and the complex mouth action representation can be represented by using less data.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (5)

1. A method for speech-driven mouth shape based on time-series convolution, comprising: representing the mouth action by adopting a blendshape, outputting weights of a plurality of blendshapes through a neural network, and combining the values of the blendshapes to obtain reasonable representation of the mouth action; the reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion.
2. The method for driving the mouth shape based on the speech of the time-series convolution as claimed in claim 1, wherein the feature transformation method is: and pre-emphasizing and windowing the voice signal, performing discrete Fourier transform, performing logarithm calculation after passing through a Mel filter bank, and performing discrete cosine transform to obtain the MFCC characteristics.
3. The method for driving mouth shape by voice based on time series convolution according to claim 1, wherein the neural network adopts a time series convolution network, the loss adopts mean square error loss (MSE), and the calculation formula is as follows:
Figure FDA0003497462040000011
wherein T is a period, yiIn order to be the true value of the value,
Figure FDA0003497462040000012
for the predicted value, the network is constrained by measuring the Euclidean distance between the predicted value and the true value.
4. A speech-driven mouth shape system based on time sequence convolution is characterized by comprising a data acquisition module and an audio characteristic processing module, wherein the data acquisition module adopts a blendshape to represent the action of a mouth, the weights of a plurality of blendshapes are output through a neural network, and the reasonable representation of the action of the mouth is obtained by combining the values of the blendshapes; and the audio characteristic processing module is used for reasonably representing discretization of the mouth action, the discretized sound signal is a time domain signal, and the time domain signal is converted into a frequency domain through Fourier transform to complete characteristic conversion.
5. The system of claim 4, wherein the audio feature processing module comprises a pre-emphasis unit, a windowing unit, a discrete Fourier transform unit, a Mel filter bank, a logarithm calculation unit, and a discrete cosine transform unit; a pre-emphasis unit for emphasizing the energy of the high-frequency part of the speech signal; the windowing unit is used for weighting the data in the sliding window; a discrete Fourier transform unit for performing discrete Fourier transform on the weighted data; the Mel filter bank is used for converting the frequency spectrum after the discrete Fourier transform to a Mel scale; the logarithm calculation unit is used for interconversion between the Mel scale and the Hertz; and the discrete cosine transform unit is used for performing inverse discrete Fourier transform to obtain the MFCC characteristics.
CN202210116972.1A 2022-02-08 2022-02-08 Method and system for driving mouth shape by voice based on time sequence convolution Pending CN114495908A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210116972.1A CN114495908A (en) 2022-02-08 2022-02-08 Method and system for driving mouth shape by voice based on time sequence convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210116972.1A CN114495908A (en) 2022-02-08 2022-02-08 Method and system for driving mouth shape by voice based on time sequence convolution

Publications (1)

Publication Number Publication Date
CN114495908A true CN114495908A (en) 2022-05-13

Family

ID=81477987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210116972.1A Pending CN114495908A (en) 2022-02-08 2022-02-08 Method and system for driving mouth shape by voice based on time sequence convolution

Country Status (1)

Country Link
CN (1) CN114495908A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019174131A1 (en) * 2018-03-12 2019-09-19 平安科技(深圳)有限公司 Identity authentication method, server, and computer readable storage medium
CN110277099A (en) * 2019-06-13 2019-09-24 北京百度网讯科技有限公司 Voice-based nozzle type generation method and device
CN113035198A (en) * 2021-02-26 2021-06-25 北京百度网讯科技有限公司 Lip movement control method, device and medium for three-dimensional face
CN113314145A (en) * 2021-06-09 2021-08-27 广州虎牙信息科技有限公司 Sample generation method, model training method, mouth shape driving device, mouth shape driving equipment and mouth shape driving medium
CN113592985A (en) * 2021-08-06 2021-11-02 宿迁硅基智能科技有限公司 Method and device for outputting mixed deformation value, storage medium and electronic device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019174131A1 (en) * 2018-03-12 2019-09-19 平安科技(深圳)有限公司 Identity authentication method, server, and computer readable storage medium
CN110277099A (en) * 2019-06-13 2019-09-24 北京百度网讯科技有限公司 Voice-based nozzle type generation method and device
CN113035198A (en) * 2021-02-26 2021-06-25 北京百度网讯科技有限公司 Lip movement control method, device and medium for three-dimensional face
CN113314145A (en) * 2021-06-09 2021-08-27 广州虎牙信息科技有限公司 Sample generation method, model training method, mouth shape driving device, mouth shape driving equipment and mouth shape driving medium
CN113592985A (en) * 2021-08-06 2021-11-02 宿迁硅基智能科技有限公司 Method and device for outputting mixed deformation value, storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
US6691090B1 (en) Speech recognition system including dimensionality reduction of baseband frequency signals
US8359195B2 (en) Method and apparatus for processing audio and speech signals
JP4150798B2 (en) Digital filtering method, digital filter device, digital filter program, and computer-readable recording medium
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
KR20060044629A (en) Isolating speech signals utilizing neural networks
CN113077806B (en) Audio processing method and device, model training method and device, medium and equipment
CN111243575A (en) Dialect species identification method based on expanded convolutional neural network
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN111192598A (en) Voice enhancement method for jump connection deep neural network
CN112992121B (en) Voice enhancement method based on attention residual error learning
CN112786059A (en) Voiceprint feature extraction method and device based on artificial intelligence
CN110931023A (en) Gender identification method, system, mobile terminal and storage medium
CN105679321B (en) Voice recognition method, device and terminal
CN113782044A (en) Voice enhancement method and device
Girirajan et al. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
CN114842878A (en) Speech emotion recognition method based on neural network
CN113450816A (en) Voice active noise reduction method based on deep neural network
CN112151055A (en) Audio processing method and device
CN111261192A (en) Audio detection method based on LSTM network, electronic equipment and storage medium
CN109215635B (en) Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement
CN116386589A (en) Deep learning voice reconstruction method based on smart phone acceleration sensor
CN114495908A (en) Method and system for driving mouth shape by voice based on time sequence convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220513

RJ01 Rejection of invention patent application after publication