WO2021051606A1 - 基于双向lstm的唇形样本生成方法、装置和存储介质 - Google Patents

基于双向lstm的唇形样本生成方法、装置和存储介质 Download PDF

Info

Publication number
WO2021051606A1
WO2021051606A1 PCT/CN2019/118373 CN2019118373W WO2021051606A1 WO 2021051606 A1 WO2021051606 A1 WO 2021051606A1 CN 2019118373 W CN2019118373 W CN 2019118373W WO 2021051606 A1 WO2021051606 A1 WO 2021051606A1
Authority
WO
WIPO (PCT)
Prior art keywords
lip
audio information
preset
face image
image
Prior art date
Application number
PCT/CN2019/118373
Other languages
English (en)
French (fr)
Inventor
韦嘉楠
王义文
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051606A1 publication Critical patent/WO2021051606A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • This application relates to the field of computer processing technology, and in particular to a method, device and storage medium for generating a lip sample based on a two-way LSTM.
  • Lip language recognition technology has been gradually applied in business scenarios such as financial security. Unlike voice recognition, lip language recognition is a technology based on machine vision and natural language processing. Lip language recognition technology is a means of living body detection. Its main working method is to prompt the user with a series of numbers and ask the user to read the series of numbers to determine the user’s identity, and to recognize the speaker’s lip movements through machine vision. Interpret the content of the speaker and determine whether the speaker is the target user.
  • the background database of the lip language recognition technology often stores sample data of the target user.
  • the existing lip language recognition technology mainly adds sample data by manually labeling data.
  • the manual labeling method consumes a lot of manpower, and the manually labeled data may contain a large amount of extreme environmental data, which is difficult to meet the needs of lip language recognition technology.
  • the requirement of sample data affects the recognition accuracy of lip language recognition technology.
  • the main purpose of this application is to provide a lip sample generation method, device and storage medium based on two-way LSTM, aiming to solve the technical problem of extreme environment sample data in the lip language recognition technology, which further affects the recognition accuracy.
  • this application provides a method for generating a lip sample based on a two-way LSTM, which includes the following steps:
  • the first MFCC feature is used as the input of the preset two-way LSTM model
  • the second lip key point is used as the output of the preset two-way LSTM model, where the first MFCC feature and the second lip key point are The sequence is the same, and the preset two-way LSTM model is trained to obtain the completed two-way LSTM model;
  • the first lip key points and the lip mask face image are input to the completed image completion model to obtain newly added sample data.
  • the present application also provides a device, the device including: a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the computer can When the read instruction is executed by the processor, the steps of the method for generating lip samples based on the bidirectional LSTM as described above are implemented.
  • the present application also provides a non-obviously readable storage medium, the non-obviously readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When realizing the steps of the lip sample generation method based on the bidirectional LSTM as described above.
  • This application discloses a method, device and storage medium for generating a lip sample based on a two-way LSTM.
  • a method, device and storage medium for generating a lip sample based on a two-way LSTM.
  • FIG. 1 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the present application;
  • FIG. 2 is a schematic flowchart of an embodiment of a method for generating a lip shape sample based on a two-way LSTM according to the present application;
  • FIG. 3 is a detailed flow diagram of the steps of collecting user sample data from the sample database and training a preset bidirectional LSTM model based on the sample data to obtain a trained bidirectional LSTM model according to the application;
  • FIG. 5 is a schematic flowchart of another embodiment of a method for generating a lip shape sample based on a two-way LSTM of the present application.
  • FIG. 1 is a schematic diagram of a terminal structure of a hardware operating environment involved in a solution of an embodiment of the present application.
  • the terminal of this application is a device, and the device may be a terminal device with a storage function such as a mobile phone, a computer, or a mobile computer.
  • the terminal may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • the terminal may also include a camera, a Wi-Fi module, etc., which will not be repeated here.
  • terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
  • the terminal may also include a camera, a Wi-Fi module, etc., which will not be repeated here.
  • terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
  • the network interface 1004 is mainly used to connect to a back-end server and communicate with the back-end server;
  • the user interface 1003 mainly includes an input unit such as a keyboard.
  • the keyboard includes a wireless keyboard and a wired keyboard for connecting to a client.
  • Perform data communication with the client; and the processor 1001 can be used to call computer-readable instructions stored in the memory 1005 and perform the following operations:
  • the first lip key points and the lip mask face image are input to the completed image completion model to obtain newly added sample data.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the first MFCC feature is used as the input of the preset two-way LSTM model
  • the second lip key point is used as the output of the preset two-way LSTM model, where the first MFCC feature and the second lip key point are The sequence is the same, and the preset two-way LSTM model is trained to obtain the completed two-way LSTM model.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • is the filter parameter, and Z is the data volume of the audio information
  • Framing and windowing are performed on the audio sequence to obtain the first MFCC feature of the audio sequence.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the face image is input into a preset second algorithm for convolution and dimensionality reduction to obtain the corresponding second lip key points.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the lip region is masked, and a face image whose lip region is masked is used as the lip masked face image.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the first MFCC feature and the second lip key point are input into a preset linear interpolation algorithm to adjust the sequence of the first MFCC feature and the second lip key point to be equal.
  • the optional embodiments of this device are basically the same as the following embodiments of the method for generating lip samples based on two-way LSTM, and will not be repeated here.
  • FIG. 2 is a schematic flowchart of an embodiment of a method for generating lip samples based on bidirectional LSTM according to the present application.
  • the method for generating lip samples based on bidirectional LSTM provided in this embodiment includes the following steps:
  • Step S10 collecting user sample data from the sample database, and training a preset bidirectional long-short-term memory network LSTM model based on the sample data to obtain a trained bidirectional LSTM model;
  • the training of the preset two-way LSTM model and the preset image completion model all use existing sample data.
  • the user's original sample data is stored in the sample database, and any piece of sample data, that is, any piece of sample video, is collected from the sample database.
  • a sample video with a duration longer than 1 second is collected.
  • the preset two-way LSTM model is trained, and the trained two-way LSTM model is obtained.
  • Step S20 Obtain a lip mask face image according to the sample data, and train a preset image completion model according to the sample data and the lip mask face image to obtain a trained image Image completion model;
  • the collected sample data is used to process the original face image information to obtain the lip mask face image.
  • an image completion model is also preset, and the preset image completion model is trained using the sample data and the lip mask face image, and the trained image completion model is obtained.
  • the image completion model is a U-NET model, and the U-Net model is obtained based on the improvement of the FCN model.
  • the U-Net model can run with fewer training images than the FCN model, and To make a more precise segmentation operation, because the network structure is like a U-shape, it is also called a U-Net network, which includes a feature extraction part and an up-sampling part. Among them, the up-sampling part, each up-sampling, is fused at the same scale as the number of channels corresponding to the feature extraction part to achieve multi-scale feature fusion.
  • Step S30 Obtain newly added user audio information, and input the user audio information into the trained two-way LSTM model to obtain the corresponding first lip key points;
  • the sample data in the sample database is increased by acquiring the newly added user audio information, and it can be ensured that the newly added sample data is not in an extreme environment .
  • Step S40 input the first lip key points and the lip mask face image to the completed image completion model to obtain newly added sample data.
  • the first lip key points and the lip mask face image obtained from the two-way LSTM model are input into the trained image completion model to obtain the newly-added lip-shaped face synchronization video, and
  • the lip-shaped face synchronization video is used as newly added sample data.
  • the user’s sample data is collected from the sample database, and the pre-set bidirectional LSTM model is trained based on the sample data to obtain the trained bidirectional LSTM model; the lip mask face image is obtained based on the sample data, and based on the sample data And lip mask face image training preset image completion model to obtain the completed image completion model; obtain the new user audio information, and input the user audio information into the trained two-way LSTM model , Obtain the corresponding first lip key points; input the first lip key points and the lip mask face image into the trained image completion model to obtain new sample data.
  • This embodiment trains the two-way LSTM model and the image completion model, and only needs to input the user's audio information into the trained two-way LSTM model and the image completion model to obtain the newly added user data.
  • the two-way LSTM model and the image completion model ensure the accuracy of the newly added user data, thereby avoiding the generation of sample data in extreme environments. A large number of new samples are generated through the above method.
  • the user is lip recognition, due to the sample data The expansion of data can further improve the accuracy of lip recognition.
  • FIG. 3 is a schematic diagram of the detailed process of collecting user sample data from the sample database described in this application, and training a preset two-way LSTM model based on the sample data to obtain the completed training of the two-way LSTM model.
  • the step of collecting user sample data from a sample database and training a preset two-way LSTM model based on the sample data to obtain the trained two-way LSTM model includes:
  • Step S11 performing format separation on the sample data to obtain corresponding audio information and image information
  • sample data originally stored by the user in the sample database is the recorded user speaking video. Since the video file has two different formats of audio and image, the sample data is formatted first, and the common format separation can be used. Method, or link format separation software to realize the separation of image and audio, so as to obtain the image information and audio information in the sample data.
  • Step S12 Obtain the corresponding first Mel frequency cepstrum coefficient MFCC feature according to the audio information, and obtain the corresponding second lip key point according to the image information;
  • the audio information of the sample data is obtained, the audio information is processed to extract the first MFCC feature corresponding to the audio information.
  • the MFCC feature is a set of feature vectors obtained by encoding the spectral envelope and audio details of the audio information; And obtain the corresponding second lip key point in the image information according to the image information in the sample data.
  • step S13 the first MFCC feature is used as the input of a preset two-way LSTM model, and the second lip key points are used as the output of the preset two-way LSTM model, where the first MFCC feature and the second lip are The key point sequence is the same, and the preset two-way LSTM model is trained to obtain the completed two-way LSTM model.
  • a two-way LSTM model is preset.
  • the two-way LSTM model is improved on the traditional RNN model.
  • the RNN model cannot solve the long-term dependency problem because of the gradient disappearance used in the optimization process.
  • the applied two-way LSTM model has stronger learning ability for long-term dependencies than the RNN model, and LSTM training is much simpler than other models, so the two-way LSTM model is selected.
  • Three new gates are added to the preset two-way LSTM model, namely input gate, forget gate and output gate, as well as hidden state.
  • the hidden state is used to store the information of the previous time step; through the above improvements, additional information is recorded to cope with the loop
  • the gradient attenuation problem in the neural network can better capture the dependence of the large time step distance in the time series, which reflects the characteristics of strong learning ability of the long-term dependence.
  • the first MFCC feature is used as the input of the preset two-way LSTM model
  • the key points of the lips are used as the output of the preset two-way LSTM model
  • the preset two-way LSTM model is trained
  • the two-way LSTM model is trained After completion, a set of functions that can express the mapping relationship between MFCC features and lip key points is obtained.
  • the sample data is separated to obtain the corresponding audio information and image information, and the audio information and image information extracted from the sample data are used to train the preset two-way LSTM model, thereby ensuring the completion of the training of the two-way LSTM model .
  • the step of inputting the user audio information into the trained two-way LSTM model to obtain the corresponding first lip key points includes:
  • Step S31 input the newly added user audio information into the preset first algorithm to obtain the second MFCC feature of the user audio information
  • the first algorithm is also preset in this embodiment.
  • the preset algorithm is the MFCC extraction algorithm.
  • the main purpose of the MFCC feature extraction algorithm is to extract the MFCC features in the audio information.
  • the audio information in the above sample data can be understood as a Group a one-dimensional sequence, input the audio sequence into a preset MFCC feature extraction algorithm, and obtain the second MFCC feature of the audio information.
  • Step S32 Input the second MFCC feature into the trained two-way LSTM model to obtain the corresponding first lip key points.
  • the output of the two-way LSTM model is the key points of the lips. Therefore, after the preset two-way LSTM model is trained, the second MFCC feature obtained through the above steps is used as the input of the trained two-way LSTM model, and the corresponding output of the two-way LSTM model is the first lip key point.
  • the newly added user audio information is input into the trained two-way LSTM model to obtain the first lip key points, thereby ensuring the accuracy of the newly generated sample data subsequently.
  • the step of obtaining the corresponding first MFCC feature according to the audio information includes:
  • Step S121 Input the audio information into a preset first algorithm to perform pre-emphasis processing on the audio information to obtain a corresponding audio sequence;
  • is the filter parameter, and Z is the data volume of the audio information
  • the audio information is input into the preset first algorithm to obtain the corresponding first MFCC feature, and the preset first algorithm processing step is to first perform pre-emphasis processing on the audio information to obtain an audio sequence.
  • the pre-emphasis processing is actually the process of passing the voice signal through a high-pass filter.
  • the formula is shown above.
  • the value range of the filter parameter ⁇ is (0.9, 1), and the value is usually 0.97.
  • the value of the filter parameter can also be adjusted according to the actual situation, which is not limited in this embodiment.
  • Step S122 Perform framing and windowing processing on the audio sequence to obtain the first MFCC feature of the audio sequence.
  • the audio sequence is subjected to framing and windowing processing.
  • framing and windowing After framing and windowing, the frequency spectrum of the signal becomes flat and remains in the entire frequency band from low frequency to high frequency. , Can use the same signal-to-noise ratio to find the frequency spectrum.
  • fast Fourier transformation is performed on the framed and windowed audio sequence, and input into the triangular bandpass filter to obtain the first MFCC feature of the audio sequence.
  • pre-emphasis processing, framing, and windowing processing are performed on the newly added audio data by using the preset first algorithm to obtain the corresponding first MFCC feature, which ensures the accuracy of the newly generated sample data generated subsequently.
  • FIG. 4 is a detailed flow diagram of the steps of obtaining the corresponding second lip key points according to the image information as described in this application.
  • the step of obtaining corresponding key points of the second lip according to the image information includes:
  • Step S123 Perform face detection on the image information to obtain a corresponding face image
  • the SSD key point algorithm or the MTCNN algorithm can be used to realize the face detection on the image information, and obtain the image information.
  • the face image and the algorithm for face detection are not limited in this embodiment.
  • Step S124 input the human face image into a preset second algorithm for convolution and dimensionality reduction to obtain the corresponding second lip key points.
  • a second algorithm is also preset.
  • the second algorithm is an improved dlib face detection algorithm. If the face image in the obtained sample data is an RGB image, the The face image is used as the input of the dlib face detection algorithm.
  • the RGB image processing of the face image can also be used as the input of the dlib face detection algorithm to form a gray image.
  • the second algorithm is preset to convolve and reduce the dimensionality of the face image.
  • use skipconnection residual connection
  • a total of 4 convolutional layers are stacked, and the width of each convolution kernel is in turn
  • the number of convolution kernels in each layer of 5*5, 3*3, 3*3, and 3*3 corresponds to 16, 32, 64, and 128, respectively.
  • the ReLu activation function is used to process the convolved data.
  • each convolution layer is followed by a layer of maxpooling layer with a kernel of 2*2 and a step size of 2 (maximum pooling). Layer) to achieve the purpose of downsampling.
  • the shape of the convolution tensor is 128*2*2.
  • the dimensionality of the convolution tensor is reduced to a feature vector of 128.
  • 20 lip key point coordinates are returned, that is, the output after the fully connected layer is a 40-dimensional vector.
  • the preset formula for reducing the dimensionality of the face image in the second algorithm is:
  • t represents the sequence number of the key point of the second lip
  • i represents the face image data
  • ⁇ (w t ) is the regular term
  • the formula for reducing the dimensionality of the face image in the second algorithm is preset as shown above.
  • the face detection is performed on the image information in the sample data to obtain the corresponding face image. It is assumed that the second algorithm accurately extracts the second lip key points corresponding to the face image, so as to ensure the accuracy of newly generated sample data subsequently.
  • the step of obtaining a lip mask face image according to the sample data includes:
  • Step S21 Obtain the lip area in the face image in the image information according to the second lip key point;
  • the number of lip key points in this embodiment is 20, and 20 lip key points are connected to obtain the lip area in the face image.
  • step S22 the lip region is masked, and the face image on which the lip region is masked is used as the lip mask face image.
  • the lip area in the image information is masked, that is, the mask bit corresponding to each pixel in the lip area in the face image is set to the masked state, and the subsequent When the face image is processed, the pixels whose mask bit state is in the masked state are not processed.
  • lip-shaped face synchronization videos of different users by changing the face image of the lip mask.
  • the lip mask face image of the original target user is not used, but the face image information of any other user can be subjected to lip mask processing according to the key points of the lips , To obtain a new lip mask face image, and input the key points of the lip and the new lip mask face image into the trained image completion model to obtain a lip-shaped face synchronization video.
  • FIG. 5 is a schematic flowchart of another embodiment of a method for generating a lip sample based on a two-way LSTM according to the present application.
  • Step S14 Input the first MFCC feature and the second lip key point into a preset linear interpolation algorithm to adjust the sequence of the first MFCC feature and the second lip key point to be equal.
  • the first MFCC feature sequence is 60 frames per second
  • the second lip key point sequence is 24 frames per second.
  • the above second lip key point sequence The length of the first MFCC feature sequence and the first MFCC feature sequence are not necessarily equal, so linear interpolation is used to make the first MFCC feature and the lip key point sequence equal.
  • Linear interpolation refers to the interpolation method in which the interpolation function is a polynomial of the first degree, and the interpolation error at the interpolation node is zero. Linear interpolation can be used to approximate the original function, and it can also be used to calculate the values that are not in the table during the table lookup.
  • the length of the key point sequence of the lips is interpolated to the length of the first MFCC feature sequence to obtain a sequence from the MFCC feature sequence to the key points of the lip.
  • This embodiment adjusts the sequence of the first MFCC feature and the second lip key point to be equal to meet the input data and output data requirements of the preset two-way LSTM model, correspondingly reduce the amount of calculation, and improve the training efficiency of the two-way LSTM model .
  • an embodiment of the present application also proposes a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium. Operation of the lip sample generation method.
  • the optional embodiments of the computer-readable storage medium of the present application are basically the same as the above-mentioned embodiments of the method for generating lip samples based on the bidirectional LSTM, and will not be repeated here.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

一种基于双向LSTM的唇形样本生成方法、装置和存储介质,该方法包括:通过从样本数据库中采集用户的样本数据,根据样本数据训练预设双向LSTM模型,以得到训练完成的双向LSTM模型(S10);根据样本数据得到唇部掩码人脸图象,并根据样本数据和唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型(S20);获取新增的用户音频信息,并将用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点(S30);将第一唇部关键点和唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据(S40)。

Description

基于双向LSTM的唇形样本生成方法、装置和存储介质
本申请要求于2019年9月18日提交中国专利局、申请号为201910896546.2、发明名称为“基于双向LSTM的唇形样本生成方法、装置和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及计算机处理技术领域,尤其涉及一种基于双向LSTM的唇形样本生成方法、装置和存储介质。
背景技术
唇语识别技术目前已经逐渐被应用在金融安防之类的业务场景中,与语音识别不同,唇语识别是基于机器视觉与自然语言处理于一体的技术。唇语识别技术作为活体检测的一种手段,它的主要工作方式为,向用户提示一串数字,并要求用户阅读该串数字,以确定用户的身份,通过机器视觉识别说话人唇部动作,解读说话者的说话内容,并以此判断说话者是否为目标用户。
为了实现达到上述技术效果,唇语识别技术的后台数据库往往存储有目标用户的样本数据。但是,现有的唇语识别技术中,主要通过人工标注数据的方式增加样本数据,人工标注方式会消耗大量的人力,且人工标注的数据可能存在大量极端环境数据,难以满足唇语识别技术对样本数据的要求,进而影响唇语识别技术的识别准确率。
发明内容
本申请的主要目的在于提供了一种基于双向LSTM的唇形样本生成方法、装置和存储介质,旨在解决唇语识别技术中因存在极端环境的样本数据,进而影响识别准确率的技术问题。
为实现上述目的,本申请提供了一种基于双向LSTM的唇形样 本生成方法,包括以下步骤:
对所述样本数据进行格式分离,得到对应的音频信息以及图象信息;
根据所述音频信息得到对应的第一Mel频率倒谱系数MFCC特征,并根据所述图象信息得到对应的第二唇部关键点;
将所述第一MFCC特征作为预设双向LSTM模型的输入,所述第二唇部关键点作为预设双向LSTM模型的输出,其中,所述第一MFCC特征和所述第二唇部关键点序列相同,训练预设双向LSTM模型,以得到训练完成的双向LSTM模型;
从样本数据库中采集用户的样本数据,根据所述样本数据训练预设双向长短期记忆网络LSTM模型,以得到训练完成的双向LSTM模型;
根据所述样本数据得到唇部掩码人脸图象,并根据所述样本数据和所述唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;
获取新增的用户音频信息,并将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;
将所述第一唇部关键点和所述唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。
此外,为实现上述目的,本申请还提供一种装置,所述装置包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述基于双向LSTM的唇形样本生成方法的步骤。
此外,为实现上述目的,本申请还提供一种非显失性可读存储介质,所述非显失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如上所述基于双向LSTM的唇形样本生成方法的步骤。
本申请公开了一种基于双向LSTM的唇形样本生成方法、装置和存储介质,通过从样本数据库中采集用户的样本数据,根据样本数据训练预设双向LSTM模型,以得到训练完成的双向LSTM模型; 根据样本数据得到唇部掩码人脸图象,并根据样本数据和唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;获取新增的用户音频信息,并将用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;将第一唇部关键点和唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。通过对双向LSTM模型和图象补全模型进行训练,只需要将用户的音频信息输入至训练完成的双向LSTM模型和图象补全模型中,就能得到新增的用户数据,使用双向LSTM模型和图象补全模型保证新增的用户数据的准确性,从而避免极端环境下的样本数据的产生,通过上述方式产生大量新增样本,以此提高唇语识别的准确率。
附图说明
图1是本申请实施例方案涉及的硬件运行环境的装置结构示意图;
图2为本申请基于双向LSTM的唇形样本生成方法一实施例的流程示意图;
图3为本申请所述从样本数据库中采集用户的样本数据,根据所述样本数据训练预设双向LSTM模型,以得到训练完成的双向LSTM模型的步骤细化流程示意图;
图4为本申请所述根据所述图象信息得到对应的第二唇部关键点的步骤细化流程示意图;
图5为本申请基于双向LSTM的唇形样本生成方法另一实施例的流程示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的可选实施例仅仅用以解释本申请,并不用于限定本申请。
如图1所示,图1是本申请实施例方案涉及的硬件运行环境的终端结构示意图。
本申请终端是一种装置,该装置可以是一种手机、电脑、移动电脑等具有存储功能的终端设备。
如图1所示,该终端可以包括:处理器1001,例如CPU,通信总线1002,用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选的用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。
可选地,终端还可以包括摄像头、Wi-Fi模块等等,在此不再赘述。
本领域技术人员可以理解,图1中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
可选地,终端还可以包括摄像头、Wi-Fi模块等等,在此不再赘述。
本领域技术人员可以理解,图1中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
在图1所示的终端中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要包括输入单元比如键盘,键盘包括无线键盘和有线键盘,用于连接客户端,与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的计算机可读指令,并执行以下操作:
从样本数据库中采集用户的样本数据,根据所述样本数据训练预设双向长短期记忆网络LSTM模型,以得到训练完成的双向LSTM 模型;
根据所述样本数据得到唇部掩码人脸图象,并根据所述样本数据和所述唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;
获取新增的用户音频信息,并将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;
将所述第一唇部关键点和所述唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
对所述样本数据进行格式分离,得到对应的音频信息以及图象信息;
根据所述音频信息得到对应的第一Mel频率倒谱系数MFCC特征,并根据所述图象信息得到对应的第二唇部关键点;
将所述第一MFCC特征作为预设双向LSTM模型的输入,所述第二唇部关键点作为预设双向LSTM模型的输出,其中,所述第一MFCC特征和所述第二唇部关键点序列相同,训练预设双向LSTM模型,以得到训练完成的双向LSTM模型。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
将新增的用户音频信息输入至预设第一算法中,得到所述用户音频信息的第二MFCC特征;
将所述第二MFCC特征输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
将所述音频信息输入至预设第一算法中,以对所述音频信息进行预加重处理,得到对应的音频序列;
其中,对所述音频信息进行预加重处理的公式为:
H(Z)=1-μZ -1
μ为滤波参数,Z为音频信息的数据量;
对所述音频序列进行分帧和加窗处理,以得到所述音频序列的第一MFCC特征。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
对所述图象信息进行人脸检测,得到对应的人脸图象;
将所述人脸图象输入至预设第二算法中进行卷积和降维,得到对应的第二唇部关键点。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
根据所述第二唇部关键点,得到图象信息中所述人脸图象中的唇部区域;
对所述唇部区域掩码处理,将唇部区域进行掩码处理的人脸图象作为所述唇部掩码人脸图象。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
将所述第一MFCC特征和所述第二唇部关键点输入至预设线性插值算法中,以调整所述第一MFCC特征和所述第二唇部关键点的序列相等。
本装置的可选实施例与下述基于双向LSTM的唇形样本生成方法各实施例基本相同,在此不作赘述。
请参阅图2,图2为本申请基于双向LSTM的唇形样本生成方法一实施例的流程示意图,本实施例提供的基于双向LSTM的唇形样本生成方法包括如下步骤:
步骤S10,从样本数据库中采集用户的样本数据,根据所述样本数据训练预设双向长短期记忆网络LSTM模型,以得到训练完成的双向LSTM模型;
容易理解的是,对于预设双向LSTM模型和预设图象补全模型的训练都是利用的现有的样本数据。在样本数据库中存储有用户原先的样本数据,从所述样本数据库中采集任意一段样本数据,即任意一 段样本视频。为了进行后续的特征分离,可选地,采集时长大于1秒的样本视频。根据采集的样本数据对预设双向LSTM模型进行训练,并得到训练完成的双向LSTM模型。
步骤S20,根据所述样本数据得到唇部掩码人脸图象,并根据所述样本数据和所述唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;
本实施例中,使用采集的样本数据对原有的人脸图象信息进行处理,得到唇部掩码人脸图象。本实施例中还预设有图像补全模型,使用所述样本数据和唇部掩码人脸图象训练预设图象补全模型,并得到训练完成的图象补全模型。可选地,所述图像补全模型为U-NET模型,U-Net模型是基于FCN模型改进所得到的,U-Net模型较比FCN模型能够在更少的训练图像的情况下运行,并做出更为精确的分割操作,由于网络结构像U型,所以也叫U-Net网络,包括特征提取部分和上采样部分。其中,上采样部分,每上采样一次,就和特征提取部分对应的通道数相同尺度融合,实现多尺度特征的融合。
步骤S30,获取新增的用户音频信息,并将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;
本实施例中,得到训练完成的双向LSTM模型和图象补全模型后,通过获取新增的用户音频信息增加样本数据库中的样本数据,且能保证新增的样本数据不处于极端环境的情况。可选地,先将所述用户音频信息输入至训练完成的双向LSTM模型,得到与所述用户新增的音频信息对应的第一唇部关键点。
步骤S40,将所述第一唇部关键点和所述唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。
上述步骤后,将从双向LSTM模型得到的第一唇部关键点和唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的唇形人脸同步视频,并将所述唇形人脸同步视频作为新增的样本数据。
本实施例通过从样本数据库中采集用户的样本数据,根据样本数据训练预设双向LSTM模型,以得到训练完成的双向LSTM模型;根据样本数据得到唇部掩码人脸图象,并根据样本数据和唇部掩码人 脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;获取新增的用户音频信息,并将用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;将第一唇部关键点和唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。本实施例通过对双向LSTM模型和图象补全模型进行训练,只需要将用户的音频信息输入至训练完成的双向LSTM模型和图象补全模型中,就能得到新增的用户数据,使用双向LSTM模型和图象补全模型保证新增的用户数据的准确性,从而避免极端环境下的样本数据的产生,通过上述方式产生大量新增样本,当对用户进行唇语识别时,由于样本数据的扩充,实现进一步地提高唇语识别的准确率。
进一步地,请参阅图3,图3为本申请所述从样本数据库中采集用户的样本数据,根据所述样本数据训练预设双向LSTM模型,以得到训练完成的双向LSTM模型步骤细化流程示意图。所述从样本数据库中采集用户的样本数据,根据所述样本数据训练预设双向LSTM模型,以得到训练完成的双向LSTM模型的步骤包括:
步骤S11,对所述样本数据进行格式分离,得到对应的音频信息以及图象信息;
应当理解的是,样本数据库中用户原先存储的样本数据为录制的用户说话视频,由于视频文件中具有音频和图像两种不同的格式,先对样本数据进行格式分离,可以使用常见的格式分离的方法,或链接格式分离的软件来实现图像和音频的分离,以此得到样本数据中的图像信息和音频信息。
步骤S12,根据所述音频信息得到对应的第一Mel频率倒谱系数MFCC特征,并根据所述图象信息得到对应的第二唇部关键点;
得到样本数据的音频信息后,对所述音频信息进行处理,提取出音频信息对应的第一MFCC特征,MFCC特征是将音频信息的频谱包络和音频细节进行编码运算得到的一组特征向量;并根据样本数据中的图像信息得到所述图像信息中对应的第二唇部关键点。
步骤S13,将所述第一MFCC特征作为预设双向LSTM模型的输入,所述第二唇部关键点作为预设双向LSTM模型的输出,其中, 所述第一MFCC特征和所述第二唇部关键点序列相同,训练预设双向LSTM模型,以得到训练完成的双向LSTM模型。
本实施例中,预设有双向LSTM模型,双向LSTM模型在传统RNN模型上进行了改进,RNN模型由于其优化过程中运用到了梯度消失,因此不能很好的解决长期依赖问题,而本实施例应用的双向LSTM模型对于长期依赖关系的学习能力强于RNN模型,且LSTM训练上远比其他模型简单,因此选用双向LSTM模型。预设双向LSTM模型中新增了3个门,分别为输入门、遗忘门和输出门,以及隐藏状态,隐藏状态用于存储之前时间步的信息;通过上述改进记录额外的信息,以应对循环神经网络(RNN)中的梯度衰减问题,并更好地捕捉时间序列中时间步距离较大的依赖关系,体现了对长期依赖关系的学习能力较强的特点。
本实施例中,将所述第一MFCC特征作为预设双向LSTM模型的输入,所述唇部关键点作为预设双向LSTM模型的输出,训练预设双向LSTM模型,在所述双向LSTM模型训练完成后,得到一组可表现MFCC特征和唇部关键点映射关系的函数。
本实施例通过对样本数据进行数据分离,得到对应的音频信息和图像信息,并利用从样本数据提取出的音频信息和图像信息训练预设的双向LSTM模型,从而保证双向LSTM模型的训练完成度。
进一步地,所述将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点的步骤包括:
步骤S31,将新增的用户音频信息输入至预设第一算法中,得到所述用户音频信息的第二MFCC特征;
本实施例中还预设有第一算法,所述预设算法为MFCC提取算法,MFCC特征提取算法的主要目的在于提取音频信息中的MFCC特征,可以将上述样本数据中的音频信息理解为一组一维序列,将所述音频序列输入至预设MFCC特征提取算法中,得到该音频信息的第二MFCC特征。
步骤S32,将所述第二MFCC特征输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点。
由于双向LSTM模型的输入为MFCC特征,双向LSTM模型的输出为唇部关键点。因此将预设双向LSTM模型训练完成后,将通过上述步骤得到的第二MFCC特征作为训练完成的双向LSTM模型的输入,则所述双向LSTM模型对应的输出为第一唇部关键点。
本实施例通过将新增的用户音频信息输入至训练完成的双向LSTM模型中,以此得到第一唇部关键点,从而保证后续生成的新增样本数据的准确性。
进一步地,所述根据所述音频信息得到对应的第一MFCC特征的步骤包括:
步骤S121,将所述音频信息输入至预设第一算法中,以对所述音频信息进行预加重处理,得到对应的音频序列;
其中,对所述音频信息进行预加重处理的公式为:
H(Z)=1-μZ -1
μ为滤波参数,Z为音频信息的数据量;
将音频信息输入至预设第一算法中,得到对应的第一MFCC特征,预设第一算法处理步骤为,先对所述音频信息进行预加重处理,得到音频序列。预加重处理其实是将语音信号通过一个高通滤波器的过程,公式如上所示。其中,滤波参数μ的数值范围为(0.9,1),通常取值0.97,当然,也可以根据实际情况对应调整滤波参数的数值,本实施例在此不做限制。
步骤S122,对所述音频序列进行分帧和加窗处理,以得到所述音频序列的第一MFCC特征。
对所述音频信息进行预加重处理后,对所述音频序列进行分帧和加窗处理,经过分帧和加窗后,使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱。特别的,对经过分帧和加窗的音频序列进行快速傅里叶变化,并输入至三角带通滤波器中,以得到所述音频序列的第一MFCC特征。
本实施例通过预设第一算法对新增的音频数据进行预加重处理、分帧以及加窗处理,得到对应的第一MFCC特征,保证后续生成的 新增样本数据的准确性。
进一步地,请参阅图4,图4为本申请所述根据所述图象信息得到对应的第二唇部关键点的步骤细化流程示意图。所述根据所述图象信息得到对应的第二唇部关键点的步骤包括:
步骤S123,对所述图象信息进行人脸检测,得到对应的人脸图象;
得到所述样本数据中的图像信息后,对所述图象信息进行人脸检测,可选地,可以使用SSD关键点算法或MTCNN算法实现对图像信息的人脸检测,并得到图像信息中的人脸图像,人脸检测的算法,本实施例在此不作限制。
步骤S124,将所述人脸图象输入至预设第二算法中进行卷积和降维,得到对应的第二唇部关键点。
本实施例中,还预设有第二算法,可选地,所述第二算法为改进的dlib人脸检测算法,若获得的样本数据中的人脸图像为RGB图像,则可以将所述人脸图像作为dlib人脸检测算法的输入,当然也可以将人脸图像的RGB图像处理形成灰度图像后作为dlib人脸检测算法的输入。
预设第二算法对人脸图像进行卷积和降维,可选地,对输入的图像使用skipconnection连接(残差连接),总共堆叠4层卷积层,每层卷积核的宽度依次为5*5、3*3、3*3以及3*3每层卷积核的数量对应分别为16、32、64以及128。每实现一次卷积后使用ReLu激活函数对卷积后的数据进行处理,可选地,使每层卷积层后接一层内核为2*2,步长为2的maxpooling层(最大池化层)达到降采样的目的。如此,经过四层卷积之后,卷积张量的形状为128*2*2,通过一层global average pooling(全局均值池化层),从而将卷积张量降维到128的特征向量,经过全连接层回归出20个唇部关键点坐标,即全连接层后的输出为40维向量。
进一步地,预设第二算法中对人脸图象进行降维的公式为:
Figure PCTCN2019118373-appb-000001
其中,t表示第二唇部关键点的序号,i表示人脸图象数据,Φ(w t)为正则项,
Figure PCTCN2019118373-appb-000002
表示损失函数。
可选地,预设第二算法中对人脸图象进行降维的公式如上所示,本实施例通过对样本数据中的图像信息进行人脸检测,得到对应的人脸图像,再利用预设第二算法精准的提取与所述人脸图像对应的第二唇部关键点,从而保证后续生成的新增样本数据的准确性。
进一步地,所述根据所述样本数据得到唇部掩码人脸图象的步骤包括:
步骤S21,根据所述第二唇部关键点,得到图象信息中所述人脸图象中的唇部区域;
可选地,本实施例中唇部关键点的数目为20,将20个唇部关键点进行连线,就得到人脸图像中的唇部区域。
步骤S22,对所述唇部区域掩码处理,将唇部区域进行掩码处理的人脸图象作为所述唇部掩码人脸图象。
得到人脸图像中的唇部区域后,对图像信息中的所述唇部区域进行掩码处理,即将人脸图像中该唇部区域中各个像素对应的掩码位设置为屏蔽状态,后续对人脸图像进行处理时,并不会对掩码位状态为屏蔽状态的像素点进行处理。
此外,也可以通过更换唇部掩码人脸图像的方式,生成不同用户的唇形人脸同步视频。可选地,在得到唇部关键点后,并不使用原目标用户的唇部掩码人脸图像,而可以根据唇部关键点对任一其他用户的人脸图像信息进行唇部掩码处理,得到新的唇部掩码人脸图像,将唇部关键点和新的唇部掩码人脸图像输入至训练完成的图像补全模型中,得到唇形人脸同步视频。
进一步地,请参阅图5,图5为本申请基于双向LSTM的唇形样本生成方法另一实施例的流程示意图。上述步骤S12根据所述音频信息得到对应的第一MFCC特征,并根据所述图象信息得到对应的第二唇部关键点之后,还包括:
步骤S14,将所述第一MFCC特征和所述第二唇部关键点输入至预设线性插值算法中,以调整所述第一MFCC特征和所述第二唇部 关键点的序列相等。
容易理解的是,由于音频分帧为一秒60帧,则第一MFCC特征序列是60帧/秒,而第二唇部关键点序列则是24帧/秒,上述第二唇部关键点序列和第一MFCC特征序列的长度不一定相等,因此应用线性插值法使第一MFCC特征和唇部关键点序列相等。
线性插值是指插值函数为一次多项式的插值方式,其在插值节点上的插值误差为零,线性插值可以用来近似代替原函数,也可以用来计算得到查表过程中表中没有的数值,将唇部关键点序列长度插值到第一MFCC特征序列长度,得到一段MFCC特征序列到唇部关键点的序列。
本实施例将第一MFCC特征和第二唇部关键点的序列调整相等,满足预设双向LSTM模型对输入数据和输出数据的要求,对应的减少计算量,提高所述双向LSTM模型的训练效率。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如上所述基于双向LSTM的唇形样本生成方法的操作。
本申请计算机可读存储介质的可选实施例与上述基于双向LSTM的唇形样本生成方法各实施例基本相同,在此不作赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于 这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的可选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于双向LSTM的唇形样本生成方法,其中,包括以下步骤:
    对样本数据进行格式分离,得到对应的音频信息以及图象信息;
    根据所述音频信息得到对应的第一Mel频率倒谱系数MFCC特征,并根据所述图象信息得到对应的第二唇部关键点;
    将所述第一MFCC特征作为预设双向LSTM模型的输入,所述第二唇部关键点作为预设双向LSTM模型的输出,其中,所述第一MFCC特征和所述第二唇部关键点序列相同,训练预设双向LSTM模型,以得到训练完成的双向LSTM模型;
    根据所述样本数据得到唇部掩码人脸图象,并根据所述样本数据和所述唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;
    获取新增的用户音频信息,并将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;
    将所述第一唇部关键点和所述唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。
  2. 如权利要求1所述的基于双向LSTM的唇形样本生成方法,其中,所述将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点的步骤包括:
    将新增的用户音频信息输入至预设第一算法中,得到所述用户音频信息的第二MFCC特征;
    将所述第二MFCC特征输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点。
  3. 如权利要求1所述的基于双向LSTM的唇形样本生成方法,其中,所述根据所述音频信息得到对应的第一MFCC特征的步骤包括:
    将所述音频信息输入至预设第一算法中,以对所述音频信息进行预加重处理,得到对应的音频序列;
    其中,对所述音频信息进行预加重处理的公式为:
    H(Z)=1-μZ -1
    μ为滤波参数,Z为音频信息的数据量;
    对所述音频序列进行分帧和加窗处理,以得到所述音频序列的第一MFCC特征。
  4. 如权利要求1所述的基于双向LSTM的唇形样本生成方法,其中,所述根据所述图象信息得到对应的第二唇部关键点的步骤包括:
    对所述图象信息进行人脸检测,得到对应的人脸图象;
    将所述人脸图象输入至预设第二算法中进行卷积和降维,得到对应的第二唇部关键点。
  5. 如权利要求4所述的基于双向LSTM的唇形样本生成方法,其中,预设第二算法中对人脸图象进行降维的公式为:
    Figure PCTCN2019118373-appb-100001
    其中,t表示第二唇部关键点的序号,i表示人脸图象数据,Φ(w t)为正则项,
    Figure PCTCN2019118373-appb-100002
    表示损失函数。
  6. 如权利要求4所述的基于双向LSTM的唇形样本生成方法,其中,所述根据所述样本数据得到唇部掩码人脸图象的步骤包括:
    根据所述第二唇部关键点,得到图象信息中所述人脸图象中的唇部区域;
    对所述唇部区域掩码处理,将唇部区域进行掩码处理的人脸图象作为所述唇部掩码人脸图象。
  7. 如权利要求1所述的基于双向LSTM的唇形样本生成方法,其中,所述根据所述音频信息得到对应的第一MFCC特征,并根据所述图象信息得到对应的第二唇部关键点的步骤之后,还包括:
    将所述第一MFCC特征和所述第二唇部关键点输入至预设线性插值算法中,以调整所述第一MFCC特征和所述第二唇部关键点的序列相等。
  8. 一种装置,其中,所述装置包括:存储器、处理器及存储在 所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被所述处理器执行时,执行如下步骤:
    对样本数据进行格式分离,得到对应的音频信息以及图象信息;
    根据所述音频信息得到对应的第一Mel频率倒谱系数MFCC特征,并根据所述图象信息得到对应的第二唇部关键点;
    将所述第一MFCC特征作为预设双向LSTM模型的输入,所述第二唇部关键点作为预设双向LSTM模型的输出,其中,所述第一MFCC特征和所述第二唇部关键点序列相同,训练预设双向LSTM模型,以得到训练完成的双向LSTM模型;
    从样本数据库中采集用户的样本数据,根据所述样本数据训练预设双向长短期记忆网络LSTM模型,以得到训练完成的双向LSTM模型;
    根据所述样本数据得到唇部掩码人脸图象,并根据所述样本数据和所述唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;
    获取新增的用户音频信息,并将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;
    将所述第一唇部关键点和所述唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。
  9. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    将新增的用户音频信息输入至预设第一算法中,得到所述用户音频信息的第二MFCC特征;
    将所述第二MFCC特征输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点。
  10. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    将所述音频信息输入至预设第一算法中,以对所述音频信息进行预加重处理,得到对应的音频序列;
    其中,对所述音频信息进行预加重处理的公式为:
    H(Z)=1-μZ -1
    μ为滤波参数,Z为音频信息的数据量;
    对所述音频序列进行分帧和加窗处理,以得到所述音频序列的第一MFCC特征。
  11. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    对所述图象信息进行人脸检测,得到对应的人脸图象;
    将所述人脸图象输入至预设第二算法中进行卷积和降维,得到对应的第二唇部关键点。
  12. 如权利要求11所述的装置,其中,预设第二算法中对人脸图象进行降维的公式为:
    Figure PCTCN2019118373-appb-100003
    其中,t表示第二唇部关键点的序号,i表示人脸图象数据,Φ(w t)为正则项,
    Figure PCTCN2019118373-appb-100004
    表示损失函数。
  13. 如权利要求11所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    根据所述第二唇部关键点,得到图象信息中所述人脸图象中的唇部区域;
    对所述唇部区域掩码处理,将唇部区域进行掩码处理的人脸图象作为所述唇部掩码人脸图象。
  14. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    将所述第一MFCC特征和所述第二唇部关键点输入至预设线性插值算法中,以调整所述第一MFCC特征和所述第二唇部关键点的序列相等。
  15. 一种非显失性可读存储介质,其中,所述非显失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,执行如下步骤:
    对样本数据进行格式分离,得到对应的音频信息以及图象信息;
    根据所述音频信息得到对应的第一Mel频率倒谱系数MFCC特征,并根据所述图象信息得到对应的第二唇部关键点;
    将所述第一MFCC特征作为预设双向LSTM模型的输入,所述第二唇部关键点作为预设双向LSTM模型的输出,其中,所述第一MFCC特征和所述第二唇部关键点序列相同,训练预设双向LSTM模型,以得到训练完成的双向LSTM模型;
    根据所述样本数据得到唇部掩码人脸图象,并根据所述样本数据和所述唇部掩码人脸图象训练预设图象补全模型,以得到训练完成的图象补全模型;
    获取新增的用户音频信息,并将所述用户音频信息输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点;
    将所述第一唇部关键点和所述唇部掩码人脸图象输入至训练完成的图象补全模型,得到新增的样本数据。
  16. 如权利要求15所述的非显失性可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    将新增的用户音频信息输入至预设第一算法中,得到所述用户音频信息的第二MFCC特征;
    将所述第二MFCC特征输入至训练完成的双向LSTM模型,得到对应的第一唇部关键点。
  17. 如权利要求15所述的非显失性可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    将所述音频信息输入至预设第一算法中,以对所述音频信息进行预加重处理,得到对应的音频序列;
    其中,对所述音频信息进行预加重处理的公式为:
    H(Z)=1-μZ -1
    μ为滤波参数,Z为音频信息的数据量;
    对所述音频序列进行分帧和加窗处理,以得到所述音频序列的第一MFCC特征。
  18. 如权利要求15所述的非显失性可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    对所述图象信息进行人脸检测,得到对应的人脸图象;
    将所述人脸图象输入至预设第二算法中进行卷积和降维,得到对应的第二唇部关键点。
  19. 如权利要求18所述的非显失性可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    Figure PCTCN2019118373-appb-100005
    其中,t表示第二唇部关键点的序号,i表示人脸图象数据,Φ(w t)为正则项,
    Figure PCTCN2019118373-appb-100006
    表示损失函数。
  20. 如权利要求15所述的非显失性可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    根据所述第二唇部关键点,得到图象信息中所述人脸图象中的唇部区域;
    对所述唇部区域掩码处理,将唇部区域进行掩码处理的人脸图象作为所述唇部掩码人脸图象。
PCT/CN2019/118373 2019-09-18 2019-11-14 基于双向lstm的唇形样本生成方法、装置和存储介质 WO2021051606A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910896546.2A CN110796000B (zh) 2019-09-18 2019-09-18 基于双向lstm的唇形样本生成方法、装置和存储介质
CN201910896546.2 2019-09-18

Publications (1)

Publication Number Publication Date
WO2021051606A1 true WO2021051606A1 (zh) 2021-03-25

Family

ID=69439662

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118373 WO2021051606A1 (zh) 2019-09-18 2019-11-14 基于双向lstm的唇形样本生成方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN110796000B (zh)
WO (1) WO2021051606A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094682A (zh) * 2021-04-12 2021-07-09 中国工商银行股份有限公司 反欺诈身份识别方法及装置
CN114338959A (zh) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 端到端即文本到视频的视频合成方法、系统介质及应用
CN114419702A (zh) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 数字人生成模型、模型的训练方法以及数字人生成方法
CN116071472A (zh) * 2023-02-08 2023-05-05 华院计算技术(上海)股份有限公司 图像生成方法及装置、计算机可读存储介质、终端
CN116741198A (zh) * 2023-08-15 2023-09-12 合肥工业大学 一种基于多尺度字典的唇形同步方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000181333A (ja) * 1998-12-21 2000-06-30 Nippon Telegr & Teleph Corp <Ntt> 発音訓練支援装置、その方法及びプログラム記録媒体
CN108763897A (zh) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 身份合法性的校验方法、终端设备及介质
CN108847234A (zh) * 2018-06-28 2018-11-20 广州华多网络科技有限公司 唇语合成方法、装置、电子设备及存储介质
CN109409195A (zh) * 2018-08-30 2019-03-01 华侨大学 一种基于神经网络的唇语识别方法及系统
CN109685724A (zh) * 2018-11-13 2019-04-26 天津大学 一种基于深度学习的对称感知人脸图像补全方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578017B (zh) * 2017-09-08 2020-11-17 百度在线网络技术(北京)有限公司 用于生成图像的方法和装置
CN109377539B (zh) * 2018-11-06 2023-04-11 北京百度网讯科技有限公司 用于生成动画的方法和装置
CN110111399B (zh) * 2019-04-24 2023-06-30 上海理工大学 一种基于视觉注意力的图像文本生成方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000181333A (ja) * 1998-12-21 2000-06-30 Nippon Telegr & Teleph Corp <Ntt> 発音訓練支援装置、その方法及びプログラム記録媒体
CN108763897A (zh) * 2018-05-22 2018-11-06 平安科技(深圳)有限公司 身份合法性的校验方法、终端设备及介质
CN108847234A (zh) * 2018-06-28 2018-11-20 广州华多网络科技有限公司 唇语合成方法、装置、电子设备及存储介质
CN109409195A (zh) * 2018-08-30 2019-03-01 华侨大学 一种基于神经网络的唇语识别方法及系统
CN109685724A (zh) * 2018-11-13 2019-04-26 天津大学 一种基于深度学习的对称感知人脸图像补全方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094682A (zh) * 2021-04-12 2021-07-09 中国工商银行股份有限公司 反欺诈身份识别方法及装置
CN114338959A (zh) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 端到端即文本到视频的视频合成方法、系统介质及应用
CN114419702A (zh) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 数字人生成模型、模型的训练方法以及数字人生成方法
CN114419702B (zh) * 2021-12-31 2023-12-01 南京硅基智能科技有限公司 数字人生成模型、模型的训练方法以及数字人生成方法
CN116071472A (zh) * 2023-02-08 2023-05-05 华院计算技术(上海)股份有限公司 图像生成方法及装置、计算机可读存储介质、终端
CN116071472B (zh) * 2023-02-08 2024-04-30 华院计算技术(上海)股份有限公司 图像生成方法及装置、计算机可读存储介质、终端
CN116741198A (zh) * 2023-08-15 2023-09-12 合肥工业大学 一种基于多尺度字典的唇形同步方法
CN116741198B (zh) * 2023-08-15 2023-10-20 合肥工业大学 一种基于多尺度字典的唇形同步方法

Also Published As

Publication number Publication date
CN110796000A (zh) 2020-02-14
CN110796000B (zh) 2023-12-22

Similar Documents

Publication Publication Date Title
WO2021051606A1 (zh) 基于双向lstm的唇形样本生成方法、装置和存储介质
US10885608B2 (en) Super-resolution with reference images
WO2020119350A1 (zh) 视频分类方法、装置、计算机设备和存储介质
US9418319B2 (en) Object detection using cascaded convolutional neural networks
US11227638B2 (en) Method, system, medium, and smart device for cutting video using video content
US11244157B2 (en) Image detection method, apparatus, device and storage medium
WO2020169051A1 (zh) 一种全景视频数据处理的方法、终端以及存储介质
TWI769725B (zh) 圖像處理方法、電子設備及電腦可讀儲存介質
US20140044359A1 (en) Landmark Detection in Digital Images
CN111429338B (zh) 用于处理视频的方法、装置、设备和计算机可读存储介质
WO2021128817A1 (zh) 视频音频识别方法、设备、存储介质及装置
KR20150032822A (ko) 이미지를 필터링하기 위한 방법 및 장치
CN113780326A (zh) 一种图像处理方法、装置、存储介质及电子设备
WO2022081226A1 (en) Dual-stage system for computational photography, and technique for training same
US20230335148A1 (en) Speech Separation Method, Electronic Device, Chip, and Computer-Readable Storage Medium
CN112614110B (zh) 评估图像质量的方法、装置及终端设备
WO2023123873A1 (zh) 一种基于注意力机制稠密光流计算方法
CN117496990A (zh) 语音去噪方法、装置、计算机设备及存储介质
JP2014063377A (ja) 画像処理装置およびプログラム
WO2023284236A1 (zh) 图像盲去噪方法、装置、电子设备和存储介质
CN114882226A (zh) 图像处理方法、智能终端及存储介质
CN117151987A (zh) 一种图像增强方法、装置及电子设备
CN114764839A (zh) 动态视频生成方法、装置、可读存储介质及终端设备
CN115866332B (zh) 一种视频帧插帧模型的处理方法、装置以及处理设备
WO2024082928A1 (zh) 语音处理方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946075

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946075

Country of ref document: EP

Kind code of ref document: A1