CN115132201A - Lip language identification method, computer device and storage medium - Google Patents

Lip language identification method, computer device and storage medium Download PDF

Info

Publication number
CN115132201A
CN115132201A CN202210913645.9A CN202210913645A CN115132201A CN 115132201 A CN115132201 A CN 115132201A CN 202210913645 A CN202210913645 A CN 202210913645A CN 115132201 A CN115132201 A CN 115132201A
Authority
CN
China
Prior art keywords
data
lip
audio
target
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210913645.9A
Other languages
Chinese (zh)
Inventor
庄晓滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202210913645.9A priority Critical patent/CN115132201A/en
Publication of CN115132201A publication Critical patent/CN115132201A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a lip language identification method, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring audio data and face picture data corresponding to the audio data; carrying out equalization processing on the audio data to obtain target audio data, and carrying out preprocessing on the face picture data to obtain target lip data; inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data; and inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data. In this way, the effectiveness, flexibility and accuracy of lip language recognition can be improved.

Description

Lip language identification method, computer device and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a lip language identification method, a computer device, and a storage medium.
Background
In the field of audio, the application of audio recognition is wide and significant, the existing audio recognition method is mainly realized based on a model prediction method, and the existing audio recognition method based on the model prediction is used for recognizing audio of a single language, and a large amount of label data is usually required to be adopted for training when a model is pre-trained, so that the operation is complex, and the effect of model prediction is not ideal. Therefore, how to improve the efficiency of audio recognition is very important.
Disclosure of Invention
The embodiment of the invention provides a lip language identification method, computer equipment and a storage medium, which can improve the effectiveness, flexibility and accuracy of lip language identification.
In a first aspect, an embodiment of the present invention provides a lip language identification method, including:
acquiring audio data and face picture data corresponding to the audio data;
carrying out equalization processing on the audio data to obtain target audio data, and carrying out preprocessing on the face picture data to obtain target lip data;
inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data;
and inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, wherein the lip data comprises position data of lips on each frame of image in the face picture data.
In a second aspect, an embodiment of the present invention provides a lip language identification device, including:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring audio data and face picture data corresponding to the audio data;
the processing unit is used for carrying out equalization processing on the audio data to obtain target audio data and carrying out preprocessing on the face picture data to obtain target lip data;
the extraction unit is used for inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data;
and the recognition unit is used for inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, wherein the lip data comprises position data of lips on each frame of image in the face picture data.
In a third aspect, an embodiment of the present invention provides a computer device, where the computer device includes: a processor and a memory, the processor to perform:
acquiring audio data and face picture data corresponding to the audio data;
carrying out equalization processing on the audio data to obtain target audio data, and carrying out preprocessing on the face picture data to obtain target lip data;
inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data;
and inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, wherein the lip data comprises position data of lips on each frame of image in the face picture data.
In a fourth aspect, the present invention further provides a computer-readable storage medium, where program instructions are stored, and when the program instructions are executed, the computer-readable storage medium is configured to implement the method according to the first aspect.
The embodiment of the invention obtains the audio data and the face picture data corresponding to the audio data; carrying out equalization processing on the audio data to obtain target audio data, and carrying out preprocessing on the face picture data to obtain target lip data; inputting target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data; the audio feature vector and the target lip data are input into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, and the effectiveness, flexibility and accuracy of lip recognition can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a lip language identification system according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a lip language identification method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a face calibration point according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of three lip calibration points provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of an audio feature extraction model provided by an embodiment of the invention;
FIG. 6 is a diagram of a lip language identification model according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a lip language identification device according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below clearly with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The lip language identification method provided by the embodiment of the invention can be applied to a lip language identification system, the lip language identification system comprises lip language identification equipment and computer equipment, the lip language identification equipment can be arranged in a terminal, and in some embodiments, the terminal can comprise but is not limited to intelligent terminal equipment such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted intelligent terminal and an intelligent watch. In some embodiments, one or more databases are included in the computer device and may be used to store audio data and facial image data, such as songs and facial images of songs. In some embodiments, the lip language identification method provided by the embodiments of the present invention may be applied to identify a scenario of lip languages of multiple languages and/or multiple timbres: for example, voice data, singing voice data, etc., of arbitrary languages, arbitrary timbres are recognized. Of course, the above application scenario is only an example, and the lip language recognition according to the embodiment of the present invention may be applied to any scenario associated with lip language recognition.
The lip language recognition system provided by the embodiment of the invention is schematically described below with reference to fig. 1.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a lip language identification system according to an embodiment of the present invention, where the system includes a terminal 11 and a computer device 12, and in some embodiments, the terminal 11 and the computer device 12 may establish a communication connection in a wireless communication manner; in some scenarios, the terminal 11 and the computer device 12 may also establish a communication connection through a wired communication manner. In some embodiments, the terminal 11 may include, but is not limited to, a smart terminal device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted smart terminal, and a smart watch.
In the embodiment of the present invention, the terminal 11 may collect audio data and face picture data corresponding to the audio data, and send the collected audio data and face picture data to the computer device 12, and the computer device 12 may perform equalization processing on the obtained audio data to obtain target audio data, and perform preprocessing on the face picture data to obtain target lip data; inputting target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data; and inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data. Alternatively, the actions performed by the computer device 12 or the terminal 11 may be performed by the other party, that is, the lip language identification method may be performed by the terminal 11 side or the computer device 12 side alone.
The following describes schematically a lip language identification method provided by an embodiment of the present invention with reference to fig. 2.
Referring to fig. 2, fig. 2 is a schematic flowchart of a lip language identification method according to an embodiment of the present invention, where the lip language identification method according to the embodiment of the present invention may be executed by a lip language identification device, where the lip language identification device is disposed in a terminal or a computer device, and a specific explanation of the terminal or the computer device is as described above. Specifically, the method of the embodiment of the present invention includes the following steps.
S201: and acquiring audio data and face picture data corresponding to the audio data.
In the embodiment of the invention, the lip language recognition equipment can acquire the audio data and the face picture data corresponding to the audio data. In some embodiments, the audio data may include, but is not limited to, multiple languages and/or multiple timbres. In some embodiments, the face picture data may be unvoiced video data. In some embodiments, the face picture data corresponding to the audio data may indicate that the face picture data corresponds to a time point of the audio data. In some embodiments, the frequency of the audio data is between 60Hz-1200Hz, and because the frequency of the voice of the person is between 60Hz-1200Hz, by selecting the frequency of the audio data in this interval, the situation that the lip language data cannot be identified due to the fact that the frequency of the audio data is not in the range of the voice frequency of the person is avoided, and the subsequent effective identification of the lip language data from the audio data is facilitated.
In one embodiment, when the lip language recognition device acquires audio data and face picture data corresponding to the audio data, it may collect voiced video data from a specified platform by using a sampling probability of an audio segment corresponding to each piece of audio training data calculated in a pre-training audio feature extraction model training process, and extract the audio data and the face picture data (i.e., unvoiced video data) from the voiced video data.
In an embodiment, when the lip language recognition device acquires audio data and face picture data corresponding to the audio data, the lip language recognition device may acquire the audio data from a specified platform by using a sampling probability of an audio clip corresponding to each audio training data calculated in a pre-trained audio feature extraction model training process, acquire time point information of the audio data after acquiring the audio data, and acquire the face picture data corresponding to the time point information according to the time point information of the audio data.
In one embodiment, when acquiring the face picture data corresponding to the time point information, a start time point and an end time point of the audio data may be acquired, and the face picture data corresponding to the start time point and the end time point may be acquired from a specified platform.
For example, assuming that the sampling probability of an audio segment corresponding to each audio training data calculated in the pre-trained audio feature extraction model training process is P, the specified platform is a spoken language examination platform, and the spoken language examination platform includes audio data and face picture data of a spoken language examination of a user, the spoken language audio data may be acquired from the spoken language examination platform according to the sampling frequency P, where a start time point of the spoken language audio data is t, an end time point of the spoken language audio data is t + z, and z is greater than 0, and face picture data in a time period from t to t + z may be acquired from the spoken language start platform.
S202: and carrying out equalization processing on the audio data to obtain target audio data, and carrying out pretreatment on the face picture data to obtain target lip data.
In the embodiment of the invention, the lip language recognition device can perform equalization processing on the audio data to obtain target audio data, and performs preprocessing on the face picture data to obtain target lip data.
In one embodiment, when the lip language recognition device performs equalization processing on audio data to obtain target audio data, one or more audio segments can be extracted from the audio data, and each audio segment is analyzed to obtain a pitch distribution vector corresponding to each audio segment; and determining pitch data according to the pitch distribution vector corresponding to each audio fragment, and performing equalization processing on the pitch data to obtain equalized target audio data.
In one embodiment, when the lip language identification device performs the equalization processing on the pitch data, the equalization processing on the pitch data can be realized by combining the audio segments into a pitch distribution matrix. In one embodiment, assuming that m audio segments are extracted from the audio data, a pitch distribution matrix a with dimension m × n is formed, and in order to achieve pitch distribution equalization, weights corresponding to the m audio segments need to be solved, as shown in the following formula (1):
A T X=B,s.t.X i >0
B i =sum(A)/m (1)
wherein, the dimension of A is m X n, the dimension of X is m X1, the dimension of B is n X1, and each element of the B vector is equal.
The weight B of each audio clip can be calculated and obtained through the formula (1), and the weight B can be used as the sampling probability of the audio clip corresponding to each audio training data in the pre-training audio feature extraction model training process, so that the sampling probability can be used for acquiring the audio data to be recognized in the subsequent lip language recognition process.
In one embodiment, when the lip language recognition device preprocesses the face picture data to obtain the target lip data, the lip language recognition device may perform frame interpolation on the face picture data to obtain the target face picture data, and extract the lip data from the target face picture data; and correcting the lip data to obtain target lip data.
In one embodiment, when the lip language recognition device performs frame interpolation on the face picture data, the frame interpolation may be performed according to a preset interpolation, and in some embodiments, the interpolation may be at a preset frame rate, for example, 50 frames per second.
In one embodiment, when extracting lip data from the target face picture data, the lip recognition device may extract lip calibration data in the face picture data from the target face picture data, the lip calibration data including a plurality of calibration points.
In one embodiment, when the lip language recognition device corrects lip data to obtain target lip data, the lip language recognition device may acquire center positions of a plurality of calibration points and translate the center positions of the plurality of calibration points to a specified position; and horizontally correcting the plurality of translated calibration points, and carrying out scaling processing on lips in the corrected face picture data according to the specified size to obtain target lip data with the specified size. In some embodiments, the lip speech recognition device may use an open-source MMPose tool to extract multiple, such as 20, landmark points in the face picture data. In some embodiments, the center position of the plurality of calibration points may be the center of symmetry (e.g., the center of a circle) of the plurality of calibration point connection patterns.
In one embodiment, when extracting lip calibration data from target face picture data, the lip language recognition device may extract a plurality of face calibration points from the target face picture data, and extract a plurality of calibration points corresponding to lips from the plurality of face calibration points.
Specifically, as shown in fig. 3, fig. 3 is a schematic diagram of a face landmark point provided in an embodiment of the present invention, and a lip language recognition device may extract landmark points corresponding to 49 to 68 from the face landmark point shown in fig. 3 as landmark points corresponding to lips. In some embodiments, the positions of the calibration points corresponding to lips of different mouth shapes are different, as shown in fig. 4, fig. 4 is a schematic diagram of calibration points of three lips provided by an embodiment of the present invention, where the three lips include three shapes, respectively: lips a, lips b and lips c.
In one embodiment, each index point includes two dimensions, X and Y, and the index point data is an n (e.g., 40) dimensional data. Because the sizes and positions of the faces in different face image data are not fixed, it is necessary to further perform horizontal correction on the translated multiple calibration points, and perform scaling processing on the lips in the corrected face image data according to the specified size to obtain lip data of the specified size. The specific implementation can be illustrated by taking 40-dimensional calibration data as an example: translating 20 calibration points of lips, translating the central positions of the 20 calibration points to specified positions, connecting the left point and the right point (such as 49 points and 55 points in fig. 3) of a mouth angle, calculating an included angle theta between the connecting line and an X axis, horizontally correcting the rotation angle theta of the mouth in the opposite direction of the X axis direction according to the included angle theta, and then taking the length of the connecting line when the mouth is closed as the measurement of the size of the mouth, and scaling the size of the mouth of all pictures to the specified size.
S203: and inputting the target audio data into the pre-trained audio feature extraction model to obtain the audio feature vector corresponding to the target audio data.
In the embodiment of the invention, the lip language recognition equipment can input the target audio data into the pre-trained audio feature extraction model to obtain the audio feature vector corresponding to the target audio data.
In one embodiment, when the lip language recognition device inputs target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data, corresponding spectral envelope features can be extracted from the target audio data, and dimension reduction processing is performed on the spectral envelope features; and inputting the spectral envelope characteristics obtained by the dimension reduction processing into an audio characteristic extraction model to obtain audio characteristic vectors corresponding to the target audio data.
In one embodiment, the lip recognition device may use a specified tool (e.g., world vocoder) to extract spectral envelope features when extracting corresponding spectral envelope features from the target audio data. For example, when performing dimension reduction processing on the spectral envelope feature, 60-dimensional spectral envelope feature may be reduced to 40-dimensional spectral envelope feature.
In one embodiment, before inputting the spectral envelope features obtained by the dimension reduction processing into the audio feature extraction model to obtain the audio feature vector corresponding to the target audio data, the lip language identification device may obtain a training data set, where the training data set includes a plurality of audio training data; carrying out equalization processing on the plurality of audio training data to obtain a plurality of target audio training data; inputting a plurality of target audio training data into a preset residual error neural network model for training to obtain a first loss function value; adjusting a first model parameter of the residual error neural network model according to the first loss function value, and inputting a plurality of target audio training data into the residual error neural network model after the first model parameter is adjusted for retraining; and when the first loss function value obtained by retraining meets a first preset threshold value, determining the obtained residual error neural network model as an audio feature extraction model.
In one embodiment, when the lip language identification device inputs target audio training data into a preset residual error neural network model for training, a World vocoder can be used to extract corresponding spectral envelope training features from the target audio training data, and the spectral envelope training features are input into the preset residual error neural network model for training.
Specifically, fig. 5 is an example, and fig. 5 is a schematic diagram of an audio feature extraction model according to an embodiment of the present invention, as shown in fig. 5, the residual neural network model includes a Linear layer (Linear), a leakage correction Linear unit (LeakyReLU), a hole convolution layer (scaled Conv), and a basic module composed of LeakyReLU and scaled Conv is repeated 5 times. The spectral envelope training features are changed from [ T,60] to [ T,128] through a first linear layer, the convolution kernel sizes of the hollow hole convolution layers in the five basic modules are all 3, the channel numbers are all 128, the expansion coefficients are [2,4,8,2,4] respectively, therefore, the receptive field of the model is increased, and finally, the audio feature training vectors are obtained through the same linear layer, and the dimensions are changed from [ T,128] to [ T,64 ]. The first Loss Function value of the model is a contrast Loss (contrast Loss Function) Function value, and the basic idea is that the smaller the distance between audio feature vectors obtained from the same phoneme (i.e. the audio feature training vectors in the corresponding model training process), the better the distance between audio feature vectors obtained from different phonemes is, the larger the distance is, the better the distance is. The first loss function value is calculated as shown in the following equation (2).
Figure BDA0003771980950000081
Where Z is i Representing the i-th frame audio feature vector, sim (x, y) representing the cosine similarity distance of x and y. Through training, the similarity between adjacent audio feature vectors can be high finally, namely the audio features of the same phoneme have a clustering effect, and the audio features of different phonemes have higher discrimination.
The audio feature extraction model trained in the way is helpful for extracting feature vectors with stronger robustness.
S204: and inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data.
In the embodiment of the invention, the lip language recognition device can input the audio characteristic vector and the target lip data into a pre-trained lip language recognition model to obtain the lip language data corresponding to the audio data and the target lip data, wherein the lip language data comprises the position data of the lips on each frame of image in the face picture data. In some embodiments, the lip language identification model includes a convolutional neural network model and a bi-directional recurrent neural network model.
In some embodiments, the lip language data includes position data including position data (e.g., position coordinates) for each of the punctuation points on the lips and motion data including motion data (e.g., direction of motion, distance of motion, etc.) for each of the punctuation points on the lips. In some embodiments, the motion data may be determined according to position data of each index point in the multi-frame lip images, and optionally, the motion direction and the motion distance of the lip may be determined according to position coordinates of each index point in the multi-frame lip images.
Taking fig. 4 as an example, assuming that the starting time point of the audio data is t, the ending time point is t + z, z is greater than 0, and the face picture data is face picture data in a time period from t to t + z, position data and motion data of the lips a in fig. 4 are obtained through a pre-trained lip recognition model, where the position data includes position coordinates of each landmark point on each frame of lip image corresponding to the lips a, and the motion data is determined according to the position coordinates of each landmark point on each frame of lip image, where the motion data includes a motion direction and a motion distance of each landmark point on the lips a.
In one embodiment, the lip language recognition model comprises a convolutional neural network model and a bidirectional recurrent neural network model; when the lip language identification device inputs the audio feature vector and the target lip data into a pre-trained lip language identification model to obtain lip language data corresponding to the audio data and the target lip data, the lip language identification device can extract the target lip feature vector from the target lip data; inputting the audio feature vector and the target lip feature vector into the convolutional neural network model for dimensionality reduction to obtain a dimensionality reduction feature vector; and inputting the dimensionality reduction feature vector into the bidirectional circulation neural network model, and predicting to obtain lip language data corresponding to the audio data and the target lip data. In some embodiments, the target lip feature vector may be extracted by using a preset feature extraction algorithm, and the extraction method of the target lip feature vector is not specifically limited in the present invention.
In one embodiment, when the lip language recognition device inputs the audio feature vector and the target lip feature vector into the convolutional neural network model for dimension reduction processing to obtain a dimension reduction feature vector, the lip language recognition device may input the audio feature vector and the target lip feature vector into a plurality of convolution modules of the convolutional neural network model to obtain a plurality of convolution calculation results; splicing the plurality of convolution calculation results to obtain a splicing characteristic vector; and inputting the spliced feature vector into a linear layer and a linear rectification unit of the convolutional neural network model to perform dimension reduction processing to obtain the dimension reduction feature vector.
In one embodiment, after inputting the audio feature vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, the lip recognition device may drive a preset visualization engine according to the lip data, so that the visualization engine visualizes the lip data based on a virtual user to obtain a plurality of virtual lip images; and determining a virtual lip language animation video according to the plurality of virtual lip language images.
According to the embodiment of the invention, lip language data is obtained through calculation, so that the audio data and the lip language data can be rendered accurately and effectively to generate vivid lip language synthetic data, and lip language visualization can be realized more effectively.
In one embodiment, before inputting the audio feature vector and the target lip data into the pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, the lip recognition device may input the training data set into the audio feature extraction model to obtain an audio feature training vector set corresponding to the training data set; inputting the audio feature training vector set and the lip training data into a preset convolutional neural network model to obtain a first feature vector; inputting the first feature vector into a preset bidirectional recurrent neural network model for training to obtain a second loss function value; adjusting a second model parameter of the convolutional neural network model and a third model parameter of the bidirectional recurrent neural network model according to the second loss function value; inputting the audio characteristic training vector set into the convolutional neural network model after the second model parameter is adjusted and the bidirectional cyclic neural network model after the third model parameter is adjusted for retraining; and when the second loss function value obtained by retraining meets a second preset threshold value, determining that the obtained convolutional neural network model and the bidirectional recurrent neural network model form a lip language recognition model.
In one embodiment, when the lip language recognition device inputs the audio feature training vector set and the lip training data into a preset convolutional neural network model to obtain a first feature vector, the lip language recognition device may input the audio feature training vector set and the lip training data into a plurality of convolution modules in the convolutional neural network model to obtain a plurality of convolution calculation results; splicing the convolution calculation results to obtain a second feature vector; and inputting the second feature vector into a plurality of linear modules in the convolutional neural network model for dimension reduction processing to obtain a first feature vector.
In one embodiment, when the set of audio feature training vectors and the lip training data are input to the plurality of convolution modules in the convolutional neural network model, the lip feature training vectors corresponding to the lip training data may be extracted, and the audio feature training vectors and the lip feature training vectors may be input to the plurality of convolution modules in the convolutional neural network model. In some embodiments, the lip feature training vector may be extracted by using a preset feature extraction algorithm, and the extraction method of the lip feature training vector is not specifically limited. In some embodiments, the dimensions of the audio feature training vector and the dimensions of the lip feature training vector may be the same.
Specifically, it can be illustrated by taking fig. 6 as an example, where fig. 6 is a schematic diagram of a lip recognition model provided in an embodiment of the present invention, assuming that dimensions of an audio feature training vector extracted by an audio feature extraction model and dimensions of a lip feature training vector are [ T,64], inputting the audio feature training vector and the lip feature training vector into a plurality of different convolution modules (convolution kernels have sizes of 1, 3, 5, 7, 9, and channels all equal to 64) in a preset convolution neural network model, then performing a concatenation processing on convolution calculation results to obtain an audio feature training vector and a lip feature training vector of [ T,64 x 5], performing a dimensionality reduction processing to obtain an audio feature training vector and a lip feature training vector of [ T,64] through Linear layer (Linear) and Linear rectification unit (ReLU) calculation in the preset convolution neural network model, and finally, using a Bi-directional recurrent neural network (Bi-GRU) model and Linear layer (Linear) conversion to obtain a result of [ T,40] as a prediction result of the model.
In order to avoid that the calculation results of the lip language identification model are all around the mean value, the invention takes the errors of the position result and the motion result (namely the difference result between the adjacent frames) as a part for calculating the second loss function value, and the calculation formula is shown as the following formula (3):
Figure BDA0003771980950000101
here, y t Representing the target output of the t-th frame,
Figure BDA0003771980950000111
representing the predicted output of the t-th frame, w 1 And w 2 Representing the weight of the position data and the motion data, respectively. The optimizer used by the lip language recognition model is Adam, and the learning rate is 0.001. And when the second loss function value is converged, finishing the model training to obtain the lip language recognition model. In the process of identifying the lip language by using the lip language identification model, corresponding lip language data can be identified and obtained only by inputting any section of data including audio data and face picture data, such as voice or singing voice.
In this way, the language and tone limitation can be overcome, and more accurate lip language data (namely, including position data and motion data) can be predicted and used for driving a graphic rendering engine (namely, a visualization engine) to facilitate the visualization of the lip language data.
According to the embodiment of the invention, the audio feature extraction model is obtained by training a large amount of audio training data, so that the workload is reduced, the complexity of model training is reduced, and more accurate audio feature vectors can be extracted by the audio feature extraction model; and establishing a lip language identification model through the audio characteristic training vector and the lip language labeling data obtained by the audio characteristic extraction model, so that the lip language identification model is beneficial to effectively, accurately and flexibly identifying the audio data of any language and/or any tone and the lip language data corresponding to the face picture data.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a lip language identification device according to an embodiment of the present invention. Specifically, lip language identification equipment sets up in computer equipment, and equipment includes: an acquisition unit 701, a processing unit 702, an extraction unit 703, and a recognition unit 704;
an obtaining unit 701, configured to obtain audio data and face picture data corresponding to the audio data;
a processing unit 702, configured to perform equalization processing on the audio data to obtain target audio data, and perform preprocessing on the face picture data to obtain target lip data;
the extracting unit 703 is configured to input the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data;
and the recognition unit 704 is configured to input the audio feature vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, where the lip data includes position data of a lip on each frame of image in the face picture data.
Further, when the processing unit 702 performs equalization processing on the audio data to obtain target audio data, the processing unit is specifically configured to:
extracting one or more audio segments from the audio data, and analyzing each audio segment to obtain a pitch distribution vector corresponding to each audio segment;
and determining pitch data according to the pitch distribution vector corresponding to each audio fragment, and performing equalization processing on the pitch data to obtain equalized target audio data.
Further, the processing unit 702 is configured to, when performing preprocessing on the face picture data to obtain target lip data, specifically:
performing frame interpolation processing on the face picture data to obtain target face picture data, and extracting lip data from the target face picture data;
and correcting the lip data to obtain target lip data.
Further, when the processing unit 702 extracts lip data from the target face picture data, it is specifically configured to:
extracting lip calibration data in the target face picture data from the target face picture data, wherein the lip calibration data comprises a plurality of calibration points;
the processing unit 702 performs correction processing on the lip data to obtain target lip data, and is specifically configured to:
acquiring the central positions of a plurality of calibration points, and translating the central positions of the calibration points to a specified position;
and horizontally correcting the plurality of translated calibration points, and carrying out scaling processing on lips in the corrected face picture data according to the specified size to obtain target lip data with the specified size.
Further, when the extracting unit 703 inputs the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data, the extracting unit is specifically configured to:
extracting corresponding spectral envelope characteristics from the target audio data, and performing dimensionality reduction on the spectral envelope characteristics;
and inputting the spectral envelope characteristics obtained by the dimension reduction processing into the audio characteristic extraction model to obtain the audio characteristic vector corresponding to the target audio data.
Further, the lip language identification model comprises a convolution neural network model and a bidirectional cyclic neural network model; the recognition unit 704 inputs the audio feature vector and the target lip data into a pre-trained lip recognition model, and when lip data corresponding to the audio data and the target lip data are obtained, the recognition unit is configured to:
extracting a target lip feature vector from the target lip data;
inputting the audio feature vector and the target lip feature vector into the convolutional neural network model for dimension reduction processing to obtain a dimension reduction feature vector;
and inputting the dimensionality reduction feature vector into the bidirectional recurrent neural network model, and predicting to obtain lip language data corresponding to the audio data and the target lip data.
Further, the identifying unit 704 inputs the audio feature vector and the target lip feature vector into the convolutional neural network model for dimension reduction processing, and when obtaining a dimension reduction feature vector, is specifically configured to:
inputting the audio feature vectors and the target lip feature vectors into a plurality of convolution modules of the convolution neural network model to obtain a plurality of convolution calculation results;
splicing the plurality of convolution calculation results to obtain a splicing characteristic vector;
and inputting the spliced feature vector into a linear layer and a linear rectification unit of the convolutional neural network model for dimensionality reduction to obtain the dimensionality reduction feature vector.
Further, after the recognition unit 704 inputs the audio feature vector and the target lip data into a pre-trained lip language recognition model and obtains lip language data corresponding to the audio data and the target lip data, the recognition unit is further configured to:
driving a preset visualization engine according to the lip language data, so that the visualization engine visualizes the lip language data based on a virtual user to obtain a plurality of virtual lip language images;
and determining a virtual lip language animation video according to the plurality of virtual lip language images.
The embodiment of the invention obtains the audio data and the face picture data corresponding to the audio data; carrying out equalization processing on the audio data to obtain target audio data, and carrying out preprocessing on the face picture data to obtain target lip data; inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data; the audio feature vector and the target lip data are input into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, and the effectiveness, flexibility and accuracy of lip recognition can be improved.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention. Specifically, the computer device includes: memory 801, processor 802.
In one embodiment, the computer device further comprises a data interface 803, the data interface 803 being used to transfer data information between the computer device and other devices.
The memory 801 may include a volatile memory (volatile memory); the memory 801 may also include a non-volatile memory (non-volatile memory); the memory 801 may also comprise a combination of memories of the kind described above. The processor 802 may be a Central Processing Unit (CPU). The processor 802 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), or any combination thereof.
The memory 801 is used for storing programs, and the processor 802 can call the programs stored in the memory 801 for executing the following steps:
acquiring audio data and face picture data corresponding to the audio data;
carrying out equalization processing on the audio data to obtain target audio data, and carrying out preprocessing on the face picture data to obtain target lip data;
inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data;
and inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, wherein the lip data comprises position data of lips on each frame of image in the face picture data.
Further, when the processor 802 performs equalization processing on the audio data to obtain target audio data, the method is specifically configured to:
extracting one or more audio segments from the audio data, and analyzing each audio segment to obtain a pitch distribution vector corresponding to each audio segment;
and determining pitch data according to the pitch distribution vector corresponding to each audio fragment, and performing equalization processing on the pitch data to obtain equalized target audio data.
Further, the processor 802 performs preprocessing on the face picture data to obtain target lip data, and is specifically configured to:
performing frame interpolation processing on the face picture data to obtain target face picture data, and extracting lip data from the target face picture data;
and correcting the lip data to obtain target lip data.
Further, when the processor 802 extracts lip data from the target face picture data, it is specifically configured to:
extracting lip calibration data in the target face picture data from the target face picture data, wherein the lip calibration data comprises a plurality of calibration points;
the processor 802 performs correction processing on the lip data to obtain target lip data, and is specifically configured to:
acquiring the central positions of a plurality of calibration points, and translating the central positions of the calibration points to a specified position;
and horizontally correcting the plurality of translated calibration points, and carrying out scaling processing on lips in the corrected human face picture data according to the specified size to obtain target lip data of the specified size.
Further, when the processor 802 inputs the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data, it is specifically configured to:
extracting corresponding spectral envelope characteristics from the target audio data, and performing dimension reduction processing on the spectral envelope characteristics;
and inputting the spectral envelope characteristics obtained by the dimension reduction processing into the audio characteristic extraction model to obtain the audio characteristic vector corresponding to the target audio data.
Further, the lip language identification model comprises a convolution neural network model and a bidirectional cyclic neural network model; the processor 802 inputs the audio feature vector and the target lip data into a pre-trained lip recognition model, and when lip data corresponding to the audio data and the target lip data is obtained, the processor is specifically configured to:
extracting a target lip feature vector from the target lip data;
inputting the audio feature vector and the target lip feature vector into the convolutional neural network model for dimension reduction processing to obtain a dimension reduction feature vector;
and inputting the dimensionality reduction feature vector into the bidirectional recurrent neural network model, and predicting to obtain lip language data corresponding to the audio data and the target lip data.
Further, when the processor 802 inputs the audio feature vector and the target lip feature vector into the convolutional neural network model for dimension reduction processing to obtain a dimension reduction feature vector, the processor is specifically configured to:
inputting the audio feature vectors and the target lip feature vectors into a plurality of convolution modules of the convolution neural network model to obtain a plurality of convolution calculation results;
splicing the plurality of convolution calculation results to obtain a splicing characteristic vector;
and inputting the spliced feature vector into a linear layer and a linear rectification unit of the convolutional neural network model for dimensionality reduction to obtain the dimensionality reduction feature vector.
Further, after the processor 802 inputs the audio feature vector and the target lip data into a pre-trained lip recognition model and obtains lip data corresponding to the audio data and the target lip data, the processor is further configured to:
driving a preset visualization engine according to the lip language data, so that the visualization engine visualizes the lip language data based on a virtual user to obtain a plurality of virtual lip language images;
and determining a virtual lip language animation video according to the plurality of virtual lip language images.
The embodiment of the invention obtains the audio data and the face picture data corresponding to the audio data; carrying out equalization processing on the audio data to obtain target audio data, and carrying out preprocessing on the face picture data to obtain target lip data; inputting target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data; the audio feature vector and the target lip data are input into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data, and the effectiveness, flexibility and accuracy of lip recognition can be improved.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method described in the embodiment corresponding to fig. 2 of the present invention is implemented, and the apparatus according to the embodiment corresponding to the present invention described in fig. 7 may also be implemented, which is not described herein again.
The computer readable storage medium may be an internal storage unit of the device according to any of the foregoing embodiments, for example, a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A lip language identification method is characterized by comprising the following steps:
acquiring audio data and face picture data corresponding to the audio data;
carrying out equalization processing on the audio data to obtain target audio data, and carrying out pretreatment on the face picture data to obtain target lip data;
inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data;
and inputting the audio characteristic vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data.
2. The method of claim 1, wherein the equalizing the audio data to obtain target audio data comprises:
extracting one or more audio segments from the audio data, and analyzing each audio segment to obtain a pitch distribution vector corresponding to each audio segment;
and determining pitch data according to the pitch distribution vector corresponding to each audio fragment, and performing equalization processing on the pitch data to obtain equalized target audio data.
3. The method according to claim 1, wherein the preprocessing the face picture data to obtain target lip data comprises:
performing frame interpolation processing on the face picture data to obtain target face picture data, and extracting lip data from the target face picture data;
and correcting the lip data to obtain target lip data.
4. The method according to claim 3, wherein the extracting lip data from the target face picture data comprises:
extracting lip calibration data in the target face picture data from the target face picture data, wherein the lip calibration data comprises a plurality of calibration points;
the correcting the lip data to obtain target lip data includes:
acquiring the central positions of the plurality of calibration points, and translating the central positions of the plurality of calibration points to a specified position;
and horizontally correcting the plurality of translated calibration points, and carrying out scaling processing on lips in the corrected human face picture data according to the specified size to obtain the target lip data of the specified size.
5. The method of claim 1, wherein the inputting the target audio data into a pre-trained audio feature extraction model to obtain an audio feature vector corresponding to the target audio data comprises:
extracting corresponding spectral envelope characteristics from the target audio data, and performing dimensionality reduction on the spectral envelope characteristics;
and inputting the spectral envelope characteristics obtained by the dimension reduction processing into the audio characteristic extraction model to obtain the audio characteristic vector corresponding to the target audio data.
6. The method of claim 5, wherein the lip language recognition model comprises a convolutional neural network model and a bi-directional cyclic neural network model; the inputting the audio feature vector and the target lip data into a pre-trained lip recognition model to obtain lip data corresponding to the audio data and the target lip data includes:
extracting a target lip feature vector from the target lip data;
inputting the audio feature vector and the target lip feature vector into the convolutional neural network model for dimension reduction processing to obtain a dimension reduction feature vector;
and inputting the dimensionality reduction feature vector into the bidirectional circulation neural network model, and predicting to obtain lip language data corresponding to the audio data and the target lip data.
7. The method according to claim 6, wherein the inputting the audio feature vector and the target lip feature vector into the convolutional neural network model for dimension reduction processing to obtain a dimension-reduced feature vector comprises:
inputting the audio feature vector and the target lip feature vector into a plurality of convolution modules of the convolution neural network model to obtain a plurality of convolution calculation results;
splicing the plurality of convolution calculation results to obtain a splicing characteristic vector;
and inputting the spliced feature vector into a linear layer and a linear rectification unit of the convolutional neural network model for dimensionality reduction to obtain the dimensionality reduction feature vector.
8. The method of claim 1, wherein after inputting the audio feature vector and the target lip data into a pre-trained lip recognition model and obtaining lip data corresponding to the audio data and the target lip data, further comprising:
driving a preset visualization engine according to the lip language data, so that the visualization engine visualizes the lip language data based on a virtual user to obtain a plurality of virtual lip language images;
and determining a virtual lip language animation video according to the plurality of virtual lip language images.
9. A computer device comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program, the computer program comprising a program, the processor being configured to invoke the program to perform the method according to any one of claims 1-8.
10. A computer-readable storage medium, having stored thereon program instructions for implementing the method of any one of claims 1-8 when executed.
CN202210913645.9A 2022-07-29 2022-07-29 Lip language identification method, computer device and storage medium Pending CN115132201A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210913645.9A CN115132201A (en) 2022-07-29 2022-07-29 Lip language identification method, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210913645.9A CN115132201A (en) 2022-07-29 2022-07-29 Lip language identification method, computer device and storage medium

Publications (1)

Publication Number Publication Date
CN115132201A true CN115132201A (en) 2022-09-30

Family

ID=83385702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210913645.9A Pending CN115132201A (en) 2022-07-29 2022-07-29 Lip language identification method, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN115132201A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631380A (en) * 2023-07-24 2023-08-22 之江实验室 Method and device for waking up audio and video multi-mode keywords
CN116934930A (en) * 2023-07-18 2023-10-24 杭州一知智能科技有限公司 Multilingual lip data generation method and system based on virtual 2d digital person

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934930A (en) * 2023-07-18 2023-10-24 杭州一知智能科技有限公司 Multilingual lip data generation method and system based on virtual 2d digital person
CN116631380A (en) * 2023-07-24 2023-08-22 之江实验室 Method and device for waking up audio and video multi-mode keywords
CN116631380B (en) * 2023-07-24 2023-11-07 之江实验室 Method and device for waking up audio and video multi-mode keywords

Similar Documents

Publication Publication Date Title
CN113343707B (en) Scene text recognition method based on robustness characterization learning
CN115132201A (en) Lip language identification method, computer device and storage medium
CN112633290A (en) Text recognition method, electronic device and computer readable medium
WO2020253051A1 (en) Lip language recognition method and apparatus
CN111160533A (en) Neural network acceleration method based on cross-resolution knowledge distillation
CN111259940A (en) Target detection method based on space attention map
CN111428727B (en) Natural scene text recognition method based on sequence transformation correction and attention mechanism
CN111259785B (en) Lip language identification method based on time offset residual error network
CN110619334B (en) Portrait segmentation method based on deep learning, architecture and related device
US20190266443A1 (en) Text image processing using stroke-aware max-min pooling for ocr system employing artificial neural network
CN112070114A (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN112861842A (en) Case text recognition method based on OCR and electronic equipment
CN114550239A (en) Video generation method and device, storage medium and terminal
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN114022882A (en) Text recognition model training method, text recognition device, text recognition equipment and medium
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN113469148B (en) Text erasing method, model training method, device and storage medium
CN111178363A (en) Character recognition method and device, electronic equipment and readable storage medium
CN110909578A (en) Low-resolution image recognition method and device and storage medium
CN111414959B (en) Image recognition method, device, computer readable medium and electronic equipment
CN114694637A (en) Hybrid speech recognition method, device, electronic equipment and storage medium
CN116071472B (en) Image generation method and device, computer readable storage medium and terminal
CN113689527A (en) Training method of face conversion model and face image conversion method
CN116844573A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination