WO2020232867A1 - 唇语识别方法、装置、计算机设备及存储介质 - Google Patents

唇语识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020232867A1
WO2020232867A1 PCT/CN2019/102569 CN2019102569W WO2020232867A1 WO 2020232867 A1 WO2020232867 A1 WO 2020232867A1 CN 2019102569 W CN2019102569 W CN 2019102569W WO 2020232867 A1 WO2020232867 A1 WO 2020232867A1
Authority
WO
WIPO (PCT)
Prior art keywords
lip
training
model
text
image
Prior art date
Application number
PCT/CN2019/102569
Other languages
English (en)
French (fr)
Inventor
王义文
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020232867A1 publication Critical patent/WO2020232867A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Definitions

  • This application relates to a lip language recognition method, device, computer equipment and storage medium.
  • lip recognition has been well applied in public security fields such as intelligent human-computer interaction, audio damage, video monitoring, military and criminal investigation security, and has become a research hotspot in the industry. It also has important practical significance in the field of hearing impaired aphasia.
  • the inventor of this application found that traditional lip language recognition technologies mostly use traditional algorithms such as HMM (Hidden Markov Model), LBP (Local Binary Pattern) of texture features, or convolutional neural networks.
  • HMM Hidden Markov Model
  • LBP Local Binary Pattern
  • the input is a single frame of pictures, without considering the semantic information before and after the frame of the picture, only the spatial channel characteristics are obtained, but the temporal channel characteristics are not obtained, and the recognized sentences are not related before and after, and cannot be accurately recognized
  • the lip language corresponding to a video stream.
  • the embodiments of the present application provide a lip language recognition method, device, computer equipment, and storage medium to solve the problem that the lip language corresponding to a video stream cannot be accurately recognized in the prior art.
  • a method of lip language recognition includes:
  • the similarity between the lip image recognition result and the voice recognition result is calculated, and when the similarity reaches a preset value, the lip image recognition result is used as the lip language recognition result of the original video.
  • a lip language recognition device includes:
  • An original video processing module configured to obtain original videos, standardize the frame rate of the original videos, and obtain standard videos
  • the standard video processing module is used to separate the standard video to obtain a valid audio stream and a valid video stream;
  • a frame video acquisition module configured to use a face recognition algorithm to track the human face in the effective video stream, and extract the mouth area in the human face to acquire a frame lip action video;
  • a frame video processing module configured to process the frame of lip motion video to obtain a sequence of lip images
  • the image sequence segmentation module is used to segment the lip image sequence using sequence segmentation rules to obtain the segmented image sequence
  • the first model recognition module is configured to sequentially input the segmented image sequences corresponding to the lip image sequence into the lip image recognition model for recognition, and obtain the lip image recognition result;
  • the second model recognition module is used to input the effective audio stream into the speech recognition model to obtain speech recognition results
  • the result verification module is used to calculate the similarity between the lip image recognition result and the voice recognition result, and when the similarity reaches a preset value, the lip image recognition result is used as the lip of the original video Language recognition results.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the similarity between the lip image recognition result and the voice recognition result is calculated, and when the similarity reaches a preset value, the lip image recognition result is used as the lip language recognition result of the original video.
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the similarity between the lip image recognition result and the voice recognition result is calculated, and when the similarity reaches a preset value, the lip image recognition result is used as the lip language recognition result of the original video.
  • Fig. 1 is an application scenario diagram of a lip language recognition method in an embodiment of the present application
  • Figure 2 is a flowchart of a lip language recognition method in an embodiment of the present application
  • FIG. 3 is a specific flowchart of step S60 in FIG. 2;
  • FIG. 5 is another flowchart of the lip language recognition method in an embodiment of the present application.
  • FIG. 6 is a specific flowchart of step S705 in FIG. 5;
  • FIG. 7 is a specific flowchart of step S7052 in FIG. 6;
  • Fig. 8 is a schematic diagram of a lip language recognition device in an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • the lip language recognition method provided by the present application can be applied in the application environment as shown in Fig. 1, in which the terminal device communicates with the server through the network.
  • the terminal equipment includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a method for lip language recognition is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the original video refers to the video collected by the video equipment. Since different video devices have different frame rates for capturing videos, in order to facilitate subsequent model identification, it is necessary to uniformly process videos of different frame rates into standard videos corresponding to the standard frame rate.
  • the standard frame rate refers to a preset frame rate that meets the requirements, such as 30 frames per second.
  • Standard video refers to a video whose frame rate of the original video is processed into a standard frame rate.
  • the frame rate processing script is used to perform frame rate standardization processing on the original video, so that the original video with different frame rates is processed into a video corresponding to a standard frequency, that is, a standard video.
  • the frame rate processing script refers to a script written by a developer to adjust the frame rate of the original video to a standard frame rate.
  • the effective audio stream refers to the audio data stream that only contains voice in the standard video.
  • the effective video stream refers to the video data stream that does not contain voice in the standard video.
  • FFMpeg needs to be used to separate the obtained standard video to obtain a valid audio stream and a valid video stream of audio. .
  • FFMpeg Full Mpeg
  • FFMpeg is a multimedia framework that can decode, encode, transcode, and play videos in multiple formats (such as asx, asf, mpg, wmv, 3gp, mp4, mov, avi, flv, etc.), It is also possible to separate the voice data and video data in the standard video, which can be used in operating systems such as windows, Linux and Mac.
  • S30 Use a face recognition algorithm to track the face in the effective video stream, and extract the mouth area in the face to obtain a frame of lip motion video.
  • the face recognition algorithm refers to the algorithm used to recognize the face in the effective video stream.
  • the face recognition algorithm in the Dlib library is used to track and recognize the face in the video.
  • the Dlib library is a C++ open source toolkit that contains machine learning algorithms. Because all the designs in the Dlib library are highly modular, it can achieve the purpose of fast execution, and because the Dlib library provides an API interface, it is easy to use. In addition, the Dlib library is suitable for various applications, including robotics, embedded devices, mobile phones and large high-performance computing environments.
  • the frame lip motion video refers to each frame of video including lip motion in the effective video stream.
  • S40 Process the frame of the lip motion video to obtain a sequence of lip images.
  • the server will adjust the size of the frame of the lip motion video, and adjust the lip motion video to the same frame width and frame height, so that the frame of the lip motion video can be subsequently spliced into Lip image sequence.
  • the lip image sequence refers to an image sequence formed by splicing frame lip action videos of the same frame width and frame height. Obtaining the lip image sequence provides a data source for the subsequent execution process.
  • S50 Use the sequence segmentation rule to segment the lip image sequence to obtain the segmented image sequence.
  • the sequence segmentation rule refers to the rule of segmenting the lip image sequence according to a preset length (such as 9 consecutive images).
  • the lip image sequence is segmented according to the sequence segmentation rule to obtain the segmented image sequence.
  • the segmented image sequence refers to a lip image sequence with a preset length.
  • the acquired lip image sequence is 001-020
  • the preset length in the preset segmentation rule is 9.
  • the server will segment the lip image sequence according to the preset length and divide the lip
  • the partial image sequence is segmented into segmented image sequences of length 9 001-009, 002-010...012-020.
  • the lip image sequence is segmented using sequence segmentation rules, which is convenient for subsequent recognition using the lip image recognition model.
  • S60 Input the segmented image sequences corresponding to the lip image sequence into the lip image recognition model in sequence for recognition, and obtain the lip image recognition result.
  • the lip image recognition model refers to a model used to recognize segmented image sequences.
  • the lip recognition image model recognizes the segmented image sequence to obtain the meaning expressed by the speaker's lip action contained in the segmented image sequence.
  • the lip image recognition model in this embodiment is a model obtained by training a 3D CNN model for recognizing segmented image sequences.
  • the segmented image sequences corresponding to the lip image sequence are sequentially input into the lip image recognition model for recognition, the recognition result corresponding to each segmented image sequence is obtained, and then each segmented image sequence
  • the recognition results corresponding to the segmented image sequence are stitched in chronological order to obtain the lip image recognition results.
  • the lip image recognition result refers to the text information obtained after recognizing the segmented image sequence according to the lip image recognition model, and the text information is the meaning expressed by the speaker's lip action in the segmented image sequence.
  • S70 Input a valid audio stream into the speech recognition model to obtain a speech recognition result.
  • the speech recognition model in this embodiment is obtained by training a bidirectional recurrent neural network (Bi-directional Recurrent Neural Networks, referred to as BRNN) model with an attention mechanism.
  • the attention mechanism is called the attention mechanism.
  • the use of the attention mechanism in the BRNN model enables the BRNN model to identify valid audio streams in batches, that is, only focus on part of the valid audio stream at a time, instead of Focus on the content of the entire valid audio stream.
  • the speech recognition model first recognizes the first part of the content in the effective audio stream, and calculates the probability of the possible words corresponding to the part of the content, and then selects the word with the highest probability as the speech recognition result of the first part of the content; and uses the speech recognition
  • the result and the second part of the effective audio stream are used to obtain the probability of the possible words corresponding to the part of the content, and then the word with the highest probability is selected as the speech recognition result corresponding to the second part of the content, and loops in turn until the input is entered into the voice
  • the effective audio stream in the recognition model is completely recognized and stopped.
  • the speech recognition results corresponding to the obtained parts of the content are connected together, and the speech recognition results corresponding to the effective audio stream can be obtained, which ensures that the speech recognition results obtained for each part of the content are obtained according to the context in the effective audio stream.
  • the accuracy of the speech recognition result corresponding to the effective audio stream is improved.
  • S80 Calculate the similarity between the lip image recognition result and the voice recognition result, and when the similarity reaches a preset value, use the lip image recognition result as the lip language recognition result of the original video.
  • the cosine similarity algorithm is used to calculate the similarity between the lip image recognition result and the speech recognition result.
  • the formula calculates the similarity between the lip image recognition result and the speech recognition result.
  • the similarity between the two reaches the preset value, it indicates that the lip image recognition result is accurate, and can be used as the lip language recognition result of the original video.
  • the preset value in this embodiment is a value within [0,1], such as 0.98.
  • the obtained lip image recognition result or voice recognition result is a sentence
  • the algorithm for extracting keywords used in this embodiment includes, but is not limited to, the TextRank keyword extraction algorithm and the LSA (Latent Semantic Analysis) algorithm.
  • Step S10-Step S80 Obtain a standard video by adjusting the frame rate of the original video to the standard frame rate. Then the audio data and video data in the standard video are separated to obtain valid audio streams and valid video streams.
  • Use face recognition algorithm to track the face in the effective video stream, and extract the mouth area in the face, obtain the frame lip motion video, and adjust the frame lip motion video to the same frame width and frame height for convenience Splice frames of lip motion videos into a sequence of lip images.
  • the recognition results corresponding to the segmented image sequences are stitched in chronological order to obtain the lip image recognition results.
  • the similarity reaches the preset value, which indicates that the lip image recognition result is accurate, and can be used as the lip language recognition result of the original video to ensure the accuracy of the lip image recognition result.
  • step S60 inputting each segmented image sequence corresponding to the lip image sequence into the lip image recognition model for recognition, and obtaining the lip image recognition result, specifically includes the following steps :
  • S61 Recognize each segmented image sequence corresponding to the lip image sequence through the lip image recognition model, and obtain segmented image features.
  • each segmented image sequence is input into the lip image recognition model, and the lip image recognition model is obtained through the convolutional layer and the pooling layer in the model Data features, and then use the fully connected layer to integrate all data features to form segmented image features.
  • the segmented image feature refers to the result of the recognition of the segmented image sequence by the lip image recognition model.
  • the hidden layer structure in this embodiment is specifically a 4-layer convolutional layer, a 3-layer pooling layer, and a two-layer fully connected layer.
  • the convolution kernel size of the convolutional layer is set to 3*3*3, and the pooling layer
  • the maximum pooling size of the layer is set to 1*3*3, and the step size is 1*2*2 to improve the recognition efficiency and accuracy of the lip image recognition model.
  • S62 Use the classification function to classify the segmented image features, and obtain the segmented image recognition result.
  • a classification function (softmax function) is used to classify the segmented image features, and the image recognition results corresponding to the segmented image features are obtained. Since the segmented image feature in this embodiment is an image feature obtained from a training image sequence containing lip movements, the segmented image recognition result is specifically a word or sentence corresponding to the segmented image feature.
  • S63 Splicing the segmented image recognition results in chronological order to obtain the lip image recognition results.
  • each segmented image feature only represents the image feature corresponding to each segmented image sequence in the lip image sequence, therefore, After obtaining the segmented image recognition results corresponding to the segmented image features, the segmented image recognition results need to be stitched to generate a lip image recognition result corresponding to the lip image.
  • the frame rate of a 3s effective video stream is 30fps/s
  • the length of the segmented image sequence is 9, that is, the lip image sequence corresponding to the effective video stream is segmented according to the length of every 9 frames or 0.3s to obtain 10 A segmented image sequence with a length of 0.3s.
  • sequentially input 10 segmented image sequences into the lip image recognition model and obtain the segmented image features corresponding to each segmented image sequence.
  • the classification function is used to classify each segmented image feature, and the segmented image recognition result is obtained, that is, a certain word or sentence corresponding to the segmented image feature.
  • the 10 segmented image recognition results are spliced in chronological order, and the meaning expressed by the lip actions in the effective video stream can be obtained.
  • Step S61-Step S63 Recognize each segmented image sequence corresponding to the lip image sequence through the lip image recognition model, obtain the segmented image features, and use the classification function to classify the segmented image features to obtain the segmented image recognition As a result, the segmented image recognition results are then stitched in chronological order to obtain the lip image recognition results, without manual intervention, which can be automatically derived from the lip image recognition model, which improves the efficiency and accuracy of recognition.
  • the lip language recognition method further includes obtaining a lip image recognition model, which specifically includes the following steps:
  • S601 Obtain a training image sequence, where the training image sequence carries an image text label, and divide the training image sequence into an image sequence training set and an image sequence test set.
  • the training image sequence refers to an image sequence formed by a plurality of images that only include lip movements and meet a preset length.
  • the image text label refers to a text label used to represent a training image sequence.
  • the image text label in this embodiment is specifically a word or a sentence.
  • the training image sequence is divided into an image sequence training set and an image sequence test set, so that the image sequence training set is used to train the 3D CNN model, and the image sequence test set is used to test the trained 3D The accuracy of the CNN model.
  • S602 Input the training image sequence in the image sequence training set into the 3D convolutional neural network model to obtain the training result.
  • a m l represents the lth layer convolution
  • the output of the m-th training image sequence of the layer, z m l represents the output of the m-th training image sequence before processing with the activation function, and a m l-1 represents the m-th training image of the l-1 convolutional layer
  • Sequence output that is, the output of the previous layer
  • represents the activation function
  • the activation function ⁇ used for the convolutional layer is ReLu (Rectified Linear Unit, linear rectification function), which will have better effect than other activation functions
  • the maximum pooling down sample is used in the pooling layer to reduce the dimensionality of the output of the convolutional layer.
  • the down-sampling calculation can choose the maximum pooling method.
  • the maximum pooling is actually to take the maximum value among the m*m samples. Then use the fully connected layer to integrate all data features to form segmented image features.
  • T (m) represents the output of the 3D CNN output layer.
  • the output is to obtain the training result corresponding to the mth training image sequence.
  • S603 Construct a loss function according to the training result and the image text label, and update and adjust the weight and bias of the 3D convolutional neural network model through the loss function to obtain a lip image training model.
  • the 3D CNN model will construct a loss function from the training results and the image text label, and by obtaining the partial derivative of the loss function, update and adjust the weights and biases in the recurrent neural network model and the 3D CNN to obtain Lip image training model.
  • S604 Use the training image sequence in the image sequence test set to test the lip image training model.
  • the lip image training model is used as Lip image recognition model.
  • the lip image training model is used as the lip image recognition model.
  • Step S601-Step S604 by inputting the training image sequence in the image sequence training set into the 3D convolutional neural network model for training, obtain the lip image training model, and use the training image sequence in the image sequence test set to train the lip image The model is tested for verification.
  • the error between the output result corresponding to the image sequence test set and the image text label is within the preset error range, it means that the lip image training model meets the requirements and can be used as a lip image recognition model.
  • the lip language recognition method further includes obtaining a voice recognition model, which specifically includes the following steps:
  • S701 Acquire training speech, perform preprocessing on the training speech, and obtain a target speech.
  • the training voice refers to the lip-reading voice obtained from the original video for model training.
  • the training speech will inevitably include silent and noisy segments.
  • the training speech needs to be preprocessed to remove the training speech.
  • the silent section and noise section of the system retain the target voice with continuous changes in voiceprint.
  • the silent segment refers to the part of the training speech that is not pronounced due to silence, such as the speaker thinking and breathing during the speaking process.
  • the noise segment refers to the environmental noise part of the training speech, such as the sound of the opening and closing of doors and windows and the collision of objects.
  • the target speech refers to the data obtained after preprocessing the training speech, which only contains the obvious continuous change of the voiceprint.
  • S702 Use speech-to-text technology to convert the target speech into original text.
  • the speech-to-text technology used in this embodiment is ASR (Automatic Speech Recognition, automatic speech recognition technology), where ASR is a technology that converts a speaker's speech into text information.
  • ASR Automatic Speech Recognition, automatic speech recognition technology
  • the server uses ASR technology to convert the target voice into original text.
  • the original text refers to the text in the corresponding text form that the target voice is converted by ASR technology. Convert the target voice to the original text to facilitate the text label processing of the target text. If the target voice is directly subjected to text label processing, since the target voice is expressed in the form of voice, the text label processing of the voice is performed by listening to the content of the voice , It is inconvenient to operate and save, and the processing speed is slow.
  • the target voice is converted into original text and expressed in the form of text.
  • the content of the text is processed by text labeling by reading the text, which is convenient for operation and high processing efficiency.
  • S703 Preprocess the original text to obtain the target text, the target text carries a corresponding text label.
  • the target text refers to the text obtained by preprocessing the original text and removing data and special symbols.
  • the data in this embodiment refers to the numbers that appear after the target voice is converted into the original text; the special symbols refer to the unrecognizable characters that appear after the target voice is converted to the original text. Such as $, *, &, #, + and? .
  • the server needs to preprocess the original text, remove the data and special symbols in the original text, and obtain the target text containing only Chinese characters.
  • the server sends the target text to the client.
  • the staff corresponding to the client reads the content of the target text and performs text labeling processing on the target text, so that the target text obtains the corresponding text label, so as to follow the target text Text and text label for model training.
  • the target text is divided into training text and test text, which are used to train the bidirectional RNN model and test whether the trained bidirectional RNN model is accurate.
  • the training text is the text used to adjust the parameters in the bidirectional RNN model.
  • the test text is used to test the recognition accuracy of the trained two-way RNN model.
  • S705 Input the training text into the original two-way cyclic neural network model for training, and obtain an effective two-way cyclic neural network model.
  • the Bi-directional Recurrent Neural Networks (BRNN) model is composed of two RNNs (Recurrent Neural Networks, RNN).
  • RNN Recurrent Neural Networks
  • the forward RNN and backward RNN in the bidirectional recurrent neural network (BRNN) model have their own corresponding hidden layers, and the input layer and output layer share one.
  • the bidirectional RNN model is a neural network model composed of an input layer, two hidden layers and an output layer.
  • the bidirectional RNN model includes the weights and biases of the neuron connections between the layers.
  • the weights and biases are parameters in the bidirectional RNN model. These weights and biases determine the properties and recognition effects of the bidirectional RNN model.
  • S706 Input the test text into the effective two-way cyclic neural network model for testing, obtain the accuracy rate corresponding to the test text, and if the accuracy rate reaches the preset threshold, determine the effective two-way cyclic neural network model as the speech recognition model.
  • an effective two-way recurrent neural network model in order to prevent over-fitting problems, that is, to prevent accuracy only when recognizing training text, but not when recognizing other content that is not training text, It is also necessary to use the test text to test the effective two-way cyclic neural network model to determine whether the trained effective two-way cyclic neural network model is accurate. If the accuracy rate reaches a preset threshold (such as 95%), the accuracy of the effective two-way cyclic neural network model is identified as meeting the requirements, and it can be used as a speech recognition model.
  • a preset threshold such as 97%
  • Step S701-Step S706 the target voice is obtained by preprocessing the training voice, and the voice-to-text technology is adopted to convert the target voice into original text to facilitate the execution of subsequent steps. Then the original text is preprocessed to obtain the target text, and the target text is divided into training text and test text, which are used to train the two-way RNN model and test the trained two-way RNN model to ensure the accuracy of the effective two-way recurrent neural network model. Requirements, it can be used as a voice recognition model.
  • step S705 inputting the training text into the original two-way cyclic neural network model for training, and obtaining an effective two-way cyclic neural network model, specifically includes the following steps:
  • a preset value is used to initialize the weight and bias
  • the preset value is a value preset by the developer based on experience.
  • Using preset values to initialize the weights and biases of the two-way RNN model can shorten the training time of the model and improve the recognition accuracy of the model during subsequent training of the two-way RNN model based on the training text.
  • the initial settings of the weights and biases are not very appropriate, which will result in poor adjustment capabilities of the model in the initial stage, which will affect the subsequent voice discrimination effect of the two-way RNN model.
  • S7052 Convert the training text into a word vector, and input the word vector into the original bidirectional cyclic neural network model for training, and obtain the model output result.
  • a word vector conversion tool is used to convert words in the training text into word vectors, and one training text includes at least one word vector.
  • the word vector conversion tool used in this embodiment is word2vec (word to vector, word conversion vector), where word2vec is a tool for converting words into vectors, and each word can be mapped into a corresponding vector in this tool .
  • the word vector is input to the hidden layer calculation of the forward RNN, and the output of the forward hidden layer and the backward hidden layer are obtained, and then the attention mechanism is used to separately calculate the output of the forward hidden layer and the back
  • the output of the hidden layer is assigned attention degree, and finally the two outputs processed by the attention mechanism are fused to obtain the value that is finally input to the output layer of the bidirectional recurrent neural network model, and the output of the model is obtained through the calculation of the output layer .
  • the attention mechanism is the attention mechanism, which refers to assigning different weights to the data according to the importance of the data. The greater importance corresponds to the greater weight, and the smaller importance corresponds to the lower weight.
  • the model output result is the output of the training text obtained through the two-way RNN model training.
  • the fusion processing in this embodiment includes, but is not limited to, the use of the arithmetic average method and the weighted average method.
  • the subsequent steps use the arithmetic average method to merge the two outputs processed by the attention mechanism.
  • the output layer of the original bidirectional RNN model calculates the model output result, it constructs a loss function with the text label y t . Then, according to the loss function, the back propagation algorithm is used to obtain partial derivatives of the weights and biases in the bidirectional RNN model, and the weights and biases of the forward RNN and the backward RNN are adjusted to obtain an effective bidirectional RNN.
  • the Back Propagation algorithm refers to adjusting the weights and biases between the hidden layer and the output layer of the original bidirectional RNN model, and the weights between the input layer and the hidden layer according to the reverse order of the timing state. And biased algorithms.
  • T represents the training image sequence carried by the training text
  • represents the set of weights and biases (U, V, W, b, c), Represents the text label corresponding to the word vector.
  • Steps S7051-step S7053 by initializing the weights and biases in the original two-way cyclic neural network model, so as to shorten the training time of the subsequent model.
  • the recurrent neural network model is called an effective two-way recurrent neural network model that can recognize training text.
  • step S7052 converting the training text into a word vector, and inputting the word vector into the original two-way recurrent neural network model for training, and obtaining the model output result specifically includes the following steps:
  • S70521 Convert the training text into a word vector, and input the word vector into the input layer of the original bidirectional recurrent neural network model.
  • the input layer inputs the obtained word vector into the forward hidden layer of the forward recurrent neural network, and use The attention mechanism processes and obtains forward output.
  • the forward hidden layer points to the hidden layer of the forward loop neural network.
  • the training text is input to the input layer of the original two-way RNN model, and the input layer inputs the acquired training text into the forward hidden layer.
  • the formula h t1 ⁇ (Ux t +Wh t -1 +b) Calculate the output of the forward hidden layer.
  • represents the activation function of the forward RNN hidden layer
  • U represents the weight between the input layer of the original bidirectional RNN model and the forward RNN hidden layer
  • W represents the weight between the hidden layers of the forward RNN
  • b represents The offset between the input layer of the original two-way RNN model and the forward RNN
  • x t represents the word vector input at time t in the input layer of the original two-way RNN model
  • h t1 represents the corresponding to time t in the hidden layer of the forward RNN
  • the output of the word vector, h t-1 represents the output of the corresponding word vector at time t in the hidden layer of the forward RNN.
  • the forward output refers to the value obtained after processing the output of the forward hidden layer using the attention mechanism.
  • c t1 refers to the attention mechanism of the attention mechanism to the semantic vector at time t in the hidden layer of the forward loop neural network (that is, the important value)
  • ⁇ tj refers to the word vector of the jth input and t Correlation of the word vector corresponding to the moment
  • h j refers to the output of j input word vectors through the forward hidden layer.
  • the normalization process is Among them, k refers to the K-th input word vector.
  • S70522 The input layer inputs the obtained word vector into the backward hidden layer of the backward loop neural network, and uses the attention mechanism for processing to obtain the backward output.
  • the backward hidden layer points to the hidden layer of the backward loop neural network.
  • the training text is input to the input layer of the original two-way RNN model, and the input layer inputs the acquired training text into the backward hidden layer.
  • the formula h t2 ⁇ (Ux t +Wh t -1 +b) Calculate the output of the backward hidden layer.
  • represents the activation function of the backward RNN hidden layer
  • U represents the weight between the input layer of the original bidirectional RNN model and the backward RNN hidden layer
  • W represents the weight between the hidden layers of the backward RNN
  • b represents The bias between the input layer of the original two-way RNN model and the backward RNN
  • x t represents the word vector input at time t in the input layer of the original two-way RNN model
  • h t2 represents the hidden layer of the backward RNN corresponding to time t
  • h t-1 represents the output of the corresponding word vector at time t in the hidden layer of the backward RNN.
  • the backward output refers to the value obtained after processing the output of the backward hidden layer using the attention mechanism.
  • c t2 refers to the degree of attention that the attention mechanism pays to the semantic vector at time t in the hidden layer of the backward loop neural network (ie the importance value)
  • ⁇ tj refers to the word vector of the jth input and t Correlation of the corresponding word vector at a time
  • h j refers to the output of j input word vectors through the backward hidden layer.
  • k refers to the K-th input word vector.
  • e tj V ⁇ tanh(U ⁇ h j +WS t-1 +b), where V represents the weight between the hidden layer and the output layer, the transposition of V ⁇ weight V, and S t-1 refers to t- The output from the output layer of the bidirectional cyclic neural network at time 1.
  • the model output result refers to the output that is finally input to the output layer.
  • Step S70521-Step S70523 in the process of training the original two-way recurrent neural network model, the attention mechanism is used to make the obtained forward output and backward output the output corresponding to the important word vectors in the training text, so that the subsequent The obtained model output result is a result that can reflect the main meaning of the training text.
  • the lip recognition method provided in this application obtains a standard video by adjusting the frame rate of the original video to a standard frame rate. Then the audio data and video data in the standard video are separated to obtain valid audio streams and valid video streams.
  • Use face recognition algorithm to track the face in the effective video stream, and extract the mouth area in the face, obtain the frame lip motion video, and adjust the frame lip motion video to the same frame width and frame height for convenience Splice frames of lip motion videos into a sequence of lip images.
  • the recognition results corresponding to the segmented image sequences are stitched in chronological order to obtain the lip image recognition results.
  • the similarity reaches the preset value, which indicates that the lip image recognition result is accurate, and can be used as the lip language recognition result of the original video to ensure the accuracy of the lip image recognition result.
  • a lip language recognition device is provided, and the lip language recognition device corresponds to the lip language recognition method in the above-mentioned embodiment one-to-one.
  • the lip language recognition device includes an original video processing module 10, a standard video processing module 20, a frame video acquisition module 30, a frame video processing module 40, an image sequence segmentation module 50, a first model recognition module 60, The second model recognition module 70 and the result verification module 80.
  • the detailed description of each functional module is as follows:
  • the original video processing module 10 is used to obtain the original video, to standardize the frame rate of the original video, and to obtain the standard video.
  • the standard video processing module 20 is used to separate the standard video to obtain an effective audio stream and an effective video stream.
  • the frame video acquisition module 30 is used to track the human face in the effective video stream using a face recognition algorithm, and extract the mouth area in the human face to acquire the frame lip motion video.
  • the frame video processing module 40 is used to process the frame lip motion video to obtain a lip image sequence.
  • the image sequence segmentation module 50 is used to segment the lip image sequence by adopting the sequence segmentation rule to obtain the segmented image sequence.
  • the first model recognition module 60 is configured to sequentially input the segmented image sequences corresponding to the lip image sequence into the lip image recognition model for recognition, and obtain the lip image recognition result.
  • the second model recognition module 70 is used to input a valid audio stream into the voice recognition model to obtain a voice recognition result.
  • the result verification module 80 is used to calculate the similarity between the lip image recognition result and the voice recognition result. When the similarity reaches a preset value, the lip image recognition result is used as the lip language recognition result of the original video.
  • the first model recognition module 60 includes an image feature acquisition unit 61, an image feature processing unit 62 and an image recognition result acquisition unit 63.
  • the image feature acquiring unit 61 is configured to recognize each segmented image sequence corresponding to the lip image sequence through a lip image recognition model, and obtain segmented image features.
  • the image feature processing unit 62 is configured to use a classification function to classify the segmented image features and obtain the segmented image recognition result.
  • the image recognition result acquisition unit 63 is configured to splice the segmented image recognition results in chronological order to obtain the lip image recognition results.
  • the lip language recognition device further includes a training image data acquisition unit, an image training result acquisition unit, a first model training unit, and a first model acquisition unit.
  • the training image data acquisition unit is used to acquire a training image sequence, the training image sequence carries an image text label, and the training image sequence is divided into an image sequence training set and an image sequence test set.
  • the image training result obtaining unit is used to input the training image sequence in the image sequence training set into the 3D convolutional neural network model to obtain the training result.
  • the first model training unit is used to construct a loss function according to the training result and the image text label, and update and adjust the weight and bias of the 3D convolutional neural network model through the loss function to obtain the lip image training model.
  • the first model acquisition unit is used to use the training image sequence in the image sequence test set to test the lip image training model.
  • the The lip image training model is used as the lip image recognition model.
  • the lip language recognition device further includes a training voice data acquisition unit, a voice processing unit, a text processing unit, a text division unit, a second model training unit, and a second model acquisition unit.
  • the training speech data acquisition unit is used to acquire training speech, preprocess the training speech, and obtain the target speech.
  • the speech processing unit is used to convert the target speech into original text by adopting the speech-to-text technology.
  • the text processing unit is used to preprocess the original text to obtain the target text, and the target text carries the corresponding text label.
  • the text division unit is used to divide the target text into training text and test text.
  • the second model training unit is used to input training text into the original bidirectional cyclic neural network model for training, and obtain an effective bidirectional cyclic neural network model.
  • the second model acquisition unit is used to input the test text into the effective two-way cyclic neural network model for testing, and obtain the accuracy rate corresponding to the test text. If the accuracy rate reaches the preset threshold, the effective two-way cyclic neural network model is determined as the voice Identify the model.
  • the second model training unit includes a parameter initialization unit, a model output result acquisition unit, and a parameter update unit.
  • the parameter initialization unit is used to initialize the weights and biases in the original bidirectional cyclic neural network model.
  • the model output result obtaining unit is used to convert the training text into a word vector, and input the word vector into the original bidirectional cyclic neural network model for training, and obtain the model output result.
  • the parameter update unit is used to update the weights and biases in the original two-way cyclic neural network model based on the model output result, and obtain an effective two-way cyclic neural network model.
  • model output result obtaining unit includes a forward output obtaining unit, a backward output obtaining unit, and an output processing unit.
  • the forward output acquisition unit is used to convert the training text into a word vector, and input the word vector to the input layer of the original two-way recurrent neural network model, and the input layer inputs the obtained word vector to the forward of the forward recurrent neural network In the hidden layer, and use the attention mechanism for processing to obtain forward output.
  • the backward output acquisition unit is used for the input layer to input the acquired word vector into the backward hidden layer of the backward loop neural network, and use the attention mechanism for processing to obtain the backward output.
  • the output processing unit is used to perform fusion processing on the forward output and the backward output to obtain the model output result.
  • each module in the above lip language recognition device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a readable storage medium and an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
  • the database of the computer equipment is used to store the data involved in the lip recognition method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a method of lip language recognition.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the lip of the above-mentioned embodiment when the processor executes the computer-readable instructions.
  • the language recognition method as shown in step S10 to step S80 in FIG. 2, or the steps shown in FIG. 3 to FIG. 7, is not repeated here to avoid repetition.
  • the processor executes the computer-readable instructions, the functions of the modules/units in this embodiment of the lip language recognition device are realized, such as the functions of modules 10 to 80 shown in FIG. 8. To avoid repetition, it will not be repeated here. Repeat.
  • one or more readable storage media storing computer readable instructions are provided.
  • the one or more processors execute the above In the lip language recognition method of the embodiment, steps S10 to S80 shown in FIG. 2 or steps shown in FIG. 3 to FIG. 7 are not repeated here to avoid repetition.
  • the computer-readable instructions when executed by one or more processors, the one or more processors realize the functions of the modules/units in the embodiment of the lip language recognition device, for example, The functions of the modules 10 to 80 shown in 8 are not repeated here in order to avoid repetition.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种唇语识别方法、装置、计算机设备及存储介质。该方法包括对获取原始视频的帧率进行标准化处理,并对获取的标准视频进行分离,获取有效音频流和有效视频流;使用人脸识别算法跟踪有效视频流中的人脸,并提取人脸中的嘴部区域,获取帧唇部动作视频,从而获取唇部图像序列;采用序列切分规则对唇部图像序列进行切分,获取切分图像序列;将唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;将有效音频流输入到语音识别模型中,获取语音识别结果;当二者相似度达到预设值,则将唇部图像识别结果作为原始视频的唇语识别结果,以保证唇部图像识别结果的准确性。

Description

唇语识别方法、装置、计算机设备及存储介质
本申请以2019年5月21日提交的申请号为201910424466.7,名称为“唇语识别方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及一种唇语识别方法、装置、计算机设备及存储介质。
背景技术
近年来唇语识别在智能人机交互、音频损坏、视屏监控、军事和刑侦安保等公共安全领域有良好应用,也成为业内研究热点,同时在听障失语者领域具有很重要的现实意义。本申请创造的发明人在研究中发现,传统的唇语识别技术大多采用HMM(隐马尔科夫模型)、纹理特征的LBP(局部二值模式)等传统算法亦或是卷积神经网络这类的深度学习算法,输入的都是单帧图片,没有考虑到该帧图片前后的语义信息,只获取空间通道特征,却没有获取时间通道特征,识别出来的语句前后没有关联性,无法准确地识别一段视频流对应的唇语。
发明内容
本申请实施例提供一种唇语识别方法、装置、计算机设备及存储介质,以解决现有技术中不能准确地识别一段视频流对应的唇语的问题。
一种唇语识别方法,包括:
获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
对所述标准视频进行分离,获取有效音频流和有效视频流;
使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
对所述帧唇部动作视频进行处理,获取唇部图像序列;
采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
将所述有效音频流输入到语音识别模型中,获取语音识别结果;
计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
一种唇语识别装置,包括:
原始视频处理模块,用于获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
标准视频处理模块,用于对所述标准视频进行分离,获取有效音频流和有效视频流;
帧视频获取模块,用于使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
帧视频处理模块,用于对所述帧唇部动作视频进行处理,获取唇部图像序列;
图像序列切分模块,用于采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
第一模型识别模块,用于将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
第二模型识别模块,用于将所述有效音频流输入到语音识别模型中,获取语音识别结 果;
结果验证模块,用于计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
对所述标准视频进行分离,获取有效音频流和有效视频流;
使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
对所述帧唇部动作视频进行处理,获取唇部图像序列;
采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
将所述有效音频流输入到语音识别模型中,获取语音识别结果;
计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
对所述标准视频进行分离,获取有效音频流和有效视频流;
使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
对所述帧唇部动作视频进行处理,获取唇部图像序列;
采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
将所述有效音频流输入到语音识别模型中,获取语音识别结果;
计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中唇语识别方法的一应用场景图;
图2是本申请一实施例中唇语识别方法的一流程图;
图3是图2中步骤S60的一具体流程图;
图4是本申请一实施例中唇语识别方法的另一流程图;
图5是本申请一实施例中唇语识别方法的另一流程图;
图6是图5中步骤S705的一具体流程图;
图7是图6中步骤S7052的一具体流程图;
图8是本申请一实施例中唇语识别装置的一示意图;
图9是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的唇语识别方法,可应用在如图1的应用环境中,其中,终端设备通过网络与服务器进行通信。该终端设备包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种唇语识别方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S10:获取原始视频,对原始视频的帧率进行标准化处理,获取标准视频。
其中,原始视频指视频设备采集的视频。由于不同的视频设备采集视频的帧率不同,为了方便后续进行模型识别,需要将不同帧率的视频统一处理成标准帧率对应的标准视频。标准帧率指预先设置好的满足需求的帧率,如30帧/秒。标准视频指将原始视频的帧率处理成标准帧率的视频。
具体地,采用帧率处理脚本对原始视频进行帧率标准化处理,使得不同帧率的原始视频处理成标准频率对应的视频,即标准视频。其中,帧率处理脚本指开发人员编写的将原始视频的帧率调整为标准帧率的脚本。
S20:对标准视频进行分离,获取有效音频流和有效视频流。
其中,有效音频流指标准视频中仅包含语音的音频数据流。有效视频流指标准视频中的不包含语音的视频数据流。具体地,由于标准视频既包括语音数据也包含视频数据,为了方便后续训练唇语识别模型,因此,本实施例中需要使用FFMpeg对获取的标准视频进行分离,获取有效音频流和音的有效视频流。其中,FFMpeg(Fast Forward Mpeg)是一个多媒体框架,它能够解码、编码、转码、播放多种格式的视频(如asx,asf,mpg,wmv,3gp,mp4,mov,avi,flv等),也可以将标准视频中的语音数据和视频数据进行分离,在windows、Linux和Mac等操作系统中均可使用。
S30:使用人脸识别算法跟踪有效视频流中的人脸,并提取人脸中的嘴部区域,获取帧唇部动作视频。
其中,人脸识别算法指用于识别有效视频流中人脸的算法,本实施例中采用Dlib库中的人脸识别算法对视频中的人脸进行跟踪识别。Dlib库是一个包含机器学习算法的C++开源工具包,由于Dlib库中所有的设计都是高度模块化的,可以达到快速执行的目的,且由于Dlib库提供由API接口,使用简单。另外,Dlib库适用于各种应用,包括机器人技术,嵌入式设备,手机和大型高性能计算环境。
具体地,在获取有效视频流后,使用Dlib库中的人脸识别算法跟踪有效视频流中的人脸,然后将有效视频流中每一帧视频的嘴部区域圈出,并提取有效视频流中每一帧视频的唇部动作,获取帧唇部动作视频。其中,帧唇部动作视频指有效视频流中包含唇部动作的每一帧视频。
S40:对帧唇部动作视频进行处理,获取唇部图像序列。
具体地,在获取帧唇部动作视频后,服务器会对帧唇部动作视频的大小进行调整,将唇部动作视频调整为相同的帧宽和帧高,以便后续将帧唇部动作视频拼接为唇部图像序列。其中,唇部图像序列指对相同的帧宽和帧高的帧唇部动作视频进行拼接形成的图像序列。获取唇部图像序列为后续执行过程提供了数据来源。
S50:采用序列切分规则对唇部图像序列进行切分,获取切分图像序列。
其中,序列切分规则指按照预设长度(如连续9张图像)对唇部图像序列进行切分的 规则。
具体地,在获取唇部图像序列后,通过序列切分规则对唇部图像序列进行切分,获取切分图像序列。其中,切分图像序列指长度为预设长度的唇部图像序列。如获取的唇部图像序列为001-020,预设切分规则中的预设长度为9,在获取唇部图像序列后,服务器会按照预设长度对唇部图像序列进行切分,将唇部图像序列切分为长度为9的切分图像序列001-009,002-010...012-020。对唇部图像序列采用序列切分规则进行切分,便于后续使用唇部图像识别模型识别。
S60:将唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果。
其中,唇部图像识别模型指用于识别切分图像序列的模型。该唇部识别图像模型通过对切分图像序列进行识别,获取切分图像序列中包含的说话人的唇部动作表达的含义。本实施例中的唇部图像识别模型是通过对3D CNN模型进行训练得到的用于识别切分图像序列的模型。
具体地,在获取切分图像序列后,将唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取每个切分图像序列对应的识别结果,然后将各切分图像序列对应的识别结果按照时间顺序进行拼接,获取唇部图像识别结果。其中,唇部图像识别结果指根据唇部图像识别模型对切分图像序列进行识别后得到的文本信息,该文本信息即为切分图像序列中说话人的唇部动作表达的含义。
S70:将有效音频流输入到语音识别模型中,获取语音识别结果。
本实施例中的语音识别模型是通过对加入注意力(attention)机制的双向循环神经网络(Bi-directional Recurrent Neural Networks,简称BRNN)模型进行训练获取的。其中,attention机制叫注意力机制,在BRNN模型中使用attention机制,可以使BRNN模型在识别有效音频流时,对有效音频流进行分批识别,即一次只关注有效音频流的一部分内容,而不是关注整个有效音频流的内容。语音识别模型首先识别有效音频流中的第一部分内容,并计算出该部分内容对应的可能出现的词的概率,然后选取概率最大的词作为该第一部分内容的语音识别结果;并使用该语音识别结果与有效音频流中的第二部分内容共同获取该部分内容对应的可能出现的词的概率,然后选取概率最大的词作为第二部分内容对应的语音识别结果,依次循环,直至将输入到语音识别模型中的有效音频流完全识别停止。最后将获取的各部分内容对应的语音识别结果连接在一起,则可以获取有效音频流对应的语音识别结果,保证了每部分内容得到的语音识别结果是根据有效音频流中的上下文得到的,提高了有效音频流对应的语音识别结果的准确性。
S80:计算唇部图像识别结果和语音识别结果的相似度,当相似度达到预设值,则将唇部图像识别结果作为原始视频的唇语识别结果。
本实施例采用余弦相似度算法计算唇部图像识别结果和语音识别结果的相似度。具体过程如下:先将唇部图像识别结果和语音识别结果分别转换成向量A=(A1,A2,……,An)和向量B(B1,B2,……,Bn),然后采用余弦相似度公式计算唇部图像识别结果和语音识别结果之间的相似度。当二者的相似度达到预设值时,则表示唇部图像识别结果是准确的,可以作为原始视频的唇语识别结果。
进一步地,由于采用余弦相似度算法计算出来的余弦相似度范围在[-1,1]之间,相似度的值越趋近于1,代表两个向量的方向越接近;相似度的值越趋近于-1,代表两个向量的方向越相反;相似度的值接近于0,表示两个向量近乎于正交。因此,本实施例中的预设值为[0,1]内的数值,如0.98。
进一步地,若获取的唇部图像识别结果或者语音识别结果是一个句子,为了方便计算唇部图像识别结果和语音识别结果的相似度,在步骤S70之后,步骤S80之前,还需要预先对唇部图像识别结果或者语音识别结果提取关键词。本实施例中使用的提取关键词的算 法包括但不限于TextRank关键词提取算法和LSA(Latent Semantic Analysis,潜在语义分析)算法。
步骤S10-步骤S80,通过将原始视频的帧率调整为标准帧率,以获取标准视频。然后对标准视频中的音频数据和视频数据进行分离,获取有效音频流和有效视频流。使用人脸识别算法跟踪有效视频流中的人脸,并提取人脸中的嘴部区域,获取帧唇部动作视频,并将帧唇部动作视频调整为相同的帧宽和帧高,以方便将帧唇部动作视频拼接为唇部图像序列。为了方便唇部图像识别模型进行识别,还需要预先对唇部图像序列进行切分,以使唇部图像序列在输入唇部图像识别模型时,为满足预设长度的切分图像序列。最后将切分图像序列对应的识别结果按照时间顺序进行拼接,获取唇部图像识别结果。为了进一步验证唇部图像识别结果是否准确,还需要将有效音频流输入到语音识别模型中,获取对应的语音识别结果,并计算语音识别结果和唇部图像识别结果的相似性,当二者的相似度达到预设值,表示唇部图像识别结果是准确的,可以作为原始视频的唇语识别结果,以保证唇部图像识别结果的准确性。
在一实施例中,如图3所示,步骤S60,将唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果,具体包括如下步骤:
S61:通过唇部图像识别模型对唇部图像序列对应的各切分图像序列进行识别,获取切分图像特征。
具体地,在获取唇部图像序列对应的各切分图像序列后,将各切分图像序列输入到唇部图像识别模型中,唇部图像识别模型通过模型中的卷积层和池化层获取数据特征,然后再使用全连接层将所有数据特征进行整合,形成切分图像特征。其中,切分图像特征指唇部图像识别模型对切分图像序列进行识别得到的结果。
优选地,本实施例中的隐藏层结构具体为4层卷积层、3层池化层和两层全连接层,其中卷积层的卷积核大小设置为3*3*3,池化层的最大池化大小设置为1*3*3,步长1*2*2,以提高唇部图像识别模型的识别效率和准确性。
S62:采用分类函数对切分图像特征进行分类,获取切分图像识别结果。
具体地,在获取切分图像特征后,使用分类函数(softmax函数)对切分图像特征进行分类,获取切分图像特征对应的图像识别结果。由于本实施例中的切分图像特征是包含唇部动作的训练图像序列得到的图像特征,因此,切分图像识别结果具体为切分图像特征对应的单词或者句子。
S63:按照时间顺序对切分图像识别结果进行拼接,获取唇部图像识别结果。
具体地,由于唇部图像识别模型每次识别的是切分图像序列对应的切分图像特征,每一切分图像特征仅代表唇部图像序列中每一个切分图像序列对应的图像特征,因此,在获取切分图像特征对应的切分图像识别结果后,需要对切分图像识别结果进行拼接,生成唇部图像对应的唇部图像识别结果。
如一段3s的有效视频流的帧率30fps/s,切分图像序列的长度为9,即将该有效视频流对应的唇部图像序列按照每9帧或者0.3s的长度进行切分,获取10个长度为0.3s的切分图像序列。然后依次将10个切分图像序列输入到唇部图像识别模型中,获取每个切分图像序列对应的切分图像特征。接着采用分类函数对每个切分图像特征进行分类,获取切分图像识别结果,即该切分图像特征对应的某个词或者句子。最后在获取切分图像识别结果后,按照时间顺序将这10个切分图像识别结果进行拼接,则可以得到该有效视频流中唇部动作表达的含义。
步骤S61-步骤S63,通过唇部图像识别模型对唇部图像序列对应的各切分图像序列进行识别,获取切分图像特征,并采用分类函数对切分图像特征进行分类,获取切分图像识别结果,然后按照时间顺序对切分图像识别结果进行拼接,以获取唇部图像识别结果,无需人工干预,可由唇部图像识别模型自动得出,提高了识别效率和准确性。
在一实施例中,如图4所示,该唇语识别方法还包括获取唇部图像识别模型,具体包括如下步骤:
S601:获取训练图像序列,训练图像序列携带有图像文本标签,将训练图像序列划分为图像序列训练集和图像序列测试集。
其中,训练图像序列指多个仅包含唇部动作且满足预设长度的图像形成的图像序列。图像文本标签指用于表示训练图像序列的文本标签,本实施例中的图像文本标签具体为一个词或者一句话。
具体地,在获取训练图像序列后,将训练图像序列划分为图像序列训练集和图像序列测试集,以使得图像序列训练集用于训练3D CNN模型,图像序列测试集用于测试训练好的3D CNN模型的准确性。
S602:将图像序列训练集中的训练图像序列输入到3D卷积神经网络模型中,获取训练结果。
具体地,在获取图像序列训练集后,将对应的图像序列训练集输入3D卷积神经网络(3D CNN)模型中进行训练,通过每一层卷积层的计算,获取每一层的卷积层的输出,卷积层的输出可以通过公式a m l=σ(z m l)=σ(a m l-1*W l+b l)计算,其中,a m l表示第l层卷积层的第m个训练图像序列的输出,z m l表示未采用激活函数处理前的第m个训练图像序列的输出,a m l-1表示l-1层卷积层的第m个训练图像序列输出(即上一层的输出),σ表示激活函数,对于卷积层采用的激活函数σ为ReLu(Rectified Linear Unit,线性整流函数),相比其他激活函数的效果会更好,*表示卷积运算,W l表示第l层卷积层的卷积核(权值),b l表示第l层卷积层的偏置。若第l层是池化层,则在池化层采用最大池化的下样采样对卷积层的输出进行降维处理,具体公式为a m l=pool(a m l-1),其中pool是指下采样计算,该下采样计算可以选择最大池化的方法,最大池化实际上就是在m*m的样本中取最大值。然后再使用全连接层将所有数据特征进行整合,形成切分图像特征。
最后通过公式
Figure PCTCN2019102569-appb-000001
获取输出层的输出,T (m)表示3D CNN输出层的输出,该输出即是要获取第m个训练图像序列对应的训练结果。
S603:根据训练结果和图像文本标签,构建损失函数,并通过损失函数更新调整3D卷积神经网络模型的权值和偏置,获取唇部图像训练模型。
具体地,在获取训练结果后,3D CNN模型会通过训练结果与图像文本标签构建损失函数,通过对损失函数求偏导,更新调整循环神经网络模型和3D CNN中的权值和偏置,获取唇部图像训练模型。
S604:使用图像序列测试集中的训练图像序列对唇部图像训练模型进行测试,当图像序列测试集对应的输出结果与图像文本标签的误差在预设误差范围内,则将唇部图像训练模型作为唇部图像识别模型。
具体地,在获取唇部图像训练模型后,为了防止出现过拟合问题,还需要使用图像序列测试集中的训练图像序列对唇部图像训练模型进行测试,以确定训练好的唇部图像训练模型是否准确。当图像序列测试集对应的输出结果与图像文本标签的误差在预设误差范围内(如0-10%),则将唇部图像训练模型作为唇部图像识别模型。
步骤S601-步骤S604,通过将图像序列训练集中的训练图像序列输入到3D卷积神经 网络模型中进行训练,获取唇部图像训练模型,并使用图像序列测试集中的训练图像序列对唇部图像训练模型进行验证测试,当图像序列测试集对应的输出结果与图像文本标签的误差在预设误差范围内,则表示唇部图像训练模型满足要求,可以作为唇部图像识别模型。
在一实施例中,如图5所示,该唇语识别方法还包括获取语音识别模型,具体包括如下步骤:
S701:获取训练语音,对训练语音进行预处理,获取目标语音。
其中,训练语音指从原始视频中获取的用于进行模型训练的唇读语音。
具体地,在获取训练语音后,训练语音中不可避免地会包括静音段和噪音段,为了不影响训练结果的准确性,在获取训练语音后,需要对训练语音进行预处理,去除训练语音中的静音段和噪音段,保留声纹连续变化明显的目标语音。其中,静音段指训练语音中由于静默而没有发音的语音部分,如说话人在说话过程中进行思考、呼吸等情况。噪音段是指训练语音中的环境噪音部分,如门窗的开关和物体的碰撞等发出的声音。目标语音指对训练语音进行预处理后得到的仅包含声纹连续变化明显的数据。
S702:采用语音转文本技术,将目标语音转换为原始文本。
本实施例使用的语音转文本技术为ASR(Automatic Speech Recognition,自动语音识别技术),其中ASR是一种将说话人的语音转换为文本信息的技术。
具体地,在获取目标语音后,服务器采用ASR技术,将目标语音转换为原始文本。其中,原始文本指目标语音通过ASR技术转换生成对应的文字形式的文本。将目标语音转换为原始文本,以方便对目标文本进行文本标签处理,若直接对目标语音进行文本标签处理,由于目标语音是以语音的形式表达的,通过听取语音的内容对语音进行文本标签处理,不方便操作和保存,处理速度慢,将目标语音转换为原始文本,以文本的形式表达出来,通过阅读文本的方式对文本的内容进行文本标签处理,方便操作,处理效率高。
S703:对原始文本进行预处理,获取目标文本,目标文本携带有对应的文本标签。
其中,目标文本指对原始文本进行预处理,去除数据和特殊符号后得到的文本。本实施例中的数据指将目标语音转换为原始文本后出现的数字;特殊符号指在将目标语音转换为原始文本后出现的不能识别的字符。如$、*、&、#、+和?。
具体地,在获取原始文本后,服务器需要对原始文本进行预处理,将原始文本中的数据和特殊符号去除,获取仅包含汉字的目标文本。在获取目标文本后,服务器将目标文本发送给客户端,客户端对应的工作人员通过阅读目标文本的内容,对目标文本进行文本标签化处理,使得目标文本获取对应的文本标签,以便后续根据目标文本和文本标签进行模型训练。
S704:将目标文本划分为训练文本和测试文本。
具体地,在获取目标文本后,将目标文本划分为训练文本和测试文本,用来训练双向RNN模型和测试训练好的双向RNN模型是否准确。其中,训练文本是用于调整双向RNN模型中的参数的文本。测试文本是用于测试训练好的双向RNN模型的识别准确率的文本。
S705:将训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型。
其中,双向循环神经网络(Bi-directional Recurrent Neural Networks,简称BRNN)模型是由两个RNN(Recurrent Neural Networks,RNN)组成的,为了便于描述,将其中一个RNN称之为向前RNN,另外一个RNN称为向后RNN。双向循环神经网络(BRNN)模型中的向前RNN和向后RNN有各自对应的隐藏层,输入层和输出层共用一个。即双向RNN模型是由一个输入层、两个隐藏层和一个输出层组成的神经网络模型。该双向RNN模型包括各层之间的神经元连接的权值和偏置,权值和偏置是双向RNN模型中的参数,这些权值和偏置决定双向RNN模型的性质及识别效果。
S706:将测试文本输入到有效双向循环神经网络模型中进行测试,获取测试文本对应 的准确率,若准确率达到预设阈值,则将有效双向循环神经网络模型确定为语音识别模型。
具体地,在获取有效双向循环神经网络模型后,为了防止出现过拟合问题,即防止出现只在识别训练文本时具有准确性,在识别其他不是训练文本的内容时不具备准确性的情况,还需要使用测试文本对有效双向循环神经网络模型进行测试,以确定训练好的有效双向循环神经网络模型是否准确。若准确率达到预设阈值(如95%),则标识有效双向循环神经网络模型的准确性满足要求,可以作为语音识别模型。
步骤S701-步骤S706,通过对训练语音进行预处理,获取目标语音,并采用语音转文本技术,将目标语音转换为原始文本,以方便执行后续步骤。然后对原始文本进行预处理,获取目标文本,并将目标文本划分为训练文本和测试文本,用来训练双向RNN模型和测试训练好的双向RNN模型,保证有效双向循环神经网络模型的准确性满足要求,可以作为语音识别模型。
在一实施例中,如图6所示,步骤S705,将训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型,具体包括如下步骤:
S7051:对原始双向循环神经网络模型中的权值和偏置进行初始化设置。
本实施例中,采用预设值对权值和偏置进行初始化设置,该预设值是开发人员根据经验预先设置好的值。采用预设值对双向RNN模型的权值和偏置进行初始化设置,可以在后续根据训练文本进行双向RNN模型训练时,缩短模型的训练时间,提高模型的识别准确率。若在有效双向RNN时,对权值和偏置的初始化设置不是很恰当,则会导致模型在初始阶段的调整能力很差,从而影响该双向RNN模型后续对语音的区分效果。
S7052:将训练文本转换成词向量,并将词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果。
具体地,通过词向量转换工具将训练文本中的词转换为词向量,一个训练文本中包括至少一个词向量。本实施例中的使用的词向量转换工具为word2vec(word to vector,单词转换向量),其中,word2vec是一种将单词转换为向量的工具,该工具中可以将每一个词映射成对应的向量。
将训练文本转换成词向量后,将词向量输入到向前RNN的隐藏层计算,获取向前隐藏层和向后隐藏层的输出,然后使用attention机制分别对向前隐藏层的输出和想后隐藏层的输出进行注意程度分配,最后将通过attention机制处理后的两个输出进行融合处理,得到最终输入到双向循环神经网络模型的输出层的值,并通过输出层的计算,获取模型输出结果。其中,attention机制即注意力机制,指将根据数据重要性的不同对数据赋予不同的权重,重要性大的对应的权重大,重要性小的对应的权重小。模型输出结果是训练文本通过双向RNN模型训练获取的输出。本实施例中的融合处理包括但不限于使用算数平均值法和加权平均值方法,为了便于描述,后续步骤使用算术平均值法对attention机制处理后的两个输出进行融合处理。
S7053:基于模型输出结果更新原始双向循环神经网络模型中的权值和偏置,获取有效双向循环神经网络模型。
具体地,原始双向RNN模型的输出层计算出模型输出结果后,与文本标签y t构建损失函数。然后根据损失函数采用反向传播算法,分别对双向RNN模型中的权值和偏置求偏导,调整向前RNN和向后RNN的权值和偏置,获取有效双向RNN。其中,反向传播(Back Propagation)算法是指按照时序状态的反向顺序调整隐藏层与原始双向RNN模型的输出层之间的权值和偏置、以及输入层与隐藏层之间的权值和偏置的算法。
进一步地,损失函数表达式为
Figure PCTCN2019102569-appb-000002
其 中,T表示训练文本携带的训练图像序列,θ表示权值和偏置的集合(U、V、W、b、c),
Figure PCTCN2019102569-appb-000003
表示词向量对应的文本标签。对双向RNN模型中的权值和偏置求偏导的计算公式
Figure PCTCN2019102569-appb-000004
步骤S7051-步骤S7053,通过对原始双向循环神经网络模型中的权值和偏置进行初始化设置,以缩短后续模型的训练时间。将训练文本对应的词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果,并基于模型输出结果构建损失函数,以更新原始双向循环神经网络模型的权值和偏重,使得原始双向循环神经网络模型称为可以识别训练文本的有效双向循环神经网络模型。
在一实施例中,如图7所示,步骤S7052,将训练文本转换成词向量,并将词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果,具体包括如下步骤:
S70521:将训练文本转换成词向量,并将词向量输入到原始双向循环神经网络模型的输入层,输入层将获取到的词向量输入到向前循环神经网络的向前隐藏层中,并使用注意力机制进行处理,获取向前输出。
其中,向前隐藏层指向前循环神经网络的隐藏层。具体地,将训练文本输入到原始双向RNN模型的输入层,输入层将获取到的训练文本输入到向前隐藏层中,在向前隐藏层中通过公式h t1=σ(Ux t+Wh t-1+b)计算向前隐藏层的输出。其中,σ表示向前RNN隐藏层的激活函数,U表示原始双向RNN模型的输入层与向前RNN隐藏层之间的权值,W表示向前RNN各隐藏层之间的权值,b表示原始双向RNN模型的输入层与向前RNN之间的偏置,x t表示原始双向RNN模型的输入层中t时刻输入的词向量,h t1表示向前RNN的隐藏层中对t时刻对应的词向量的输出,h t-1表示向前RNN的隐藏层中t时刻对应的词向量的输出。
使用attention机制对向前隐藏层的输出进行处理,获取向前输出。其中,向前输出指使用attention机制对向前隐藏层的输出进行处理后得到的值。具体地,根据公式
Figure PCTCN2019102569-appb-000005
计算语义向量的重要值,其中,c t1指attention机制对向前循环神经网络的隐藏层中t时刻的语义向量的注意程度(即重要值),α tj指第j个输入的词向量与t时刻对应的词向量的相关性,h j指j个输入的词向量在通过向前隐藏层得到的输出。进一步地,归一化过程为
Figure PCTCN2019102569-appb-000006
其中,k指第K个输入词向量。然后e tj=V Γtanh(U·h j+WS t-1+b),其中,V表示隐藏层和输出层之间的权重,V Γ权重V的转置,S t-1指t-1时刻双向循环神经网络输出层得到的输出。
S70522:输入层将获取到的词向量输入到向后循环神经网络的向后隐藏层中,并使用注意力机制进行处理,获取向后输出。
其中,向后隐藏层指向后循环神经网络的隐藏层。具体地,将训练文本输入到原始双 向RNN模型的输入层,输入层将获取到的训练文本输入到向后隐藏层中,在向后隐藏层中通过公式h t2=σ(Ux t+Wh t-1+b)计算向后隐藏层的输出。其中,σ表示向后RNN隐藏层的激活函数,U表示原始双向RNN模型的输入层与向后RNN隐藏层之间的权值,W表示向后RNN各隐藏层之间的权值,b表示原始双向RNN模型的输入层与向后RNN之间的偏置,x t表示原始双向RNN模型的输入层中t时刻输入的词向量,h t2表示向后RNN的隐藏层中对t时刻对应的词向量的输出,h t-1表示向后RNN的隐藏层中t时刻对应的词向量的输出。
使用attention机制对向后隐藏层的输出进行处理,获取向后输出。其中,向后输出指使用attention机制对向后隐藏层的输出进行处理后得到的值。具体地,根据公式
Figure PCTCN2019102569-appb-000007
计算语义向量的重要值,其中,c t2指attention机制对向后循环神经网络的隐藏层中t时刻的语义向量的注意程度(即重要值),α tj指第j个输入的词向量与t时刻对应的词向量的相关性,h j指j个输入的词向量在通过向后隐藏层得到的输出。进一步地,归一化过程为
Figure PCTCN2019102569-appb-000008
其中,k指第K个输入词向量。然后e tj=V Γtanh(U·h j+WS t-1+b),其中,V表示隐藏层和输出层之间的权重,V Γ权重V的转置,S t-1指t-1时刻双向循环神经网络输出层得到的输出。
S70523:对向前输出和向后输出进行融合处理,获取模型输出结果。
具体地,获取向前输出和向后输出后,使用公式
Figure PCTCN2019102569-appb-000009
对向前输出和向后输出进行融合处理,获取模型输出结果。其中,模型输出结果指最终要输入到输出层的输出。
步骤S70521-步骤S70523,在对原始双向循环神经网络模型中进行训练过程中,使用注意力机制,使得获取的向前输出和向后输出为训练文本中重要的词向量对应的输出,以使后续获取的模型输出结果为可以反映训练文本主要的含义的结果。
本申请提供的唇语识别方法,通过将原始视频的帧率调整为标准帧率,以获取标准视频。然后对标准视频中的音频数据和视频数据进行分离,获取有效音频流和有效视频流。使用人脸识别算法跟踪有效视频流中的人脸,并提取人脸中的嘴部区域,获取帧唇部动作视频,并将帧唇部动作视频调整为相同的帧宽和帧高,以方便将帧唇部动作视频拼接为唇部图像序列。为了方便唇部图像识别模型进行识别,还需要预先对唇部图像序列进行切分,以使唇部图像序列在输入唇部图像识别模型时,为满足预设长度的切分图像序列。最后将切分图像序列对应的识别结果按照时间顺序进行拼接,获取唇部图像识别结果。为了进一步验证唇部图像识别结果是否准确,还需要将有效音频流输入到语音识别模型中,获取对应的语音识别结果,并计算语音识别结果和唇部图像识别结果的相似性,当二者的相似度达到预设值,表示唇部图像识别结果是准确的,可以作为原始视频的唇语识别结果,以保证唇部图像识别结果的准确性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执 行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种唇语识别装置,该唇语识别装置与上述实施例中唇语识别方法一一对应。如图8所示,该唇语识别装置包括原始视频处理模块10、标准视频处理模块20、帧视频获取模块30、帧视频处理模块40、图像序列切分模块50、第一模型识别模块60、第二模型识别模块70和结果验证模块80。各功能模块详细说明如下:
原始视频处理模块10,用于获取原始视频,对原始视频的帧率进行标准化处理,获取标准视频。
标准视频处理模块20,用于对标准视频进行分离,获取有效音频流和有效视频流。
帧视频获取模块30,用于使用人脸识别算法跟踪有效视频流中的人脸,并提取人脸中的嘴部区域,获取帧唇部动作视频。
帧视频处理模块40,用于对帧唇部动作视频进行处理,获取唇部图像序列。
图像序列切分模块50,用于采用序列切分规则对唇部图像序列进行切分,获取切分图像序列。
第一模型识别模块60,用于将唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果。
第二模型识别模块70,用于将有效音频流输入到语音识别模型中,获取语音识别结果。
结果验证模块80,用于计算唇部图像识别结果和语音识别结果的相似度,当相似度达到预设值,则将唇部图像识别结果作为原始视频的唇语识别结果。
进一步地,第一模型识别模块60包括图像特征获取单元61、图像特征处理单元62和图像识别结果获取单元63。
图像特征获取单元61,用于通过唇部图像识别模型对唇部图像序列对应的各切分图像序列进行识别,获取切分图像特征。
图像特征处理单元62,用于采用分类函数对切分图像特征进行分类,获取切分图像识别结果。
图像识别结果获取单元63,用于按照时间顺序对切分图像识别结果进行拼接,获取唇部图像识别结果。
进一步地,唇语识别装置还包括训练图像数据获取单元、图像训练结果获取单元、第一模型训练单元和第一模型获取单元。
训练图像数据获取单元,用于获取训练图像序列,训练图像序列携带有图像文本标签,将训练图像序列划分为图像序列训练集和图像序列测试集。
图像训练结果获取单元,用于将图像序列训练集中的训练图像序列输入到3D卷积神经网络模型中,获取训练结果。
第一模型训练单元,用于根据训练结果和图像文本标签,构建损失函数,并通过损失函数更新调整3D卷积神经网络模型的权值和偏置,获取唇部图像训练模型。
第一模型获取单元,用于使用图像序列测试集中的训练图像序列对唇部图像训练模型进行测试,当图像序列测试集对应的输出结果与图像文本标签的误差在预设误差范围内,则将唇部图像训练模型作为唇部图像识别模型。
进一步地,唇语识别装置还包括训练语音数据获取单元、语音处理单元、文本处理单元、文本划分单元、第二模型训练单元和第二模型获取单元。
训练语音数据获取单元,用于获取训练语音,对训练语音进行预处理,获取目标语音。
语音处理单元,用于采用语音转文本技术,将目标语音转换为原始文本。
文本处理单元,用于对原始文本进行预处理,获取目标文本,目标文本携带有对应的文本标签。
文本划分单元,用于将目标文本划分为训练文本和测试文本。
第二模型训练单元,用于将训练文本输入到原始双向循环神经网络模型中进行训练, 获取有效双向循环神经网络模型。
第二模型获取单元,用于将测试文本输入到有效双向循环神经网络模型中进行测试,获取测试文本对应的准确率,若准确率达到预设阈值,则将有效双向循环神经网络模型确定为语音识别模型。
进一步地,第二模型训练单元包括参数初始化单元、模型输出结果获取单元和参数更新单元。
参数初始化单元,用于对原始双向循环神经网络模型中的权值和偏置进行初始化设置。
模型输出结果获取单元,用于将训练文本转换成词向量,并将词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果。
参数更新单元,用于基于模型输出结果更新原始双向循环神经网络模型中的权值和偏置,获取有效双向循环神经网络模型。
进一步地,模型输出结果获取单元包括向前输出获取单元、向后输出获取单元和输出处理单元。
向前输出获取单元,用于将训练文本转换成词向量,并将词向量输入到原始双向循环神经网络模型的输入层,输入层将获取到的词向量输入到向前循环神经网络的向前隐藏层中,并使用注意力机制进行处理,获取向前输出。
向后输出获取单元,用于输入层将获取到的词向量输入到向后循环神经网络的向后隐藏层中,并使用注意力机制进行处理,获取向后输出。
输出处理单元,用于对向前输出和向后输出进行融合处理,获取模型输出结果。
关于唇语识别装置的具体限定可以参见上文中对于唇语识别方法的限定,在此不再赘述。上述唇语识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储唇语识别方法涉及到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种唇语识别方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例的唇语识别方法,如图2所示的步骤S10-步骤S80,或者图3至图7中所示的步骤,为避免重复,这里不再赘述。或者,处理器执行计算机可读指令时实现上述唇语识别装置这一实施例中的各模块/单元的功能,例如图8所示的模块10至模块80的功能,为避免重复,这里不再赘述。
在一个实施例中,提供一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行上述实施例的唇语识别方法,如图2所示的步骤S10-步骤S80,或者图3至图7中所示的步骤,为避免重复,这里不再赘述。或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现上述唇语识别装置这一实施例中的各模块/单元的功能,例如图8所示的模块10至模块80的功能,为避免重复,这里不再赘述。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种唇语识别方法,其特征在于,包括:
    获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
    对所述标准视频进行分离,获取有效音频流和有效视频流;
    使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
    对所述帧唇部动作视频进行处理,获取唇部图像序列;
    采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
    将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
    将所述有效音频流输入到语音识别模型中,获取语音识别结果;
    计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
  2. 如权利要求1所述的唇语识别方法,其特征在于,所述将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果,包括:
    通过所述唇部图像识别模型对所述唇部图像序列对应的各切分图像序列进行识别,获取切分图像特征;
    采用分类函数对所述切分图像特征进行分类,获取切分图像识别结果;
    按照时间顺序对所述切分图像识别结果进行拼接,获取唇部图像识别结果。
  3. 如权利要求1所述的唇语识别方法,其特征在于,所述唇语识别方法还包括:
    获取训练图像序列,所述训练图像序列携带有图像文本标签,将所述训练图像序列划分为图像序列训练集和图像序列测试集;
    将所述图像序列训练集中的训练图像序列输入到3D卷积神经网络模型中,获取训练结果;
    根据所述训练结果和所述图像文本标签,构建损失函数,并通过所述损失函数更新调整所述3D卷积神经网络模型的权值和偏置,获取唇部图像训练模型;
    使用所述图像序列测试集中的训练图像序列对所述唇部图像训练模型进行测试,当所述图像序列测试集对应的输出结果与所述图像文本标签的误差在预设误差范围内,则将所述唇部图像训练模型作为唇部图像识别模型。
  4. 如权利要求1所述的唇语识别方法,其特征在于,所述唇语识别方法还包括:
    获取训练语音,对所述训练语音进行预处理,获取目标语音;
    采用语音转文本技术,将所述目标语音转换为原始文本;
    对所述原始文本进行预处理,获取目标文本,所述目标文本携带有对应的文本标签;
    将所述目标文本划分为训练文本和测试文本;
    将所述训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型;
    将所述测试文本输入到所述有效双向循环神经网络模型中进行测试,获取所述测试文本对应的准确率,若所述准确率达到预设阈值,则将所述有效双向循环神经网络模型确定为语音识别模型。
  5. 如权利要求4所述的唇语识别方法,其特征在于,所述将所述训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型,包括:
    对原始双向循环神经网络模型中的权值和偏置进行初始化设置;
    将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果;
    基于所述模型输出结果更新所述原始双向循环神经网络模型中的权值和偏置,获取有效双向循环神经网络模型。
  6. 如权利要求5所述的唇语识别方法,其特征在于,所述将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果,包括:
    将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型的输入层,输入层将获取到的所述词向量输入到向前循环神经网络的向前隐藏层中,并使用注意力机制进行处理,获取向前输出;
    输入层将获取到的所述词向量输入到向后循环神经网络的向后隐藏层中,并使用注意力机制进行处理,获取向后输出;
    对向前输出和向后输出进行融合处理,获取模型输出结果。
  7. 一种唇语识别装置,其特征在于,包括:
    原始视频处理模块,用于获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
    标准视频处理模块,用于对所述标准视频进行分离,获取有效音频流和有效视频流;
    帧视频获取模块,用于使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
    帧视频处理模块,用于对所述帧唇部动作视频进行处理,获取唇部图像序列;
    图像序列切分模块,用于采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
    第一模型识别模块,用于将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
    第二模型识别模块,用于将所述有效音频流输入到语音识别模型中,获取语音识别结果;
    结果验证模块,用于计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
  8. 如权利要求7所述的唇语识别装置,其特征在于,第一模型识别模块包括:
    图像特征获取单元,用于通过所述唇部图像识别模型对所述唇部图像序列对应的各切分图像序列进行识别,获取切分图像特征;
    图像特征处理单元,用于采用分类函数对所述切分图像特征进行分类,获取切分图像识别结果;
    图像识别结果获取单元,用于按照时间顺序对所述切分图像识别结果进行拼接,获取唇部图像识别结果。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
    对所述标准视频进行分离,获取有效音频流和有效视频流;
    使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
    对所述帧唇部动作视频进行处理,获取唇部图像序列;
    采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
    将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
    将所述有效音频流输入到语音识别模型中,获取语音识别结果;
    计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设 值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
  10. 如权利要求9所述的计算机设备,其特征在于,所述将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果,包括:
    通过所述唇部图像识别模型对所述唇部图像序列对应的各切分图像序列进行识别,获取切分图像特征;
    采用分类函数对所述切分图像特征进行分类,获取切分图像识别结果;
    按照时间顺序对所述切分图像识别结果进行拼接,获取唇部图像识别结果。
  11. 如权利要求9所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取训练图像序列,所述训练图像序列携带有图像文本标签,将所述训练图像序列划分为图像序列训练集和图像序列测试集;
    将所述图像序列训练集中的训练图像序列输入到3D卷积神经网络模型中,获取训练结果;
    根据所述训练结果和所述图像文本标签,构建损失函数,并通过所述损失函数更新调整所述3D卷积神经网络模型的权值和偏置,获取唇部图像训练模型;
    使用所述图像序列测试集中的训练图像序列对所述唇部图像训练模型进行测试,当所述图像序列测试集对应的输出结果与所述图像文本标签的误差在预设误差范围内,则将所述唇部图像训练模型作为唇部图像识别模型。
  12. 如权利要求9所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取训练语音,对所述训练语音进行预处理,获取目标语音;
    采用语音转文本技术,将所述目标语音转换为原始文本;
    对所述原始文本进行预处理,获取目标文本,所述目标文本携带有对应的文本标签;
    将所述目标文本划分为训练文本和测试文本;
    将所述训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型;
    将所述测试文本输入到所述有效双向循环神经网络模型中进行测试,获取所述测试文本对应的准确率,若所述准确率达到预设阈值,则将所述有效双向循环神经网络模型确定为语音识别模型。
  13. 如权利要求12所述的计算机设备,其特征在于,所述将所述训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型,包括:
    对原始双向循环神经网络模型中的权值和偏置进行初始化设置;
    将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果;
    基于所述模型输出结果更新所述原始双向循环神经网络模型中的权值和偏置,获取有效双向循环神经网络模型。
  14. 如权利要求13所述的计算机设备,其特征在于,所述将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果,包括:
    将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型的输入层,输入层将获取到的所述词向量输入到向前循环神经网络的向前隐藏层中,并使用注意力机制进行处理,获取向前输出;
    输入层将获取到的所述词向量输入到向后循环神经网络的向后隐藏层中,并使用注意力机制进行处理,获取向后输出;
    对向前输出和向后输出进行融合处理,获取模型输出结果。
  15. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个 或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取原始视频,对所述原始视频的帧率进行标准化处理,获取标准视频;
    对所述标准视频进行分离,获取有效音频流和有效视频流;
    使用人脸识别算法跟踪所述有效视频流中的人脸,并提取所述人脸中的嘴部区域,获取帧唇部动作视频;
    对所述帧唇部动作视频进行处理,获取唇部图像序列;
    采用序列切分规则对所述唇部图像序列进行切分,获取切分图像序列;
    将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果;
    将所述有效音频流输入到语音识别模型中,获取语音识别结果;
    计算所述唇部图像识别结果和所述语音识别结果的相似度,当所述相似度达到预设值,则将所述唇部图像识别结果作为所述原始视频的唇语识别结果。
  16. 如权利要求15所述的可读存储介质,其特征在于,所述将所述唇部图像序列对应的各切分图像序列依次输入到唇部图像识别模型中进行识别,获取唇部图像识别结果,包括:
    通过所述唇部图像识别模型对所述唇部图像序列对应的各切分图像序列进行识别,获取切分图像特征;
    采用分类函数对所述切分图像特征进行分类,获取切分图像识别结果;
    按照时间顺序对所述切分图像识别结果进行拼接,获取唇部图像识别结果。
  17. 如权利要求15所述的可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取训练图像序列,所述训练图像序列携带有图像文本标签,将所述训练图像序列划分为图像序列训练集和图像序列测试集;
    将所述图像序列训练集中的训练图像序列输入到3D卷积神经网络模型中,获取训练结果;
    根据所述训练结果和所述图像文本标签,构建损失函数,并通过所述损失函数更新调整所述3D卷积神经网络模型的权值和偏置,获取唇部图像训练模型;
    使用所述图像序列测试集中的训练图像序列对所述唇部图像训练模型进行测试,当所述图像序列测试集对应的输出结果与所述图像文本标签的误差在预设误差范围内,则将所述唇部图像训练模型作为唇部图像识别模型。
  18. 如权利要求15所述的可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取训练语音,对所述训练语音进行预处理,获取目标语音;
    采用语音转文本技术,将所述目标语音转换为原始文本;
    对所述原始文本进行预处理,获取目标文本,所述目标文本携带有对应的文本标签;
    将所述目标文本划分为训练文本和测试文本;
    将所述训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型;
    将所述测试文本输入到所述有效双向循环神经网络模型中进行测试,获取所述测试文本对应的准确率,若所述准确率达到预设阈值,则将所述有效双向循环神经网络模型确定为语音识别模型。
  19. 如权利要求18所述的可读存储介质,其特征在于,所述将所述训练文本输入到原始双向循环神经网络模型中进行训练,获取有效双向循环神经网络模型,包括:
    对原始双向循环神经网络模型中的权值和偏置进行初始化设置;
    将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型中进行 训练,获取模型输出结果;
    基于所述模型输出结果更新所述原始双向循环神经网络模型中的权值和偏置,获取有效双向循环神经网络模型。
  20. 如权利要求19所述的可读存储介质,其特征在于,所述将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型中进行训练,获取模型输出结果,包括:
    将训练文本转换成词向量,并将所述词向量输入到原始双向循环神经网络模型的输入层,输入层将获取到的所述词向量输入到向前循环神经网络的向前隐藏层中,并使用注意力机制进行处理,获取向前输出;
    输入层将获取到的所述词向量输入到向后循环神经网络的向后隐藏层中,并使用注意力机制进行处理,获取向后输出;
    对向前输出和向后输出进行融合处理,获取模型输出结果。
PCT/CN2019/102569 2019-05-21 2019-08-26 唇语识别方法、装置、计算机设备及存储介质 WO2020232867A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910424466.7A CN110276259B (zh) 2019-05-21 2019-05-21 唇语识别方法、装置、计算机设备及存储介质
CN201910424466.7 2019-05-21

Publications (1)

Publication Number Publication Date
WO2020232867A1 true WO2020232867A1 (zh) 2020-11-26

Family

ID=67959061

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102569 WO2020232867A1 (zh) 2019-05-21 2019-08-26 唇语识别方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110276259B (zh)
WO (1) WO2020232867A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633100A (zh) * 2020-12-14 2021-04-09 深兰科技(上海)有限公司 行为识别方法、装置、电子设备和存储介质
CN112967713A (zh) * 2021-01-23 2021-06-15 西安交通大学 一种基于多次模态融合的视听语音识别方法、装置、设备和存储介质
CN113052159A (zh) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 一种图像识别方法、装置、设备及计算机存储介质
CN113192530A (zh) * 2021-04-26 2021-07-30 深圳追一科技有限公司 模型训练、嘴部动作参数获取方法、装置、设备及介质
CN113569740A (zh) * 2021-07-27 2021-10-29 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 视频识别模型训练方法与装置、视频识别方法与装置
CN113782048A (zh) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 多模态语音分离方法、训练方法及相关装置
CN113837083A (zh) * 2021-09-24 2021-12-24 焦点科技股份有限公司 一种基于Transformer的视频片段分割方法
CN113851145A (zh) * 2021-09-23 2021-12-28 厦门大学 一种联合语音和语义关键动作的虚拟人动作序列合成方法
CN114781401A (zh) * 2022-05-06 2022-07-22 马上消费金融股份有限公司 一种数据处理方法、装置、设备和存储介质
CN116580440A (zh) * 2023-05-24 2023-08-11 北华航天工业学院 基于视觉transformer的轻量级唇语识别方法
CN117152317B (zh) * 2023-11-01 2024-02-13 之江实验室科技控股有限公司 数字人界面控制的优化方法
CN113851145B (zh) * 2021-09-23 2024-06-07 厦门大学 一种联合语音和语义关键动作的虚拟人动作序列合成方法

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689464A (zh) * 2019-10-09 2020-01-14 重庆医药高等专科学校 一种基于口型识别的英语发音质量评估方法
CN110865705B (zh) * 2019-10-24 2023-09-19 中国人民解放军军事科学院国防科技创新研究院 多模态融合的通讯方法、装置、头戴设备及存储介质
CN110929239B (zh) * 2019-10-30 2021-11-19 中科南京人工智能创新研究院 一种基于唇语指令的终端解锁方法
CN111079791A (zh) * 2019-11-18 2020-04-28 京东数字科技控股有限公司 人脸识别方法、设备及计算机可读存储介质
CN110992958B (zh) * 2019-11-19 2021-06-22 深圳追一科技有限公司 内容记录方法、装置、电子设备及存储介质
CN111091823A (zh) * 2019-11-28 2020-05-01 广州赛特智能科技有限公司 基于语音及人脸动作的机器人控制系统、方法及电子设备
CN111091824B (zh) * 2019-11-30 2022-10-04 华为技术有限公司 一种语音匹配方法及相关设备
CN111048113B (zh) * 2019-12-18 2023-07-28 腾讯科技(深圳)有限公司 声音方向定位处理方法、装置、系统、计算机设备及存储介质
CN111179919B (zh) * 2019-12-20 2022-11-04 华中科技大学鄂州工业技术研究院 一种确定失语类型的方法及装置
CN111125437B (zh) * 2019-12-24 2023-06-09 四川新网银行股份有限公司 对视频中唇语图片识别的方法
CN111370020B (zh) * 2020-02-04 2023-02-14 清华珠三角研究院 一种将语音转换成唇形的方法、系统、装置和存储介质
CN111326143B (zh) * 2020-02-28 2022-09-06 科大讯飞股份有限公司 语音处理方法、装置、设备及存储介质
CN111325289A (zh) * 2020-03-18 2020-06-23 中国科学院深圳先进技术研究院 一种行为识别方法、装置、设备及介质
CN111583916B (zh) * 2020-05-19 2023-07-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111643809B (zh) * 2020-05-29 2023-12-05 广州大学 一种基于潜能干预仪的电磁脉冲控制方法及系统
CN113743160A (zh) * 2020-05-29 2021-12-03 北京中关村科金技术有限公司 活体检测的方法、装置以及存储介质
CN111881726B (zh) * 2020-06-15 2022-11-25 马上消费金融股份有限公司 一种活体检测方法、装置及存储介质
CN111883107B (zh) * 2020-08-03 2022-09-16 北京字节跳动网络技术有限公司 语音合成、特征提取模型训练方法、装置、介质及设备
CN111931662A (zh) * 2020-08-12 2020-11-13 中国工商银行股份有限公司 唇读识别系统、方法及自助终端
CN112102448B (zh) * 2020-09-14 2023-08-04 北京百度网讯科技有限公司 虚拟对象图像显示方法、装置、电子设备和存储介质
CN112053690B (zh) * 2020-09-22 2023-12-29 湖南大学 一种跨模态多特征融合的音视频语音识别方法及系统
CN112330713B (zh) * 2020-11-26 2023-12-19 南京工程学院 基于唇语识别的重度听障患者言语理解度的改进方法
CN112465029A (zh) * 2020-11-27 2021-03-09 北京三快在线科技有限公司 一种实例追踪的方法及装置
CN112633136B (zh) * 2020-12-18 2024-03-22 深圳追一科技有限公司 视频分析方法、装置、电子设备及存储介质
CN112617755A (zh) * 2020-12-28 2021-04-09 深圳市艾利特医疗科技有限公司 言语功能障碍检测方法、装置、设备、存储介质及系统
CN112633208A (zh) * 2020-12-30 2021-04-09 海信视像科技股份有限公司 一种唇语识别方法、服务设备及存储介质
CN113658582B (zh) * 2021-07-15 2024-05-07 中国科学院计算技术研究所 一种音视协同的唇语识别方法及系统
CN113380271B (zh) * 2021-08-12 2021-12-21 明品云(北京)数据科技有限公司 情绪识别方法、系统、设备及介质
CN113516985A (zh) * 2021-09-13 2021-10-19 北京易真学思教育科技有限公司 语音识别方法、装置和非易失性计算机可读存储介质
CN113869212A (zh) * 2021-09-28 2021-12-31 平安科技(深圳)有限公司 多模态活体检测方法、装置、计算机设备及存储介质
CN114299944B (zh) * 2021-12-08 2023-03-24 天翼爱音乐文化科技有限公司 视频处理方法、系统、装置及存储介质
CN114677631B (zh) * 2022-04-22 2024-03-12 西北大学 一种基于多特征融合及多阶段训练的文化资源视频中文描述生成方法
CN117292437B (zh) * 2023-10-13 2024-03-01 山东睿芯半导体科技有限公司 一种唇语识别方法、装置、芯片及终端

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150228278A1 (en) * 2013-11-22 2015-08-13 Jonathan J. Huang Apparatus and method for voice based user enrollment with video assistance
CN107346340A (zh) * 2017-07-04 2017-11-14 北京奇艺世纪科技有限公司 一种用户意图识别方法及系统
CN108346427A (zh) * 2018-02-05 2018-07-31 广东小天才科技有限公司 一种语音识别方法、装置、设备及存储介质
CN109377995A (zh) * 2018-11-20 2019-02-22 珠海格力电器股份有限公司 一种控制设备的方法与装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203070287U (zh) * 2012-12-13 2013-07-17 合肥寰景信息技术有限公司 一种基于动作识别及语音识别技术的唇语翻译系统
CN104537358A (zh) * 2014-12-26 2015-04-22 安徽寰智信息科技股份有限公司 基于深度学习的唇语识别唇形训练数据库的生成方法
CN108537207B (zh) * 2018-04-24 2021-01-22 Oppo广东移动通信有限公司 唇语识别方法、装置、存储介质及移动终端
CN109409195A (zh) * 2018-08-30 2019-03-01 华侨大学 一种基于神经网络的唇语识别方法及系统
CN109524006B (zh) * 2018-10-17 2023-01-24 天津大学 一种基于深度学习的汉语普通话唇语识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150228278A1 (en) * 2013-11-22 2015-08-13 Jonathan J. Huang Apparatus and method for voice based user enrollment with video assistance
CN107346340A (zh) * 2017-07-04 2017-11-14 北京奇艺世纪科技有限公司 一种用户意图识别方法及系统
CN108346427A (zh) * 2018-02-05 2018-07-31 广东小天才科技有限公司 一种语音识别方法、装置、设备及存储介质
CN109377995A (zh) * 2018-11-20 2019-02-22 珠海格力电器股份有限公司 一种控制设备的方法与装置

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633100B (zh) * 2020-12-14 2023-08-08 深兰人工智能应用研究院(山东)有限公司 行为识别方法、装置、电子设备和存储介质
CN112633100A (zh) * 2020-12-14 2021-04-09 深兰科技(上海)有限公司 行为识别方法、装置、电子设备和存储介质
CN112967713A (zh) * 2021-01-23 2021-06-15 西安交通大学 一种基于多次模态融合的视听语音识别方法、装置、设备和存储介质
CN112967713B (zh) * 2021-01-23 2023-08-22 西安交通大学 一种基于多次模态融合的视听语音识别方法、装置、设备和存储介质
CN113052159A (zh) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 一种图像识别方法、装置、设备及计算机存储介质
CN113052159B (zh) * 2021-04-14 2024-06-07 中国移动通信集团陕西有限公司 一种图像识别方法、装置、设备及计算机存储介质
CN113192530A (zh) * 2021-04-26 2021-07-30 深圳追一科技有限公司 模型训练、嘴部动作参数获取方法、装置、设备及介质
CN113192530B (zh) * 2021-04-26 2023-08-22 深圳追一科技有限公司 模型训练、嘴部动作参数获取方法、装置、设备及介质
CN113569740A (zh) * 2021-07-27 2021-10-29 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 视频识别模型训练方法与装置、视频识别方法与装置
CN113569740B (zh) * 2021-07-27 2023-11-21 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 视频识别模型训练方法与装置、视频识别方法与装置
CN113851145A (zh) * 2021-09-23 2021-12-28 厦门大学 一种联合语音和语义关键动作的虚拟人动作序列合成方法
CN113851145B (zh) * 2021-09-23 2024-06-07 厦门大学 一种联合语音和语义关键动作的虚拟人动作序列合成方法
CN113837083A (zh) * 2021-09-24 2021-12-24 焦点科技股份有限公司 一种基于Transformer的视频片段分割方法
CN113782048A (zh) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 多模态语音分离方法、训练方法及相关装置
CN114781401A (zh) * 2022-05-06 2022-07-22 马上消费金融股份有限公司 一种数据处理方法、装置、设备和存储介质
CN116580440A (zh) * 2023-05-24 2023-08-11 北华航天工业学院 基于视觉transformer的轻量级唇语识别方法
CN116580440B (zh) * 2023-05-24 2024-01-26 北华航天工业学院 基于视觉transformer的轻量级唇语识别方法
CN117152317B (zh) * 2023-11-01 2024-02-13 之江实验室科技控股有限公司 数字人界面控制的优化方法

Also Published As

Publication number Publication date
CN110276259A (zh) 2019-09-24
CN110276259B (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
WO2020232867A1 (zh) 唇语识别方法、装置、计算机设备及存储介质
US11386900B2 (en) Visual speech recognition by phoneme prediction
Mansoorizadeh et al. Multimodal information fusion application to human emotion recognition from face and speech
Hassan et al. Multiple proposals for continuous arabic sign language recognition
CN114694076A (zh) 基于多任务学习与层叠跨模态融合的多模态情感分析方法
Hassanat Visual speech recognition
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
Jachimski et al. A comparative study of English viseme recognition methods and algorithms
Zhu et al. Unsupervised voice-face representation learning by cross-modal prototype contrast
Wang et al. Wavenet with cross-attention for audiovisual speech recognition
CN114694255A (zh) 基于通道注意力与时间卷积网络的句子级唇语识别方法
Goh et al. Audio-visual speech recognition system using recurrent neural network
Shipman et al. Speed-accuracy tradeoffs for detecting sign language content in video sharing sites
Pu et al. Review on research progress of machine lip reading
Elons et al. Facial expressions recognition for arabic sign language translation
Sheng et al. Importance-aware information bottleneck learning paradigm for lip reading
Ivanko et al. Designing advanced geometric features for automatic Russian visual speech recognition
Fernandes et al. IoT based smart security for the blind
Chelali Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment
CN116312512A (zh) 面向多人场景的视听融合唤醒词识别方法及装置
Cavallaro et al. Audio-visual learning for body-worn cameras
Pibre et al. Audio-video fusion strategies for active speaker detection in meetings
CN114155606A (zh) 基于人体动作分析的语义识别方法及相关装置
Bhaskar et al. A survey on different visual speech recognition techniques
Kumar et al. Development of visual-only speech recognition system for mute people

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929486

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929486

Country of ref document: EP

Kind code of ref document: A1