CN113571051A

CN113571051A - Voice recognition system and method for lip voice activity detection and result error correction

Info

Publication number: CN113571051A
Application number: CN202110654992.XA
Authority: CN
Inventors: 冯伟; 史鹏; 高丽清; 刘泽康; 刘之谏
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-10-29

Abstract

The invention relates to a voice recognition system and a voice recognition method for lip voice activity detection and result error correction. The audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio; the voice activity detector is used for detecting voice activity of the video clip containing the audio and judging whether the video clip containing the audio is the video clip containing the voice; and the voice recognizer is used for performing voice recognition on the audio extracted from the video clip detected as containing the voice to obtain an original result of the voice recognition. And the recognition result error corrector is used for correcting the recognition result of the voice recognizer.

Description

Voice recognition system and method for lip voice activity detection and result error correction

Technical Field

The invention belongs to the field of artificial intelligence, computer vision and voice recognition, and particularly relates to a voice recognition system and method for lip voice activity detection and result error correction.

Background

With the development of computer technology, human-machine interaction is more frequent. Among various human-computer interaction modes, voice is an important mode which cannot be ignored. The voice recognition technology is widely used in smart homes, mobile phone voice assistants and the like. In 3 months in 2017, according to the Mary Meeker annual Internet report, the word accuracy of the Google speech recognition system with machine learning as the background is 95% in the English field, and the result approaches the accuracy of human speech recognition. Therefore, the recognition accuracy rate of the current speech recognition technology in a quiet scene reaches a higher level. However, in a noisy scene, the accuracy is greatly affected, and there are many reasons for errors in speech recognition in a noisy environment. Voice Activity Detection (VAD) is a technique in the field of Voice signal processing, and determines whether a user is speaking according to an input signal, and intercepts an effective Voice segment for subsequent Voice recognition. The voice activity detection can reduce the calculation amount of voice recognition and reduce false recognition under the condition of noise. The voice activity detection effect is not good, so that sentences cannot be accurately segmented when sentences are segmented in the audio, and the built-in language model of voice recognition cannot search the context, so that the recognition precision is reduced.

At present, the realization of voice activity detection at home and abroad mainly comprises two modes, namely an audio signal-based mode and a video signal-based mode. (china 201810864097) using a posterior probability calculation, it is determined whether an audio frame is a speech frame. (china 202011332443.2) using a deep learning algorithm to identify audio frames, and using the frames reaching a preset long silence threshold as a segmentation point to segment the continuous speech signal into a plurality of valid speech segments. Because the voice activity that relies on audio signal detects the influence that receives background noise very easily, and in speech recognition's practical application scene, a large amount of electronic equipment can gather user's audio frequency and user's facial video simultaneously, consequently, this patent uses video signal, carries out voice activity according to user's lip action and detects to speech recognition's accuracy under the noise environment is promoted.

An LPN (Voice over Internet Network) model proposed by Shixi school of New York State university is different from a traditional method of using audio as silence detection, and the voice activity detection based on video signals is realized by integrating face information into a deep neural Network for feature learning, so that the accuracy rate of the disclosed LSW data set reaches 79.9%. The LPN model, however, requires that the input image only include lip regions. The open source face detection algorithm RetinaFace proposed by the empire State institute of technology has good precision on a plurality of data sets, and can predict key points of the face while detecting the face.

In addition, domain-specific proper nouns also make speech recognition very difficult. (china 201710952988) proposes a method for correcting errors in a text after speech recognition based on domain recognition, which calculates a similarity score according to an edit distance for error correction.

Disclosure of Invention

The invention aims to provide a voice recognition system and a method with good robustness on noise, the system and the method are characterized in that a face key point prediction module of Retina face is modified, a data set containing lip key points is used for training, lip region images can be output, voice activity detection is completed through LPN, in addition, a proper noun database in a specific field is established based on a longest public subsequence method, after the specific field of voice recognition is appointed, a voice recognition result is corrected, and recognition errors of proper nouns in the voice recognition result are corrected. The technical scheme is as follows:

a voice recognition system for lip voice activity detection and result error correction is characterized by comprising an audio and video processing module, a voice activity detector, a voice recognizer, a proper noun database and a recognition result error corrector. Wherein the content of the first and second substances,

the audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio;

the voice activity detector is used for detecting voice activity of the video clip containing the audio and judging whether the video clip containing the audio is the video clip containing the voice; the method comprises the following steps: the method comprises the following steps of dividing the method into a lip region extractor and a lip voice activity detector; the lip region extractor is realized by using a RetinaFace model, and lip key points and a lip region picture are obtained by detecting human faces in a video clip containing audio; the lip voice activity detector is realized by using an LPN model, and judges whether the video clip is a video clip containing voice or not through lip key points and lip region pictures of video frames in the video clip;

and the voice recognizer is used for performing voice recognition on the audio extracted from the video clip detected as containing the voice to obtain an original result of the voice recognition.

The special noun database is used for storing special nouns in a specific field and pinyin sequences corresponding to words.

The recognition result error corrector is used for correcting the recognition result of the voice recognizer, and the method comprises the following steps: converting the original recognition result of the voice recognizer into a pinyin sequence, calculating the pinyin sequence of the original recognition result and the longest public subsequence of the pinyin sequence of the proper nouns in the proper noun database, and correcting errors of the original recognition result.

Further, the method for correcting the error of the original result by using the longest common subsequence comprises the following steps:

(1) converting the original result of the voice recognition into a pinyin sequence;

(2) calculating to obtain the longest public subsequence between the pinyin sequence of the original voice recognition result and the pinyin sequence of each word in the word bank;

(3) acquiring a part to be replaced in an original voice recognition result according to the head and tail characters of the longest public subsequence;

(4) selecting a replacement word according to a group of rules according to the longest public subsequence, the phonetic sequence of the original result of the voice recognition and the numerical value of the phonetic sequence of each word in the word bank;

(5) and replacing the part to be replaced in the original voice recognition result.

The invention also provides a voice recognition method, which is characterized by comprising the following steps:

a preparation stage, namely searching proper nouns in a specific field, converting the proper nouns into pinyin sequences, and establishing a proper noun database;

training a voice activity detector, and respectively training a lip region extractor and a lip voice activity detector;

a prediction stage:

dividing the collected video into video segments by an audio and video processing module, and extracting the audio in the video;

step two, the voice activity detector detects the voice activity of the video clip;

thirdly, the voice recognizer performs voice recognition on the audio extracted from the video clip detected as the voice to obtain an original result of the voice recognition;

and step four, the recognition result error corrector corrects the original result of the voice recognition, and corrects the error by using an error correction method based on the longest common subsequence when proper noun recognition errors exist in the original result of the voice recognition.

Furthermore, the lip region extractor modifies the model structure of RetinaFace by using a RetinaFace model, and predicts the original face key points by adding 18 lip key points from 5 face key points.

Further, a backbone network MobileNet V1-0.25 of the RetinaFace model, an optimizer Adam, a learning rate of 0.001 and weight _ decay of 5e-4 are adopted, and the learning rate is reduced to 0.92 times after each round of training.

Further, the lip voice activity detector uses the LPN model and the training method is: training using LSW dataset, optimizer adagard, initial learning rate 0.0001, learning rate reduction by 10 times per 50000 iterations, momentum0.9, attenuation coefficient 0.0005, training is ended after 200000 iterations.

The invention uses a deep learning method, uses human face video information, realizes voice activity detection through lip motion detection, so as to improve the robustness of voice recognition to noise environment, and corrects the voice recognition result through a maximum public subsequence method, thereby improving the accuracy of voice recognition.

Drawings

FIG. 1 is a schematic diagram of the system

FIG. 2 example of a picture in a WFLW dataset

FIG. 3 is a diagram of a voice activity detection model

FIG. 4 Experimental results

Detailed Description

In order to make the technical scheme of the invention clearer, the invention is further explained with reference to the attached drawings.

Referring to fig. 1, the present invention provides a speech recognition system, which includes an audio/video processing module, a speech activity detector, a speech recognizer, a proper noun database, and a recognition result error corrector.

And the audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio. And the voice activity detector is used for detecting voice activity of the video clip containing the audio and judging whether the video clip is a voice video clip. And the voice recognizer is used for performing voice recognition on the audio extracted from the video clip detected as the voice to obtain an original result of the voice recognition. And the proper noun database is used for storing the proper nouns in the specific field and the pinyin sequences corresponding to the proper nouns. And the recognition result error corrector is used for correcting the recognition result of the voice recognizer.

In this embodiment, the voice activity detector is divided into two parts, namely a lip region extractor and a lip voice activity detector. The lip region extractor is realized by using a RetinaFace model, and lip key points and a lip region picture are obtained by detecting human faces in a video frame; the lip voice activity detector is realized by using an LPN model, and judges whether a video clip is voice or not through lip key points and lip region pictures of a plurality of video frames.

The invention provides a voice recognition method, which distinguishes whether an audio clip is voice or non-voice (noise and silence) by analyzing the action of a lip part in a video clip, carries out voice recognition on the audio clip detected as the voice, and corrects the voice recognition result according to the content of a special word bank, thereby improving the accuracy of the voice recognition in a noise environment.

Preparation phase

Collecting proper nouns of a specific field and establishing a proper noun database. In this embodiment, proper nouns including relevant information about hotels, restaurants, and scenic spots in the beijing sunward area, the western city area, and the eastern city area, including names, types of hotels, addresses, telephones, facilities, recommended dishes, and the like, are collected. The proper noun is converted into a pinyin sequence using the Pypinyin library of Python.

Training phase

Training a voice activity detector, respectively training a lip region extractor and a lip voice activity detector.

In this embodiment, the lip region extractor uses a retinaFace model.

(1) Modifying the structure of the RetinaFace model: the human face key point regression module of the RetinaFace model predicts the x and y coordinates of 5 human face key points including the left eye, the right eye, the nose tip, the left lip and the right lip through a convolution layer with the output channel number of anchor number 10, the embodiment expands the number of the key points, changes the output channel number of the convolution layer of the human face key point regression module from anchor number 10 to anchor number 46, enables the human face key point regression module to predict the x and y coordinates of 23 key points, and adds 18 lip key points on the basis of retaining the original 5 key points.

(2) The training method of the RetinaFace model comprises the following steps: training was performed using the WFLW dataset, which consists of pictures containing 98 face key points, as shown in fig. 2. And selecting key points with the numbers of 96 (left eye), 97 (right eye), 54 (nose tip) and 76-95 (lip) as training samples of a key point regression module of the RetinaFace model, and training the modified RetinaFace model. The backbone network MobileNet V1-0.25 of the RetinaFace model, the optimizer Adam, the learning rate 0.001, and the weight _ decay 5e-4, wherein the learning rate is reduced to 0.92 times after each round of training.

The lip voice activity detector uses LPN model (Landmark posing Network). The training method of the LPN model comprises the following steps: training is performed using the LSW dataset. The optimizer Adagrad, initial learning rate of 0.0001, learning rate reduction by 10 times per 50000 iterations, momentum0.9, attenuation coefficient 0.0005, and training is ended after 200000 iterations.

A model diagram of a voice activity detector is shown in fig. 3.

Prediction phase

Dividing the collected video into video segments with the length of 0.4 second by an audio and video processing module, and extracting the video segments containing audio;

and step two, the voice activity detector detects voice activity of the video clips, the input of the lip region extractor Retina face is a video frame, and the output is 20 lip key points and lip region pictures of the video frame. The input of the lip voice activity detector LPN is 20 lip keypoints and lip region pictures of each frame of the video segment, and the output is whether the current video segment is voice or not.

and fourthly, correcting the original result of the voice recognition by the recognition result corrector. An error correction method based on the longest common subsequence is used. The error correction method based on the longest public subsequence can correct the recognition error of proper nouns, and correct the wrong proper nouns in the original result of voice recognition on the premise of giving the word bank of the voice recognition proper nouns. The method comprises the following steps:

(1) according to the special noun word stock of the specific field, each word in the word stock is labeled, and the whole word stock is represented by Term.

Term＝[term₁,term₂,…term_n]

Wherein term_iRepresenting the ith word in the lexicon.

(2) Each word in the word stock is converted into a pinyin sequence with tones. For the Word bank with the number of words being N, a pinyin sequence set of the Word bank can be obtained and expressed by Word:

Word＝[word₁,word₂,…,word_n]

wherein, word_iIs a sequence, each element of which is a pinyin with tones, e.g., for the word "Beijing duck", its corresponding word_iIs [ 'b ě i'],['jīng'],['kǎo'],['yā']]。

(3) The original Result of the speech recognition is converted into a pinyin sequence with tones, which is represented by Result.

(4) Traversing the Pinyin sequence set Word of the Word stock, and calculating the Pinyin sequence Word of each Word in the Word stock_iAnd the longest common subsequence of the Pinyin Result of the original Result of the speech recognition, the longest common subsequence using lcs_iAnd expressing the longest public subsequence set of the word bank and the original result of the voice recognition, which is expressed by LCS.

LCS＝[lcs₁,lcs₂，…，lcs_i,…，lcs_n]

Wherein lcs_iThe longest common subsequence representing the pinyin of the ith word in the lexicon and the pinyin of the original result of the speech recognition.

(5) Traverse LCS, for each LCS_iTo obtain its first character f_iAnd the last character l_iIn Result, intercept with f_iFor the first character,/_iFor the subsequence of the last character, as the part to be replaced, with replace_iAnd (4) showing. Replace represents Replace_iA collection of (a).

Replace＝[replace₁，replace₂,…,replace_n]

(6) Traversing LCS, Word and Replace to calculate L_i、S_i、P_i. Wherein L is_iIs lcs_iLength of (d); s_iIs lcs_iLength of (2) and word_iThe ratio of the lengths of (a); p_iIs place_iLength of (2) and word_iThe ratio of the smaller to the larger of the lengths of (a). Through experimentation, a set of rules can be determined, according to L_i、S_i、P_iSelects the most suitable word to replace in the current thesaurus. In this embodiment, the rule determined is as follows: l is_iGreater than 1, S_iGreater than or equal to 0.6, P_iGreater than 0.66, on the basis of which i, chosen to maximize the multiplication of the three, will be term_iAs a replacement item and requires a portion to be replaced place_iLength of (2) and word_iLength of (1) is differentIs less than or equal to 1.

argmax_i(L_i*S_i*P_i)

(7) Use of term_iReplace_iThe corresponding text.

The following embodiments are combined to verify the feasibility of the method of the present invention:

the trained voice recognition model is used for testing the collected audio and video data, and the voice activity detection is accurately carried out by using the lip action information in the video, so that the effect of the voice recognition in a noise environment is greatly improved. The experimental data was a 100-sentence human recorded dialogue, and an algorithm was used to add a fixed signal-to-noise ratio of babble noise to it. The specific experimental result is shown in fig. 4, the recognition effect of the voice under different signal-to-noise ratios is tested, and the accuracy of the voice recognition is obviously improved under the condition that the environmental signal-to-noise ratio is less than 15 dB.

Claims

1. A voice recognition system for lip voice activity detection and result error correction is characterized by comprising an audio and video processing module, a voice activity detector, a voice recognizer, a proper noun database and a recognition result error corrector. The audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio;

2. The speech recognition system of claim 1, wherein the error correction of the original result using the longest common subsequence comprises the steps of:

3. A method of speech recognition with lip voice activity detection and result error correction implemented using the speech recognition system of claim 1, comprising the steps of:

a prediction stage:

4. The speech recognition method of claim 3, wherein the lip region extractor modifies a RetinaFace model structure using a RetinaFace model, and predicts original facial keypoints prediction by adding up to 5 facial keypoints and 18 lip keypoints.

5. The speech recognition method of claim 3, wherein the backbone network of the RetinaFace model is MobileNet V1-0.25, the optimizer Adam has a learning rate of 0.001, and the learning rate is reduced to 0.92 times after each training round.

6. The speech recognition method of claim 3, wherein the lip voice activity detector uses an LPN model, and the training method is: training using LSW dataset, optimizer adagard, initial learning rate 0.0001, learning rate reduction by 10 times per 50000 iterations, momentum0.9, attenuation coefficient 0.0005, training is ended after 200000 iterations.

7. A speech recognition method according to claim 3, wherein the error correction of the original result is performed using the longest common subsequence, comprising the steps of: