CN113571051A - Voice recognition system and method for lip voice activity detection and result error correction - Google Patents

Voice recognition system and method for lip voice activity detection and result error correction Download PDF

Info

Publication number
CN113571051A
CN113571051A CN202110654992.XA CN202110654992A CN113571051A CN 113571051 A CN113571051 A CN 113571051A CN 202110654992 A CN202110654992 A CN 202110654992A CN 113571051 A CN113571051 A CN 113571051A
Authority
CN
China
Prior art keywords
voice
original
result
recognition
lip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110654992.XA
Other languages
Chinese (zh)
Inventor
冯伟
史鹏
高丽清
刘泽康
刘之谏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110654992.XA priority Critical patent/CN113571051A/en
Publication of CN113571051A publication Critical patent/CN113571051A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a voice recognition system and a voice recognition method for lip voice activity detection and result error correction. The audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio; the voice activity detector is used for detecting voice activity of the video clip containing the audio and judging whether the video clip containing the audio is the video clip containing the voice; and the voice recognizer is used for performing voice recognition on the audio extracted from the video clip detected as containing the voice to obtain an original result of the voice recognition. And the recognition result error corrector is used for correcting the recognition result of the voice recognizer.

Description

Voice recognition system and method for lip voice activity detection and result error correction
Technical Field
The invention belongs to the field of artificial intelligence, computer vision and voice recognition, and particularly relates to a voice recognition system and method for lip voice activity detection and result error correction.
Background
With the development of computer technology, human-machine interaction is more frequent. Among various human-computer interaction modes, voice is an important mode which cannot be ignored. The voice recognition technology is widely used in smart homes, mobile phone voice assistants and the like. In 3 months in 2017, according to the Mary Meeker annual Internet report, the word accuracy of the Google speech recognition system with machine learning as the background is 95% in the English field, and the result approaches the accuracy of human speech recognition. Therefore, the recognition accuracy rate of the current speech recognition technology in a quiet scene reaches a higher level. However, in a noisy scene, the accuracy is greatly affected, and there are many reasons for errors in speech recognition in a noisy environment. Voice Activity Detection (VAD) is a technique in the field of Voice signal processing, and determines whether a user is speaking according to an input signal, and intercepts an effective Voice segment for subsequent Voice recognition. The voice activity detection can reduce the calculation amount of voice recognition and reduce false recognition under the condition of noise. The voice activity detection effect is not good, so that sentences cannot be accurately segmented when sentences are segmented in the audio, and the built-in language model of voice recognition cannot search the context, so that the recognition precision is reduced.
At present, the realization of voice activity detection at home and abroad mainly comprises two modes, namely an audio signal-based mode and a video signal-based mode. (china 201810864097) using a posterior probability calculation, it is determined whether an audio frame is a speech frame. (china 202011332443.2) using a deep learning algorithm to identify audio frames, and using the frames reaching a preset long silence threshold as a segmentation point to segment the continuous speech signal into a plurality of valid speech segments. Because the voice activity that relies on audio signal detects the influence that receives background noise very easily, and in speech recognition's practical application scene, a large amount of electronic equipment can gather user's audio frequency and user's facial video simultaneously, consequently, this patent uses video signal, carries out voice activity according to user's lip action and detects to speech recognition's accuracy under the noise environment is promoted.
An LPN (Voice over Internet Network) model proposed by Shixi school of New York State university is different from a traditional method of using audio as silence detection, and the voice activity detection based on video signals is realized by integrating face information into a deep neural Network for feature learning, so that the accuracy rate of the disclosed LSW data set reaches 79.9%. The LPN model, however, requires that the input image only include lip regions. The open source face detection algorithm RetinaFace proposed by the empire State institute of technology has good precision on a plurality of data sets, and can predict key points of the face while detecting the face.
In addition, domain-specific proper nouns also make speech recognition very difficult. (china 201710952988) proposes a method for correcting errors in a text after speech recognition based on domain recognition, which calculates a similarity score according to an edit distance for error correction.
Disclosure of Invention
The invention aims to provide a voice recognition system and a method with good robustness on noise, the system and the method are characterized in that a face key point prediction module of Retina face is modified, a data set containing lip key points is used for training, lip region images can be output, voice activity detection is completed through LPN, in addition, a proper noun database in a specific field is established based on a longest public subsequence method, after the specific field of voice recognition is appointed, a voice recognition result is corrected, and recognition errors of proper nouns in the voice recognition result are corrected. The technical scheme is as follows:
a voice recognition system for lip voice activity detection and result error correction is characterized by comprising an audio and video processing module, a voice activity detector, a voice recognizer, a proper noun database and a recognition result error corrector. Wherein the content of the first and second substances,
the audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio;
the voice activity detector is used for detecting voice activity of the video clip containing the audio and judging whether the video clip containing the audio is the video clip containing the voice; the method comprises the following steps: the method comprises the following steps of dividing the method into a lip region extractor and a lip voice activity detector; the lip region extractor is realized by using a RetinaFace model, and lip key points and a lip region picture are obtained by detecting human faces in a video clip containing audio; the lip voice activity detector is realized by using an LPN model, and judges whether the video clip is a video clip containing voice or not through lip key points and lip region pictures of video frames in the video clip;
and the voice recognizer is used for performing voice recognition on the audio extracted from the video clip detected as containing the voice to obtain an original result of the voice recognition.
The special noun database is used for storing special nouns in a specific field and pinyin sequences corresponding to words.
The recognition result error corrector is used for correcting the recognition result of the voice recognizer, and the method comprises the following steps: converting the original recognition result of the voice recognizer into a pinyin sequence, calculating the pinyin sequence of the original recognition result and the longest public subsequence of the pinyin sequence of the proper nouns in the proper noun database, and correcting errors of the original recognition result.
Further, the method for correcting the error of the original result by using the longest common subsequence comprises the following steps:
(1) converting the original result of the voice recognition into a pinyin sequence;
(2) calculating to obtain the longest public subsequence between the pinyin sequence of the original voice recognition result and the pinyin sequence of each word in the word bank;
(3) acquiring a part to be replaced in an original voice recognition result according to the head and tail characters of the longest public subsequence;
(4) selecting a replacement word according to a group of rules according to the longest public subsequence, the phonetic sequence of the original result of the voice recognition and the numerical value of the phonetic sequence of each word in the word bank;
(5) and replacing the part to be replaced in the original voice recognition result.
The invention also provides a voice recognition method, which is characterized by comprising the following steps:
a preparation stage, namely searching proper nouns in a specific field, converting the proper nouns into pinyin sequences, and establishing a proper noun database;
training a voice activity detector, and respectively training a lip region extractor and a lip voice activity detector;
a prediction stage:
dividing the collected video into video segments by an audio and video processing module, and extracting the audio in the video;
step two, the voice activity detector detects the voice activity of the video clip;
thirdly, the voice recognizer performs voice recognition on the audio extracted from the video clip detected as the voice to obtain an original result of the voice recognition;
and step four, the recognition result error corrector corrects the original result of the voice recognition, and corrects the error by using an error correction method based on the longest common subsequence when proper noun recognition errors exist in the original result of the voice recognition.
Furthermore, the lip region extractor modifies the model structure of RetinaFace by using a RetinaFace model, and predicts the original face key points by adding 18 lip key points from 5 face key points.
Further, a backbone network MobileNet V1-0.25 of the RetinaFace model, an optimizer Adam, a learning rate of 0.001 and weight _ decay of 5e-4 are adopted, and the learning rate is reduced to 0.92 times after each round of training.
Further, the lip voice activity detector uses the LPN model and the training method is: training using LSW dataset, optimizer adagard, initial learning rate 0.0001, learning rate reduction by 10 times per 50000 iterations, momentum0.9, attenuation coefficient 0.0005, training is ended after 200000 iterations.
Further, the method for correcting the error of the original result by using the longest common subsequence comprises the following steps:
(1) converting the original result of the voice recognition into a pinyin sequence;
(2) calculating to obtain the longest public subsequence between the pinyin sequence of the original voice recognition result and the pinyin sequence of each word in the word bank;
(3) acquiring a part to be replaced in an original voice recognition result according to the head and tail characters of the longest public subsequence;
(4) selecting a replacement word according to a group of rules according to the longest public subsequence, the phonetic sequence of the original result of the voice recognition and the numerical value of the phonetic sequence of each word in the word bank;
(5) and replacing the part to be replaced in the original voice recognition result.
The invention uses a deep learning method, uses human face video information, realizes voice activity detection through lip motion detection, so as to improve the robustness of voice recognition to noise environment, and corrects the voice recognition result through a maximum public subsequence method, thereby improving the accuracy of voice recognition.
Drawings
FIG. 1 is a schematic diagram of the system
FIG. 2 example of a picture in a WFLW dataset
FIG. 3 is a diagram of a voice activity detection model
FIG. 4 Experimental results
Detailed Description
In order to make the technical scheme of the invention clearer, the invention is further explained with reference to the attached drawings.
Referring to fig. 1, the present invention provides a speech recognition system, which includes an audio/video processing module, a speech activity detector, a speech recognizer, a proper noun database, and a recognition result error corrector.
And the audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio. And the voice activity detector is used for detecting voice activity of the video clip containing the audio and judging whether the video clip is a voice video clip. And the voice recognizer is used for performing voice recognition on the audio extracted from the video clip detected as the voice to obtain an original result of the voice recognition. And the proper noun database is used for storing the proper nouns in the specific field and the pinyin sequences corresponding to the proper nouns. And the recognition result error corrector is used for correcting the recognition result of the voice recognizer.
In this embodiment, the voice activity detector is divided into two parts, namely a lip region extractor and a lip voice activity detector. The lip region extractor is realized by using a RetinaFace model, and lip key points and a lip region picture are obtained by detecting human faces in a video frame; the lip voice activity detector is realized by using an LPN model, and judges whether a video clip is voice or not through lip key points and lip region pictures of a plurality of video frames.
The invention provides a voice recognition method, which distinguishes whether an audio clip is voice or non-voice (noise and silence) by analyzing the action of a lip part in a video clip, carries out voice recognition on the audio clip detected as the voice, and corrects the voice recognition result according to the content of a special word bank, thereby improving the accuracy of the voice recognition in a noise environment.
Preparation phase
Collecting proper nouns of a specific field and establishing a proper noun database. In this embodiment, proper nouns including relevant information about hotels, restaurants, and scenic spots in the beijing sunward area, the western city area, and the eastern city area, including names, types of hotels, addresses, telephones, facilities, recommended dishes, and the like, are collected. The proper noun is converted into a pinyin sequence using the Pypinyin library of Python.
Training phase
Training a voice activity detector, respectively training a lip region extractor and a lip voice activity detector.
In this embodiment, the lip region extractor uses a retinaFace model.
(1) Modifying the structure of the RetinaFace model: the human face key point regression module of the RetinaFace model predicts the x and y coordinates of 5 human face key points including the left eye, the right eye, the nose tip, the left lip and the right lip through a convolution layer with the output channel number of anchor number 10, the embodiment expands the number of the key points, changes the output channel number of the convolution layer of the human face key point regression module from anchor number 10 to anchor number 46, enables the human face key point regression module to predict the x and y coordinates of 23 key points, and adds 18 lip key points on the basis of retaining the original 5 key points.
(2) The training method of the RetinaFace model comprises the following steps: training was performed using the WFLW dataset, which consists of pictures containing 98 face key points, as shown in fig. 2. And selecting key points with the numbers of 96 (left eye), 97 (right eye), 54 (nose tip) and 76-95 (lip) as training samples of a key point regression module of the RetinaFace model, and training the modified RetinaFace model. The backbone network MobileNet V1-0.25 of the RetinaFace model, the optimizer Adam, the learning rate 0.001, and the weight _ decay 5e-4, wherein the learning rate is reduced to 0.92 times after each round of training.
The lip voice activity detector uses LPN model (Landmark posing Network). The training method of the LPN model comprises the following steps: training is performed using the LSW dataset. The optimizer Adagrad, initial learning rate of 0.0001, learning rate reduction by 10 times per 50000 iterations, momentum0.9, attenuation coefficient 0.0005, and training is ended after 200000 iterations.
A model diagram of a voice activity detector is shown in fig. 3.
Prediction phase
Dividing the collected video into video segments with the length of 0.4 second by an audio and video processing module, and extracting the video segments containing audio;
and step two, the voice activity detector detects voice activity of the video clips, the input of the lip region extractor Retina face is a video frame, and the output is 20 lip key points and lip region pictures of the video frame. The input of the lip voice activity detector LPN is 20 lip keypoints and lip region pictures of each frame of the video segment, and the output is whether the current video segment is voice or not.
Thirdly, the voice recognizer performs voice recognition on the audio extracted from the video clip detected as the voice to obtain an original result of the voice recognition;
and fourthly, correcting the original result of the voice recognition by the recognition result corrector. An error correction method based on the longest common subsequence is used. The error correction method based on the longest public subsequence can correct the recognition error of proper nouns, and correct the wrong proper nouns in the original result of voice recognition on the premise of giving the word bank of the voice recognition proper nouns. The method comprises the following steps:
(1) according to the special noun word stock of the specific field, each word in the word stock is labeled, and the whole word stock is represented by Term.
Term=[term1,term2,…termn]
Wherein termiRepresenting the ith word in the lexicon.
(2) Each word in the word stock is converted into a pinyin sequence with tones. For the Word bank with the number of words being N, a pinyin sequence set of the Word bank can be obtained and expressed by Word:
Word=[word1,word2,…,wordn]
wherein, wordiIs a sequence, each element of which is a pinyin with tones, e.g., for the word "Beijing duck", its corresponding wordiIs [ 'b ě i'],['jīng'],['kǎo'],['yā']]。
(3) The original Result of the speech recognition is converted into a pinyin sequence with tones, which is represented by Result.
(4) Traversing the Pinyin sequence set Word of the Word stock, and calculating the Pinyin sequence Word of each Word in the Word stockiAnd the longest common subsequence of the Pinyin Result of the original Result of the speech recognition, the longest common subsequence using lcsiAnd expressing the longest public subsequence set of the word bank and the original result of the voice recognition, which is expressed by LCS.
LCS=[lcs1,lcs2,…,lcsi,…,lcsn]
Wherein lcsiThe longest common subsequence representing the pinyin of the ith word in the lexicon and the pinyin of the original result of the speech recognition.
(5) Traverse LCS, for each LCSiTo obtain its first character fiAnd the last character liIn Result, intercept with fiFor the first character,/iFor the subsequence of the last character, as the part to be replaced, with replaceiAnd (4) showing. Replace represents ReplaceiA collection of (a).
Replace=[replace1,replace2,…,replacen]
(6) Traversing LCS, Word and Replace to calculate Li、Si、Pi. Wherein L isiIs lcsiLength of (d); siIs lcsiLength of (2) and wordiThe ratio of the lengths of (a); piIs placeiLength of (2) and wordiThe ratio of the smaller to the larger of the lengths of (a). Through experimentation, a set of rules can be determined, according to Li、Si、PiSelects the most suitable word to replace in the current thesaurus. In this embodiment, the rule determined is as follows: l isiGreater than 1, SiGreater than or equal to 0.6, PiGreater than 0.66, on the basis of which i, chosen to maximize the multiplication of the three, will be termiAs a replacement item and requires a portion to be replaced placeiLength of (2) and wordiLength of (1) is differentIs less than or equal to 1.
argmaxi(Li*Si*Pi)
(7) Use of termiReplaceiThe corresponding text.
The following embodiments are combined to verify the feasibility of the method of the present invention:
the trained voice recognition model is used for testing the collected audio and video data, and the voice activity detection is accurately carried out by using the lip action information in the video, so that the effect of the voice recognition in a noise environment is greatly improved. The experimental data was a 100-sentence human recorded dialogue, and an algorithm was used to add a fixed signal-to-noise ratio of babble noise to it. The specific experimental result is shown in fig. 4, the recognition effect of the voice under different signal-to-noise ratios is tested, and the accuracy of the voice recognition is obviously improved under the condition that the environmental signal-to-noise ratio is less than 15 dB.

Claims (7)

1. A voice recognition system for lip voice activity detection and result error correction is characterized by comprising an audio and video processing module, a voice activity detector, a voice recognizer, a proper noun database and a recognition result error corrector. The audio and video processing module is used for processing the collected videos containing the human faces, dividing the videos into video segments and extracting the video segments containing the audio;
the voice activity detector is used for detecting voice activity of the video clip containing the audio and judging whether the video clip containing the audio is the video clip containing the voice; the method comprises the following steps: the method comprises the following steps of dividing the method into a lip region extractor and a lip voice activity detector; the lip region extractor is realized by using a RetinaFace model, and lip key points and a lip region picture are obtained by detecting human faces in a video clip containing audio; the lip voice activity detector is realized by using an LPN model, and judges whether the video clip is a video clip containing voice or not through lip key points and lip region pictures of video frames in the video clip;
and the voice recognizer is used for performing voice recognition on the audio extracted from the video clip detected as containing the voice to obtain an original result of the voice recognition.
The special noun database is used for storing special nouns in a specific field and pinyin sequences corresponding to words.
The recognition result error corrector is used for correcting the recognition result of the voice recognizer, and the method comprises the following steps: converting the original recognition result of the voice recognizer into a pinyin sequence, calculating the pinyin sequence of the original recognition result and the longest public subsequence of the pinyin sequence of the proper nouns in the proper noun database, and correcting errors of the original recognition result.
2. The speech recognition system of claim 1, wherein the error correction of the original result using the longest common subsequence comprises the steps of:
(1) converting the original result of the voice recognition into a pinyin sequence;
(2) calculating to obtain the longest public subsequence between the pinyin sequence of the original voice recognition result and the pinyin sequence of each word in the word bank;
(3) acquiring a part to be replaced in an original voice recognition result according to the head and tail characters of the longest public subsequence;
(4) selecting a replacement word according to a group of rules according to the longest public subsequence, the phonetic sequence of the original result of the voice recognition and the numerical value of the phonetic sequence of each word in the word bank;
(5) and replacing the part to be replaced in the original voice recognition result.
3. A method of speech recognition with lip voice activity detection and result error correction implemented using the speech recognition system of claim 1, comprising the steps of:
a preparation stage, namely searching proper nouns in a specific field, converting the proper nouns into pinyin sequences, and establishing a proper noun database;
training a voice activity detector, and respectively training a lip region extractor and a lip voice activity detector;
a prediction stage:
dividing the collected video into video segments by an audio and video processing module, and extracting the audio in the video;
step two, the voice activity detector detects the voice activity of the video clip;
thirdly, the voice recognizer performs voice recognition on the audio extracted from the video clip detected as the voice to obtain an original result of the voice recognition;
and step four, the recognition result error corrector corrects the original result of the voice recognition, and corrects the error by using an error correction method based on the longest common subsequence when proper noun recognition errors exist in the original result of the voice recognition.
4. The speech recognition method of claim 3, wherein the lip region extractor modifies a RetinaFace model structure using a RetinaFace model, and predicts original facial keypoints prediction by adding up to 5 facial keypoints and 18 lip keypoints.
5. The speech recognition method of claim 3, wherein the backbone network of the RetinaFace model is MobileNet V1-0.25, the optimizer Adam has a learning rate of 0.001, and the learning rate is reduced to 0.92 times after each training round.
6. The speech recognition method of claim 3, wherein the lip voice activity detector uses an LPN model, and the training method is: training using LSW dataset, optimizer adagard, initial learning rate 0.0001, learning rate reduction by 10 times per 50000 iterations, momentum0.9, attenuation coefficient 0.0005, training is ended after 200000 iterations.
7. A speech recognition method according to claim 3, wherein the error correction of the original result is performed using the longest common subsequence, comprising the steps of:
(1) converting the original result of the voice recognition into a pinyin sequence;
(2) calculating to obtain the longest public subsequence between the pinyin sequence of the original voice recognition result and the pinyin sequence of each word in the word bank;
(3) acquiring a part to be replaced in an original voice recognition result according to the head and tail characters of the longest public subsequence;
(4) selecting a replacement word according to a group of rules according to the longest public subsequence, the phonetic sequence of the original result of the voice recognition and the numerical value of the phonetic sequence of each word in the word bank;
(5) and replacing the part to be replaced in the original voice recognition result.
CN202110654992.XA 2021-06-11 2021-06-11 Voice recognition system and method for lip voice activity detection and result error correction Pending CN113571051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110654992.XA CN113571051A (en) 2021-06-11 2021-06-11 Voice recognition system and method for lip voice activity detection and result error correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110654992.XA CN113571051A (en) 2021-06-11 2021-06-11 Voice recognition system and method for lip voice activity detection and result error correction

Publications (1)

Publication Number Publication Date
CN113571051A true CN113571051A (en) 2021-10-29

Family

ID=78161988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110654992.XA Pending CN113571051A (en) 2021-06-11 2021-06-11 Voice recognition system and method for lip voice activity detection and result error correction

Country Status (1)

Country Link
CN (1) CN113571051A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093380A (en) * 2022-01-24 2022-02-25 荣耀终端有限公司 Voice enhancement method, electronic equipment, chip system and readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273359A (en) * 2017-06-20 2017-10-20 北京四海心通科技有限公司 A kind of text similarity determines method
CN109905764A (en) * 2019-03-21 2019-06-18 广州国音智能科技有限公司 Target person voice intercept method and device in a kind of video
CN110276277A (en) * 2019-06-03 2019-09-24 罗普特科技集团股份有限公司 Method and apparatus for detecting facial image
CN110298252A (en) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 Meeting summary generation method, device, computer equipment and storage medium
CN110879986A (en) * 2019-11-21 2020-03-13 上海眼控科技股份有限公司 Face recognition method, apparatus and computer-readable storage medium
CN111311634A (en) * 2020-01-23 2020-06-19 支付宝实验室(新加坡)有限公司 Face image detection method, device and equipment
CN112863516A (en) * 2020-12-31 2021-05-28 竹间智能科技(上海)有限公司 Text error correction method and system and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273359A (en) * 2017-06-20 2017-10-20 北京四海心通科技有限公司 A kind of text similarity determines method
CN109905764A (en) * 2019-03-21 2019-06-18 广州国音智能科技有限公司 Target person voice intercept method and device in a kind of video
CN110298252A (en) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 Meeting summary generation method, device, computer equipment and storage medium
CN110276277A (en) * 2019-06-03 2019-09-24 罗普特科技集团股份有限公司 Method and apparatus for detecting facial image
CN110879986A (en) * 2019-11-21 2020-03-13 上海眼控科技股份有限公司 Face recognition method, apparatus and computer-readable storage medium
CN111311634A (en) * 2020-01-23 2020-06-19 支付宝实验室(新加坡)有限公司 Face image detection method, device and equipment
CN112863516A (en) * 2020-12-31 2021-05-28 竹间智能科技(上海)有限公司 Text error correction method and system and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BOYU WANG ET AL.: "Are You Speaking Real-Time Speech Activity Detection via Landmark Pooling Network", 2019 14TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION(FG 2019), pages 167 - 5 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093380A (en) * 2022-01-24 2022-02-25 荣耀终端有限公司 Voice enhancement method, electronic equipment, chip system and readable storage medium

Similar Documents

Publication Publication Date Title
WO2021232725A1 (en) Voice interaction-based information verification method and apparatus, and device and computer storage medium
CN110717031B (en) Intelligent conference summary generation method and system
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN107305541B (en) Method and device for segmenting speech recognition text
US20160163318A1 (en) Metadata extraction of non-transcribed video and audio streams
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
CN112599128B (en) Voice recognition method, device, equipment and storage medium
JP2015187684A (en) Unsupervised training method, training apparatus, and training program for n-gram language model
US20130030794A1 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
CN112802461B (en) Speech recognition method and device, server and computer readable storage medium
CN109872714A (en) A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
Bluche et al. Predicting detection filters for small footprint open-vocabulary keyword spotting
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN105869622B (en) Chinese hot word detection method and device
CN113468891A (en) Text processing method and device
CN114550718A (en) Hot word speech recognition method, device, equipment and computer readable storage medium
US20190103110A1 (en) Information processing device, information processing method, and program
CN113571051A (en) Voice recognition system and method for lip voice activity detection and result error correction
WO2020238681A1 (en) Audio processing method and device, and man-machine interactive system
CN113129895A (en) Voice detection processing system
CN107507627B (en) Voice data heat analysis method and system
CN113539235B (en) Text analysis and speech synthesis method, device, system and storage medium
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
CN112687291B (en) Pronunciation defect recognition model training method and pronunciation defect recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination