CN109637521A - A kind of lip reading recognition methods and device based on deep learning - Google Patents
A kind of lip reading recognition methods and device based on deep learning Download PDFInfo
- Publication number
- CN109637521A CN109637521A CN201811389295.0A CN201811389295A CN109637521A CN 109637521 A CN109637521 A CN 109637521A CN 201811389295 A CN201811389295 A CN 201811389295A CN 109637521 A CN109637521 A CN 109637521A
- Authority
- CN
- China
- Prior art keywords
- lip
- text
- feature vector
- image sequence
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000013135 deep learning Methods 0.000 title claims abstract description 30
- 238000005516 engineering process Methods 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims description 31
- 230000004927 fusion Effects 0.000 claims description 12
- 238000010168 coupling process Methods 0.000 claims description 10
- 238000005859 coupling reaction Methods 0.000 claims description 10
- 241001672694 Citrus reticulata Species 0.000 claims description 9
- 230000008878 coupling Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 4
- 230000005236 sound signal Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 2
- 239000000284 extract Substances 0.000 description 13
- 238000001514 detection method Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
Abstract
The lip reading recognition methods and device that the embodiment of the invention provides a kind of based on deep learning, this method comprises: obtaining the voice signal and video of user, wherein video is to be shot during user issues voice signal to the face of user;By speech recognition technology recognition of speech signals, the first text is obtained;Lip image sequence to be identified is obtained from video;Lip feature vector is extracted from lip image sequence to be identified, and the second text is obtained according to lip feature vector;The first text is corrected according to the second text, obtains the corresponding text of voice signal of user.Technical solution provided in an embodiment of the present invention is able to solve the problem that speech discrimination accuracy is low in noisy environments in the prior art.
Description
[technical field]
The present invention relates to lip reading identification technology field more particularly to a kind of lip reading recognition methods and dress based on deep learning
It sets.
[background technique]
Current man-machine interaction mode common on the market is that keyboard or voice input, and in some noisy environment, is obtained
Ambient noise is adulterated in the user speech taken, the influence of ambient noise is will receive during identifying to user speech, leads
Cause speech discrimination accuracy low.
Therefore, the accuracy for speech recognition how being improved in noisy environments becomes one of current urgent problem to be solved.
[summary of the invention]
In view of this, the lip reading recognition methods that the embodiment of the invention provides a kind of based on deep learning and device, to
Solve the problems, such as that speech discrimination accuracy is low in noisy environments in the prior art.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of lip reading based on deep learning is known
Other method, which comprises obtain the voice signal and video of user, wherein the video is in user sending institute
The face of the user is shot during predicate sound signal;The voice is identified by speech recognition technology
Signal obtains the first text;Lip image sequence to be identified is obtained from the video;From the lip image sequence to be identified
Middle extraction lip feature vector, and the second text is obtained according to the lip feature vector;Institute is corrected according to second text
The first text is stated, the corresponding text of voice signal of the user is obtained.
Further, the method that the voice signal is identified by speech recognition technology, obtains the first text, packet
It includes: feature extraction being carried out to the voice signal, obtains characteristic information;According to the characteristic information and the differentiation pre-established
Model identifies characteristic voice;The voice signal is identified using the speech recognition modeling to match with the characteristic voice, is obtained
First text.
Further, described that lip image sequence to be identified is obtained from the video, the video includes depth image
Information and Infrared Image Information, which comprises extract range image sequence from the deep image information, and from described
Infrared image sequence is extracted in Infrared Image Information;The first lip-region of the user is extracted from the range image sequence
Image sequence;The second lip-region image sequence of the user is extracted from the infrared image sequence;By first lip
Portion's area image sequence and the second lip-region image sequence are as the lip image sequence to be identified.
Further, described that lip feature vector is extracted from the lip image sequence to be identified, and according to the lip
The method that portion's feature vector obtains the second text, comprising: using the lip reading recognizer based on deep learning to described to be identified
Lip contour in lip image sequence is positioned, and the first lip contour curve and the second lip contour curve are obtained;To institute
It states the first lip contour curve and the second lip contour curve carries out fusion treatment, obtain target lip curve;From described
Lip feature vector is extracted in target lip curve;The standard stored in the lip feature vector of extraction and lid speech characteristic library is special
Sign vector is matched, and the lid speech characteristic library includes mandarin feature database and multiple provincialism libraries;It is special to calculate the lip
Levy the similarity value of vector and the standard feature vector;The similarity value is selected to be greater than the lip feature vector of preset threshold
As target feature vector;Export lip reading text corresponding with the target feature vector;According to the video to multiple institutes
It states lip reading text to be ranked up, obtains second text.
Further, described that first text is corrected according to second text, obtain the voice signal of the user
The method of corresponding text, comprising: match second text with first text;By the text of successful match and
Space pre-output, the space are used to indicate the text of non-successful match, obtain base text;Based on context semantic analysis obtains
Take the conjunctive word of text corresponding with the space in second text;It is filled up in the base text with the conjunctive word
The space, obtain the corresponding text of voice signal of the user.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of lip reading based on deep learning is known
Other device, which is characterized in that described device includes: first acquisition unit, for obtaining the voice signal and video of user,
In, the video is to be shot to obtain to the face of the user during user issues the voice signal
's;Recognition unit obtains the first text for identifying the voice signal by speech recognition technology;Second acquisition unit is used
In obtaining lip image sequence to be identified from the video;Generation unit, for from the lip image sequence to be identified
Lip feature vector is extracted, and the second text is obtained according to the lip feature vector;Amending unit, for according to described second
Text corrects first text, obtains the corresponding text of voice signal of the user.
Further, the generation unit includes: locator unit, is calculated for being identified using the lip reading based on deep learning
Method positions the lip contour in the lip image sequence to be identified, obtains the first lip contour curve and the second lip
Contour curve;Subelement is merged, for merging to the first lip contour curve with the second lip contour curve
Processing, obtains target lip curve;First obtain subelement, for from the target lip curve extract lip feature to
Amount;First coupling subelement, standard feature vector for that will store in the lip feature vector of extraction and lid speech characteristic library into
Row matching, the lid speech characteristic library includes mandarin feature database and multiple provincialism libraries;Computation subunit, it is described for calculating
The similarity value of lip feature vector and the standard feature vector;Subelement is confirmed, for selecting the similarity value to be greater than
The lip feature vector of preset threshold is as target feature vector;First output subelement, for exporting and the target signature
The corresponding lip reading text of vector;Composition subelement is obtained for being ranked up according to the video to multiple lip reading texts
To second text.
Further, the amending unit includes: the second coupling subelement, is used for second text and described first
Text is matched;Second output subelement, for by the text of successful match and space pre-output, the space to be for indicating
The text of non-successful match, obtains base text;Second obtains subelement, obtains described for based on context semantic analysis
The conjunctive word of text corresponding with the space in two texts;Subelement is filled up, for filling up the base with the conjunctive word
The space in plinth text obtains the corresponding text of voice signal of the user.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of storage medium, the storage medium
Program including storage, equipment where controlling the storage medium in described program operation execute above-mentioned based on deep learning
Lip reading recognition methods.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of server, including memory and place
Device is managed, the memory is used to control the execution of program instruction, institute for storing the information including program instruction, the processor
State the step of above-mentioned lip reading recognition methods based on deep learning is realized when program instruction is loaded and executed by processor.
In the present solution, by obtain user voice signal and video, utilize deep learning lip reading recognizer know
The lip feature vector of user in other video, obtains the second text according to lip feature vector, is passed through with the second text to correct
The first text that voice signal identifies, so that can be said in noisy environment using lip type more accurately to obtain user
Content, improve noisy environment in voice recognition accuracy, therefore, the embodiment of the present invention is able to solve noisy in the prior art
The low problem of speech discrimination accuracy in environment.
[Detailed description of the invention]
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field
For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of flow chart of lip reading recognition methods based on deep learning according to an embodiment of the present invention;
Fig. 2 is a kind of schematic diagram of lip reading identification device based on deep learning according to an embodiment of the present invention.
[specific embodiment]
For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing
It states.
It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
Its embodiment, shall fall within the protection scope of the present invention.
The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments
The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the"
It is also intended to including most forms, unless the context clearly indicates other meaning.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate
There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three
Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".
It will be appreciated that though terminal may be described using term first, second, third, etc. in embodiments of the present invention,
But these terminals should not necessarily be limited by these terms.These terms are only used to for terminal being distinguished from each other out.For example, not departing from the present invention
In the case where scope of embodiments, first acquisition unit can also be referred to as second acquisition unit, similarly, second acquisition unit
First acquisition unit can be referred to as.
Depending on context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determination " or " in response to detection ".Similarly, depend on context, phrase " if it is determined that " or " if detection
(condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when the detection (condition of statement
Or event) when " or " in response to detection (condition or event of statement) ".
Fig. 1 is a kind of flow chart of lip reading recognition methods based on deep learning according to an embodiment of the present invention, such as Fig. 1 institute
Show, this method comprises:
Step S101 obtains the voice signal and video of user, wherein video is to issue the process of voice signal in user
In the face of user shot;
Step S102 obtains the first text by speech recognition technology recognition of speech signals;
Step S103 obtains lip image sequence to be identified from video;
Step S104 extracts lip feature vector from lip image sequence to be identified, and is obtained according to lip feature vector
To the second text;
Step S105 corrects the first text according to the second text, obtains the corresponding text of voice signal of user.
In the present solution, by obtain user voice signal and video, utilize deep learning lip reading recognizer know
The lip feature vector of user in other video, obtains the second text according to lip feature vector, is passed through with the second text to correct
The first text that voice signal identifies is increased so that can be obtained the content that user says using lip type in noisy environment
Add the recognition accuracy of voice in noisy environment, therefore, the present embodiment, which is able to solve voice in noisy environments in the prior art, to be known
The low problem of other accuracy.
It is alternatively possible to by terminal device obtain user voice signal, voice signal can be wav, mp3 or other
The voice data of format.
Optionally, video includes deep image information and Infrared Image Information, in one embodiment, depth image letter
Breath can be obtained by 3D structure light video camera head, and Infrared Image Information can be obtained by infrared camera, so that video is not
It is easy to be influenced by environment such as light intensity, video identification degree can be effectively improved, to recognize correct lip reading text
Basis is provided.
Optionally, by speech recognition technology recognition of speech signals, the method for obtaining the first text, comprising: voice is believed
Number carry out feature extraction, obtain characteristic information;According to characteristic information and the discrimination model pre-established, characteristic voice is identified;
Using the speech recognition modeling recognition of speech signals to match with characteristic voice, the first text is obtained.
Specifically, feature extraction for example can be spectrum signature extraction, fundamental frequency feature extraction, power feature extraction or zero passage
Rate extraction etc..And it is possible to using support vector machines (support vector machine, SVM) or hidden Markov mould
The modeling techniques such as type (Hidden Markov Model, HMM) establish discrimination model, discrimination model may include mandarin model,
Chongqing accent model, Wu dialect accent model, Henan accent model or Guangdong accent model etc.;To identify that characteristic voice is general
Call, Chongqing accent, Wu dialect accent, Henan accent or Guangdong accent etc..Using the speech recognition to match with characteristic voice
Model recognition of speech signals can effectively improve the accuracy of speech recognition.
Optionally, lip image sequence to be identified is obtained from video, video includes deep image information and infrared image
Information, method include: range image sequence to be extracted from deep image information, and infrared image is extracted from Infrared Image Information
Sequence;The first lip-region image sequence of user is extracted from range image sequence;User is extracted from infrared image sequence
The second lip-region image sequence;Using the first lip-region image sequence and the second lip-region image sequence as to be identified
Lip image sequence.
Optionally, lip feature vector is extracted from lip image sequence to be identified, and is obtained according to lip feature vector
The method of second text, comprising: using the lip reading recognizer based on deep learning to the lip in lip image sequence to be identified
Shape profile is positioned, and the first lip contour curve and the second lip contour curve are obtained;To the first lip contour curve and
Bilabiate contouring curve carries out fusion treatment, obtains target lip curve;Lip feature vector is extracted from target lip curve;
The standard feature vector stored in the lip feature vector of extraction and lid speech characteristic library is matched, lid speech characteristic library includes general
Conversational nature library and multiple provincialism libraries;Calculate the similarity value of lip feature vector and standard feature vector;It selects similar
Angle value is greater than the lip feature vector of preset threshold as target feature vector;Export lip reading corresponding with target feature vector
Text;Multiple lip reading texts are ranked up according to video, obtain the second text.
Wherein, the lip reading recognizer of deep learning is trained by a large amount of training sample, can effectively improve lip
Language recognition efficiency.Fusion treatment can be the first lip contour curve and the second lip contour curve from shape, size, line
These aspects of reason, contrast are merged, and the lip contour curve after fusion treatment is used for feature extraction.By from depth
Lip contour curve, the place that can effectively avoid user weak in light, shooting are extracted in image sequence and infrared image sequence
The low problem of video identification degree so that lip reading recognition result is more accurate.
Optionally, the first text, the method for obtaining the corresponding text of voice signal of user, packet are corrected according to the second text
It includes: the second text is matched with the first text;By the text of successful match and space pre-output, space is not for indicating not
With successful text, base text is obtained;Based on context semantic analysis obtains text corresponding with space in the second text
Conjunctive word;The space in base text is filled up with conjunctive word, obtains the corresponding text of voice signal of user.Such as: we
It finishes classes and leave school goodbye;Text corresponding with space is " gas " in second text, his conjunctive word includes " seven ", " its ", " phase ", " surprise "
Deng " phase " being obtained by semantic analysis, to obtain " our next term goodbyes ".
The embodiment of the invention provides a kind of lip reading identification device based on deep learning, the device is for executing above-mentioned base
In the lip reading recognition methods of deep learning, as shown in Fig. 2, the device includes: first acquisition unit 10, recognition unit 20, second
Acquiring unit 30, generation unit 40, amending unit 50.
First acquisition unit 10, for obtaining the voice signal and video of user, wherein video is to issue voice in user
The face of user is shot during signal.
Recognition unit 20, for obtaining the first text by speech recognition technology recognition of speech signals.
Second acquisition unit 30, for obtaining lip image sequence to be identified from video.
Generation unit 40, for extracting lip feature vector from lip image sequence to be identified, and according to lip feature
Vector obtains the second text.
Amending unit 50 obtains the corresponding text of voice signal of user for correcting the first text according to the second text.
In the present solution, by obtain user voice signal and video, utilize deep learning lip reading recognizer know
The lip feature vector of user in other video, obtains the second text according to lip feature vector, is passed through with the second text to correct
The first text that voice signal identifies is increased so that can be obtained the content that user says using lip type in noisy environment
Add the recognition accuracy of voice in noisy environment, therefore, the present embodiment, which is able to solve voice in noisy environments in the prior art, to be known
The low problem of other accuracy.
It is alternatively possible to by terminal device obtain user voice signal, voice signal can be wav, mp3 or other
The voice data of format.
Optionally, video includes deep image information and Infrared Image Information, in one embodiment, depth image letter
Breath can be obtained by 3D structure light video camera head, and Infrared Image Information can be obtained by infrared camera, so that video is not
It is easy to be influenced by environment such as light intensity, video identification degree can be effectively improved, to recognize correct lip reading text
Basis is provided.
Optionally, recognition unit 20 includes: the first extraction subelement, identification subelement, generates subelement.
First extracts subelement, for carrying out feature extraction to voice signal, obtains characteristic information;It identifies subelement, uses
According to characteristic information and the discrimination model pre-established, characteristic voice is identified;Subelement is generated, for using special with voice
The speech recognition modeling recognition of speech signals that point matches, obtains the first text.
Specifically, feature extraction for example can be spectrum signature extraction, fundamental frequency feature extraction, power feature extraction or zero passage
Rate extraction etc..And it is possible to using support vector machines (support vector machine, SVM) or hidden Markov mould
The modeling techniques such as type (Hidden Markov Model, HMM) establish discrimination model, discrimination model may include mandarin model,
Chongqing accent model, Wu dialect accent model, Henan accent model or Guangdong accent model etc.;To identify that characteristic voice is general
Call, Chongqing accent, Wu dialect accent, Henan accent or Guangdong accent etc..Using the speech recognition to match with characteristic voice
Model recognition of speech signals can effectively improve the accuracy of speech recognition.
Optionally, second acquisition unit 30 includes: the second extraction subelement, third extraction subelement, the 4th extraction son list
Member, the first processing subelement.
Second extracts subelement, for extracting range image sequence from deep image information, and from Infrared Image Information
Middle extraction infrared image sequence;Third extracts subelement, for extracting the first lip-region of user from range image sequence
Image sequence;4th extracts subelement, for extracting the second lip-region image sequence of user from infrared image sequence;The
One processing subelement, for using the first lip-region image sequence and the second lip-region image sequence as lip figure to be identified
As sequence.
Optionally, generation unit 40 includes: locator unit, fusion subelement, the first acquisition subelement, the first matching
Unit, computation subunit, confirmation subelement, the first output subelement, composition subelement.
Locator unit, for using the lip reading recognizer based on deep learning in lip image sequence to be identified
Lip contour is positioned, and the first lip contour curve and the second lip contour curve are obtained;Subelement is merged, for first
Lip contour curve and the second lip contour curve carry out fusion treatment, obtain target lip curve;First obtains subelement, uses
In the extraction lip feature vector from target lip curve;First coupling subelement, lip feature vector for that will extract with
The standard feature vector stored in lid speech characteristic library is matched, and lid speech characteristic library includes that mandarin feature database and multiple dialects are special
Levy library;Computation subunit, for calculating the similarity value of lip feature vector Yu standard feature vector;Confirm subelement, is used for
Similarity value is selected to be greater than the lip feature vector of preset threshold as target feature vector;First output subelement, for defeated
Lip reading text corresponding with target feature vector out;Subelement is formed, for arranging according to video multiple lip reading texts
Sequence obtains the second text.
Wherein, the lip reading recognizer of deep learning is trained by a large amount of training sample, can effectively improve lip
Language recognition efficiency.Fusion treatment can be the first lip contour curve and the second lip contour curve from shape, size, line
These aspects of reason, contrast are merged, and the lip contour curve after fusion treatment is used for feature extraction.By from depth
Lip contour curve, the place that can effectively avoid user weak in light, shooting are extracted in image sequence and infrared image sequence
The low problem of video identification degree so that lip reading recognition result is more accurate.
Optionally, amending unit 50 includes: the second coupling subelement, the second output subelement, the second acquisition subelement, fills out
Mend subelement.
Second coupling subelement, for matching the second text with the first text;Second output subelement, being used for will
The text and space pre-output of successful match, space are used to indicate the text of non-successful match, obtain base text;Second obtains
Subelement obtains the conjunctive word of text corresponding with space in the second text for based on context semantic analysis;Fill up son
Unit obtains the corresponding text of voice signal of user for filling up the space in base text with conjunctive word.Such as: we
It finishes classes and leave school goodbye;Text corresponding with space is " gas " in second text, his conjunctive word includes " seven ", " its ", " phase ", " surprise "
Deng " phase " being obtained by semantic analysis, to obtain " our next term goodbyes ".
The embodiment of the invention provides a kind of storage medium, storage medium includes the program of storage, wherein is run in program
When control storage medium where equipment execute following steps:
Obtain the voice signal and video of user, wherein video is during user issues voice signal to user
Face shot;By speech recognition technology recognition of speech signals, the first text is obtained;From video obtain to
Identify lip image sequence;Lip feature vector is extracted from lip image sequence to be identified, and is obtained according to lip feature vector
To the second text;The first text is corrected according to the second text, obtains the corresponding text of voice signal of user.
Optionally, when program is run, equipment where control storage medium also executes following steps: carrying out to voice signal
Feature extraction obtains characteristic information;According to characteristic information and the discrimination model pre-established, characteristic voice is identified;Using with
The speech recognition modeling recognition of speech signals that characteristic voice matches, obtains the first text.
Optionally, when program is run, equipment where control storage medium also executes following steps: video includes depth map
As information and Infrared Image Information, range image sequence is extracted from deep image information, and extract from Infrared Image Information
Infrared image sequence;The first lip-region image sequence of user is extracted from range image sequence;From infrared image sequence
Extract the second lip-region image sequence of user;First lip-region image sequence and the second lip-region image sequence are made
For lip image sequence to be identified.
Optionally, when program is run, equipment where control storage medium also executes following steps: using based on depth
The lip reading recognizer of habit positions the lip contour in lip image sequence to be identified, obtains the first lip contour curve
And the second lip contour curve;Fusion treatment is carried out to the first lip contour curve and the second lip contour curve, obtains target
Lip curve;Lip feature vector is extracted from target lip curve;It will be in the lip feature vector of extraction and lid speech characteristic library
The standard feature vector of storage is matched, and lid speech characteristic library includes mandarin feature database and multiple provincialism libraries;Calculate lip
The similarity value of portion's feature vector and standard feature vector;Select similarity value be greater than preset threshold lip feature vector as
Target feature vector;Export lip reading text corresponding with target feature vector;Multiple lip reading texts are arranged according to video
Sequence obtains the second text.
Optionally, when program is run, equipment where control storage medium also executes following steps: by the second text and the
One text is matched;By the text of successful match and space pre-output, space is used to indicate the text of non-successful match, obtains
Base text;Based on context semantic analysis obtains the conjunctive word of text corresponding with space in the second text;Use conjunctive word
The space in base text is filled up, the corresponding text of voice signal of user is obtained.
The embodiment of the invention provides a kind of server, including memory and processor, memory includes journey for storing
The information of sequence instruction, processor are used to control the execution of program instruction, when program instruction load and is executed by processor realization with
Lower step:
Obtain the voice signal and video of user, wherein video is during user issues voice signal to user
Face shot;By speech recognition technology recognition of speech signals, the first text is obtained;From video obtain to
Identify lip image sequence;Lip feature vector is extracted from lip image sequence to be identified, and is obtained according to lip feature vector
To the second text;The first text is corrected according to the second text, obtains the corresponding text of voice signal of user.
Optionally, it is also performed the steps of when program instruction is loaded and executed by processor and feature is carried out to voice signal
It extracts, obtains characteristic information;According to characteristic information and the discrimination model pre-established, characteristic voice is identified;Using with voice
The speech recognition modeling recognition of speech signals that feature matches, obtains the first text.
Optionally, it includes depth image letter that video is also performed the steps of when program instruction is loaded and executed by processor
Breath and Infrared Image Information, extract range image sequence from deep image information, and extract from Infrared Image Information infrared
Image sequence;The first lip-region image sequence of user is extracted from range image sequence;It is extracted from infrared image sequence
The second lip-region image sequence of user;Using the first lip-region image sequence and the second lip-region image sequence as to
Identify lip image sequence.
Optionally, it also performs the steps of when program instruction is loaded and executed by processor using based on deep learning
Lip reading recognizer positions the lip contour in lip image sequence to be identified, obtains the first lip contour curve and
Bilabiate contouring curve;Fusion treatment is carried out to the first lip contour curve and the second lip contour curve, obtains target lip
Curve;Lip feature vector is extracted from target lip curve;It will be stored in the lip feature vector of extraction and lid speech characteristic library
Standard feature vector matched, lid speech characteristic library includes mandarin feature database and multiple provincialism libraries;It is special to calculate lip
Levy the similarity value of vector and standard feature vector;Similarity value is selected to be greater than the lip feature vector of preset threshold as target
Feature vector;Export lip reading text corresponding with target feature vector;Multiple lip reading texts are ranked up according to video, are obtained
To the second text.
Optionally, it is also performed the steps of when program instruction is loaded and executed by processor by the second text and the first text
This is matched;By the text of successful match and space pre-output, space is used to indicate the text of non-successful match, obtains basis
Text;Based on context semantic analysis obtains the conjunctive word of text corresponding with space in the second text;It is filled up with conjunctive word
Space in base text obtains the corresponding text of voice signal of user.
It should be noted that terminal involved in the embodiment of the present invention can include but is not limited to personal computer
(Personal Computer, PC), personal digital assistant (Personal Digital Assistant, PDA), wireless handheld
Equipment, tablet computer (Tablet Computer), mobile phone, MP3 player, MP4 player etc..
It is understood that the application can be mounted in the application program (nativeApp) in terminal, or may be used also
To be a web page program (webApp) of browser in terminal, the embodiment of the present invention is to this without limiting.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group
Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown
Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect
Coupling or communication connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention
The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various
It can store the medium of program code.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (10)
1. a kind of lip reading recognition methods based on deep learning, which is characterized in that the described method includes:
Obtain the voice signal and video of user, wherein the video is to issue the process of the voice signal in the user
In the face of the user shot;
The voice signal is identified by speech recognition technology, obtains the first text;
Lip image sequence to be identified is obtained from the video;
Lip feature vector is extracted from the lip image sequence to be identified, and obtains second according to the lip feature vector
Text;
First text is corrected according to second text, obtains the corresponding text of voice signal of the user.
2. the method according to claim 1, wherein described identify that the voice is believed by speech recognition technology
Number, the method for obtaining the first text, comprising:
Feature extraction is carried out to the voice signal, obtains characteristic information;
According to the characteristic information and the discrimination model pre-established, characteristic voice is identified;
The voice signal is identified using the speech recognition modeling to match with the characteristic voice, obtains first text.
3. the method according to claim 1, wherein described obtain lip image sequence to be identified from the video
Column, the video includes deep image information and Infrared Image Information, which comprises
Range image sequence is extracted from the deep image information, and infrared image sequence is extracted from the Infrared Image Information
Column;
The first lip-region image sequence of the user is extracted from the range image sequence;
The second lip-region image sequence of the user is extracted from the infrared image sequence;
Using the first lip-region image sequence and the second lip-region image sequence as the lip figure to be identified
As sequence.
4. method according to any one of claims 1 to 3, which is characterized in that described from the lip image sequence to be identified
Lip feature vector, and the method that the second text is obtained according to the lip feature vector are extracted in column, comprising:
The lip contour in the lip image sequence to be identified is determined using the lip reading recognizer based on deep learning
Position, obtains the first lip contour curve and the second lip contour curve;
Fusion treatment is carried out to the first lip contour curve and the second lip contour curve, it is bent to obtain target lip
Line;
Lip feature vector is extracted from the target lip curve;
The standard feature vector stored in the lip feature vector of extraction and lid speech characteristic library is matched, the lid speech characteristic
Library includes mandarin feature database and multiple provincialism libraries;
Calculate the similarity value of the lip feature vector Yu the standard feature vector;
The similarity value is selected to be greater than the lip feature vector of preset threshold as target feature vector;
Export lip reading text corresponding with the target feature vector;
Multiple lip reading texts are ranked up according to the video, obtain second text.
5. method according to any one of claims 1 to 3, which is characterized in that described to correct institute according to second text
The first text is stated, the method for obtaining the corresponding text of voice signal of the user, comprising:
Second text is matched with first text;
By the text of successful match and space pre-output, the space is used to indicate the text of non-successful match, obtains basic text
This;
Based on context semantic analysis obtains the conjunctive word of text corresponding with the space in second text;
The space in the base text is filled up with the conjunctive word, obtains the corresponding text of voice signal of the user
This.
6. a kind of lip reading identification device based on deep learning, which is characterized in that described device includes:
First acquisition unit, for obtaining the voice signal and video of user, wherein the video is in user sending institute
The face of the user is shot during predicate sound signal;
Recognition unit obtains the first text for identifying the voice signal by speech recognition technology;
Second acquisition unit, for obtaining lip image sequence to be identified from the video;
Generation unit, for extracting lip feature vector from the lip image sequence to be identified, and it is special according to the lip
Sign vector obtains the second text;
Amending unit, for correcting first text according to second text, the voice signal for obtaining the user is corresponding
Text.
7. device according to claim 6, which is characterized in that the generation unit includes:
Locator unit, for using the lip reading recognizer based on deep learning in the lip image sequence to be identified
Lip contour is positioned, and the first lip contour curve and the second lip contour curve are obtained;
Subelement is merged, for carrying out fusion treatment to the first lip contour curve and the second lip contour curve,
Obtain target lip curve;
First obtains subelement, for extracting lip feature vector from the target lip curve;
First coupling subelement, standard feature vector for that will store in the lip feature vector of extraction and lid speech characteristic library into
Row matching, the lid speech characteristic library includes mandarin feature database and multiple provincialism libraries;
Computation subunit, for calculating the similarity value of the lip feature vector Yu the standard feature vector;
Confirm subelement, for select the similarity value be greater than preset threshold lip feature vector as target signature to
Amount;
First output subelement, for exporting lip reading text corresponding with the target feature vector;
It forms subelement and obtains second text for being ranked up according to the video to multiple lip reading texts.
8. device according to claim 6, which is characterized in that the amending unit includes:
Second coupling subelement, for matching second text with first text;
Second output subelement, for by the text of successful match and space pre-output, the space not to be matched into for indicating
The text of function, obtains base text;
Second obtains subelement, obtains for based on context semantic analysis corresponding with the space in second text
The conjunctive word of text;
Subelement is filled up, for filling up the space in the base text with the conjunctive word, obtains the language of the user
The corresponding text of sound signal.
9. a kind of storage medium, the storage medium includes the program of storage, which is characterized in that is controlled in described program operation
Equipment perform claim requires the lip reading recognition methods based on deep learning described in 1 to 5 any one where the storage medium.
10. a kind of server, including memory and processor, the memory is for storing the information including program instruction, institute
Processor is stated for controlling the execution of program instruction, it is characterised in that: described program instruction is real when being loaded and executed by processor
The step of showing the lip reading recognition methods described in claim 1 to 5 any one based on deep learning.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811269809 | 2018-10-29 | ||
CN2018112698099 | 2018-10-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109637521A true CN109637521A (en) | 2019-04-16 |
Family
ID=66068629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811389295.0A Pending CN109637521A (en) | 2018-10-29 | 2018-11-21 | A kind of lip reading recognition methods and device based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109637521A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210310A (en) * | 2019-04-30 | 2019-09-06 | 北京搜狗科技发展有限公司 | A kind of method for processing video frequency, device and the device for video processing |
CN110992958A (en) * | 2019-11-19 | 2020-04-10 | 深圳追一科技有限公司 | Content recording method, content recording apparatus, electronic device, and storage medium |
CN111028833A (en) * | 2019-12-16 | 2020-04-17 | 广州小鹏汽车科技有限公司 | Interaction method and device for interaction and vehicle interaction |
CN111447325A (en) * | 2020-04-03 | 2020-07-24 | 上海闻泰电子科技有限公司 | Call auxiliary method, device, terminal and storage medium |
CN111583916A (en) * | 2020-05-19 | 2020-08-25 | 科大讯飞股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN111832412A (en) * | 2020-06-09 | 2020-10-27 | 北方工业大学 | Sound production training correction method and system |
CN112037788A (en) * | 2020-09-10 | 2020-12-04 | 中航华东光电(上海)有限公司 | Voice correction fusion technology |
CN112417925A (en) * | 2019-08-21 | 2021-02-26 | 北京中关村科金技术有限公司 | In-vivo detection method and device based on deep learning and storage medium |
CN112820274A (en) * | 2021-01-08 | 2021-05-18 | 上海仙剑文化传媒股份有限公司 | Voice information recognition correction method and system |
CN113345436A (en) * | 2021-08-05 | 2021-09-03 | 创维电器股份有限公司 | Remote voice recognition control system and method based on multi-system integration high recognition rate |
CN113660501A (en) * | 2021-08-11 | 2021-11-16 | 云知声(上海)智能科技有限公司 | Method and device for matching subtitles |
CN113722513A (en) * | 2021-09-06 | 2021-11-30 | 北京字节跳动网络技术有限公司 | Multimedia data processing method and equipment |
WO2022033556A1 (en) * | 2020-08-14 | 2022-02-17 | 华为技术有限公司 | Electronic device and speech recognition method therefor, and medium |
CN114676282A (en) * | 2022-04-11 | 2022-06-28 | 北京女娲补天科技信息技术有限公司 | Event entry method and device based on audio and video data and computer equipment |
CN116805272A (en) * | 2022-10-29 | 2023-09-26 | 武汉行已学教育咨询有限公司 | Visual education teaching analysis method, system and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101752A (en) * | 2007-07-19 | 2008-01-09 | 华中科技大学 | Monosyllabic language lip-reading recognition system based on vision character |
CN106504751A (en) * | 2016-08-01 | 2017-03-15 | 深圳奥比中光科技有限公司 | Self adaptation lip reading exchange method and interactive device |
CN106875941A (en) * | 2017-04-01 | 2017-06-20 | 彭楚奥 | A kind of voice method for recognizing semantics of service robot |
CN107045385A (en) * | 2016-08-01 | 2017-08-15 | 深圳奥比中光科技有限公司 | Lip reading exchange method and lip reading interactive device based on depth image |
CN108346427A (en) * | 2018-02-05 | 2018-07-31 | 广东小天才科技有限公司 | A kind of audio recognition method, device, equipment and storage medium |
CN108537207A (en) * | 2018-04-24 | 2018-09-14 | Oppo广东移动通信有限公司 | Lip reading recognition methods, device, storage medium and mobile terminal |
-
2018
- 2018-11-21 CN CN201811389295.0A patent/CN109637521A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101752A (en) * | 2007-07-19 | 2008-01-09 | 华中科技大学 | Monosyllabic language lip-reading recognition system based on vision character |
CN106504751A (en) * | 2016-08-01 | 2017-03-15 | 深圳奥比中光科技有限公司 | Self adaptation lip reading exchange method and interactive device |
CN107045385A (en) * | 2016-08-01 | 2017-08-15 | 深圳奥比中光科技有限公司 | Lip reading exchange method and lip reading interactive device based on depth image |
CN106875941A (en) * | 2017-04-01 | 2017-06-20 | 彭楚奥 | A kind of voice method for recognizing semantics of service robot |
CN108346427A (en) * | 2018-02-05 | 2018-07-31 | 广东小天才科技有限公司 | A kind of audio recognition method, device, equipment and storage medium |
CN108537207A (en) * | 2018-04-24 | 2018-09-14 | Oppo广东移动通信有限公司 | Lip reading recognition methods, device, storage medium and mobile terminal |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210310A (en) * | 2019-04-30 | 2019-09-06 | 北京搜狗科技发展有限公司 | A kind of method for processing video frequency, device and the device for video processing |
CN112417925A (en) * | 2019-08-21 | 2021-02-26 | 北京中关村科金技术有限公司 | In-vivo detection method and device based on deep learning and storage medium |
CN110992958A (en) * | 2019-11-19 | 2020-04-10 | 深圳追一科技有限公司 | Content recording method, content recording apparatus, electronic device, and storage medium |
CN110992958B (en) * | 2019-11-19 | 2021-06-22 | 深圳追一科技有限公司 | Content recording method, content recording apparatus, electronic device, and storage medium |
CN111028833A (en) * | 2019-12-16 | 2020-04-17 | 广州小鹏汽车科技有限公司 | Interaction method and device for interaction and vehicle interaction |
CN111028833B (en) * | 2019-12-16 | 2022-08-16 | 广州小鹏汽车科技有限公司 | Interaction method and device for interaction and vehicle interaction |
CN111447325A (en) * | 2020-04-03 | 2020-07-24 | 上海闻泰电子科技有限公司 | Call auxiliary method, device, terminal and storage medium |
CN111583916A (en) * | 2020-05-19 | 2020-08-25 | 科大讯飞股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN111832412A (en) * | 2020-06-09 | 2020-10-27 | 北方工业大学 | Sound production training correction method and system |
CN111832412B (en) * | 2020-06-09 | 2024-04-09 | 北方工业大学 | Sounding training correction method and system |
WO2022033556A1 (en) * | 2020-08-14 | 2022-02-17 | 华为技术有限公司 | Electronic device and speech recognition method therefor, and medium |
CN112037788B (en) * | 2020-09-10 | 2021-08-24 | 中航华东光电(上海)有限公司 | Voice correction fusion method |
CN112037788A (en) * | 2020-09-10 | 2020-12-04 | 中航华东光电(上海)有限公司 | Voice correction fusion technology |
CN112820274B (en) * | 2021-01-08 | 2021-09-28 | 上海仙剑文化传媒股份有限公司 | Voice information recognition correction method and system |
CN112820274A (en) * | 2021-01-08 | 2021-05-18 | 上海仙剑文化传媒股份有限公司 | Voice information recognition correction method and system |
CN113345436B (en) * | 2021-08-05 | 2021-11-12 | 创维电器股份有限公司 | Remote voice recognition control system and method based on multi-system integration high recognition rate |
CN113345436A (en) * | 2021-08-05 | 2021-09-03 | 创维电器股份有限公司 | Remote voice recognition control system and method based on multi-system integration high recognition rate |
CN113660501A (en) * | 2021-08-11 | 2021-11-16 | 云知声(上海)智能科技有限公司 | Method and device for matching subtitles |
CN113722513A (en) * | 2021-09-06 | 2021-11-30 | 北京字节跳动网络技术有限公司 | Multimedia data processing method and equipment |
CN113722513B (en) * | 2021-09-06 | 2022-12-20 | 抖音视界有限公司 | Multimedia data processing method and equipment |
CN114676282A (en) * | 2022-04-11 | 2022-06-28 | 北京女娲补天科技信息技术有限公司 | Event entry method and device based on audio and video data and computer equipment |
CN114676282B (en) * | 2022-04-11 | 2023-02-03 | 北京女娲补天科技信息技术有限公司 | Event entry method and device based on audio and video data and computer equipment |
CN116805272A (en) * | 2022-10-29 | 2023-09-26 | 武汉行已学教育咨询有限公司 | Visual education teaching analysis method, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109637521A (en) | A kind of lip reading recognition methods and device based on deep learning | |
EP3553773B1 (en) | Training and testing utterance-based frameworks | |
CN106683680B (en) | Speaker recognition method and device, computer equipment and computer readable medium | |
US9779730B2 (en) | Method and apparatus for speech recognition and generation of speech recognition engine | |
KR102582291B1 (en) | Emotion information-based voice synthesis method and device | |
CN106575502B (en) | System and method for providing non-lexical cues in synthesized speech | |
CN110838289A (en) | Awakening word detection method, device, equipment and medium based on artificial intelligence | |
CN109686383B (en) | Voice analysis method, device and storage medium | |
US7792671B2 (en) | Augmentation and calibration of output from non-deterministic text generators by modeling its characteristics in specific environments | |
CN105654940B (en) | Speech synthesis method and device | |
CN109036471B (en) | Voice endpoint detection method and device | |
WO2014183373A1 (en) | Systems and methods for voice identification | |
JP2010152751A (en) | Statistic model learning device, statistic model learning method and program | |
CN109166569B (en) | Detection method and device for phoneme mislabeling | |
CN112735371B (en) | Method and device for generating speaker video based on text information | |
CN111402862A (en) | Voice recognition method, device, storage medium and equipment | |
CN109461459A (en) | Speech assessment method, apparatus, computer equipment and storage medium | |
CN110503941B (en) | Language ability evaluation method, device, system, computer equipment and storage medium | |
CN106847273B (en) | Awakening word selection method and device for voice recognition | |
CN111552777A (en) | Audio identification method and device, electronic equipment and storage medium | |
US20110224985A1 (en) | Model adaptation device, method thereof, and program thereof | |
US11615787B2 (en) | Dialogue system and method of controlling the same | |
CN110853669A (en) | Audio identification method, device and equipment | |
CN111680514A (en) | Information processing and model training method, device, equipment and storage medium | |
CN114783424A (en) | Text corpus screening method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190416 |