CN109637521A

CN109637521A - A kind of lip reading recognition methods and device based on deep learning

Info

Publication number: CN109637521A
Application number: CN201811389295.0A
Authority: CN
Inventors: 高立志
Original assignee: OneConnect Smart Technology Co Ltd
Current assignee: OneConnect Smart Technology Co Ltd
Priority date: 2018-10-29
Filing date: 2018-11-21
Publication date: 2019-04-16

Abstract

The lip reading recognition methods and device that the embodiment of the invention provides a kind of based on deep learning, this method comprises: obtaining the voice signal and video of user, wherein video is to be shot during user issues voice signal to the face of user；By speech recognition technology recognition of speech signals, the first text is obtained；Lip image sequence to be identified is obtained from video；Lip feature vector is extracted from lip image sequence to be identified, and the second text is obtained according to lip feature vector；The first text is corrected according to the second text, obtains the corresponding text of voice signal of user.Technical solution provided in an embodiment of the present invention is able to solve the problem that speech discrimination accuracy is low in noisy environments in the prior art.

Description

A kind of lip reading recognition methods and device based on deep learning

[technical field]

The present invention relates to lip reading identification technology field more particularly to a kind of lip reading recognition methods and dress based on deep learning It sets.

[background technique]

Current man-machine interaction mode common on the market is that keyboard or voice input, and in some noisy environment, is obtained Ambient noise is adulterated in the user speech taken, the influence of ambient noise is will receive during identifying to user speech, leads Cause speech discrimination accuracy low.

Therefore, the accuracy for speech recognition how being improved in noisy environments becomes one of current urgent problem to be solved.

[summary of the invention]

In view of this, the lip reading recognition methods that the embodiment of the invention provides a kind of based on deep learning and device, to Solve the problems, such as that speech discrimination accuracy is low in noisy environments in the prior art.

To achieve the goals above, according to an aspect of the invention, there is provided a kind of lip reading based on deep learning is known Other method, which comprises obtain the voice signal and video of user, wherein the video is in user sending institute The face of the user is shot during predicate sound signal；The voice is identified by speech recognition technology Signal obtains the first text；Lip image sequence to be identified is obtained from the video；From the lip image sequence to be identified Middle extraction lip feature vector, and the second text is obtained according to the lip feature vector；Institute is corrected according to second text The first text is stated, the corresponding text of voice signal of the user is obtained.

Further, the method that the voice signal is identified by speech recognition technology, obtains the first text, packet It includes: feature extraction being carried out to the voice signal, obtains characteristic information；According to the characteristic information and the differentiation pre-established Model identifies characteristic voice；The voice signal is identified using the speech recognition modeling to match with the characteristic voice, is obtained First text.

Further, described that lip image sequence to be identified is obtained from the video, the video includes depth image Information and Infrared Image Information, which comprises extract range image sequence from the deep image information, and from described Infrared image sequence is extracted in Infrared Image Information；The first lip-region of the user is extracted from the range image sequence Image sequence；The second lip-region image sequence of the user is extracted from the infrared image sequence；By first lip Portion's area image sequence and the second lip-region image sequence are as the lip image sequence to be identified.

Further, described that lip feature vector is extracted from the lip image sequence to be identified, and according to the lip The method that portion's feature vector obtains the second text, comprising: using the lip reading recognizer based on deep learning to described to be identified Lip contour in lip image sequence is positioned, and the first lip contour curve and the second lip contour curve are obtained；To institute It states the first lip contour curve and the second lip contour curve carries out fusion treatment, obtain target lip curve；From described Lip feature vector is extracted in target lip curve；The standard stored in the lip feature vector of extraction and lid speech characteristic library is special Sign vector is matched, and the lid speech characteristic library includes mandarin feature database and multiple provincialism libraries；It is special to calculate the lip Levy the similarity value of vector and the standard feature vector；The similarity value is selected to be greater than the lip feature vector of preset threshold As target feature vector；Export lip reading text corresponding with the target feature vector；According to the video to multiple institutes It states lip reading text to be ranked up, obtains second text.

Further, described that first text is corrected according to second text, obtain the voice signal of the user The method of corresponding text, comprising: match second text with first text；By the text of successful match and Space pre-output, the space are used to indicate the text of non-successful match, obtain base text；Based on context semantic analysis obtains Take the conjunctive word of text corresponding with the space in second text；It is filled up in the base text with the conjunctive word The space, obtain the corresponding text of voice signal of the user.

To achieve the goals above, according to an aspect of the invention, there is provided a kind of lip reading based on deep learning is known Other device, which is characterized in that described device includes: first acquisition unit, for obtaining the voice signal and video of user, In, the video is to be shot to obtain to the face of the user during user issues the voice signal 's；Recognition unit obtains the first text for identifying the voice signal by speech recognition technology；Second acquisition unit is used In obtaining lip image sequence to be identified from the video；Generation unit, for from the lip image sequence to be identified Lip feature vector is extracted, and the second text is obtained according to the lip feature vector；Amending unit, for according to described second Text corrects first text, obtains the corresponding text of voice signal of the user.

Further, the generation unit includes: locator unit, is calculated for being identified using the lip reading based on deep learning Method positions the lip contour in the lip image sequence to be identified, obtains the first lip contour curve and the second lip Contour curve；Subelement is merged, for merging to the first lip contour curve with the second lip contour curve Processing, obtains target lip curve；First obtain subelement, for from the target lip curve extract lip feature to Amount；First coupling subelement, standard feature vector for that will store in the lip feature vector of extraction and lid speech characteristic library into Row matching, the lid speech characteristic library includes mandarin feature database and multiple provincialism libraries；Computation subunit, it is described for calculating The similarity value of lip feature vector and the standard feature vector；Subelement is confirmed, for selecting the similarity value to be greater than The lip feature vector of preset threshold is as target feature vector；First output subelement, for exporting and the target signature The corresponding lip reading text of vector；Composition subelement is obtained for being ranked up according to the video to multiple lip reading texts To second text.

Further, the amending unit includes: the second coupling subelement, is used for second text and described first Text is matched；Second output subelement, for by the text of successful match and space pre-output, the space to be for indicating The text of non-successful match, obtains base text；Second obtains subelement, obtains described for based on context semantic analysis The conjunctive word of text corresponding with the space in two texts；Subelement is filled up, for filling up the base with the conjunctive word The space in plinth text obtains the corresponding text of voice signal of the user.

To achieve the goals above, according to an aspect of the invention, there is provided a kind of storage medium, the storage medium Program including storage, equipment where controlling the storage medium in described program operation execute above-mentioned based on deep learning Lip reading recognition methods.

To achieve the goals above, according to an aspect of the invention, there is provided a kind of server, including memory and place Device is managed, the memory is used to control the execution of program instruction, institute for storing the information including program instruction, the processor State the step of above-mentioned lip reading recognition methods based on deep learning is realized when program instruction is loaded and executed by processor.

In the present solution, by obtain user voice signal and video, utilize deep learning lip reading recognizer know The lip feature vector of user in other video, obtains the second text according to lip feature vector, is passed through with the second text to correct The first text that voice signal identifies, so that can be said in noisy environment using lip type more accurately to obtain user Content, improve noisy environment in voice recognition accuracy, therefore, the embodiment of the present invention is able to solve noisy in the prior art The low problem of speech discrimination accuracy in environment.

[Detailed description of the invention]

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of flow chart of lip reading recognition methods based on deep learning according to an embodiment of the present invention；

Fig. 2 is a kind of schematic diagram of lip reading identification device based on deep learning according to an embodiment of the present invention.

[specific embodiment]

For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing It states.

It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its Its embodiment, shall fall within the protection scope of the present invention.

The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning.

It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".

It will be appreciated that though terminal may be described using term first, second, third, etc. in embodiments of the present invention, But these terminals should not necessarily be limited by these terms.These terms are only used to for terminal being distinguished from each other out.For example, not departing from the present invention In the case where scope of embodiments, first acquisition unit can also be referred to as second acquisition unit, similarly, second acquisition unit First acquisition unit can be referred to as.

Depending on context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determination " or " in response to detection ".Similarly, depend on context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when the detection (condition of statement Or event) when " or " in response to detection (condition or event of statement) ".

Fig. 1 is a kind of flow chart of lip reading recognition methods based on deep learning according to an embodiment of the present invention, such as Fig. 1 institute Show, this method comprises:

Step S101 obtains the voice signal and video of user, wherein video is to issue the process of voice signal in user In the face of user shot；

Step S102 obtains the first text by speech recognition technology recognition of speech signals；

Step S103 obtains lip image sequence to be identified from video；

Step S104 extracts lip feature vector from lip image sequence to be identified, and is obtained according to lip feature vector To the second text；

Step S105 corrects the first text according to the second text, obtains the corresponding text of voice signal of user.

In the present solution, by obtain user voice signal and video, utilize deep learning lip reading recognizer know The lip feature vector of user in other video, obtains the second text according to lip feature vector, is passed through with the second text to correct The first text that voice signal identifies is increased so that can be obtained the content that user says using lip type in noisy environment Add the recognition accuracy of voice in noisy environment, therefore, the present embodiment, which is able to solve voice in noisy environments in the prior art, to be known The low problem of other accuracy.

It is alternatively possible to by terminal device obtain user voice signal, voice signal can be wav, mp3 or other The voice data of format.

Optionally, video includes deep image information and Infrared Image Information, in one embodiment, depth image letter Breath can be obtained by 3D structure light video camera head, and Infrared Image Information can be obtained by infrared camera, so that video is not It is easy to be influenced by environment such as light intensity, video identification degree can be effectively improved, to recognize correct lip reading text Basis is provided.

Optionally, by speech recognition technology recognition of speech signals, the method for obtaining the first text, comprising: voice is believed Number carry out feature extraction, obtain characteristic information；According to characteristic information and the discrimination model pre-established, characteristic voice is identified； Using the speech recognition modeling recognition of speech signals to match with characteristic voice, the first text is obtained.

Specifically, feature extraction for example can be spectrum signature extraction, fundamental frequency feature extraction, power feature extraction or zero passage Rate extraction etc..And it is possible to using support vector machines (support vector machine, SVM) or hidden Markov mould The modeling techniques such as type (Hidden Markov Model, HMM) establish discrimination model, discrimination model may include mandarin model, Chongqing accent model, Wu dialect accent model, Henan accent model or Guangdong accent model etc.；To identify that characteristic voice is general Call, Chongqing accent, Wu dialect accent, Henan accent or Guangdong accent etc..Using the speech recognition to match with characteristic voice Model recognition of speech signals can effectively improve the accuracy of speech recognition.

Optionally, lip image sequence to be identified is obtained from video, video includes deep image information and infrared image Information, method include: range image sequence to be extracted from deep image information, and infrared image is extracted from Infrared Image Information Sequence；The first lip-region image sequence of user is extracted from range image sequence；User is extracted from infrared image sequence The second lip-region image sequence；Using the first lip-region image sequence and the second lip-region image sequence as to be identified Lip image sequence.

Optionally, lip feature vector is extracted from lip image sequence to be identified, and is obtained according to lip feature vector The method of second text, comprising: using the lip reading recognizer based on deep learning to the lip in lip image sequence to be identified Shape profile is positioned, and the first lip contour curve and the second lip contour curve are obtained；To the first lip contour curve and Bilabiate contouring curve carries out fusion treatment, obtains target lip curve；Lip feature vector is extracted from target lip curve； The standard feature vector stored in the lip feature vector of extraction and lid speech characteristic library is matched, lid speech characteristic library includes general Conversational nature library and multiple provincialism libraries；Calculate the similarity value of lip feature vector and standard feature vector；It selects similar Angle value is greater than the lip feature vector of preset threshold as target feature vector；Export lip reading corresponding with target feature vector Text；Multiple lip reading texts are ranked up according to video, obtain the second text.

Wherein, the lip reading recognizer of deep learning is trained by a large amount of training sample, can effectively improve lip Language recognition efficiency.Fusion treatment can be the first lip contour curve and the second lip contour curve from shape, size, line These aspects of reason, contrast are merged, and the lip contour curve after fusion treatment is used for feature extraction.By from depth Lip contour curve, the place that can effectively avoid user weak in light, shooting are extracted in image sequence and infrared image sequence The low problem of video identification degree so that lip reading recognition result is more accurate.

Optionally, the first text, the method for obtaining the corresponding text of voice signal of user, packet are corrected according to the second text It includes: the second text is matched with the first text；By the text of successful match and space pre-output, space is not for indicating not With successful text, base text is obtained；Based on context semantic analysis obtains text corresponding with space in the second text Conjunctive word；The space in base text is filled up with conjunctive word, obtains the corresponding text of voice signal of user.Such as: we It finishes classes and leave school goodbye；Text corresponding with space is " gas " in second text, his conjunctive word includes " seven ", " its ", " phase ", " surprise " Deng " phase " being obtained by semantic analysis, to obtain " our next term goodbyes ".

The embodiment of the invention provides a kind of lip reading identification device based on deep learning, the device is for executing above-mentioned base In the lip reading recognition methods of deep learning, as shown in Fig. 2, the device includes: first acquisition unit 10, recognition unit 20, second Acquiring unit 30, generation unit 40, amending unit 50.

First acquisition unit 10, for obtaining the voice signal and video of user, wherein video is to issue voice in user The face of user is shot during signal.

Recognition unit 20, for obtaining the first text by speech recognition technology recognition of speech signals.

Second acquisition unit 30, for obtaining lip image sequence to be identified from video.

Generation unit 40, for extracting lip feature vector from lip image sequence to be identified, and according to lip feature Vector obtains the second text.

Amending unit 50 obtains the corresponding text of voice signal of user for correcting the first text according to the second text.

Optionally, recognition unit 20 includes: the first extraction subelement, identification subelement, generates subelement.

First extracts subelement, for carrying out feature extraction to voice signal, obtains characteristic information；It identifies subelement, uses According to characteristic information and the discrimination model pre-established, characteristic voice is identified；Subelement is generated, for using special with voice The speech recognition modeling recognition of speech signals that point matches, obtains the first text.

Optionally, second acquisition unit 30 includes: the second extraction subelement, third extraction subelement, the 4th extraction son list Member, the first processing subelement.

Second extracts subelement, for extracting range image sequence from deep image information, and from Infrared Image Information Middle extraction infrared image sequence；Third extracts subelement, for extracting the first lip-region of user from range image sequence Image sequence；4th extracts subelement, for extracting the second lip-region image sequence of user from infrared image sequence；The One processing subelement, for using the first lip-region image sequence and the second lip-region image sequence as lip figure to be identified As sequence.

Optionally, generation unit 40 includes: locator unit, fusion subelement, the first acquisition subelement, the first matching Unit, computation subunit, confirmation subelement, the first output subelement, composition subelement.

Locator unit, for using the lip reading recognizer based on deep learning in lip image sequence to be identified Lip contour is positioned, and the first lip contour curve and the second lip contour curve are obtained；Subelement is merged, for first Lip contour curve and the second lip contour curve carry out fusion treatment, obtain target lip curve；First obtains subelement, uses In the extraction lip feature vector from target lip curve；First coupling subelement, lip feature vector for that will extract with The standard feature vector stored in lid speech characteristic library is matched, and lid speech characteristic library includes that mandarin feature database and multiple dialects are special Levy library；Computation subunit, for calculating the similarity value of lip feature vector Yu standard feature vector；Confirm subelement, is used for Similarity value is selected to be greater than the lip feature vector of preset threshold as target feature vector；First output subelement, for defeated Lip reading text corresponding with target feature vector out；Subelement is formed, for arranging according to video multiple lip reading texts Sequence obtains the second text.

Optionally, amending unit 50 includes: the second coupling subelement, the second output subelement, the second acquisition subelement, fills out Mend subelement.

Second coupling subelement, for matching the second text with the first text；Second output subelement, being used for will The text and space pre-output of successful match, space are used to indicate the text of non-successful match, obtain base text；Second obtains Subelement obtains the conjunctive word of text corresponding with space in the second text for based on context semantic analysis；Fill up son Unit obtains the corresponding text of voice signal of user for filling up the space in base text with conjunctive word.Such as: we It finishes classes and leave school goodbye；Text corresponding with space is " gas " in second text, his conjunctive word includes " seven ", " its ", " phase ", " surprise " Deng " phase " being obtained by semantic analysis, to obtain " our next term goodbyes ".

The embodiment of the invention provides a kind of storage medium, storage medium includes the program of storage, wherein is run in program When control storage medium where equipment execute following steps:

Obtain the voice signal and video of user, wherein video is during user issues voice signal to user Face shot；By speech recognition technology recognition of speech signals, the first text is obtained；From video obtain to Identify lip image sequence；Lip feature vector is extracted from lip image sequence to be identified, and is obtained according to lip feature vector To the second text；The first text is corrected according to the second text, obtains the corresponding text of voice signal of user.

Optionally, when program is run, equipment where control storage medium also executes following steps: carrying out to voice signal Feature extraction obtains characteristic information；According to characteristic information and the discrimination model pre-established, characteristic voice is identified；Using with The speech recognition modeling recognition of speech signals that characteristic voice matches, obtains the first text.

Optionally, when program is run, equipment where control storage medium also executes following steps: video includes depth map As information and Infrared Image Information, range image sequence is extracted from deep image information, and extract from Infrared Image Information Infrared image sequence；The first lip-region image sequence of user is extracted from range image sequence；From infrared image sequence Extract the second lip-region image sequence of user；First lip-region image sequence and the second lip-region image sequence are made For lip image sequence to be identified.

Optionally, when program is run, equipment where control storage medium also executes following steps: using based on depth The lip reading recognizer of habit positions the lip contour in lip image sequence to be identified, obtains the first lip contour curve And the second lip contour curve；Fusion treatment is carried out to the first lip contour curve and the second lip contour curve, obtains target Lip curve；Lip feature vector is extracted from target lip curve；It will be in the lip feature vector of extraction and lid speech characteristic library The standard feature vector of storage is matched, and lid speech characteristic library includes mandarin feature database and multiple provincialism libraries；Calculate lip The similarity value of portion's feature vector and standard feature vector；Select similarity value be greater than preset threshold lip feature vector as Target feature vector；Export lip reading text corresponding with target feature vector；Multiple lip reading texts are arranged according to video Sequence obtains the second text.

Optionally, when program is run, equipment where control storage medium also executes following steps: by the second text and the One text is matched；By the text of successful match and space pre-output, space is used to indicate the text of non-successful match, obtains Base text；Based on context semantic analysis obtains the conjunctive word of text corresponding with space in the second text；Use conjunctive word The space in base text is filled up, the corresponding text of voice signal of user is obtained.

The embodiment of the invention provides a kind of server, including memory and processor, memory includes journey for storing The information of sequence instruction, processor are used to control the execution of program instruction, when program instruction load and is executed by processor realization with Lower step:

Optionally, it is also performed the steps of when program instruction is loaded and executed by processor and feature is carried out to voice signal It extracts, obtains characteristic information；According to characteristic information and the discrimination model pre-established, characteristic voice is identified；Using with voice The speech recognition modeling recognition of speech signals that feature matches, obtains the first text.

Optionally, it includes depth image letter that video is also performed the steps of when program instruction is loaded and executed by processor Breath and Infrared Image Information, extract range image sequence from deep image information, and extract from Infrared Image Information infrared Image sequence；The first lip-region image sequence of user is extracted from range image sequence；It is extracted from infrared image sequence The second lip-region image sequence of user；Using the first lip-region image sequence and the second lip-region image sequence as to Identify lip image sequence.

Optionally, it also performs the steps of when program instruction is loaded and executed by processor using based on deep learning Lip reading recognizer positions the lip contour in lip image sequence to be identified, obtains the first lip contour curve and Bilabiate contouring curve；Fusion treatment is carried out to the first lip contour curve and the second lip contour curve, obtains target lip Curve；Lip feature vector is extracted from target lip curve；It will be stored in the lip feature vector of extraction and lid speech characteristic library Standard feature vector matched, lid speech characteristic library includes mandarin feature database and multiple provincialism libraries；It is special to calculate lip Levy the similarity value of vector and standard feature vector；Similarity value is selected to be greater than the lip feature vector of preset threshold as target Feature vector；Export lip reading text corresponding with target feature vector；Multiple lip reading texts are ranked up according to video, are obtained To the second text.

Optionally, it is also performed the steps of when program instruction is loaded and executed by processor by the second text and the first text This is matched；By the text of successful match and space pre-output, space is used to indicate the text of non-successful match, obtains basis Text；Based on context semantic analysis obtains the conjunctive word of text corresponding with space in the second text；It is filled up with conjunctive word Space in base text obtains the corresponding text of voice signal of user.

It should be noted that terminal involved in the embodiment of the present invention can include but is not limited to personal computer (Personal Computer, PC), personal digital assistant (Personal Digital Assistant, PDA), wireless handheld Equipment, tablet computer (Tablet Computer), mobile phone, MP3 player, MP4 player etc..

It is understood that the application can be mounted in the application program (nativeApp) in terminal, or may be used also To be a web page program (webApp) of browser in terminal, the embodiment of the present invention is to this without limiting.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect Coupling or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of lip reading recognition methods based on deep learning, which is characterized in that the described method includes:

Obtain the voice signal and video of user, wherein the video is to issue the process of the voice signal in the user In the face of the user shot；

The voice signal is identified by speech recognition technology, obtains the first text；

Lip image sequence to be identified is obtained from the video；

Lip feature vector is extracted from the lip image sequence to be identified, and obtains second according to the lip feature vector Text；

First text is corrected according to second text, obtains the corresponding text of voice signal of the user.

2. the method according to claim 1, wherein described identify that the voice is believed by speech recognition technology Number, the method for obtaining the first text, comprising:

Feature extraction is carried out to the voice signal, obtains characteristic information；

According to the characteristic information and the discrimination model pre-established, characteristic voice is identified；

The voice signal is identified using the speech recognition modeling to match with the characteristic voice, obtains first text.

3. the method according to claim 1, wherein described obtain lip image sequence to be identified from the video Column, the video includes deep image information and Infrared Image Information, which comprises

Range image sequence is extracted from the deep image information, and infrared image sequence is extracted from the Infrared Image Information Column；

The first lip-region image sequence of the user is extracted from the range image sequence；

The second lip-region image sequence of the user is extracted from the infrared image sequence；

Using the first lip-region image sequence and the second lip-region image sequence as the lip figure to be identified As sequence.

4. method according to any one of claims 1 to 3, which is characterized in that described from the lip image sequence to be identified Lip feature vector, and the method that the second text is obtained according to the lip feature vector are extracted in column, comprising:

The lip contour in the lip image sequence to be identified is determined using the lip reading recognizer based on deep learning Position, obtains the first lip contour curve and the second lip contour curve；

Fusion treatment is carried out to the first lip contour curve and the second lip contour curve, it is bent to obtain target lip Line；

Lip feature vector is extracted from the target lip curve；

The standard feature vector stored in the lip feature vector of extraction and lid speech characteristic library is matched, the lid speech characteristic Library includes mandarin feature database and multiple provincialism libraries；

Calculate the similarity value of the lip feature vector Yu the standard feature vector；

The similarity value is selected to be greater than the lip feature vector of preset threshold as target feature vector；

Export lip reading text corresponding with the target feature vector；

Multiple lip reading texts are ranked up according to the video, obtain second text.

5. method according to any one of claims 1 to 3, which is characterized in that described to correct institute according to second text The first text is stated, the method for obtaining the corresponding text of voice signal of the user, comprising:

Second text is matched with first text；

By the text of successful match and space pre-output, the space is used to indicate the text of non-successful match, obtains basic text This；

Based on context semantic analysis obtains the conjunctive word of text corresponding with the space in second text；

The space in the base text is filled up with the conjunctive word, obtains the corresponding text of voice signal of the user This.

6. a kind of lip reading identification device based on deep learning, which is characterized in that described device includes:

First acquisition unit, for obtaining the voice signal and video of user, wherein the video is in user sending institute The face of the user is shot during predicate sound signal；

Recognition unit obtains the first text for identifying the voice signal by speech recognition technology；

Second acquisition unit, for obtaining lip image sequence to be identified from the video；

Generation unit, for extracting lip feature vector from the lip image sequence to be identified, and it is special according to the lip Sign vector obtains the second text；

Amending unit, for correcting first text according to second text, the voice signal for obtaining the user is corresponding Text.

7. device according to claim 6, which is characterized in that the generation unit includes:

Locator unit, for using the lip reading recognizer based on deep learning in the lip image sequence to be identified Lip contour is positioned, and the first lip contour curve and the second lip contour curve are obtained；

Subelement is merged, for carrying out fusion treatment to the first lip contour curve and the second lip contour curve, Obtain target lip curve；

First obtains subelement, for extracting lip feature vector from the target lip curve；

First coupling subelement, standard feature vector for that will store in the lip feature vector of extraction and lid speech characteristic library into Row matching, the lid speech characteristic library includes mandarin feature database and multiple provincialism libraries；

Computation subunit, for calculating the similarity value of the lip feature vector Yu the standard feature vector；

Confirm subelement, for select the similarity value be greater than preset threshold lip feature vector as target signature to Amount；

First output subelement, for exporting lip reading text corresponding with the target feature vector；

It forms subelement and obtains second text for being ranked up according to the video to multiple lip reading texts.

8. device according to claim 6, which is characterized in that the amending unit includes:

Second coupling subelement, for matching second text with first text；

Second output subelement, for by the text of successful match and space pre-output, the space not to be matched into for indicating The text of function, obtains base text；

Second obtains subelement, obtains for based on context semantic analysis corresponding with the space in second text The conjunctive word of text；

Subelement is filled up, for filling up the space in the base text with the conjunctive word, obtains the language of the user The corresponding text of sound signal.

9. a kind of storage medium, the storage medium includes the program of storage, which is characterized in that is controlled in described program operation Equipment perform claim requires the lip reading recognition methods based on deep learning described in 1 to 5 any one where the storage medium.

10. a kind of server, including memory and processor, the memory is for storing the information including program instruction, institute Processor is stated for controlling the execution of program instruction, it is characterised in that: described program instruction is real when being loaded and executed by processor The step of showing the lip reading recognition methods described in claim 1 to 5 any one based on deep learning.