CN107293300A - Audio recognition method and device, computer installation and readable storage medium storing program for executing - Google Patents
Audio recognition method and device, computer installation and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN107293300A CN107293300A CN201710648985.2A CN201710648985A CN107293300A CN 107293300 A CN107293300 A CN 107293300A CN 201710648985 A CN201710648985 A CN 201710648985A CN 107293300 A CN107293300 A CN 107293300A
- Authority
- CN
- China
- Prior art keywords
- voice messaging
- lip
- pause information
- user
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000009434 installation Methods 0.000 title claims abstract description 26
- 238000003860 storage Methods 0.000 title description 7
- 238000004590 computer program Methods 0.000 claims description 22
- 238000001514 detection method Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 230000001815 facial effect Effects 0.000 description 4
- 230000036651 mood Effects 0.000 description 4
- 238000013139 quantization Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The invention provides a kind of audio recognition method, the audio recognition method includes:Obtain the voice messaging of user's input;Obtain lip image of the user when inputting the voice messaging;Pause information in the voice messaging according to the lip image recognition;Speech recognition is carried out to the voice messaging according to the pause information.The present invention also provides a kind of speech recognition equipment, computer installation and computer-readable recording medium.The present invention can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.
Description
Technical field
The present invention relates to intelligent sound technical field, and in particular to a kind of audio recognition method and device, computer installation
And readable storage medium storing program for executing.
Background technology
At present, with the development of electronics and the communication technology, the terminal such as mobile phone, tablet personal computer is widely used, man-machine friendship
Mutual mode is also more and more diversified.Phonetic entry is more and more used as one of natural mode of man-machine interaction most convenient
Family is received.However, current speech recognition accuracy is not high, poor user experience.
The content of the invention
In view of the foregoing, it is necessary to propose a kind of audio recognition method and device, computer installation and readable storage medium
Matter, it can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.
The first aspect of the application provides a kind of audio recognition method, and methods described includes:
Obtain the voice messaging of user's input;
Obtain lip image of the user when inputting the voice messaging;
Pause information in the voice messaging according to the lip image recognition;
Speech recognition is carried out to the voice messaging according to the pause information.
It is described that speech recognition is carried out to the voice messaging according to the pause information in alternatively possible implementation
Including:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into
In the text message being converted into by the voice messaging;Or
The pause information in the voice messaging is removed, the voice messaging for having removed the pause information is entered
Row speech recognition.
In alternatively possible implementation, the pause letter in the voice messaging according to the lip image recognition
Breath includes:
Disconnected word pause information and/or punctuate pause information in the voice messaging according to the lip image recognition;
Carrying out speech recognition to the voice messaging according to the pause information includes:
Speech recognition is carried out to the voice messaging according to disconnected the word pause information and/or punctuate pause information.
In alternatively possible implementation, the voice messaging for obtaining user's input;Obtain user described in input
Lip image during voice messaging includes:
When user inputs the voice messaging, the voice messaging is gathered by the microphone of terminal, and pass through end
The camera at end shoots the lip image.
In alternatively possible implementation, methods described also includes:
Judge whether the lip motion information matches with the voice messaging;
If the lip motion information is mismatched with the voice messaging, the camera is controlled to stop shooting the lip figure
Picture.
In alternatively possible implementation, methods described also includes:
The motion amplitude of user's lip is obtained according to the lip image, is recognized according to the motion amplitude of user's lip
The corresponding tone of the voice messaging;Or
The lip characteristic of user pronunciation is obtained, user characteristics is determined according to the lip characteristic, according to the user characteristics
Speech recognition is carried out to the voice messaging with the pause information.
The second aspect of the application provides a kind of speech recognition equipment, and described device includes:
First acquisition unit, the voice messaging for obtaining user's input;
Second acquisition unit, for obtaining lip image of the user when inputting the voice messaging;
First recognition unit, for the pause information in the voice messaging according to the lip image recognition;
Second recognition unit, for carrying out speech recognition to the voice messaging according to the pause information.
In alternatively possible implementation, second recognition unit specifically for:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into
In the text message being converted into by the voice messaging;Or
The pause information in the voice messaging is removed, the voice messaging for having removed the pause information is entered
Row speech recognition.
The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing
The step of device is used to realize the audio recognition method when performing the computer program stored in memory.
The fourth aspect of the application provides a kind of computer-readable recording medium, is stored thereon with computer program, described
The step of audio recognition method being realized when computer program is executed by processor.
The present invention obtains the voice messaging of user's input;Obtain lip image of the user when inputting the voice messaging;
Pause information in the voice messaging according to the lip image recognition;The voice messaging is entered according to the pause information
Row speech recognition.The present invention can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.
Brief description of the drawings
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one is provided;
Fig. 2 is the structure chart for the speech recognition equipment that the embodiment of the present invention two is provided;
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three is provided.
Main element symbol description
Computer installation 1
Speech recognition equipment 10
Memory 20
Processor 30
Computer program 40
First acquisition unit 201
Second acquisition unit 202
First recognition unit 203
Second recognition unit 204
Following embodiment will further illustrate the present invention with reference to above-mentioned accompanying drawing.
Embodiment
It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention
Applying example, the present invention will be described in detail.It should be noted that in the case where not conflicting, embodiments herein and embodiment
In feature can be mutually combined.
Elaborate many details in the following description to facilitate a thorough understanding of the present invention, described embodiment only
Only it is a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill
The every other embodiment that personnel are obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention
The implication that technical staff is generally understood that is identical.Term used in the description of the invention herein is intended merely to description tool
The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Preferably, audio recognition method of the invention is applied in one or more terminal.The terminal is a kind of energy
It is enough according to the instruction for being previously set or store, the equipment of automatic progress numerical computations and/or information processing, its hardware is included but not
It is limited to microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), can compiles
Journey gate array (Field-Programmable Gate Array, FPGA), digital processing unit (Digital Signal
Processor, DSP), embedded device etc..
The terminal may be, but not limited to, any one can with user by keyboard, mouse, remote control, touch pad or
The modes such as voice-operated device carry out the electronic product of man-machine interaction, for example, tablet personal computer, smart mobile phone, personal digital assistant
(Personal Digital Assistant, PDA), intelligent wearable equipment etc..
Embodiment one
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one is provided.As shown in figure 1, this method is specifically wrapped
Include following steps:
101:Obtain the voice messaging of user's input.
The voice messaging is the speech data obtained according to the natural-sounding of user.For example, the voice messaging is logical
Cross microphone and the natural-sounding of user is converted into the voice signal that electric signal is obtained.
The voice messaging can be gathered by the microphone of terminal in user's input voice information.For example, can examine
Survey and whether receive phonetic entry sign on (for example detecting whether the home keys of terminal are long pressed), refer to if receiving phonetic entry
Order, then start to gather the voice messaging that user inputs by the microphone of terminal.It can also detect whether to receive phonetic entry knot
Shu Zhiling (for example detects whether the home keys of terminal are released), if receiving phonetic entry END instruction, stops passing through terminal
Microphone collection user input voice messaging.
Or, the voice messaging gathered in advance can be read.For example, the voice messaging of user's input can be gathered in advance,
When needing to carry out speech recognition to the voice messaging, the voice messaging is read.
102:Obtain lip image of the user when inputting the voice messaging.
The lip image is also lip motion video or labiomaney image, refers to when people speaks, the lip motion of speaker
The image of change.Lip image in a period of time may be constructed image sequence or image/video.
Facial image of the user when inputting the voice messaging can be obtained, lip position is determined from the facial image
Put, so as to obtain the lip image.
Camera can also be directly directed to user's lip to be shot, so as to obtain the lip image.For example, shooting
Head can be built in microphone (such as in headset), or microphone is built in camera, and user is in use, take the photograph
As head is directly directed at user's lip, so as to easily obtain lip image.
The lip image can be shot by the camera of terminal in user's input voice information.For example, can examine
Whether survey receives phonetic entry sign on, if receiving phonetic entry sign on, gathers and uses in the microphone by terminal
While the voice messaging of family input, the lip image of user is shot by the camera of terminal.It can also detect whether to receive
Phonetic entry END instruction, if receiving phonetic entry END instruction, user's input is gathered stopping the microphone by terminal
Voice messaging while, stop shooting the lip image of user by the camera of terminal.
Or, the lip image shot in advance can be read.For example, can be in the voice messaging that collection user inputs in advance
When, the lip image is shot, when needing to carry out speech recognition to the voice messaging, the lip image is read.
The voice messaging that user inputs and the camera shooting lip for passing through terminal are gathered in the microphone by terminal
During shape image, it can be determined that whether the lip motion information matches with the voice messaging, if the lip motion information and the voice
Information is mismatched, and controls the camera to stop shooting the lip image.
It can detect whether the lip motion information is synchronous with the voice messaging, if the lip motion information is believed with the voice
Breath is asynchronous, then the lip motion information is mismatched with the voice messaging.If for example, according to the voice messaging determine user from
Loquitur within 1st second, determine that user loquitured from the 5th second according to the lip motion information, then the lip motion information and institute's predicate
Message breath is asynchronous, thus the lip motion information and voice messaging mismatch.
Or, it can detect that the corresponding text information of lip motion information text information corresponding with the voice messaging is
It is no consistent, it is described if the corresponding text information of lip motion information text information corresponding with the voice messaging is inconsistent
Lip motion information is mismatched with the voice messaging.For example, the corresponding text information of the lip motion information is " I in certain time period
Have a meeting ", the corresponding text information of the voice messaging is " today, weather was pretty good ", then the corresponding text of the lip motion information
Word information text information corresponding with the voice messaging is inconsistent, thus the lip motion information and the voice messaging are not
Match somebody with somebody.
103:Pause information in the voice messaging according to the lip image recognition.
Pause often occurs when speaking by user, therefore, and the lip image includes lip image when pausing, described
Voice messaging includes the voice messaging (information of pausing) when pausing, and the voice can be recognized according to lip image when pausing
The pause information that packet contains.
User can be paused during speaking when needing disconnected word or punctuate, and therefore, the pause information can be with table
Show disconnected word and/or punctuate (now pause information can be mute signal), the pause information can include disconnected word pause information
And/or punctuate pause information.
Or, user can be paused during speaking when other side speaks or thinks deeply, therefore, and the pause information can
With represent one section it is Jing Yin.Now the pause information is invalid phonetic entry.
Or, user can be paused during speaking when there is noise (such as when noise is excessive), therefore, described
Pause information can represent noise (now pause information can be noise signal).Now the pause information is invalid voice
Input.
When the pause information represents disconnected word and/or punctuate, it can believe voice according to the lip image recognition
Disconnected word pause information and/or punctuate pause information in breath.
Whether can not occurred according to the lip image detection to the first preset time (such as 0.1 second) interior user's lip
Whether change or amplitude of variation are less than or equal to predetermined amplitude, if according in the lip image detection to the first preset time
User's lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then pre- by described in the voice messaging first
If time corresponding voice messaging is identified as disconnected word pause information.
Whether can not occurred according to the lip image detection to the second preset time (such as 0.5 second) interior user's lip
Change or amplitude of variation are less than or equal to predetermined amplitude, if according to user in the lip image detection to the second preset time
Lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then when described in the voice messaging second is preset
Between corresponding voice messaging be identified as punctuate pause information.Second preset time can be more than first preset time.
, can be default to the 3rd according to the lip image detection when the pause information represents that one section Jing Yin or during noise
Whether time (such as 3 seconds) interior user's lip does not change or whether amplitude of variation is less than or equal to predetermined amplitude, if root
Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal in advance
If amplitude, then the corresponding voice messaging of the 3rd preset time described in the voice messaging is identified as pause information.Or, if
Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal to
Predetermined amplitude, and the corresponding voice signal amplitude of the 3rd preset time described in the voice messaging is more than predetermined threshold value,
The corresponding voice messaging of the 3rd preset time described in the voice messaging is then identified as pause information.Described 3rd it is default when
Between can be more than second preset time.
104:Speech recognition is carried out to the voice messaging according to the pause information.
If the pause information includes disconnected word pause information, the voice can be believed according to the disconnected word pause information
Breath carries out speech recognition.
Or, can be according to the punctuate pause information to described if the pause information includes punctuate pause information
Voice messaging carries out speech recognition.
Or, can be according to the disconnected word if the pause information includes disconnected word pause information and punctuate pause information
Pause information and punctuate pause information carry out speech recognition to the voice messaging.
Can according to the time map relation between the pause information and the voice messaging (i.e. corresponding time relationship),
The pause information is inserted into the text message being converted into by the voice messaging.For example, can be to the voice messaging
Carry out speech recognition, obtain the corresponding text message of the voice messaging, according to the pause information (disconnected word pause information and/
Or punctuate pause information) time of occurrence in the voice messaging, the pause information is inserted into the text message,
Obtain including the text message of pause information.
Or, the pause information in the voice messaging can be removed, to having removed described in the pause information
Voice messaging carries out speech recognition.As it was previously stated, the pause information can represent noise or Jing Yin, i.e., invalid voice is defeated
Enter, the noise in the voice messaging can be removed by carrying out speech recognition to the voice messaging for having removed the pause information
Or it is Jing Yin.
Can use various speech recognition technologies, such as dynamic time warping (Dynamic Time Warping, DTW),
It is hidden Markov model (Hidden Markov Model, HMM), vector quantization (Vector Quantization, VQ), artificial
Technology is to the voice messaging or has removed pause information for neutral net (Artificial Neural Network, ANN) etc.
The voice messaging carries out speech recognition.
The audio recognition method of embodiment one obtains the voice messaging of user's input;Obtain user and input the voice letter
Lip image during breath;Pause information in the voice messaging according to the lip image recognition;According to the pause information
Speech recognition is carried out to the voice messaging.The audio recognition method of embodiment one can carry out voice knowledge using lip image
Not, the accuracy rate of speech recognition is improved.
In another embodiment, methods described can also include:The motion of user's lip is obtained according to the lip image
Amplitude, the corresponding tone of the voice messaging is recognized according to the motion amplitude of user's lip.The tone can include old
Predicate gas, the query tone, imperative mood, exclamation tone etc..If for example, the motion amplitude of user's lip is in the first default width
In the range of degree, it is determined that the corresponding tone of the voice messaging is sighs with feeling the tone;If the motion amplitude of user's lip is
In the range of two predetermined amplitudes, it is determined that the corresponding tone of the voice messaging is imperative mood.
In another embodiment, methods described can also include:Obtain the lip characteristic of user pronunciation;According to the lip
Characteristic determines user characteristics;Speech recognition is carried out to the voice messaging according to the user characteristics and the pause information.Institute
User's sex, language form, dialect type and/or pet phrase custom etc. can be included by stating user characteristics.For example, can according to
The lip characteristic of family pronunciation determines language form (such as Chinese), according to the language form and the pause information to institute's predicate
Message breath carries out speech recognition.To obtaining more auxiliary informations before voice messaging progress speech recognition, (i.e. user is special
Levy), it can further improve the accuracy rate of speech recognition.
Embodiment two
Fig. 2 is the structure chart for the speech recognition equipment that the embodiment of the present invention two is provided.As shown in Fig. 2 the speech recognition
Device 10 can include:First acquisition unit 201, second acquisition unit 202, the first recognition unit 203, the second recognition unit
204。
First acquisition unit 201, the voice messaging for obtaining user's input.
The voice messaging is the speech data obtained according to the natural-sounding of user.For example, the voice messaging is logical
Cross microphone and the natural-sounding of user is converted into the voice signal that electric signal is obtained.
The voice messaging can be gathered by the microphone of terminal in user's input voice information.For example, can examine
Survey and whether receive phonetic entry sign on (for example detecting whether the home keys of terminal are long pressed), refer to if receiving phonetic entry
Order, then start to gather the voice messaging that user inputs by the microphone of terminal.It can also detect whether to receive phonetic entry knot
Shu Zhiling (for example detects whether the home keys of terminal are released), if receiving phonetic entry END instruction, stops passing through terminal
Microphone collection user input voice messaging.
Or, the voice messaging gathered in advance can be read.For example, the voice messaging of user's input can be gathered in advance,
When needing to carry out speech recognition to the voice messaging, the voice messaging is read.
Second acquisition unit 202, for obtaining lip image of the user when inputting the voice messaging.
The lip image is also lip motion video or labiomaney image, refers to when people speaks, the lip motion of speaker
The image of change.Lip image in a period of time may be constructed image sequence or image/video.
Facial image of the user when inputting the voice messaging can be obtained, lip position is determined from the facial image
Put, so as to obtain the lip image.
Camera can also be directly directed to user's lip to be shot, so as to obtain the lip image.For example, shooting
Head can be built in microphone (such as in headset), or microphone is built in camera, and user is in use, take the photograph
As head is directly directed at user's lip, so as to easily obtain lip image.
The lip image can be shot by the camera of terminal in user's input voice information.For example, can examine
Whether survey receives phonetic entry sign on, if receiving phonetic entry sign on, gathers and uses in the microphone by terminal
While the voice messaging of family input, the lip image of user is shot by the camera of terminal.It can also detect whether to receive
Phonetic entry END instruction, if receiving phonetic entry END instruction, user's input is gathered stopping the microphone by terminal
Voice messaging while, stop shooting the lip image of user by the camera of terminal.
Or, the lip image shot in advance can be read.For example, can be in the voice messaging that collection user inputs in advance
When, the lip image is shot, when needing to carry out speech recognition to the voice messaging, the lip image is read.
The voice messaging that user inputs and the camera shooting lip for passing through terminal are gathered in the microphone by terminal
During shape image, it can be determined that whether the lip motion information matches with the voice messaging, if the lip motion information and the voice
Information is mismatched, and controls the camera to stop shooting the lip image.
It can detect whether the lip motion information is synchronous with the voice messaging, if the lip motion information is believed with the voice
Breath is asynchronous, then the lip motion information is mismatched with the voice messaging.If for example, according to the voice messaging determine user from
Loquitur within 1st second, determine that user loquitured from the 5th second according to the lip motion information, then the lip motion information and institute's predicate
Message breath is asynchronous, thus the lip motion information and voice messaging mismatch.
Or, it can detect that the corresponding text information of lip motion information text information corresponding with the voice messaging is
It is no consistent, it is described if the corresponding text information of lip motion information text information corresponding with the voice messaging is inconsistent
Lip motion information is mismatched with the voice messaging.For example, the corresponding text information of the lip motion information is " I in certain time period
Have a meeting ", the corresponding text information of the voice messaging is " today, weather was pretty good ", then the corresponding text of the lip motion information
Word information text information corresponding with the voice messaging is inconsistent, thus the lip motion information and the voice messaging are not
Match somebody with somebody.
First recognition unit 203, for the pause information in the voice messaging according to the lip image recognition.
Pause often occurs when speaking by user, therefore, and the lip image includes lip image when pausing, described
Voice messaging includes the voice messaging (information of pausing) when pausing, and the voice can be recognized according to lip image when pausing
The pause information that packet contains.
User can be paused during speaking when needing disconnected word or punctuate, and therefore, the pause information can be with table
Show disconnected word and/or punctuate (now pause information can be mute signal), the pause information can include disconnected word pause information
And/or punctuate pause information.
Or, user can be paused during speaking when other side speaks or thinks deeply, therefore, and the pause information can
With represent one section it is Jing Yin.Now the pause information is invalid phonetic entry.
Or, user can be paused during speaking when there is noise (such as when noise is excessive), therefore, described
Pause information can represent noise (now pause information can be noise signal).Now the pause information is invalid voice
Input.
When the pause information represents disconnected word and/or punctuate, it can believe voice according to the lip image recognition
Disconnected word pause information and/or punctuate pause information in breath.
Whether can not occurred according to the lip image detection to the first preset time (such as 0.1 second) interior user's lip
Whether change or amplitude of variation are less than or equal to predetermined amplitude, if according in the lip image detection to the first preset time
User's lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then pre- by described in the voice messaging first
If time corresponding voice messaging is identified as disconnected word pause information.
Whether can not occurred according to the lip image detection to the second preset time (such as 0.5 second) interior user's lip
Change or amplitude of variation are less than or equal to predetermined amplitude, if according to user in the lip image detection to the second preset time
Lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then when described in the voice messaging second is preset
Between corresponding voice messaging be identified as punctuate pause information.Second preset time can be more than first preset time.
, can be default to the 3rd according to the lip image detection when the pause information represents that one section Jing Yin or during noise
Whether time (such as 3 seconds) interior user's lip does not change or whether amplitude of variation is less than or equal to predetermined amplitude, if root
Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal in advance
If amplitude, then the corresponding voice messaging of the 3rd preset time described in the voice messaging is identified as pause information.Or, if
Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal to
Predetermined amplitude, and the corresponding voice signal amplitude of the 3rd preset time described in the voice messaging is more than predetermined threshold value,
The corresponding voice messaging of the 3rd preset time described in the voice messaging is then identified as pause information.Described 3rd it is default when
Between can be more than second preset time.
Second recognition unit 204, for carrying out speech recognition to the voice messaging according to the pause information.
If the pause information includes disconnected word pause information, the voice can be believed according to the disconnected word pause information
Breath carries out speech recognition.
Or, can be according to the punctuate pause information to described if the pause information includes punctuate pause information
Voice messaging carries out speech recognition.
Or, can be according to the disconnected word if the pause information includes disconnected word pause information and punctuate pause information
Pause information and punctuate pause information carry out speech recognition to the voice messaging.
Can according to the time map relation between the pause information and the voice messaging (i.e. corresponding time relationship),
The pause information is inserted into the text message being converted into by the voice messaging.For example, can be to the voice messaging
Carry out speech recognition, obtain the corresponding text message of the voice messaging, according to the pause information (disconnected word pause information and/
Or punctuate pause information) time of occurrence in the voice messaging, the pause information is inserted into the text message,
Obtain including the text message of pause information.
Or, the pause information in the voice messaging can be removed, to having removed described in the pause information
Voice messaging carries out speech recognition.As it was previously stated, the pause information can represent noise or Jing Yin, i.e., invalid voice is defeated
Enter, the noise in the voice messaging can be removed by carrying out speech recognition to the voice messaging for having removed the pause information
Or it is Jing Yin.
Can use various speech recognition technologies, such as dynamic time warping (Dynamic Time Warping, DTW),
It is hidden Markov model (Hidden Markov Model, HMM), vector quantization (Vector Quantization, VQ), artificial
Technology is to the voice messaging or has removed pause information for neutral net (Artificial Neural Network, ANN) etc.
The voice messaging carries out speech recognition.
The speech recognition equipment 10 of embodiment two obtains the voice messaging of user's input;Obtain user and input the voice
Lip image during information;Pause information in the voice messaging according to the lip image recognition;Believed according to described pause
Breath carries out speech recognition to the voice messaging.The speech recognition equipment 10 of embodiment two can carry out voice using lip image
Identification, improves the accuracy rate of speech recognition.
In another embodiment, the speech recognition equipment 10 can also include:
3rd recognition unit, the motion amplitude for obtaining user's lip according to the lip image, according to the user
The motion amplitude of lip recognizes the corresponding tone of the voice messaging.The tone can include indicative mood, the query tone, pray
Make the tone, sigh with feeling tone etc..If for example, the motion amplitude of user's lip is in the range of the first predetermined amplitude, it is determined that institute
The corresponding tone of voice messaging is stated to sigh with feeling the tone;If the motion amplitude of user's lip is in the range of the second predetermined amplitude,
It is imperative mood then to determine the corresponding tone of the voice messaging.
In another embodiment, the speech recognition equipment 10 can also include:
4th recognition unit, the lip characteristic for obtaining user pronunciation;User characteristics is determined according to the lip characteristic;
Speech recognition is carried out to the voice messaging according to the user characteristics and the pause information.The user characteristics can include
User's sex, language form, dialect type and/or pet phrase custom etc..For example, can be true according to the lip characteristic of user pronunciation
Determine language form (such as Chinese), voice knowledge is carried out to the voice messaging according to the language form and the pause information
Not.The voice messaging is carried out to obtain more auxiliary informations (i.e. user characteristics) before speech recognition, can further be carried
The accuracy rate of high speech recognition.
Embodiment three
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three is provided.The computer installation 1 includes memory
20th, processor 30 and the computer program 40 that can be run in the memory 20 and on the processor 30, example are stored in
Such as speech recognition program.The processor 30 is realized when performing the computer program 40 in above-mentioned audio recognition method embodiment
The step of, such as step 101 shown in Fig. 1~104.Or, the processor 30 is realized when performing the computer program 40
The function of each module/unit, such as unit 201~204 in said apparatus embodiment.
Exemplary, the computer program 40 can be divided into one or more module/units, it is one or
Multiple module/units are stored in the memory 20, and are performed by the processor 30, to complete the present invention.Described one
Individual or multiple module/units can complete the series of computation machine programmed instruction section of specific function, and the instruction segment is used for
Implementation procedure of the computer program 40 in the computer installation 1 is described.For example, the computer program 40 can be by
First acquisition unit 201 in Fig. 2, second acquisition unit 202, the first recognition unit 203, the second recognition unit 204 are divided into,
Each module concrete function is referring to embodiment two.
The computer installation 1 can be that the calculating such as desktop PC, notebook, palm PC and cloud server is set
It is standby.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 1, do not constitute to computer
The restriction of device 1, can include than illustrating more or less parts, either combine some parts or different parts, example
Computer installation 1 can also include input-output equipment, network access equipment, bus etc. as described.
Alleged processor 30 can be CPU (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) or other PLDs, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor 30 can also be any conventional processor
Deng the processor 30 is the control centre of the computer installation 1, utilizes various interfaces and connection whole computer dress
Put 1 various pieces.
The memory 20 can be used for storing the computer program 40 and/or module/unit, and the processor 30 passes through
Operation performs and is stored in computer program and/or module/unit in the memory 20, and calls and be stored in memory
Data in 20, realize the various functions of the computer installation 1.The memory 20 can mainly include storing program area and deposit
Data field is stored up, wherein, the application program that storing program area can be needed for storage program area, at least one function (such as broadcast by sound
Playing function, image player function etc.) etc.;Storage data field can be stored uses created data (ratio according to computer installation 1
Such as voice data, phone directory) etc..In addition, memory 20 can include high-speed random access memory, it can also include non-easy
The property lost memory, such as hard disk, internal memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital
(Secure Digital, SD) block, flash card (Flash Card), at least one disk memory, flush memory device or other
Volatile solid-state part.
If the integrated module/unit of the computer installation 1 is realized using in the form of SFU software functional unit and as independently
Production marketing or in use, can be stored in a computer read/write memory medium.Understood based on such, the present invention
All or part of flow in above-described embodiment method is realized, the hardware of correlation can also be instructed by computer program come complete
Into described computer program can be stored in a computer-readable recording medium, and the computer program is being executed by processor
When, the step of each above-mentioned embodiment of the method can be achieved.Wherein, the computer program includes computer program code, described
Computer program code can be source code form, object identification code form, executable file or some intermediate forms etc..The meter
Calculation machine computer-readable recording medium can include:Can carry any entity or device of the computer program code, recording medium, USB flash disk,
Mobile hard disk, magnetic disc, CD, computer storage, read-only storage (ROM, Read-Only Memory), random access memory
Device (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..Need explanation
It is that the content that the computer-readable medium is included can be fitted according to legislation in jurisdiction and the requirement of patent practice
When increase and decrease, such as in some jurisdictions, according to legislation and patent practice, computer-readable medium does not include electric carrier wave letter
Number and telecommunication signal.
, can be with several embodiments provided by the present invention, it should be understood that disclosed computer installation and method
Realize by another way.For example, computer installation embodiment described above is only schematical, for example, described
The division of unit, only a kind of division of logic function, can there is other dividing mode when actually realizing.
In addition, each functional unit in each embodiment of the invention can be integrated in same treatment unit, can also
That unit is individually physically present, can also two or more units be integrated in same unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, it would however also be possible to employ hardware adds the form of software function module to realize.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention.Any reference in claim should not be considered as to the claim involved by limitation.This
Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.Stated in computer installation claim
Multiple units or computer installation can also be realized by same unit or computer installation by software or hardware.The
One, the second grade word is used for representing title, and is not offered as any specific order.
Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted, although reference
The present invention is described in detail for preferred embodiment, it will be understood by those within the art that, can be to the present invention's
Technical scheme is modified or equivalent substitution, without departing from the spirit and scope of technical solution of the present invention.
Claims (10)
1. a kind of audio recognition method, it is characterised in that methods described includes:
Obtain the voice messaging of user's input;
Obtain lip image of the user when inputting the voice messaging;
Pause information in the voice messaging according to the lip image recognition;
Speech recognition is carried out to the voice messaging according to the pause information.
2. audio recognition method as claimed in claim 1, it is characterised in that it is described according to the pause information to the voice
Information, which carries out speech recognition, to be included:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into by institute
State in the text message that voice messaging is converted into;Or
The pause information in the voice messaging is removed, language is carried out to the voice messaging for having removed the pause information
Sound is recognized.
3. audio recognition method as claimed in claim 1, it is characterised in that described according to lip image recognition institute predicate
Pause information in message breath includes:
Disconnected word pause information and/or punctuate pause information in the voice messaging according to the lip image recognition;
Carrying out speech recognition to the voice messaging according to the pause information includes:
Speech recognition is carried out to the voice messaging according to disconnected the word pause information and/or punctuate pause information.
4. audio recognition method as claimed in claim 1, it is characterised in that the voice messaging of the acquisition user input;Obtain
Taking lip image of the family when inputting the voice messaging includes:
When user inputs the voice messaging, the voice messaging is gathered by the microphone of terminal, and pass through terminal
Camera shoots the lip image.
5. audio recognition method as claimed in claim 4, it is characterised in that methods described also includes:
Judge whether the lip motion information matches with the voice messaging;
If the lip motion information is mismatched with the voice messaging, the camera is controlled to stop shooting the lip image.
6. the audio recognition method as any one of claim 1-5, it is characterised in that methods described also includes:
The motion amplitude of user's lip is obtained according to the lip image, according to the identification of the motion amplitude of user's lip
The corresponding tone of voice messaging;Or
The lip characteristic of user pronunciation is obtained, user characteristics is determined according to the lip characteristic, according to the user characteristics and institute
State pause information and speech recognition is carried out to the voice messaging.
7. a kind of speech recognition equipment, it is characterised in that described device includes:
First acquisition unit, the voice messaging for obtaining user's input;
Second acquisition unit, for obtaining lip image of the user when inputting the voice messaging;
First recognition unit, for the pause information in the voice messaging according to the lip image recognition;
Second recognition unit, for carrying out speech recognition to the voice messaging according to the pause information.
8. speech recognition equipment as claimed in claim 7, it is characterised in that second recognition unit specifically for:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into by institute
State in the text message that voice messaging is converted into;Or
The pause information in the voice messaging is removed, language is carried out to the voice messaging for having removed the pause information
Sound is recognized.
9. a kind of computer installation, it is characterised in that the computer installation includes processor, the processor is deposited for execution
Realized during the computer program stored in reservoir as any one of claim 1-6 the step of audio recognition method.
10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer program
Realized when being executed by processor as any one of claim 1-6 the step of audio recognition method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710648985.2A CN107293300A (en) | 2017-08-01 | 2017-08-01 | Audio recognition method and device, computer installation and readable storage medium storing program for executing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710648985.2A CN107293300A (en) | 2017-08-01 | 2017-08-01 | Audio recognition method and device, computer installation and readable storage medium storing program for executing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107293300A true CN107293300A (en) | 2017-10-24 |
Family
ID=60104131
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710648985.2A Withdrawn CN107293300A (en) | 2017-08-01 | 2017-08-01 | Audio recognition method and device, computer installation and readable storage medium storing program for executing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107293300A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107799125A (en) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | A kind of audio recognition method, mobile terminal and computer-readable recording medium |
CN108389573A (en) * | 2018-02-09 | 2018-08-10 | 北京易真学思教育科技有限公司 | Language Identification and device, training method and device, medium, terminal |
CN108847237A (en) * | 2018-07-27 | 2018-11-20 | 重庆柚瓣家科技有限公司 | continuous speech recognition method and system |
CN109599130A (en) * | 2018-12-10 | 2019-04-09 | 百度在线网络技术(北京)有限公司 | Reception method, device and storage medium |
CN109697976A (en) * | 2018-12-14 | 2019-04-30 | 北京葡萄智学科技有限公司 | A kind of pronunciation recognition methods and device |
CN109726536A (en) * | 2017-10-31 | 2019-05-07 | 百度(美国)有限责任公司 | Method for authenticating, electronic equipment and computer-readable program medium |
WO2019134463A1 (en) * | 2018-01-02 | 2019-07-11 | Boe Technology Group Co., Ltd. | Lip language recognition method and mobile terminal |
CN110534109A (en) * | 2019-09-25 | 2019-12-03 | 深圳追一科技有限公司 | Audio recognition method, device, electronic equipment and storage medium |
CN110827823A (en) * | 2019-11-13 | 2020-02-21 | 联想(北京)有限公司 | Voice auxiliary recognition method and device, storage medium and electronic equipment |
CN110992958A (en) * | 2019-11-19 | 2020-04-10 | 深圳追一科技有限公司 | Content recording method, content recording apparatus, electronic device, and storage medium |
CN111091824A (en) * | 2019-11-30 | 2020-05-01 | 华为技术有限公司 | Voice matching method and related equipment |
CN111768799A (en) * | 2019-03-14 | 2020-10-13 | 富泰华工业(深圳)有限公司 | Voice recognition method, voice recognition apparatus, computer apparatus, and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070136071A1 (en) * | 2005-12-08 | 2007-06-14 | Lee Soo J | Apparatus and method for speech segment detection and system for speech recognition |
CN103745723A (en) * | 2014-01-13 | 2014-04-23 | 苏州思必驰信息科技有限公司 | Method and device for identifying audio signal |
CN104409075A (en) * | 2014-11-28 | 2015-03-11 | 深圳创维-Rgb电子有限公司 | Voice identification method and system |
CN105022470A (en) * | 2014-04-17 | 2015-11-04 | 中兴通讯股份有限公司 | Method and device of terminal operation based on lip reading |
CN105389097A (en) * | 2014-09-03 | 2016-03-09 | 中兴通讯股份有限公司 | Man-machine interaction device and method |
-
2017
- 2017-08-01 CN CN201710648985.2A patent/CN107293300A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070136071A1 (en) * | 2005-12-08 | 2007-06-14 | Lee Soo J | Apparatus and method for speech segment detection and system for speech recognition |
CN103745723A (en) * | 2014-01-13 | 2014-04-23 | 苏州思必驰信息科技有限公司 | Method and device for identifying audio signal |
CN105022470A (en) * | 2014-04-17 | 2015-11-04 | 中兴通讯股份有限公司 | Method and device of terminal operation based on lip reading |
CN105389097A (en) * | 2014-09-03 | 2016-03-09 | 中兴通讯股份有限公司 | Man-machine interaction device and method |
CN104409075A (en) * | 2014-11-28 | 2015-03-11 | 深圳创维-Rgb电子有限公司 | Voice identification method and system |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109726536A (en) * | 2017-10-31 | 2019-05-07 | 百度(美国)有限责任公司 | Method for authenticating, electronic equipment and computer-readable program medium |
CN107799125A (en) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | A kind of audio recognition method, mobile terminal and computer-readable recording medium |
WO2019134463A1 (en) * | 2018-01-02 | 2019-07-11 | Boe Technology Group Co., Ltd. | Lip language recognition method and mobile terminal |
US11495231B2 (en) | 2018-01-02 | 2022-11-08 | Beijing Boe Technology Development Co., Ltd. | Lip language recognition method and mobile terminal using sound and silent modes |
CN108389573A (en) * | 2018-02-09 | 2018-08-10 | 北京易真学思教育科技有限公司 | Language Identification and device, training method and device, medium, terminal |
CN108389573B (en) * | 2018-02-09 | 2022-03-08 | 北京世纪好未来教育科技有限公司 | Language identification method and device, training method and device, medium and terminal |
CN108847237A (en) * | 2018-07-27 | 2018-11-20 | 重庆柚瓣家科技有限公司 | continuous speech recognition method and system |
CN109599130B (en) * | 2018-12-10 | 2020-10-30 | 百度在线网络技术(北京)有限公司 | Sound reception method, device and storage medium |
CN109599130A (en) * | 2018-12-10 | 2019-04-09 | 百度在线网络技术(北京)有限公司 | Reception method, device and storage medium |
CN109697976A (en) * | 2018-12-14 | 2019-04-30 | 北京葡萄智学科技有限公司 | A kind of pronunciation recognition methods and device |
CN111768799A (en) * | 2019-03-14 | 2020-10-13 | 富泰华工业(深圳)有限公司 | Voice recognition method, voice recognition apparatus, computer apparatus, and storage medium |
CN110534109A (en) * | 2019-09-25 | 2019-12-03 | 深圳追一科技有限公司 | Audio recognition method, device, electronic equipment and storage medium |
CN110827823A (en) * | 2019-11-13 | 2020-02-21 | 联想(北京)有限公司 | Voice auxiliary recognition method and device, storage medium and electronic equipment |
CN110992958A (en) * | 2019-11-19 | 2020-04-10 | 深圳追一科技有限公司 | Content recording method, content recording apparatus, electronic device, and storage medium |
CN110992958B (en) * | 2019-11-19 | 2021-06-22 | 深圳追一科技有限公司 | Content recording method, content recording apparatus, electronic device, and storage medium |
CN111091824A (en) * | 2019-11-30 | 2020-05-01 | 华为技术有限公司 | Voice matching method and related equipment |
CN111091824B (en) * | 2019-11-30 | 2022-10-04 | 华为技术有限公司 | Voice matching method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107293300A (en) | Audio recognition method and device, computer installation and readable storage medium storing program for executing | |
US11475897B2 (en) | Method and apparatus for response using voice matching user category | |
CN108922525B (en) | Voice processing method, device, storage medium and electronic equipment | |
WO2017084197A1 (en) | Smart home control method and system based on emotion recognition | |
CN111368609A (en) | Voice interaction method based on emotion engine technology, intelligent terminal and storage medium | |
WO2019000991A1 (en) | Voice print recognition method and apparatus | |
CN109346076A (en) | Interactive voice, method of speech processing, device and system | |
WO2020253128A1 (en) | Voice recognition-based communication service method, apparatus, computer device, and storage medium | |
CN104252226B (en) | The method and electronic equipment of a kind of information processing | |
CN107506166A (en) | Information cuing method and device, computer installation and readable storage medium storing program for executing | |
CN110265011B (en) | Electronic equipment interaction method and electronic equipment | |
CN102404278A (en) | Song request system based on voiceprint recognition and application method thereof | |
CN107704612A (en) | Dialogue exchange method and system for intelligent robot | |
CN107452382A (en) | Voice operating method and device, computer installation and computer-readable recording medium | |
CN107393529A (en) | Audio recognition method, device, terminal and computer-readable recording medium | |
CN111312222A (en) | Awakening and voice recognition model training method and device | |
JP2021108142A (en) | Information processing system, information processing method, and information processing program | |
CN108052250A (en) | Virtual idol deductive data processing method and system based on multi-modal interaction | |
CN110706707B (en) | Method, apparatus, device and computer-readable storage medium for voice interaction | |
CN109994106A (en) | A kind of method of speech processing and equipment | |
JP6915637B2 (en) | Information processing equipment, information processing methods, and programs | |
CN108345612A (en) | A kind of question processing method and device, a kind of device for issue handling | |
CN107291704A (en) | Treating method and apparatus, the device for processing | |
US20210082405A1 (en) | Method for Location Reminder and Electronic Device | |
CN107463684A (en) | Voice replying method and device, computer installation and computer-readable recording medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20171024 |