CN107293300A - Audio recognition method and device, computer installation and readable storage medium storing program for executing - Google Patents

Audio recognition method and device, computer installation and readable storage medium storing program for executing Download PDF

Info

Publication number
CN107293300A
CN107293300A CN201710648985.2A CN201710648985A CN107293300A CN 107293300 A CN107293300 A CN 107293300A CN 201710648985 A CN201710648985 A CN 201710648985A CN 107293300 A CN107293300 A CN 107293300A
Authority
CN
China
Prior art keywords
voice messaging
lip
pause information
user
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201710648985.2A
Other languages
Chinese (zh)
Inventor
关超雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meizu Technology Co Ltd
Original Assignee
Meizu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meizu Technology Co Ltd filed Critical Meizu Technology Co Ltd
Priority to CN201710648985.2A priority Critical patent/CN107293300A/en
Publication of CN107293300A publication Critical patent/CN107293300A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a kind of audio recognition method, the audio recognition method includes:Obtain the voice messaging of user's input;Obtain lip image of the user when inputting the voice messaging;Pause information in the voice messaging according to the lip image recognition;Speech recognition is carried out to the voice messaging according to the pause information.The present invention also provides a kind of speech recognition equipment, computer installation and computer-readable recording medium.The present invention can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.

Description

Audio recognition method and device, computer installation and readable storage medium storing program for executing
Technical field
The present invention relates to intelligent sound technical field, and in particular to a kind of audio recognition method and device, computer installation And readable storage medium storing program for executing.
Background technology
At present, with the development of electronics and the communication technology, the terminal such as mobile phone, tablet personal computer is widely used, man-machine friendship Mutual mode is also more and more diversified.Phonetic entry is more and more used as one of natural mode of man-machine interaction most convenient Family is received.However, current speech recognition accuracy is not high, poor user experience.
The content of the invention
In view of the foregoing, it is necessary to propose a kind of audio recognition method and device, computer installation and readable storage medium Matter, it can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.
The first aspect of the application provides a kind of audio recognition method, and methods described includes:
Obtain the voice messaging of user's input;
Obtain lip image of the user when inputting the voice messaging;
Pause information in the voice messaging according to the lip image recognition;
Speech recognition is carried out to the voice messaging according to the pause information.
It is described that speech recognition is carried out to the voice messaging according to the pause information in alternatively possible implementation Including:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into In the text message being converted into by the voice messaging;Or
The pause information in the voice messaging is removed, the voice messaging for having removed the pause information is entered Row speech recognition.
In alternatively possible implementation, the pause letter in the voice messaging according to the lip image recognition Breath includes:
Disconnected word pause information and/or punctuate pause information in the voice messaging according to the lip image recognition;
Carrying out speech recognition to the voice messaging according to the pause information includes:
Speech recognition is carried out to the voice messaging according to disconnected the word pause information and/or punctuate pause information.
In alternatively possible implementation, the voice messaging for obtaining user's input;Obtain user described in input Lip image during voice messaging includes:
When user inputs the voice messaging, the voice messaging is gathered by the microphone of terminal, and pass through end The camera at end shoots the lip image.
In alternatively possible implementation, methods described also includes:
Judge whether the lip motion information matches with the voice messaging;
If the lip motion information is mismatched with the voice messaging, the camera is controlled to stop shooting the lip figure Picture.
In alternatively possible implementation, methods described also includes:
The motion amplitude of user's lip is obtained according to the lip image, is recognized according to the motion amplitude of user's lip The corresponding tone of the voice messaging;Or
The lip characteristic of user pronunciation is obtained, user characteristics is determined according to the lip characteristic, according to the user characteristics Speech recognition is carried out to the voice messaging with the pause information.
The second aspect of the application provides a kind of speech recognition equipment, and described device includes:
First acquisition unit, the voice messaging for obtaining user's input;
Second acquisition unit, for obtaining lip image of the user when inputting the voice messaging;
First recognition unit, for the pause information in the voice messaging according to the lip image recognition;
Second recognition unit, for carrying out speech recognition to the voice messaging according to the pause information.
In alternatively possible implementation, second recognition unit specifically for:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into In the text message being converted into by the voice messaging;Or
The pause information in the voice messaging is removed, the voice messaging for having removed the pause information is entered Row speech recognition.
The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing The step of device is used to realize the audio recognition method when performing the computer program stored in memory.
The fourth aspect of the application provides a kind of computer-readable recording medium, is stored thereon with computer program, described The step of audio recognition method being realized when computer program is executed by processor.
The present invention obtains the voice messaging of user's input;Obtain lip image of the user when inputting the voice messaging; Pause information in the voice messaging according to the lip image recognition;The voice messaging is entered according to the pause information Row speech recognition.The present invention can carry out speech recognition using lip image, improve the accuracy rate of speech recognition.
Brief description of the drawings
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one is provided;
Fig. 2 is the structure chart for the speech recognition equipment that the embodiment of the present invention two is provided;
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three is provided.
Main element symbol description
Computer installation 1
Speech recognition equipment 10
Memory 20
Processor 30
Computer program 40
First acquisition unit 201
Second acquisition unit 202
First recognition unit 203
Second recognition unit 204
Following embodiment will further illustrate the present invention with reference to above-mentioned accompanying drawing.
Embodiment
It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention Applying example, the present invention will be described in detail.It should be noted that in the case where not conflicting, embodiments herein and embodiment In feature can be mutually combined.
Elaborate many details in the following description to facilitate a thorough understanding of the present invention, described embodiment only Only it is a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill The every other embodiment that personnel are obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention The implication that technical staff is generally understood that is identical.Term used in the description of the invention herein is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Preferably, audio recognition method of the invention is applied in one or more terminal.The terminal is a kind of energy It is enough according to the instruction for being previously set or store, the equipment of automatic progress numerical computations and/or information processing, its hardware is included but not It is limited to microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), can compiles Journey gate array (Field-Programmable Gate Array, FPGA), digital processing unit (Digital Signal Processor, DSP), embedded device etc..
The terminal may be, but not limited to, any one can with user by keyboard, mouse, remote control, touch pad or The modes such as voice-operated device carry out the electronic product of man-machine interaction, for example, tablet personal computer, smart mobile phone, personal digital assistant (Personal Digital Assistant, PDA), intelligent wearable equipment etc..
Embodiment one
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one is provided.As shown in figure 1, this method is specifically wrapped Include following steps:
101:Obtain the voice messaging of user's input.
The voice messaging is the speech data obtained according to the natural-sounding of user.For example, the voice messaging is logical Cross microphone and the natural-sounding of user is converted into the voice signal that electric signal is obtained.
The voice messaging can be gathered by the microphone of terminal in user's input voice information.For example, can examine Survey and whether receive phonetic entry sign on (for example detecting whether the home keys of terminal are long pressed), refer to if receiving phonetic entry Order, then start to gather the voice messaging that user inputs by the microphone of terminal.It can also detect whether to receive phonetic entry knot Shu Zhiling (for example detects whether the home keys of terminal are released), if receiving phonetic entry END instruction, stops passing through terminal Microphone collection user input voice messaging.
Or, the voice messaging gathered in advance can be read.For example, the voice messaging of user's input can be gathered in advance, When needing to carry out speech recognition to the voice messaging, the voice messaging is read.
102:Obtain lip image of the user when inputting the voice messaging.
The lip image is also lip motion video or labiomaney image, refers to when people speaks, the lip motion of speaker The image of change.Lip image in a period of time may be constructed image sequence or image/video.
Facial image of the user when inputting the voice messaging can be obtained, lip position is determined from the facial image Put, so as to obtain the lip image.
Camera can also be directly directed to user's lip to be shot, so as to obtain the lip image.For example, shooting Head can be built in microphone (such as in headset), or microphone is built in camera, and user is in use, take the photograph As head is directly directed at user's lip, so as to easily obtain lip image.
The lip image can be shot by the camera of terminal in user's input voice information.For example, can examine Whether survey receives phonetic entry sign on, if receiving phonetic entry sign on, gathers and uses in the microphone by terminal While the voice messaging of family input, the lip image of user is shot by the camera of terminal.It can also detect whether to receive Phonetic entry END instruction, if receiving phonetic entry END instruction, user's input is gathered stopping the microphone by terminal Voice messaging while, stop shooting the lip image of user by the camera of terminal.
Or, the lip image shot in advance can be read.For example, can be in the voice messaging that collection user inputs in advance When, the lip image is shot, when needing to carry out speech recognition to the voice messaging, the lip image is read.
The voice messaging that user inputs and the camera shooting lip for passing through terminal are gathered in the microphone by terminal During shape image, it can be determined that whether the lip motion information matches with the voice messaging, if the lip motion information and the voice Information is mismatched, and controls the camera to stop shooting the lip image.
It can detect whether the lip motion information is synchronous with the voice messaging, if the lip motion information is believed with the voice Breath is asynchronous, then the lip motion information is mismatched with the voice messaging.If for example, according to the voice messaging determine user from Loquitur within 1st second, determine that user loquitured from the 5th second according to the lip motion information, then the lip motion information and institute's predicate Message breath is asynchronous, thus the lip motion information and voice messaging mismatch.
Or, it can detect that the corresponding text information of lip motion information text information corresponding with the voice messaging is It is no consistent, it is described if the corresponding text information of lip motion information text information corresponding with the voice messaging is inconsistent Lip motion information is mismatched with the voice messaging.For example, the corresponding text information of the lip motion information is " I in certain time period Have a meeting ", the corresponding text information of the voice messaging is " today, weather was pretty good ", then the corresponding text of the lip motion information Word information text information corresponding with the voice messaging is inconsistent, thus the lip motion information and the voice messaging are not Match somebody with somebody.
103:Pause information in the voice messaging according to the lip image recognition.
Pause often occurs when speaking by user, therefore, and the lip image includes lip image when pausing, described Voice messaging includes the voice messaging (information of pausing) when pausing, and the voice can be recognized according to lip image when pausing The pause information that packet contains.
User can be paused during speaking when needing disconnected word or punctuate, and therefore, the pause information can be with table Show disconnected word and/or punctuate (now pause information can be mute signal), the pause information can include disconnected word pause information And/or punctuate pause information.
Or, user can be paused during speaking when other side speaks or thinks deeply, therefore, and the pause information can With represent one section it is Jing Yin.Now the pause information is invalid phonetic entry.
Or, user can be paused during speaking when there is noise (such as when noise is excessive), therefore, described Pause information can represent noise (now pause information can be noise signal).Now the pause information is invalid voice Input.
When the pause information represents disconnected word and/or punctuate, it can believe voice according to the lip image recognition Disconnected word pause information and/or punctuate pause information in breath.
Whether can not occurred according to the lip image detection to the first preset time (such as 0.1 second) interior user's lip Whether change or amplitude of variation are less than or equal to predetermined amplitude, if according in the lip image detection to the first preset time User's lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then pre- by described in the voice messaging first If time corresponding voice messaging is identified as disconnected word pause information.
Whether can not occurred according to the lip image detection to the second preset time (such as 0.5 second) interior user's lip Change or amplitude of variation are less than or equal to predetermined amplitude, if according to user in the lip image detection to the second preset time Lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then when described in the voice messaging second is preset Between corresponding voice messaging be identified as punctuate pause information.Second preset time can be more than first preset time.
, can be default to the 3rd according to the lip image detection when the pause information represents that one section Jing Yin or during noise Whether time (such as 3 seconds) interior user's lip does not change or whether amplitude of variation is less than or equal to predetermined amplitude, if root Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal in advance If amplitude, then the corresponding voice messaging of the 3rd preset time described in the voice messaging is identified as pause information.Or, if Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal to Predetermined amplitude, and the corresponding voice signal amplitude of the 3rd preset time described in the voice messaging is more than predetermined threshold value, The corresponding voice messaging of the 3rd preset time described in the voice messaging is then identified as pause information.Described 3rd it is default when Between can be more than second preset time.
104:Speech recognition is carried out to the voice messaging according to the pause information.
If the pause information includes disconnected word pause information, the voice can be believed according to the disconnected word pause information Breath carries out speech recognition.
Or, can be according to the punctuate pause information to described if the pause information includes punctuate pause information Voice messaging carries out speech recognition.
Or, can be according to the disconnected word if the pause information includes disconnected word pause information and punctuate pause information Pause information and punctuate pause information carry out speech recognition to the voice messaging.
Can according to the time map relation between the pause information and the voice messaging (i.e. corresponding time relationship), The pause information is inserted into the text message being converted into by the voice messaging.For example, can be to the voice messaging Carry out speech recognition, obtain the corresponding text message of the voice messaging, according to the pause information (disconnected word pause information and/ Or punctuate pause information) time of occurrence in the voice messaging, the pause information is inserted into the text message, Obtain including the text message of pause information.
Or, the pause information in the voice messaging can be removed, to having removed described in the pause information Voice messaging carries out speech recognition.As it was previously stated, the pause information can represent noise or Jing Yin, i.e., invalid voice is defeated Enter, the noise in the voice messaging can be removed by carrying out speech recognition to the voice messaging for having removed the pause information Or it is Jing Yin.
Can use various speech recognition technologies, such as dynamic time warping (Dynamic Time Warping, DTW), It is hidden Markov model (Hidden Markov Model, HMM), vector quantization (Vector Quantization, VQ), artificial Technology is to the voice messaging or has removed pause information for neutral net (Artificial Neural Network, ANN) etc. The voice messaging carries out speech recognition.
The audio recognition method of embodiment one obtains the voice messaging of user's input;Obtain user and input the voice letter Lip image during breath;Pause information in the voice messaging according to the lip image recognition;According to the pause information Speech recognition is carried out to the voice messaging.The audio recognition method of embodiment one can carry out voice knowledge using lip image Not, the accuracy rate of speech recognition is improved.
In another embodiment, methods described can also include:The motion of user's lip is obtained according to the lip image Amplitude, the corresponding tone of the voice messaging is recognized according to the motion amplitude of user's lip.The tone can include old Predicate gas, the query tone, imperative mood, exclamation tone etc..If for example, the motion amplitude of user's lip is in the first default width In the range of degree, it is determined that the corresponding tone of the voice messaging is sighs with feeling the tone;If the motion amplitude of user's lip is In the range of two predetermined amplitudes, it is determined that the corresponding tone of the voice messaging is imperative mood.
In another embodiment, methods described can also include:Obtain the lip characteristic of user pronunciation;According to the lip Characteristic determines user characteristics;Speech recognition is carried out to the voice messaging according to the user characteristics and the pause information.Institute User's sex, language form, dialect type and/or pet phrase custom etc. can be included by stating user characteristics.For example, can according to The lip characteristic of family pronunciation determines language form (such as Chinese), according to the language form and the pause information to institute's predicate Message breath carries out speech recognition.To obtaining more auxiliary informations before voice messaging progress speech recognition, (i.e. user is special Levy), it can further improve the accuracy rate of speech recognition.
Embodiment two
Fig. 2 is the structure chart for the speech recognition equipment that the embodiment of the present invention two is provided.As shown in Fig. 2 the speech recognition Device 10 can include:First acquisition unit 201, second acquisition unit 202, the first recognition unit 203, the second recognition unit 204。
First acquisition unit 201, the voice messaging for obtaining user's input.
The voice messaging is the speech data obtained according to the natural-sounding of user.For example, the voice messaging is logical Cross microphone and the natural-sounding of user is converted into the voice signal that electric signal is obtained.
The voice messaging can be gathered by the microphone of terminal in user's input voice information.For example, can examine Survey and whether receive phonetic entry sign on (for example detecting whether the home keys of terminal are long pressed), refer to if receiving phonetic entry Order, then start to gather the voice messaging that user inputs by the microphone of terminal.It can also detect whether to receive phonetic entry knot Shu Zhiling (for example detects whether the home keys of terminal are released), if receiving phonetic entry END instruction, stops passing through terminal Microphone collection user input voice messaging.
Or, the voice messaging gathered in advance can be read.For example, the voice messaging of user's input can be gathered in advance, When needing to carry out speech recognition to the voice messaging, the voice messaging is read.
Second acquisition unit 202, for obtaining lip image of the user when inputting the voice messaging.
The lip image is also lip motion video or labiomaney image, refers to when people speaks, the lip motion of speaker The image of change.Lip image in a period of time may be constructed image sequence or image/video.
Facial image of the user when inputting the voice messaging can be obtained, lip position is determined from the facial image Put, so as to obtain the lip image.
Camera can also be directly directed to user's lip to be shot, so as to obtain the lip image.For example, shooting Head can be built in microphone (such as in headset), or microphone is built in camera, and user is in use, take the photograph As head is directly directed at user's lip, so as to easily obtain lip image.
The lip image can be shot by the camera of terminal in user's input voice information.For example, can examine Whether survey receives phonetic entry sign on, if receiving phonetic entry sign on, gathers and uses in the microphone by terminal While the voice messaging of family input, the lip image of user is shot by the camera of terminal.It can also detect whether to receive Phonetic entry END instruction, if receiving phonetic entry END instruction, user's input is gathered stopping the microphone by terminal Voice messaging while, stop shooting the lip image of user by the camera of terminal.
Or, the lip image shot in advance can be read.For example, can be in the voice messaging that collection user inputs in advance When, the lip image is shot, when needing to carry out speech recognition to the voice messaging, the lip image is read.
The voice messaging that user inputs and the camera shooting lip for passing through terminal are gathered in the microphone by terminal During shape image, it can be determined that whether the lip motion information matches with the voice messaging, if the lip motion information and the voice Information is mismatched, and controls the camera to stop shooting the lip image.
It can detect whether the lip motion information is synchronous with the voice messaging, if the lip motion information is believed with the voice Breath is asynchronous, then the lip motion information is mismatched with the voice messaging.If for example, according to the voice messaging determine user from Loquitur within 1st second, determine that user loquitured from the 5th second according to the lip motion information, then the lip motion information and institute's predicate Message breath is asynchronous, thus the lip motion information and voice messaging mismatch.
Or, it can detect that the corresponding text information of lip motion information text information corresponding with the voice messaging is It is no consistent, it is described if the corresponding text information of lip motion information text information corresponding with the voice messaging is inconsistent Lip motion information is mismatched with the voice messaging.For example, the corresponding text information of the lip motion information is " I in certain time period Have a meeting ", the corresponding text information of the voice messaging is " today, weather was pretty good ", then the corresponding text of the lip motion information Word information text information corresponding with the voice messaging is inconsistent, thus the lip motion information and the voice messaging are not Match somebody with somebody.
First recognition unit 203, for the pause information in the voice messaging according to the lip image recognition.
Pause often occurs when speaking by user, therefore, and the lip image includes lip image when pausing, described Voice messaging includes the voice messaging (information of pausing) when pausing, and the voice can be recognized according to lip image when pausing The pause information that packet contains.
User can be paused during speaking when needing disconnected word or punctuate, and therefore, the pause information can be with table Show disconnected word and/or punctuate (now pause information can be mute signal), the pause information can include disconnected word pause information And/or punctuate pause information.
Or, user can be paused during speaking when other side speaks or thinks deeply, therefore, and the pause information can With represent one section it is Jing Yin.Now the pause information is invalid phonetic entry.
Or, user can be paused during speaking when there is noise (such as when noise is excessive), therefore, described Pause information can represent noise (now pause information can be noise signal).Now the pause information is invalid voice Input.
When the pause information represents disconnected word and/or punctuate, it can believe voice according to the lip image recognition Disconnected word pause information and/or punctuate pause information in breath.
Whether can not occurred according to the lip image detection to the first preset time (such as 0.1 second) interior user's lip Whether change or amplitude of variation are less than or equal to predetermined amplitude, if according in the lip image detection to the first preset time User's lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then pre- by described in the voice messaging first If time corresponding voice messaging is identified as disconnected word pause information.
Whether can not occurred according to the lip image detection to the second preset time (such as 0.5 second) interior user's lip Change or amplitude of variation are less than or equal to predetermined amplitude, if according to user in the lip image detection to the second preset time Lip does not change or amplitude of variation is less than or equal to predetermined amplitude, then when described in the voice messaging second is preset Between corresponding voice messaging be identified as punctuate pause information.Second preset time can be more than first preset time.
, can be default to the 3rd according to the lip image detection when the pause information represents that one section Jing Yin or during noise Whether time (such as 3 seconds) interior user's lip does not change or whether amplitude of variation is less than or equal to predetermined amplitude, if root Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal in advance If amplitude, then the corresponding voice messaging of the 3rd preset time described in the voice messaging is identified as pause information.Or, if Do not changed according to user's lip in the lip image detection to the 3rd preset time or amplitude of variation is less than or equal to Predetermined amplitude, and the corresponding voice signal amplitude of the 3rd preset time described in the voice messaging is more than predetermined threshold value, The corresponding voice messaging of the 3rd preset time described in the voice messaging is then identified as pause information.Described 3rd it is default when Between can be more than second preset time.
Second recognition unit 204, for carrying out speech recognition to the voice messaging according to the pause information.
If the pause information includes disconnected word pause information, the voice can be believed according to the disconnected word pause information Breath carries out speech recognition.
Or, can be according to the punctuate pause information to described if the pause information includes punctuate pause information Voice messaging carries out speech recognition.
Or, can be according to the disconnected word if the pause information includes disconnected word pause information and punctuate pause information Pause information and punctuate pause information carry out speech recognition to the voice messaging.
Can according to the time map relation between the pause information and the voice messaging (i.e. corresponding time relationship), The pause information is inserted into the text message being converted into by the voice messaging.For example, can be to the voice messaging Carry out speech recognition, obtain the corresponding text message of the voice messaging, according to the pause information (disconnected word pause information and/ Or punctuate pause information) time of occurrence in the voice messaging, the pause information is inserted into the text message, Obtain including the text message of pause information.
Or, the pause information in the voice messaging can be removed, to having removed described in the pause information Voice messaging carries out speech recognition.As it was previously stated, the pause information can represent noise or Jing Yin, i.e., invalid voice is defeated Enter, the noise in the voice messaging can be removed by carrying out speech recognition to the voice messaging for having removed the pause information Or it is Jing Yin.
Can use various speech recognition technologies, such as dynamic time warping (Dynamic Time Warping, DTW), It is hidden Markov model (Hidden Markov Model, HMM), vector quantization (Vector Quantization, VQ), artificial Technology is to the voice messaging or has removed pause information for neutral net (Artificial Neural Network, ANN) etc. The voice messaging carries out speech recognition.
The speech recognition equipment 10 of embodiment two obtains the voice messaging of user's input;Obtain user and input the voice Lip image during information;Pause information in the voice messaging according to the lip image recognition;Believed according to described pause Breath carries out speech recognition to the voice messaging.The speech recognition equipment 10 of embodiment two can carry out voice using lip image Identification, improves the accuracy rate of speech recognition.
In another embodiment, the speech recognition equipment 10 can also include:
3rd recognition unit, the motion amplitude for obtaining user's lip according to the lip image, according to the user The motion amplitude of lip recognizes the corresponding tone of the voice messaging.The tone can include indicative mood, the query tone, pray Make the tone, sigh with feeling tone etc..If for example, the motion amplitude of user's lip is in the range of the first predetermined amplitude, it is determined that institute The corresponding tone of voice messaging is stated to sigh with feeling the tone;If the motion amplitude of user's lip is in the range of the second predetermined amplitude, It is imperative mood then to determine the corresponding tone of the voice messaging.
In another embodiment, the speech recognition equipment 10 can also include:
4th recognition unit, the lip characteristic for obtaining user pronunciation;User characteristics is determined according to the lip characteristic; Speech recognition is carried out to the voice messaging according to the user characteristics and the pause information.The user characteristics can include User's sex, language form, dialect type and/or pet phrase custom etc..For example, can be true according to the lip characteristic of user pronunciation Determine language form (such as Chinese), voice knowledge is carried out to the voice messaging according to the language form and the pause information Not.The voice messaging is carried out to obtain more auxiliary informations (i.e. user characteristics) before speech recognition, can further be carried The accuracy rate of high speech recognition.
Embodiment three
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention three is provided.The computer installation 1 includes memory 20th, processor 30 and the computer program 40 that can be run in the memory 20 and on the processor 30, example are stored in Such as speech recognition program.The processor 30 is realized when performing the computer program 40 in above-mentioned audio recognition method embodiment The step of, such as step 101 shown in Fig. 1~104.Or, the processor 30 is realized when performing the computer program 40 The function of each module/unit, such as unit 201~204 in said apparatus embodiment.
Exemplary, the computer program 40 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 20, and are performed by the processor 30, to complete the present invention.Described one Individual or multiple module/units can complete the series of computation machine programmed instruction section of specific function, and the instruction segment is used for Implementation procedure of the computer program 40 in the computer installation 1 is described.For example, the computer program 40 can be by First acquisition unit 201 in Fig. 2, second acquisition unit 202, the first recognition unit 203, the second recognition unit 204 are divided into, Each module concrete function is referring to embodiment two.
The computer installation 1 can be that the calculating such as desktop PC, notebook, palm PC and cloud server is set It is standby.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 1, do not constitute to computer The restriction of device 1, can include than illustrating more or less parts, either combine some parts or different parts, example Computer installation 1 can also include input-output equipment, network access equipment, bus etc. as described.
Alleged processor 30 can be CPU (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) or other PLDs, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor 30 can also be any conventional processor Deng the processor 30 is the control centre of the computer installation 1, utilizes various interfaces and connection whole computer dress Put 1 various pieces.
The memory 20 can be used for storing the computer program 40 and/or module/unit, and the processor 30 passes through Operation performs and is stored in computer program and/or module/unit in the memory 20, and calls and be stored in memory Data in 20, realize the various functions of the computer installation 1.The memory 20 can mainly include storing program area and deposit Data field is stored up, wherein, the application program that storing program area can be needed for storage program area, at least one function (such as broadcast by sound Playing function, image player function etc.) etc.;Storage data field can be stored uses created data (ratio according to computer installation 1 Such as voice data, phone directory) etc..In addition, memory 20 can include high-speed random access memory, it can also include non-easy The property lost memory, such as hard disk, internal memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) block, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.
If the integrated module/unit of the computer installation 1 is realized using in the form of SFU software functional unit and as independently Production marketing or in use, can be stored in a computer read/write memory medium.Understood based on such, the present invention All or part of flow in above-described embodiment method is realized, the hardware of correlation can also be instructed by computer program come complete Into described computer program can be stored in a computer-readable recording medium, and the computer program is being executed by processor When, the step of each above-mentioned embodiment of the method can be achieved.Wherein, the computer program includes computer program code, described Computer program code can be source code form, object identification code form, executable file or some intermediate forms etc..The meter Calculation machine computer-readable recording medium can include:Can carry any entity or device of the computer program code, recording medium, USB flash disk, Mobile hard disk, magnetic disc, CD, computer storage, read-only storage (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..Need explanation It is that the content that the computer-readable medium is included can be fitted according to legislation in jurisdiction and the requirement of patent practice When increase and decrease, such as in some jurisdictions, according to legislation and patent practice, computer-readable medium does not include electric carrier wave letter Number and telecommunication signal.
, can be with several embodiments provided by the present invention, it should be understood that disclosed computer installation and method Realize by another way.For example, computer installation embodiment described above is only schematical, for example, described The division of unit, only a kind of division of logic function, can there is other dividing mode when actually realizing.
In addition, each functional unit in each embodiment of the invention can be integrated in same treatment unit, can also That unit is individually physically present, can also two or more units be integrated in same unit.Above-mentioned integrated list Member can both be realized in the form of hardware, it would however also be possible to employ hardware adds the form of software function module to realize.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit is required rather than described above is limited, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as to the claim involved by limitation.This Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.Stated in computer installation claim Multiple units or computer installation can also be realized by same unit or computer installation by software or hardware.The One, the second grade word is used for representing title, and is not offered as any specific order.
Finally it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted, although reference The present invention is described in detail for preferred embodiment, it will be understood by those within the art that, can be to the present invention's Technical scheme is modified or equivalent substitution, without departing from the spirit and scope of technical solution of the present invention.

Claims (10)

1. a kind of audio recognition method, it is characterised in that methods described includes:
Obtain the voice messaging of user's input;
Obtain lip image of the user when inputting the voice messaging;
Pause information in the voice messaging according to the lip image recognition;
Speech recognition is carried out to the voice messaging according to the pause information.
2. audio recognition method as claimed in claim 1, it is characterised in that it is described according to the pause information to the voice Information, which carries out speech recognition, to be included:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into by institute State in the text message that voice messaging is converted into;Or
The pause information in the voice messaging is removed, language is carried out to the voice messaging for having removed the pause information Sound is recognized.
3. audio recognition method as claimed in claim 1, it is characterised in that described according to lip image recognition institute predicate Pause information in message breath includes:
Disconnected word pause information and/or punctuate pause information in the voice messaging according to the lip image recognition;
Carrying out speech recognition to the voice messaging according to the pause information includes:
Speech recognition is carried out to the voice messaging according to disconnected the word pause information and/or punctuate pause information.
4. audio recognition method as claimed in claim 1, it is characterised in that the voice messaging of the acquisition user input;Obtain Taking lip image of the family when inputting the voice messaging includes:
When user inputs the voice messaging, the voice messaging is gathered by the microphone of terminal, and pass through terminal Camera shoots the lip image.
5. audio recognition method as claimed in claim 4, it is characterised in that methods described also includes:
Judge whether the lip motion information matches with the voice messaging;
If the lip motion information is mismatched with the voice messaging, the camera is controlled to stop shooting the lip image.
6. the audio recognition method as any one of claim 1-5, it is characterised in that methods described also includes:
The motion amplitude of user's lip is obtained according to the lip image, according to the identification of the motion amplitude of user's lip The corresponding tone of voice messaging;Or
The lip characteristic of user pronunciation is obtained, user characteristics is determined according to the lip characteristic, according to the user characteristics and institute State pause information and speech recognition is carried out to the voice messaging.
7. a kind of speech recognition equipment, it is characterised in that described device includes:
First acquisition unit, the voice messaging for obtaining user's input;
Second acquisition unit, for obtaining lip image of the user when inputting the voice messaging;
First recognition unit, for the pause information in the voice messaging according to the lip image recognition;
Second recognition unit, for carrying out speech recognition to the voice messaging according to the pause information.
8. speech recognition equipment as claimed in claim 7, it is characterised in that second recognition unit specifically for:
According to the time map relation between the pause information and the voice messaging, the pause information is inserted into by institute State in the text message that voice messaging is converted into;Or
The pause information in the voice messaging is removed, language is carried out to the voice messaging for having removed the pause information Sound is recognized.
9. a kind of computer installation, it is characterised in that the computer installation includes processor, the processor is deposited for execution Realized during the computer program stored in reservoir as any one of claim 1-6 the step of audio recognition method.
10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer program Realized when being executed by processor as any one of claim 1-6 the step of audio recognition method.
CN201710648985.2A 2017-08-01 2017-08-01 Audio recognition method and device, computer installation and readable storage medium storing program for executing Withdrawn CN107293300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710648985.2A CN107293300A (en) 2017-08-01 2017-08-01 Audio recognition method and device, computer installation and readable storage medium storing program for executing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710648985.2A CN107293300A (en) 2017-08-01 2017-08-01 Audio recognition method and device, computer installation and readable storage medium storing program for executing

Publications (1)

Publication Number Publication Date
CN107293300A true CN107293300A (en) 2017-10-24

Family

ID=60104131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710648985.2A Withdrawn CN107293300A (en) 2017-08-01 2017-08-01 Audio recognition method and device, computer installation and readable storage medium storing program for executing

Country Status (1)

Country Link
CN (1) CN107293300A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107799125A (en) * 2017-11-09 2018-03-13 维沃移动通信有限公司 A kind of audio recognition method, mobile terminal and computer-readable recording medium
CN108389573A (en) * 2018-02-09 2018-08-10 北京易真学思教育科技有限公司 Language Identification and device, training method and device, medium, terminal
CN108847237A (en) * 2018-07-27 2018-11-20 重庆柚瓣家科技有限公司 continuous speech recognition method and system
CN109599130A (en) * 2018-12-10 2019-04-09 百度在线网络技术(北京)有限公司 Reception method, device and storage medium
CN109697976A (en) * 2018-12-14 2019-04-30 北京葡萄智学科技有限公司 A kind of pronunciation recognition methods and device
CN109726536A (en) * 2017-10-31 2019-05-07 百度(美国)有限责任公司 Method for authenticating, electronic equipment and computer-readable program medium
WO2019134463A1 (en) * 2018-01-02 2019-07-11 Boe Technology Group Co., Ltd. Lip language recognition method and mobile terminal
CN110534109A (en) * 2019-09-25 2019-12-03 深圳追一科技有限公司 Audio recognition method, device, electronic equipment and storage medium
CN110827823A (en) * 2019-11-13 2020-02-21 联想(北京)有限公司 Voice auxiliary recognition method and device, storage medium and electronic equipment
CN110992958A (en) * 2019-11-19 2020-04-10 深圳追一科技有限公司 Content recording method, content recording apparatus, electronic device, and storage medium
CN111091824A (en) * 2019-11-30 2020-05-01 华为技术有限公司 Voice matching method and related equipment
CN111768799A (en) * 2019-03-14 2020-10-13 富泰华工业(深圳)有限公司 Voice recognition method, voice recognition apparatus, computer apparatus, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136071A1 (en) * 2005-12-08 2007-06-14 Lee Soo J Apparatus and method for speech segment detection and system for speech recognition
CN103745723A (en) * 2014-01-13 2014-04-23 苏州思必驰信息科技有限公司 Method and device for identifying audio signal
CN104409075A (en) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 Voice identification method and system
CN105022470A (en) * 2014-04-17 2015-11-04 中兴通讯股份有限公司 Method and device of terminal operation based on lip reading
CN105389097A (en) * 2014-09-03 2016-03-09 中兴通讯股份有限公司 Man-machine interaction device and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136071A1 (en) * 2005-12-08 2007-06-14 Lee Soo J Apparatus and method for speech segment detection and system for speech recognition
CN103745723A (en) * 2014-01-13 2014-04-23 苏州思必驰信息科技有限公司 Method and device for identifying audio signal
CN105022470A (en) * 2014-04-17 2015-11-04 中兴通讯股份有限公司 Method and device of terminal operation based on lip reading
CN105389097A (en) * 2014-09-03 2016-03-09 中兴通讯股份有限公司 Man-machine interaction device and method
CN104409075A (en) * 2014-11-28 2015-03-11 深圳创维-Rgb电子有限公司 Voice identification method and system

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726536A (en) * 2017-10-31 2019-05-07 百度(美国)有限责任公司 Method for authenticating, electronic equipment and computer-readable program medium
CN107799125A (en) * 2017-11-09 2018-03-13 维沃移动通信有限公司 A kind of audio recognition method, mobile terminal and computer-readable recording medium
WO2019134463A1 (en) * 2018-01-02 2019-07-11 Boe Technology Group Co., Ltd. Lip language recognition method and mobile terminal
US11495231B2 (en) 2018-01-02 2022-11-08 Beijing Boe Technology Development Co., Ltd. Lip language recognition method and mobile terminal using sound and silent modes
CN108389573A (en) * 2018-02-09 2018-08-10 北京易真学思教育科技有限公司 Language Identification and device, training method and device, medium, terminal
CN108389573B (en) * 2018-02-09 2022-03-08 北京世纪好未来教育科技有限公司 Language identification method and device, training method and device, medium and terminal
CN108847237A (en) * 2018-07-27 2018-11-20 重庆柚瓣家科技有限公司 continuous speech recognition method and system
CN109599130B (en) * 2018-12-10 2020-10-30 百度在线网络技术(北京)有限公司 Sound reception method, device and storage medium
CN109599130A (en) * 2018-12-10 2019-04-09 百度在线网络技术(北京)有限公司 Reception method, device and storage medium
CN109697976A (en) * 2018-12-14 2019-04-30 北京葡萄智学科技有限公司 A kind of pronunciation recognition methods and device
CN111768799A (en) * 2019-03-14 2020-10-13 富泰华工业(深圳)有限公司 Voice recognition method, voice recognition apparatus, computer apparatus, and storage medium
CN110534109A (en) * 2019-09-25 2019-12-03 深圳追一科技有限公司 Audio recognition method, device, electronic equipment and storage medium
CN110827823A (en) * 2019-11-13 2020-02-21 联想(北京)有限公司 Voice auxiliary recognition method and device, storage medium and electronic equipment
CN110992958A (en) * 2019-11-19 2020-04-10 深圳追一科技有限公司 Content recording method, content recording apparatus, electronic device, and storage medium
CN110992958B (en) * 2019-11-19 2021-06-22 深圳追一科技有限公司 Content recording method, content recording apparatus, electronic device, and storage medium
CN111091824A (en) * 2019-11-30 2020-05-01 华为技术有限公司 Voice matching method and related equipment
CN111091824B (en) * 2019-11-30 2022-10-04 华为技术有限公司 Voice matching method and related equipment

Similar Documents

Publication Publication Date Title
CN107293300A (en) Audio recognition method and device, computer installation and readable storage medium storing program for executing
US11475897B2 (en) Method and apparatus for response using voice matching user category
CN108922525B (en) Voice processing method, device, storage medium and electronic equipment
WO2017084197A1 (en) Smart home control method and system based on emotion recognition
CN111368609A (en) Voice interaction method based on emotion engine technology, intelligent terminal and storage medium
WO2019000991A1 (en) Voice print recognition method and apparatus
CN109346076A (en) Interactive voice, method of speech processing, device and system
WO2020253128A1 (en) Voice recognition-based communication service method, apparatus, computer device, and storage medium
CN104252226B (en) The method and electronic equipment of a kind of information processing
CN107506166A (en) Information cuing method and device, computer installation and readable storage medium storing program for executing
CN110265011B (en) Electronic equipment interaction method and electronic equipment
CN102404278A (en) Song request system based on voiceprint recognition and application method thereof
CN107704612A (en) Dialogue exchange method and system for intelligent robot
CN107452382A (en) Voice operating method and device, computer installation and computer-readable recording medium
CN107393529A (en) Audio recognition method, device, terminal and computer-readable recording medium
CN111312222A (en) Awakening and voice recognition model training method and device
JP2021108142A (en) Information processing system, information processing method, and information processing program
CN108052250A (en) Virtual idol deductive data processing method and system based on multi-modal interaction
CN110706707B (en) Method, apparatus, device and computer-readable storage medium for voice interaction
CN109994106A (en) A kind of method of speech processing and equipment
JP6915637B2 (en) Information processing equipment, information processing methods, and programs
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN107291704A (en) Treating method and apparatus, the device for processing
US20210082405A1 (en) Method for Location Reminder and Electronic Device
CN107463684A (en) Voice replying method and device, computer installation and computer-readable recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20171024