CN109448711A - A kind of method, apparatus and computer storage medium of speech recognition - Google Patents
A kind of method, apparatus and computer storage medium of speech recognition Download PDFInfo
- Publication number
- CN109448711A CN109448711A CN201811238626.0A CN201811238626A CN109448711A CN 109448711 A CN109448711 A CN 109448711A CN 201811238626 A CN201811238626 A CN 201811238626A CN 109448711 A CN109448711 A CN 109448711A
- Authority
- CN
- China
- Prior art keywords
- user
- speech
- control instruction
- voice
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Abstract
The invention discloses a kind of method, apparatus of speech recognition and computer storage mediums, lower, the not convenient and fast technical problem of the discrimination to solve voice existing in the prior art.This method comprises: passing through image acquisition device user's face image when acquiring user speech by voice acquisition device;Based on user speech and user's face image, with the corresponding prediction voice of prediction model prediction user speech;Wherein, prediction model is obtained by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image;Based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database;Wherein, speech database is the mapping relations of control instruction and corresponding speech audio normal data;The matching degree that user speech and speech audio normal data are calculated by Matching Model controls smart home device according to the corresponding control instruction of speech audio normal data when matching degree reaches given threshold.
Description
Technical field
The present invention relates to smart home fields, method, apparatus and computer storage more particularly, to a kind of speech recognition
Medium.
Background technique
With the development of science and technology, speech recognition technology in smart home field using more and more extensive.
For example, user can make smart home device work by sending phonetic order to smart home device.Such as, it uses
Say " booting " that intelligent air condition can identify the phonetic order of user by speech recognition technology, and then execute to intelligent air condition in family
Boot action.
However, during control using speech recognition technology smart home device in smart home device,
Since the voice that user issues is easy to be influenced by factors such as noise, distances, to reduce the discrimination of voice, and then make
Smart home device can not execute corresponding movement fully according to the phonetic order of user.
In the prior art, noise reduction process would generally be carried out to collected user speech to improve phonetic recognization rate,
There are two types of common processing methods, one is to collected user speech carry out segment processing (including noise reduction, increase gain
Deng), and then extract effective voice messaging and carry out algorithm identification;Another kind be using end-to-end deep learning algorithm to
Family voice is trained study, obtains speech recognition modeling, identifies user speech with speech recognition modeling.
But both methods is all extremely limited to the raising of phonetic recognization rate, and training speech recognition modeling when need compared with
More time, so that user experience will be reduced.
In consideration of it, how it is convenient, fast and it is effective improve voice discrimination, become a technology urgently to be resolved and ask
Topic.
Summary of the invention
The present invention provides the method, apparatus and computer storage medium of a kind of speech recognition, to solve in the prior art
Lower, the not convenient and fast technical problem of the discrimination of existing voice.
In a first aspect, in order to solve the above technical problems, a kind of method of speech recognition provided in an embodiment of the present invention, application
It is as follows in the technical solution of smart home device, this method:
When acquiring user speech by voice acquisition device, pass through image acquisition device user's face image;
Based on the user speech and the user's face image, predict that the user speech is corresponding pre- with prediction model
Survey voice;Wherein, the prediction model is the voice and corresponding standard facial by the corresponding different crowd of each control instruction
Image training obtains, and makes the prediction model to different crowd for the same control instruction voice issued and the face of presentation
Image can export the similar voice of corresponding with same control instruction received pronunciation after being predicted;
Based on the prediction voice, speech audio standard corresponding with the control instruction is matched from speech database
Data;Wherein, the speech database is control instruction and the corresponding speech audio normal data of the smart home device
Mapping relations;
The matching degree degree that the user speech Yu the speech audio normal data are calculated by Matching Model, when described
When reaching given threshold with degree, the smart home is controlled according to the corresponding control instruction of the speech audio normal data and is set
It is standby.
When acquiring user speech by voice acquisition device by smart home device, while being adopted by image collecting device
Collect user's face image;And it is based on collected user speech and user's face image, user speech pair is predicted with prediction model
The prediction voice answered;Wherein, prediction model is the voice and corresponding index plane by the corresponding different crowd of each control instruction
Image training in portion's obtains, and schemes face of the prediction model to different crowd for the same control instruction voice issued and presentation
As the similar voice of corresponding with same control instruction received pronunciation can be exported after being predicted;And then it is based on prediction voice,
Speech audio normal data corresponding with control instruction is matched from speech database;Wherein, speech database is intelligent family
Occupy the control instruction of equipment and the mapping relations of corresponding speech audio normal data;Finally, calculating user by Matching Model
The matching degree of voice and speech audio normal data, when matching degree reaches given threshold, according to speech audio normal data pair
The control instruction control smart home device answered.To allow smart home device that can fast and easily improve the identification of voice
Rate reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.
Preferably, it is based on the user speech and the user's face image, predicts the user speech with prediction model
Corresponding prediction voice, comprising:
Through the speech recognition technology in the prediction model from the user speech, the user speech pair is identified
The the first control instruction collection answered;
Based on the user's face image from the face image data library in the prediction model, obtain and the user
The corresponding second control instruction collection of face-image;Wherein, the face image data library is control instruction and Standard User table
The mapping relations of feelings and/or Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated,
Using the corresponding audio data of the highest control instruction of matching degree as the prediction voice.
Preferably, based on the user's face image from the face image data library in the prediction model, obtain with
The corresponding second control instruction collection of the user's face image, comprising:
Corresponding user's expression and/or user's lip shape are extracted from the user's face image, obtain user's expression data
And/or user's lip type data;
Based on user's expression data and/or user's lip type data, obtained from the face image data library described
Second control instruction collection.
Preferably, after the similarity for calculating the user speech and the speech audio normal data, further includes:
If the similarity cannot reach the given threshold, indicate that user will resurvey use by default prompt information
Family voice;Wherein, the default prompt information is sound and/or light prompt information.
Second aspect, the embodiment of the invention provides a kind of devices for speech recognition, are applied to smart home device,
The device includes:
Acquisition unit, for being used by image acquisition device when acquiring user speech by voice acquisition device
Family face-image;
Predicting unit predicts the use with prediction model for being based on the user speech and the user's face image
The corresponding prediction voice of family voice;Wherein, the prediction model be by the corresponding different crowd of each control instruction voice and
The training of corresponding face-image obtains, the voice that issues the prediction model different crowd for same control instruction and
The face-image of presentation can export the similar voice of corresponding with same control instruction received pronunciation after being predicted;
Acquiring unit matches corresponding with the control instruction for being based on the prediction voice from speech database
Speech audio normal data;Wherein, the speech database is control instruction and the corresponding language of the smart home device
The mapping relations of sound audio standard data;
Computing unit, for calculating the matching of the user speech Yu the speech audio normal data by Matching Model
Degree, when the matching degree reaches given threshold, according to the corresponding control instruction control of the speech audio normal data
Smart home device.
Preferably, the predicting unit is specifically used for:
Through the speech recognition technology in the prediction model from the user speech, the user speech pair is identified
The the first control instruction collection answered;
Based on the user's face image from the face image data library in the prediction model, obtain and the user
The corresponding second control instruction collection of face-image;Wherein, the face image data library is control instruction and Standard User table
The mapping relations of feelings and/or Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated,
Using the corresponding audio data of the highest control instruction of matching degree as the prediction voice.
Preferably, the predicting unit is also used to:
Corresponding user's expression and/or user's lip shape are extracted from the user's face image, obtain user's expression data
And/or user's lip type data;
Based on user's expression data and/or user's lip type data, obtained from the face image data library described
Second control instruction collection.
Preferably, the computing unit is also used to:
If the similarity cannot reach the given threshold, indicate that user will resurvey use by default prompt information
Family voice;Wherein, the default prompt information is sound and/or light prompt information.
The third aspect, the embodiment of the present invention also provide a kind of device for speech recognition, are applied to smart home device,
The device includes:
At least one processor, and
The memory being connect at least one described processor;
Wherein, the memory is stored with the instruction that can be executed by least one described processor, described at least one
The instruction that device is stored by executing the memory is managed, the method as described in above-mentioned first aspect is executed.
Fourth aspect, the embodiment of the present invention also provide a kind of computer readable storage medium, comprising:
The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers
When, so that computer executes the method as described in above-mentioned first aspect.
The technical solution in said one or multiple embodiments through the embodiment of the present invention, the embodiment of the present invention at least have
There is following technical effect:
In embodiment provided by the invention, user speech is acquired by voice acquisition device by smart home device
When, while passing through image acquisition device user's face image;And it is based on collected user speech and user's face image,
With the corresponding prediction voice of prediction model prediction user speech;Wherein, prediction model is by the corresponding difference of each control instruction
What the voice of crowd and the training of corresponding face-image obtained, make the prediction model to different crowd for same control instruction
The voice of sending and the face-image of presentation can export received pronunciation phase corresponding with the same control instruction after being predicted
As voice;And then based on prediction voice, speech audio standard corresponding with control instruction is matched from speech database
Data;Wherein, speech database is that the control instruction of smart home device is closed with the mapping of corresponding speech audio normal data
System;Finally, calculating the matching degree of user speech and speech audio normal data by Matching Model, reach setting threshold when matching
When value, smart home device is controlled according to the corresponding control instruction of speech audio normal data.To allow smart home device energy
Enough discriminations for fast and easily improving voice reduce and malfunction because caused by speech recognition is incorrect, improve user experience.
Detailed description of the invention
Fig. 1 is a kind of flow chart of audio recognition method provided in an embodiment of the present invention;
Fig. 2 is the schematic diagram that air-conditioning provided in an embodiment of the present invention carries out speech recognition;
Fig. 3 is the schematic diagram provided in an embodiment of the present invention for obtaining the second control instruction collection;
Fig. 4 is a kind of structural schematic diagram of speech recognition equipment provided in an embodiment of the present invention.
Specific embodiment
Implementation column of the present invention provides the method, apparatus and computer storage medium of a kind of speech recognition, to solve existing skill
Lower, the not convenient and fast technical problem of the discrimination of voice present in art.
In order to solve the above technical problems, general thought is as follows for technical solution in the embodiment of the present application:
There is provided a kind of method of speech recognition, comprising: when acquiring user speech by voice acquisition device, pass through image
Acquisition device acquires user's face image;Based on user speech and user's face image, user speech pair is predicted with prediction model
The prediction voice answered;Wherein, prediction model is the voice and corresponding face figure by the corresponding different crowd of each control instruction
As training obtain, make prediction model to different crowd for same control instruction issue voice and presentation face-image into
The similar voice of corresponding with same control instruction received pronunciation can be exported after row prediction;Based on prediction voice, from voice data
Speech audio normal data corresponding with control instruction is matched in library;Wherein, speech database is the control of smart home device
System instructs and the mapping relations of corresponding speech audio normal data;User speech and speech audio mark are calculated by Matching Model
The matching degree of quasi- data is controlled when matching degree reaches given threshold according to the corresponding control instruction of speech audio normal data
Smart home device.
Due in the above scheme, when smart home device acquires user speech by voice acquisition device, leading to simultaneously
Cross image acquisition device user's face image;And it is based on collected user speech and user's face image, with prediction mould
Type predicts the corresponding prediction voice of user speech;Wherein, prediction model is by the language of the corresponding different crowd of each control instruction
Sound and the training of corresponding face-image obtain, the voice that issues prediction model different crowd for same control instruction and
The face-image of presentation can export the similar voice of corresponding with same control instruction received pronunciation after being predicted;And then
Based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database;Wherein, voice number
According to the control instruction that library is smart home device and the mapping relations of corresponding speech audio normal data;Finally, passing through matching
Model calculates the matching degree of user speech and speech audio normal data, when matching degree reaches given threshold, according to voice sound
The corresponding control instruction of frequency normal data controls smart home device.To which smart home device can fast and easily be improved
The discrimination of voice reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.
In order to better understand the above technical scheme, below by attached drawing and specific embodiment to technical solution of the present invention
It is described in detail, it should be understood that the specific features in the embodiment of the present invention and embodiment are to the detailed of technical solution of the present invention
Thin explanation, rather than the restriction to technical solution of the present invention, in the absence of conflict, the embodiment of the present invention and embodiment
In technical characteristic can be combined with each other.
Referring to FIG. 1, the embodiment of the present invention provides a kind of method of speech recognition, it is applied to smart home device, the party
The treatment process of method is as follows.
Step 101: when acquiring user speech by voice acquisition device, passing through image acquisition device user's face
Image.
In smart home device such as intelligent air condition, smart television etc., when being controlled with voice them, due to distance
Smart home device is farther out or user uses sound when voice there is also other noises such as shutdown of opening the door, laundry washer clothes
When the noise etc. that issues, the smart home device for causing a user to control can not accurately identify the corresponding instruction of user speech.
For this purpose, in embodiment provided by the invention, by allowing smart home device acquiring user's language using voice device
When sound, also use image acquisition device user's face expression, allow smart home device by user speech with
User's face expression carries out comprehensive analysis, judgement, determines the corresponding correct instruction of user speech to control smart home device
It works according to instruction.
Wherein, voice acquisition device can be microphone, microphone array etc., and voice acquisition device can be smart home
The component part of equipment is also possible to external voice acquisition device, can also be the microphone on smart phone, external language
Sound acquisition device can be communicated by wired mode with smart home device, can also wirelessly with intelligence
Home equipment is communicated, specifically without limitation.
Image collecting device can be camera, ccd sensor, camera etc., and image collecting device can be smart home
The component part of equipment is also possible to external image collecting device, can also be the camera on smart phone, external figure
As acquisition device can be communicated by wired mode with smart home device, can also wirelessly with intelligence
Home equipment is communicated, specifically without limitation.
After through voice acquisition device and image acquisition device to user speech and user's face image,
Execute step 102.
Step 102: being based on user speech and user's face image, predict that the user speech is corresponding pre- with prediction model
Survey voice;Wherein, prediction model is by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image
It obtains, predicts face-image of the prediction model to different crowd for the same control instruction voice issued and presentation
After can export the similar voice of corresponding with same control instruction received pronunciation.
Wherein, prediction model can be obtained by the voice of different crowd and the training of corresponding face-image, smart home
Prediction model used in equipment is trained model.
For example, by taking intelligent air condition as an example, it is assumed that the prediction voice of Yao Xunlian " turning on the aircondition " can allow allowing different people respectively
Group such as man, Ms, child old man read " turning on the aircondition ", when different crowd reads " turning on the aircondition " while acquiring what corresponding crowd issued
Face-image when sound (audio data) and sounding will obtain and instruct corresponding standard audio and standard picture phase with turning on the aircondition
The audio data and face-image for being 90% like degree.User is in use, by acquiring user with above-mentioned trained prediction model
After voice and face-image, comparable speech can be directly exported.
Further, in order to adapt to the dialect in each place, the different crowd local dialect in each place can also be used
Read control instruction, face-image when acquiring corresponding audio data and reading instruction prediction model is trained with it is trained
The corresponding similar voice of control instruction and face-image, training process is similar to process above, and details are not described herein.
Specifically, being based on user speech and user's face image, predict that the user speech is corresponding pre- with prediction model
Voice is surveyed, can be realized by following procedure:
Firstly, through the speech recognition technology in prediction model from user speech, user speech corresponding the is identified
One control instruction collection.
Secondly, being obtained and user's face figure based on user's face image from the face image data library in prediction model
As corresponding second control instruction collection;Wherein, face image data library is control instruction and Standard User expression and/or standard
The mapping relations of user's lip shape.
Corresponding user's expression and/or user's lip shape are first extracted from user's face image specifically, can be, and are used
Family expression data and/or user's lip type data;It is based on user's expression data and/or user's lip type data again, from face-image number
According to obtaining the second control instruction collection in library.
Finally, the first control instruction collection is matched one by one with every control instruction that the second control instruction is concentrated, it will
The corresponding audio data of the highest control instruction of matching degree is as prediction voice.
For example, referring to Fig. 2, by taking smart home device is air-conditioning as an example, the image collector which uses is set to outer
The camera set while air-conditioning acquires user speech by voice acquisition device, is also controlled when user says " turning on the aircondition "
Camera acquires user's face image.Wherein, when user issues the voice of " turning on the aircondition ", since washing machine is working, institute
To produce noise 1, since another kinsfolk is making child not see TV, and has issued noise 2 and " turn off TV!",
So in the user speech that air-conditioning obtains other than the voice of " turning on the aircondition ", the noise 1 of washing machine and other is also mixed
Voice noise 2 " turns off TV!".
After air-conditioning obtains user's face image and user speech, identified from user speech by built-in prediction model
The corresponding first control instruction collection of user speech out: " booting " instruction and " shutdown " instruct;Meanwhile it being mentioned from user's face image
Corresponding user's lip shape is taken, and user's lip shape of extraction and the lip shape data in face image data library are compared one by one,
It determines the corresponding word of each lip shape, and then determines that (identification word 1 is to turn on the aircondition to the corresponding identification word of these lip shapes, and identification word 2 is
Like air-conditioning), then according to the corresponding relationship for identifying word and air-conditioning instruction in prediction model, determine the corresponding air-conditioning of each identification word
Control instruction, and then the instruction 1 " booting " and instruction 2 that obtain the second control instruction concentration corresponding with user's face image are " certainly
Dynamic cleaning ", specifically refers to Fig. 3.
It is corresponding in the corresponding first control instruction collection " booting " of acquisition user speech and " shutdown " and user's face image
The second control instruction collection " booting " and " automated cleaning " after, the first control instruction collection is concentrated with the second control instruction every
Control instruction is matched one by one, using the corresponding audio data of the highest control instruction of matching degree (i.e. " booting " instructs) as
Predict voice.
It should be noted that above-described embodiment, only to actually use for extracting lip shape in user's face image
In, the corresponding control instruction of user speech can also be assisted in identifying with reference to facial expression, limb action of user etc.,
Improve the accuracy of user speech identification.
After smart home device predicts the corresponding prediction voice of user speech, step 103-104 can be executed.
Step 103: based on prediction voice, speech audio standard corresponding with control instruction is matched from speech database
Data;Wherein, speech database is that the control instruction of smart home device is closed with the mapping of corresponding speech audio normal data
System.
Step 104: the matching degree of user speech and speech audio normal data is calculated by Matching Model, when matching degree reaches
When to given threshold, smart home device is controlled according to the corresponding control instruction of speech audio normal data.
After smart home device predicts the corresponding prediction voice of user speech, it is also necessary to further verifying prediction
Result it is whether correct, specifically can according to prediction voice, from store the control instruction of smart home device with it is corresponding
In the speech database of the mapping relations of token sound data, speech audio normal data corresponding with control instruction is obtained, and
It is whether correct come the prediction voice for verifying prediction by the similarity for calculating user speech and speech audio normal data, specifically may be used
It is otherwise incorrect to be to determine that the prediction voice of prediction is correct when similarity reaches given threshold such as 90%.
If the prediction voice of prediction is correct, smart home is controlled according to the corresponding control instruction of speech audio normal data
Equipment.
If the prediction voice of prediction is incorrect, i.e., the similarity for calculating user speech and speech audio normal data it
Afterwards, it if similarity cannot reach given threshold, determines that the prediction voice of prediction is incorrect, is then used by default prompt information instruction
Family will resurvey user speech;Wherein, presetting prompt information is sound and/or light prompt information.
For example, smart home device can inform user's weight by audio frequency apparatus when similarity does not reach given threshold
Newly input voice information again, as air-conditioning plays " what you are saying? " user is allowed to repeat user speech just now.Finger can also be passed through
Show lamp instruction user input voice information again again, as air-conditioning can repeat user speech just now with flashing red light, illustrative user.
Based on the same inventive concept, a kind of device for speech recognition is provided in one embodiment of the invention, the device
The specific embodiment of audio recognition method can be found in the description of embodiment of the method part, and overlaps will not be repeated, refer to
Fig. 4, the device include:
Acquisition unit 401, for passing through image acquisition device when acquiring user speech by voice acquisition device
User's face image;
Predicting unit 402, for being based on the user speech and the user's face image, predicted with prediction model described in
The corresponding prediction voice of user speech;Wherein, the prediction model is by the voice of the corresponding different crowd of each control instruction
And corresponding face-image training obtains, and the prediction model is made to be directed to the voice that same control instruction issues to different crowd
And the face-image presented predicted after can export the similar voice of corresponding with same control instruction received pronunciation;
Matching unit 403 matches corresponding with the control instruction for being based on the prediction voice from speech database
Speech audio normal data;Wherein, the speech database is control instruction and the corresponding language of the smart home device
The mapping relations of sound audio standard data;
Computing unit 404, for calculating the user speech and the speech audio normal data by Matching Model
Matching degree is controlled when the matching degree reaches given threshold according to the corresponding control instruction of the speech audio normal data
The smart home device.
Preferably, the predicting unit 402 is specifically used for:
Through the speech recognition technology in the prediction model from the user speech, the user speech pair is identified
The the first control instruction collection answered;
Based on the user's face image from the face image data library in the prediction model, obtain and the user
The corresponding second control instruction collection of face-image;Wherein, the face image data library is control instruction and Standard User table
The mapping relations of feelings and/or Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated,
Using the corresponding audio data of the highest control instruction of matching degree as the prediction voice.
Preferably, the predicting unit 402 is also used to:
Corresponding user's expression and/or user's lip shape are extracted from the user's face image, obtain user's expression data
And/or user's lip type data;
Based on user's expression data and/or user's lip type data, obtained from the face image data library described
Second control instruction collection.
Preferably, the computing unit 404 is also used to:
If the similarity cannot reach the given threshold, indicate that user will resurvey use by default prompt information
Family voice;Wherein, the default prompt information is sound and/or light prompt information.
Based on the same inventive concept, a kind of device for speech recognition is provided in the embodiment of the present invention, comprising: at least
One processor, and
The memory being connect at least one described processor;
Wherein, the memory is stored with the instruction that can be executed by least one described processor, described at least one
The instruction that device is stored by executing the memory is managed, audio recognition method as described above is executed.
Based on the same inventive concept, the embodiment of the present invention also mentions a kind of computer readable storage medium, comprising:
The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers
When, so that computer executes audio recognition method as described above.
In embodiment provided by the invention, user speech is acquired by voice acquisition device by smart home device
When, while passing through image acquisition device user's face image;And it is based on collected user speech and user's face image,
With the corresponding prediction voice of prediction model prediction user speech;Wherein, prediction model is by the corresponding difference of each control instruction
What the voice of crowd and the training of corresponding face-image obtained, issue prediction model different crowd for same control instruction
Voice and presentation face-image predicted after can export the similar voice of corresponding with same control instruction received pronunciation;
And then based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database;Its
In, speech database is control instruction and the mapping relations of corresponding speech audio normal data of smart home device;Finally,
The matching degree that user speech and speech audio normal data are calculated by Matching Model, when matching degree reaches given threshold, root
Smart home device is controlled according to the corresponding control instruction of speech audio normal data.To make smart home device fast square
Just the discrimination of raising voice, reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as the production of method, system or computer program
Product.Therefore, in terms of the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and hardware
Embodiment form.Moreover, it wherein includes computer available programs generation that the embodiment of the present invention, which can be used in one or more,
The meter implemented in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of code
The form of calculation machine program product.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, equipment (system) and computer program product
Flowchart and/or the block diagram describe.It should be understood that can be realized by computer program instructions in flowchart and/or the block diagram
The combination of process and/or box in each flow and/or block and flowchart and/or the block diagram.It can provide these calculating
Processing of the machine program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices
Device is to generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute
For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram
Device.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art
Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to include these modifications and variations.
Claims (10)
1. a kind of method of speech recognition is applied to smart home device characterized by comprising
When acquiring user speech by voice acquisition device, pass through image acquisition device user's face image;
Based on the user speech and the user's face image, the corresponding prediction language of the user speech is predicted with prediction model
Sound;Wherein, the prediction model is by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image
It obtains, carries out face-image of the prediction model to different crowd for the same control instruction voice issued and presentation
The similar voice of corresponding with same control instruction received pronunciation can be exported after prediction;
Based on the prediction voice, speech audio criterion numeral corresponding with the control instruction is matched from speech database
According to;Wherein, the speech database is control instruction and the corresponding speech audio normal data of the smart home device
Mapping relations;
The matching degree that the user speech Yu the speech audio normal data are calculated by Matching Model, when the matching degree reaches
When to given threshold, the smart home device is controlled according to the corresponding control instruction of the speech audio normal data.
2. the method as described in claim 1, which is characterized in that be based on the user speech and the user's face image, use
Prediction model predicts the corresponding prediction voice of the user speech, comprising:
Through the speech recognition technology in the prediction model from the user speech, identify that the user speech is corresponding
First control instruction collection;
Based on the user's face image from the face image data library in the prediction model, obtain and the user's face
The corresponding second control instruction collection of image;Wherein, the face image data library be control instruction and Standard User expression and/
Or the mapping relations of Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, general
With the corresponding audio data of the highest control instruction of degree as the prediction voice.
3. method according to claim 2, which is characterized in that based on the user's face image from the prediction model
In face image data library, the second control instruction collection corresponding with the user's face image is obtained, comprising:
Extract corresponding user's expression and/or user's lip shape from the user's face image, obtain user's expression data and/or
User's lip type data;
Based on user's expression data and/or user's lip type data, described second is obtained from the face image data library
Control instruction collection.
4. the method as described in any claim of claim 1-3, which is characterized in that calculate the user speech and the voice sound
After the similarity of frequency normal data, further includes:
If the similarity cannot reach the given threshold, indicate that user will resurvey user's language by default prompt information
Sound;Wherein, the default prompt information is sound and/or light prompt information.
5. a kind of device of speech recognition is applied to smart home device characterized by comprising
Acquisition unit, for passing through image acquisition device user face when acquiring user speech by voice acquisition device
Portion's image;
Predicting unit predicts user's language with prediction model for being based on the user speech and the user's face image
The corresponding prediction voice of sound;Wherein, the prediction model is by the voice and correspondence of the corresponding different crowd of each control instruction
Face-image training obtain, the voice for issuing the prediction model different crowd for same control instruction and presentation
Face-image predicted after can export the similar voice of corresponding with same control instruction received pronunciation;
Acquiring unit matches language corresponding with the control instruction for being based on the prediction voice from speech database
Sound audio standard data;Wherein, the speech database is control instruction and the corresponding voice sound of the smart home device
The mapping relations of frequency normal data;
Computing unit, for calculating the matching degree of the user speech Yu the speech audio normal data by Matching Model,
When the matching degree reaches given threshold, the intelligence is controlled according to the corresponding control instruction of the speech audio normal data
Home equipment.
6. device as claimed in claim 5, which is characterized in that the predicting unit is specifically used for:
Through the speech recognition technology in the prediction model from the user speech, identify that the user speech is corresponding
First control instruction collection;
Based on the user's face image from the face image data library in the prediction model, obtain and the user's face
The corresponding second control instruction collection of image;Wherein, the face image data library be control instruction and Standard User expression and/
Or the mapping relations of Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, general
With the corresponding audio data of the highest control instruction of degree as the prediction voice.
7. device as claimed in claim 6, which is characterized in that the predicting unit is also used to:
Extract corresponding user's expression and/or user's lip shape from the user's face image, obtain user's expression data and/or
User's lip type data;
Based on user's expression data and/or user's lip type data, described second is obtained from the face image data library
Control instruction collection.
8. the device as described in any claim of claim 5-7, which is characterized in that the computing unit is also used to:
If the similarity cannot reach the given threshold, indicate that user will resurvey user's language by default prompt information
Sound;Wherein, the default prompt information is sound and/or light prompt information.
9. a kind of device of speech recognition characterized by comprising
At least one processor, and
The memory being connect at least one described processor;
Wherein, the memory is stored with the instruction that can be executed by least one described processor, at least one described processor
By executing the instruction of the memory storage, method according to any of claims 1-4 is executed.
10. a kind of computer readable storage medium, it is characterised in that:
The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers,
So that computer executes such as method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811238626.0A CN109448711A (en) | 2018-10-23 | 2018-10-23 | A kind of method, apparatus and computer storage medium of speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811238626.0A CN109448711A (en) | 2018-10-23 | 2018-10-23 | A kind of method, apparatus and computer storage medium of speech recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109448711A true CN109448711A (en) | 2019-03-08 |
Family
ID=65548031
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811238626.0A Pending CN109448711A (en) | 2018-10-23 | 2018-10-23 | A kind of method, apparatus and computer storage medium of speech recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109448711A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047486A (en) * | 2019-05-20 | 2019-07-23 | 合肥美的电冰箱有限公司 | Sound control method, device, server, system and storage medium |
CN110262278A (en) * | 2019-07-31 | 2019-09-20 | 珠海格力电器股份有限公司 | The control method and device of intelligent appliance equipment, intelligent electric appliance |
CN110349577A (en) * | 2019-06-19 | 2019-10-18 | 深圳前海达闼云端智能科技有限公司 | Man-machine interaction method, device, storage medium and electronic equipment |
CN111028842A (en) * | 2019-12-10 | 2020-04-17 | 上海芯翌智能科技有限公司 | Method and equipment for triggering voice interaction response |
CN111276140A (en) * | 2020-01-19 | 2020-06-12 | 珠海格力电器股份有限公司 | Voice command recognition method, device, system and storage medium |
CN111312221A (en) * | 2020-01-20 | 2020-06-19 | 宁波舜韵电子有限公司 | Intelligent range hood based on voice control |
CN111739534A (en) * | 2020-06-04 | 2020-10-02 | 广东小天才科技有限公司 | Processing method and device for assisting speech recognition, electronic equipment and storage medium |
CN111803936A (en) * | 2020-07-16 | 2020-10-23 | 网易(杭州)网络有限公司 | Voice communication method and device, electronic equipment and storage medium |
CN114578705A (en) * | 2022-04-01 | 2022-06-03 | 深圳冠特家居健康系统有限公司 | Intelligent home control system based on 5G Internet of things |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030212557A1 (en) * | 2002-05-09 | 2003-11-13 | Nefian Ara V. | Coupled hidden markov model for audiovisual speech recognition |
CN102023703A (en) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | Combined lip reading and voice recognition multimodal interface system |
CN102324035A (en) * | 2011-08-19 | 2012-01-18 | 广东好帮手电子科技股份有限公司 | Method and system of applying lip posture assisted speech recognition technique to vehicle navigation |
EP2562746A1 (en) * | 2011-08-25 | 2013-02-27 | Samsung Electronics Co., Ltd. | Apparatus and method for recognizing voice by using lip image |
CN106157956A (en) * | 2015-03-24 | 2016-11-23 | 中兴通讯股份有限公司 | The method and device of speech recognition |
WO2017151672A2 (en) * | 2016-02-29 | 2017-09-08 | Faraday & Future Inc. | Voice assistance system for devices of an ecosystem |
CN107272607A (en) * | 2017-05-11 | 2017-10-20 | 上海斐讯数据通信技术有限公司 | A kind of intelligent home control system and method |
CN108346427A (en) * | 2018-02-05 | 2018-07-31 | 广东小天才科技有限公司 | A kind of audio recognition method, device, equipment and storage medium |
-
2018
- 2018-10-23 CN CN201811238626.0A patent/CN109448711A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030212557A1 (en) * | 2002-05-09 | 2003-11-13 | Nefian Ara V. | Coupled hidden markov model for audiovisual speech recognition |
CN102023703A (en) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | Combined lip reading and voice recognition multimodal interface system |
CN102324035A (en) * | 2011-08-19 | 2012-01-18 | 广东好帮手电子科技股份有限公司 | Method and system of applying lip posture assisted speech recognition technique to vehicle navigation |
EP2562746A1 (en) * | 2011-08-25 | 2013-02-27 | Samsung Electronics Co., Ltd. | Apparatus and method for recognizing voice by using lip image |
CN106157956A (en) * | 2015-03-24 | 2016-11-23 | 中兴通讯股份有限公司 | The method and device of speech recognition |
WO2017151672A2 (en) * | 2016-02-29 | 2017-09-08 | Faraday & Future Inc. | Voice assistance system for devices of an ecosystem |
CN107272607A (en) * | 2017-05-11 | 2017-10-20 | 上海斐讯数据通信技术有限公司 | A kind of intelligent home control system and method |
CN108346427A (en) * | 2018-02-05 | 2018-07-31 | 广东小天才科技有限公司 | A kind of audio recognition method, device, equipment and storage medium |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047486A (en) * | 2019-05-20 | 2019-07-23 | 合肥美的电冰箱有限公司 | Sound control method, device, server, system and storage medium |
CN110349577A (en) * | 2019-06-19 | 2019-10-18 | 深圳前海达闼云端智能科技有限公司 | Man-machine interaction method, device, storage medium and electronic equipment |
CN110349577B (en) * | 2019-06-19 | 2022-12-06 | 达闼机器人股份有限公司 | Man-machine interaction method and device, storage medium and electronic equipment |
CN110262278A (en) * | 2019-07-31 | 2019-09-20 | 珠海格力电器股份有限公司 | The control method and device of intelligent appliance equipment, intelligent electric appliance |
CN111028842A (en) * | 2019-12-10 | 2020-04-17 | 上海芯翌智能科技有限公司 | Method and equipment for triggering voice interaction response |
CN111028842B (en) * | 2019-12-10 | 2021-05-11 | 上海芯翌智能科技有限公司 | Method and equipment for triggering voice interaction response |
CN111276140A (en) * | 2020-01-19 | 2020-06-12 | 珠海格力电器股份有限公司 | Voice command recognition method, device, system and storage medium |
CN111276140B (en) * | 2020-01-19 | 2023-05-12 | 珠海格力电器股份有限公司 | Voice command recognition method, device, system and storage medium |
CN111312221B (en) * | 2020-01-20 | 2022-07-22 | 宁波舜韵电子有限公司 | Intelligent range hood based on voice control |
CN111312221A (en) * | 2020-01-20 | 2020-06-19 | 宁波舜韵电子有限公司 | Intelligent range hood based on voice control |
CN111739534A (en) * | 2020-06-04 | 2020-10-02 | 广东小天才科技有限公司 | Processing method and device for assisting speech recognition, electronic equipment and storage medium |
CN111739534B (en) * | 2020-06-04 | 2022-12-27 | 广东小天才科技有限公司 | Processing method and device for assisting speech recognition, electronic equipment and storage medium |
CN111803936A (en) * | 2020-07-16 | 2020-10-23 | 网易(杭州)网络有限公司 | Voice communication method and device, electronic equipment and storage medium |
CN114578705A (en) * | 2022-04-01 | 2022-06-03 | 深圳冠特家居健康系统有限公司 | Intelligent home control system based on 5G Internet of things |
CN114578705B (en) * | 2022-04-01 | 2022-12-27 | 深圳冠特家居健康系统有限公司 | Intelligent home control system based on 5G Internet of things |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109448711A (en) | A kind of method, apparatus and computer storage medium of speech recognition | |
CN106251874B (en) | A kind of voice gate inhibition and quiet environment monitoring method and system | |
US20180261236A1 (en) | Speaker recognition method and apparatus, computer device and computer-readable medium | |
CN108304385A (en) | A kind of speech recognition text error correction method and device | |
WO2016150001A1 (en) | Speech recognition method, device and computer storage medium | |
CN109783642A (en) | Structured content processing method, device, equipment and the medium of multi-person conference scene | |
CN105810213A (en) | Typical abnormal sound detection method and device | |
CN101923857A (en) | Extensible audio recognition method based on man-machine interaction | |
CN109360572A (en) | Call separation method, device, computer equipment and storage medium | |
WO2020180719A1 (en) | Determining input for speech processing engine | |
CN105308679A (en) | Method and system for identifying location associated with voice command to control home appliance | |
CN109960743A (en) | Conference content differentiating method, device, computer equipment and storage medium | |
US20200194006A1 (en) | Voice-Controlled Management of User Profiles | |
CN106971714A (en) | A kind of speech de-noising recognition methods and device applied to robot | |
CN102637433A (en) | Method and system for identifying affective state loaded in voice signal | |
CN103943111A (en) | Method and device for identity recognition | |
CN108520752A (en) | A kind of method for recognizing sound-groove and device | |
CN102945673A (en) | Continuous speech recognition method with speech command range changed dynamically | |
CN103236261A (en) | Speaker-dependent voice recognizing method | |
CN104103280A (en) | Dynamic time warping algorithm based voice activity detection method and device | |
CN109783049A (en) | Method of controlling operation thereof, device, equipment and storage medium | |
CN111105798B (en) | Equipment control method based on voice recognition | |
CN106205610B (en) | A kind of voice information identification method and equipment | |
CN110580897A (en) | audio verification method and device, storage medium and electronic equipment | |
KR102220964B1 (en) | Method and device for audio recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190308 |
|
RJ01 | Rejection of invention patent application after publication |