CN109448711A

CN109448711A - A kind of method, apparatus and computer storage medium of speech recognition

Info

Publication number: CN109448711A
Application number: CN201811238626.0A
Authority: CN
Inventors: 刘健军; 王慧君; 秦萍
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2019-03-08

Abstract

The invention discloses a kind of method, apparatus of speech recognition and computer storage mediums, lower, the not convenient and fast technical problem of the discrimination to solve voice existing in the prior art.This method comprises: passing through image acquisition device user's face image when acquiring user speech by voice acquisition device；Based on user speech and user's face image, with the corresponding prediction voice of prediction model prediction user speech；Wherein, prediction model is obtained by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image；Based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database；Wherein, speech database is the mapping relations of control instruction and corresponding speech audio normal data；The matching degree that user speech and speech audio normal data are calculated by Matching Model controls smart home device according to the corresponding control instruction of speech audio normal data when matching degree reaches given threshold.

Description

A kind of method, apparatus and computer storage medium of speech recognition

Technical field

The present invention relates to smart home fields, method, apparatus and computer storage more particularly, to a kind of speech recognition Medium.

Background technique

With the development of science and technology, speech recognition technology in smart home field using more and more extensive.

For example, user can make smart home device work by sending phonetic order to smart home device.Such as, it uses Say " booting " that intelligent air condition can identify the phonetic order of user by speech recognition technology, and then execute to intelligent air condition in family Boot action.

However, during control using speech recognition technology smart home device in smart home device, Since the voice that user issues is easy to be influenced by factors such as noise, distances, to reduce the discrimination of voice, and then make Smart home device can not execute corresponding movement fully according to the phonetic order of user.

In the prior art, noise reduction process would generally be carried out to collected user speech to improve phonetic recognization rate, There are two types of common processing methods, one is to collected user speech carry out segment processing (including noise reduction, increase gain Deng), and then extract effective voice messaging and carry out algorithm identification；Another kind be using end-to-end deep learning algorithm to Family voice is trained study, obtains speech recognition modeling, identifies user speech with speech recognition modeling.

But both methods is all extremely limited to the raising of phonetic recognization rate, and training speech recognition modeling when need compared with More time, so that user experience will be reduced.

In consideration of it, how it is convenient, fast and it is effective improve voice discrimination, become a technology urgently to be resolved and ask Topic.

Summary of the invention

The present invention provides the method, apparatus and computer storage medium of a kind of speech recognition, to solve in the prior art Lower, the not convenient and fast technical problem of the discrimination of existing voice.

In a first aspect, in order to solve the above technical problems, a kind of method of speech recognition provided in an embodiment of the present invention, application It is as follows in the technical solution of smart home device, this method:

When acquiring user speech by voice acquisition device, pass through image acquisition device user's face image；

Based on the user speech and the user's face image, predict that the user speech is corresponding pre- with prediction model Survey voice；Wherein, the prediction model is the voice and corresponding standard facial by the corresponding different crowd of each control instruction Image training obtains, and makes the prediction model to different crowd for the same control instruction voice issued and the face of presentation Image can export the similar voice of corresponding with same control instruction received pronunciation after being predicted；

Based on the prediction voice, speech audio standard corresponding with the control instruction is matched from speech database Data；Wherein, the speech database is control instruction and the corresponding speech audio normal data of the smart home device Mapping relations；

The matching degree degree that the user speech Yu the speech audio normal data are calculated by Matching Model, when described When reaching given threshold with degree, the smart home is controlled according to the corresponding control instruction of the speech audio normal data and is set It is standby.

When acquiring user speech by voice acquisition device by smart home device, while being adopted by image collecting device Collect user's face image；And it is based on collected user speech and user's face image, user speech pair is predicted with prediction model The prediction voice answered；Wherein, prediction model is the voice and corresponding index plane by the corresponding different crowd of each control instruction Image training in portion's obtains, and schemes face of the prediction model to different crowd for the same control instruction voice issued and presentation As the similar voice of corresponding with same control instruction received pronunciation can be exported after being predicted；And then it is based on prediction voice, Speech audio normal data corresponding with control instruction is matched from speech database；Wherein, speech database is intelligent family Occupy the control instruction of equipment and the mapping relations of corresponding speech audio normal data；Finally, calculating user by Matching Model The matching degree of voice and speech audio normal data, when matching degree reaches given threshold, according to speech audio normal data pair The control instruction control smart home device answered.To allow smart home device that can fast and easily improve the identification of voice Rate reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.

Preferably, it is based on the user speech and the user's face image, predicts the user speech with prediction model Corresponding prediction voice, comprising:

Through the speech recognition technology in the prediction model from the user speech, the user speech pair is identified The the first control instruction collection answered；

Based on the user's face image from the face image data library in the prediction model, obtain and the user The corresponding second control instruction collection of face-image；Wherein, the face image data library is control instruction and Standard User table The mapping relations of feelings and/or Standard User lip shape；

The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, Using the corresponding audio data of the highest control instruction of matching degree as the prediction voice.

Preferably, based on the user's face image from the face image data library in the prediction model, obtain with The corresponding second control instruction collection of the user's face image, comprising:

Corresponding user's expression and/or user's lip shape are extracted from the user's face image, obtain user's expression data And/or user's lip type data；

Based on user's expression data and/or user's lip type data, obtained from the face image data library described Second control instruction collection.

Preferably, after the similarity for calculating the user speech and the speech audio normal data, further includes:

If the similarity cannot reach the given threshold, indicate that user will resurvey use by default prompt information Family voice；Wherein, the default prompt information is sound and/or light prompt information.

Second aspect, the embodiment of the invention provides a kind of devices for speech recognition, are applied to smart home device, The device includes:

Acquisition unit, for being used by image acquisition device when acquiring user speech by voice acquisition device Family face-image；

Predicting unit predicts the use with prediction model for being based on the user speech and the user's face image The corresponding prediction voice of family voice；Wherein, the prediction model be by the corresponding different crowd of each control instruction voice and The training of corresponding face-image obtains, the voice that issues the prediction model different crowd for same control instruction and The face-image of presentation can export the similar voice of corresponding with same control instruction received pronunciation after being predicted；

Acquiring unit matches corresponding with the control instruction for being based on the prediction voice from speech database Speech audio normal data；Wherein, the speech database is control instruction and the corresponding language of the smart home device The mapping relations of sound audio standard data；

Computing unit, for calculating the matching of the user speech Yu the speech audio normal data by Matching Model Degree, when the matching degree reaches given threshold, according to the corresponding control instruction control of the speech audio normal data Smart home device.

Preferably, the predicting unit is specifically used for:

Preferably, the predicting unit is also used to:

Preferably, the computing unit is also used to:

The third aspect, the embodiment of the present invention also provide a kind of device for speech recognition, are applied to smart home device, The device includes:

At least one processor, and

The memory being connect at least one described processor；

Wherein, the memory is stored with the instruction that can be executed by least one described processor, described at least one The instruction that device is stored by executing the memory is managed, the method as described in above-mentioned first aspect is executed.

Fourth aspect, the embodiment of the present invention also provide a kind of computer readable storage medium, comprising:

The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers When, so that computer executes the method as described in above-mentioned first aspect.

The technical solution in said one or multiple embodiments through the embodiment of the present invention, the embodiment of the present invention at least have There is following technical effect:

In embodiment provided by the invention, user speech is acquired by voice acquisition device by smart home device When, while passing through image acquisition device user's face image；And it is based on collected user speech and user's face image, With the corresponding prediction voice of prediction model prediction user speech；Wherein, prediction model is by the corresponding difference of each control instruction What the voice of crowd and the training of corresponding face-image obtained, make the prediction model to different crowd for same control instruction The voice of sending and the face-image of presentation can export received pronunciation phase corresponding with the same control instruction after being predicted As voice；And then based on prediction voice, speech audio standard corresponding with control instruction is matched from speech database Data；Wherein, speech database is that the control instruction of smart home device is closed with the mapping of corresponding speech audio normal data System；Finally, calculating the matching degree of user speech and speech audio normal data by Matching Model, reach setting threshold when matching When value, smart home device is controlled according to the corresponding control instruction of speech audio normal data.To allow smart home device energy Enough discriminations for fast and easily improving voice reduce and malfunction because caused by speech recognition is incorrect, improve user experience.

Detailed description of the invention

Fig. 1 is a kind of flow chart of audio recognition method provided in an embodiment of the present invention；

Fig. 2 is the schematic diagram that air-conditioning provided in an embodiment of the present invention carries out speech recognition；

Fig. 3 is the schematic diagram provided in an embodiment of the present invention for obtaining the second control instruction collection；

Fig. 4 is a kind of structural schematic diagram of speech recognition equipment provided in an embodiment of the present invention.

Specific embodiment

Implementation column of the present invention provides the method, apparatus and computer storage medium of a kind of speech recognition, to solve existing skill Lower, the not convenient and fast technical problem of the discrimination of voice present in art.

In order to solve the above technical problems, general thought is as follows for technical solution in the embodiment of the present application:

There is provided a kind of method of speech recognition, comprising: when acquiring user speech by voice acquisition device, pass through image Acquisition device acquires user's face image；Based on user speech and user's face image, user speech pair is predicted with prediction model The prediction voice answered；Wherein, prediction model is the voice and corresponding face figure by the corresponding different crowd of each control instruction As training obtain, make prediction model to different crowd for same control instruction issue voice and presentation face-image into The similar voice of corresponding with same control instruction received pronunciation can be exported after row prediction；Based on prediction voice, from voice data Speech audio normal data corresponding with control instruction is matched in library；Wherein, speech database is the control of smart home device System instructs and the mapping relations of corresponding speech audio normal data；User speech and speech audio mark are calculated by Matching Model The matching degree of quasi- data is controlled when matching degree reaches given threshold according to the corresponding control instruction of speech audio normal data Smart home device.

Due in the above scheme, when smart home device acquires user speech by voice acquisition device, leading to simultaneously Cross image acquisition device user's face image；And it is based on collected user speech and user's face image, with prediction mould Type predicts the corresponding prediction voice of user speech；Wherein, prediction model is by the language of the corresponding different crowd of each control instruction Sound and the training of corresponding face-image obtain, the voice that issues prediction model different crowd for same control instruction and The face-image of presentation can export the similar voice of corresponding with same control instruction received pronunciation after being predicted；And then Based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database；Wherein, voice number According to the control instruction that library is smart home device and the mapping relations of corresponding speech audio normal data；Finally, passing through matching Model calculates the matching degree of user speech and speech audio normal data, when matching degree reaches given threshold, according to voice sound The corresponding control instruction of frequency normal data controls smart home device.To which smart home device can fast and easily be improved The discrimination of voice reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.

In order to better understand the above technical scheme, below by attached drawing and specific embodiment to technical solution of the present invention It is described in detail, it should be understood that the specific features in the embodiment of the present invention and embodiment are to the detailed of technical solution of the present invention Thin explanation, rather than the restriction to technical solution of the present invention, in the absence of conflict, the embodiment of the present invention and embodiment In technical characteristic can be combined with each other.

Referring to FIG. 1, the embodiment of the present invention provides a kind of method of speech recognition, it is applied to smart home device, the party The treatment process of method is as follows.

Step 101: when acquiring user speech by voice acquisition device, passing through image acquisition device user's face Image.

In smart home device such as intelligent air condition, smart television etc., when being controlled with voice them, due to distance Smart home device is farther out or user uses sound when voice there is also other noises such as shutdown of opening the door, laundry washer clothes When the noise etc. that issues, the smart home device for causing a user to control can not accurately identify the corresponding instruction of user speech.

For this purpose, in embodiment provided by the invention, by allowing smart home device acquiring user's language using voice device When sound, also use image acquisition device user's face expression, allow smart home device by user speech with User's face expression carries out comprehensive analysis, judgement, determines the corresponding correct instruction of user speech to control smart home device It works according to instruction.

Wherein, voice acquisition device can be microphone, microphone array etc., and voice acquisition device can be smart home The component part of equipment is also possible to external voice acquisition device, can also be the microphone on smart phone, external language Sound acquisition device can be communicated by wired mode with smart home device, can also wirelessly with intelligence Home equipment is communicated, specifically without limitation.

Image collecting device can be camera, ccd sensor, camera etc., and image collecting device can be smart home The component part of equipment is also possible to external image collecting device, can also be the camera on smart phone, external figure As acquisition device can be communicated by wired mode with smart home device, can also wirelessly with intelligence Home equipment is communicated, specifically without limitation.

After through voice acquisition device and image acquisition device to user speech and user's face image, Execute step 102.

Step 102: being based on user speech and user's face image, predict that the user speech is corresponding pre- with prediction model Survey voice；Wherein, prediction model is by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image It obtains, predicts face-image of the prediction model to different crowd for the same control instruction voice issued and presentation After can export the similar voice of corresponding with same control instruction received pronunciation.

Wherein, prediction model can be obtained by the voice of different crowd and the training of corresponding face-image, smart home Prediction model used in equipment is trained model.

For example, by taking intelligent air condition as an example, it is assumed that the prediction voice of Yao Xunlian " turning on the aircondition " can allow allowing different people respectively Group such as man, Ms, child old man read " turning on the aircondition ", when different crowd reads " turning on the aircondition " while acquiring what corresponding crowd issued Face-image when sound (audio data) and sounding will obtain and instruct corresponding standard audio and standard picture phase with turning on the aircondition The audio data and face-image for being 90% like degree.User is in use, by acquiring user with above-mentioned trained prediction model After voice and face-image, comparable speech can be directly exported.

Further, in order to adapt to the dialect in each place, the different crowd local dialect in each place can also be used Read control instruction, face-image when acquiring corresponding audio data and reading instruction prediction model is trained with it is trained The corresponding similar voice of control instruction and face-image, training process is similar to process above, and details are not described herein.

Specifically, being based on user speech and user's face image, predict that the user speech is corresponding pre- with prediction model Voice is surveyed, can be realized by following procedure:

Firstly, through the speech recognition technology in prediction model from user speech, user speech corresponding the is identified One control instruction collection.

Secondly, being obtained and user's face figure based on user's face image from the face image data library in prediction model As corresponding second control instruction collection；Wherein, face image data library is control instruction and Standard User expression and/or standard The mapping relations of user's lip shape.

Corresponding user's expression and/or user's lip shape are first extracted from user's face image specifically, can be, and are used Family expression data and/or user's lip type data；It is based on user's expression data and/or user's lip type data again, from face-image number According to obtaining the second control instruction collection in library.

Finally, the first control instruction collection is matched one by one with every control instruction that the second control instruction is concentrated, it will The corresponding audio data of the highest control instruction of matching degree is as prediction voice.

For example, referring to Fig. 2, by taking smart home device is air-conditioning as an example, the image collector which uses is set to outer The camera set while air-conditioning acquires user speech by voice acquisition device, is also controlled when user says " turning on the aircondition " Camera acquires user's face image.Wherein, when user issues the voice of " turning on the aircondition ", since washing machine is working, institute To produce noise 1, since another kinsfolk is making child not see TV, and has issued noise 2 and " turn off TV！", So in the user speech that air-conditioning obtains other than the voice of " turning on the aircondition ", the noise 1 of washing machine and other is also mixed Voice noise 2 " turns off TV！".

After air-conditioning obtains user's face image and user speech, identified from user speech by built-in prediction model The corresponding first control instruction collection of user speech out: " booting " instruction and " shutdown " instruct；Meanwhile it being mentioned from user's face image Corresponding user's lip shape is taken, and user's lip shape of extraction and the lip shape data in face image data library are compared one by one, It determines the corresponding word of each lip shape, and then determines that (identification word 1 is to turn on the aircondition to the corresponding identification word of these lip shapes, and identification word 2 is Like air-conditioning), then according to the corresponding relationship for identifying word and air-conditioning instruction in prediction model, determine the corresponding air-conditioning of each identification word Control instruction, and then the instruction 1 " booting " and instruction 2 that obtain the second control instruction concentration corresponding with user's face image are " certainly Dynamic cleaning ", specifically refers to Fig. 3.

It is corresponding in the corresponding first control instruction collection " booting " of acquisition user speech and " shutdown " and user's face image The second control instruction collection " booting " and " automated cleaning " after, the first control instruction collection is concentrated with the second control instruction every Control instruction is matched one by one, using the corresponding audio data of the highest control instruction of matching degree (i.e. " booting " instructs) as Predict voice.

It should be noted that above-described embodiment, only to actually use for extracting lip shape in user's face image In, the corresponding control instruction of user speech can also be assisted in identifying with reference to facial expression, limb action of user etc., Improve the accuracy of user speech identification.

After smart home device predicts the corresponding prediction voice of user speech, step 103-104 can be executed.

Step 103: based on prediction voice, speech audio standard corresponding with control instruction is matched from speech database Data；Wherein, speech database is that the control instruction of smart home device is closed with the mapping of corresponding speech audio normal data System.

Step 104: the matching degree of user speech and speech audio normal data is calculated by Matching Model, when matching degree reaches When to given threshold, smart home device is controlled according to the corresponding control instruction of speech audio normal data.

After smart home device predicts the corresponding prediction voice of user speech, it is also necessary to further verifying prediction Result it is whether correct, specifically can according to prediction voice, from store the control instruction of smart home device with it is corresponding In the speech database of the mapping relations of token sound data, speech audio normal data corresponding with control instruction is obtained, and It is whether correct come the prediction voice for verifying prediction by the similarity for calculating user speech and speech audio normal data, specifically may be used It is otherwise incorrect to be to determine that the prediction voice of prediction is correct when similarity reaches given threshold such as 90%.

If the prediction voice of prediction is correct, smart home is controlled according to the corresponding control instruction of speech audio normal data Equipment.

If the prediction voice of prediction is incorrect, i.e., the similarity for calculating user speech and speech audio normal data it Afterwards, it if similarity cannot reach given threshold, determines that the prediction voice of prediction is incorrect, is then used by default prompt information instruction Family will resurvey user speech；Wherein, presetting prompt information is sound and/or light prompt information.

For example, smart home device can inform user's weight by audio frequency apparatus when similarity does not reach given threshold Newly input voice information again, as air-conditioning plays " what you are saying? " user is allowed to repeat user speech just now.Finger can also be passed through Show lamp instruction user input voice information again again, as air-conditioning can repeat user speech just now with flashing red light, illustrative user.

Based on the same inventive concept, a kind of device for speech recognition is provided in one embodiment of the invention, the device The specific embodiment of audio recognition method can be found in the description of embodiment of the method part, and overlaps will not be repeated, refer to Fig. 4, the device include:

Acquisition unit 401, for passing through image acquisition device when acquiring user speech by voice acquisition device User's face image；

Predicting unit 402, for being based on the user speech and the user's face image, predicted with prediction model described in The corresponding prediction voice of user speech；Wherein, the prediction model is by the voice of the corresponding different crowd of each control instruction And corresponding face-image training obtains, and the prediction model is made to be directed to the voice that same control instruction issues to different crowd And the face-image presented predicted after can export the similar voice of corresponding with same control instruction received pronunciation；

Matching unit 403 matches corresponding with the control instruction for being based on the prediction voice from speech database Speech audio normal data；Wherein, the speech database is control instruction and the corresponding language of the smart home device The mapping relations of sound audio standard data；

Computing unit 404, for calculating the user speech and the speech audio normal data by Matching Model Matching degree is controlled when the matching degree reaches given threshold according to the corresponding control instruction of the speech audio normal data The smart home device.

Preferably, the predicting unit 402 is specifically used for:

Preferably, the predicting unit 402 is also used to:

Preferably, the computing unit 404 is also used to:

Based on the same inventive concept, a kind of device for speech recognition is provided in the embodiment of the present invention, comprising: at least One processor, and

The memory being connect at least one described processor；

Wherein, the memory is stored with the instruction that can be executed by least one described processor, described at least one The instruction that device is stored by executing the memory is managed, audio recognition method as described above is executed.

Based on the same inventive concept, the embodiment of the present invention also mentions a kind of computer readable storage medium, comprising:

The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers When, so that computer executes audio recognition method as described above.

In embodiment provided by the invention, user speech is acquired by voice acquisition device by smart home device When, while passing through image acquisition device user's face image；And it is based on collected user speech and user's face image, With the corresponding prediction voice of prediction model prediction user speech；Wherein, prediction model is by the corresponding difference of each control instruction What the voice of crowd and the training of corresponding face-image obtained, issue prediction model different crowd for same control instruction Voice and presentation face-image predicted after can export the similar voice of corresponding with same control instruction received pronunciation； And then based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database；Its In, speech database is control instruction and the mapping relations of corresponding speech audio normal data of smart home device；Finally, The matching degree that user speech and speech audio normal data are calculated by Matching Model, when matching degree reaches given threshold, root Smart home device is controlled according to the corresponding control instruction of speech audio normal data.To make smart home device fast square Just the discrimination of raising voice, reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as the production of method, system or computer program Product.Therefore, in terms of the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and hardware Embodiment form.Moreover, it wherein includes computer available programs generation that the embodiment of the present invention, which can be used in one or more, The meter implemented in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of code The form of calculation machine program product.

The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, equipment (system) and computer program product Flowchart and/or the block diagram describe.It should be understood that can be realized by computer program instructions in flowchart and/or the block diagram The combination of process and/or box in each flow and/or block and flowchart and/or the block diagram.It can provide these calculating Processing of the machine program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices Device is to generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram Device.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of method of speech recognition is applied to smart home device characterized by comprising

Based on the user speech and the user's face image, the corresponding prediction language of the user speech is predicted with prediction model Sound；Wherein, the prediction model is by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image It obtains, carries out face-image of the prediction model to different crowd for the same control instruction voice issued and presentation The similar voice of corresponding with same control instruction received pronunciation can be exported after prediction；

Based on the prediction voice, speech audio criterion numeral corresponding with the control instruction is matched from speech database According to；Wherein, the speech database is control instruction and the corresponding speech audio normal data of the smart home device Mapping relations；

The matching degree that the user speech Yu the speech audio normal data are calculated by Matching Model, when the matching degree reaches When to given threshold, the smart home device is controlled according to the corresponding control instruction of the speech audio normal data.

2. the method as described in claim 1, which is characterized in that be based on the user speech and the user's face image, use Prediction model predicts the corresponding prediction voice of the user speech, comprising:

Through the speech recognition technology in the prediction model from the user speech, identify that the user speech is corresponding First control instruction collection；

Based on the user's face image from the face image data library in the prediction model, obtain and the user's face The corresponding second control instruction collection of image；Wherein, the face image data library be control instruction and Standard User expression and/ Or the mapping relations of Standard User lip shape；

The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, general With the corresponding audio data of the highest control instruction of degree as the prediction voice.

3. method according to claim 2, which is characterized in that based on the user's face image from the prediction model In face image data library, the second control instruction collection corresponding with the user's face image is obtained, comprising:

Extract corresponding user's expression and/or user's lip shape from the user's face image, obtain user's expression data and/or User's lip type data；

Based on user's expression data and/or user's lip type data, described second is obtained from the face image data library Control instruction collection.

4. the method as described in any claim of claim 1-3, which is characterized in that calculate the user speech and the voice sound After the similarity of frequency normal data, further includes:

If the similarity cannot reach the given threshold, indicate that user will resurvey user's language by default prompt information Sound；Wherein, the default prompt information is sound and/or light prompt information.

5. a kind of device of speech recognition is applied to smart home device characterized by comprising

Acquisition unit, for passing through image acquisition device user face when acquiring user speech by voice acquisition device Portion's image；

Predicting unit predicts user's language with prediction model for being based on the user speech and the user's face image The corresponding prediction voice of sound；Wherein, the prediction model is by the voice and correspondence of the corresponding different crowd of each control instruction Face-image training obtain, the voice for issuing the prediction model different crowd for same control instruction and presentation Face-image predicted after can export the similar voice of corresponding with same control instruction received pronunciation；

Acquiring unit matches language corresponding with the control instruction for being based on the prediction voice from speech database Sound audio standard data；Wherein, the speech database is control instruction and the corresponding voice sound of the smart home device The mapping relations of frequency normal data；

Computing unit, for calculating the matching degree of the user speech Yu the speech audio normal data by Matching Model, When the matching degree reaches given threshold, the intelligence is controlled according to the corresponding control instruction of the speech audio normal data Home equipment.

6. device as claimed in claim 5, which is characterized in that the predicting unit is specifically used for:

7. device as claimed in claim 6, which is characterized in that the predicting unit is also used to:

8. the device as described in any claim of claim 5-7, which is characterized in that the computing unit is also used to:

9. a kind of device of speech recognition characterized by comprising

At least one processor, and

The memory being connect at least one described processor；

Wherein, the memory is stored with the instruction that can be executed by least one described processor, at least one described processor By executing the instruction of the memory storage, method according to any of claims 1-4 is executed.

10. a kind of computer readable storage medium, it is characterised in that:

The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers, So that computer executes such as method of any of claims 1-4.