CN109448711A - A kind of method, apparatus and computer storage medium of speech recognition - Google Patents

A kind of method, apparatus and computer storage medium of speech recognition Download PDF

Info

Publication number
CN109448711A
CN109448711A CN201811238626.0A CN201811238626A CN109448711A CN 109448711 A CN109448711 A CN 109448711A CN 201811238626 A CN201811238626 A CN 201811238626A CN 109448711 A CN109448711 A CN 109448711A
Authority
CN
China
Prior art keywords
user
speech
control instruction
voice
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811238626.0A
Other languages
Chinese (zh)
Inventor
刘健军
王慧君
秦萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN201811238626.0A priority Critical patent/CN109448711A/en
Publication of CN109448711A publication Critical patent/CN109448711A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The invention discloses a kind of method, apparatus of speech recognition and computer storage mediums, lower, the not convenient and fast technical problem of the discrimination to solve voice existing in the prior art.This method comprises: passing through image acquisition device user's face image when acquiring user speech by voice acquisition device;Based on user speech and user's face image, with the corresponding prediction voice of prediction model prediction user speech;Wherein, prediction model is obtained by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image;Based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database;Wherein, speech database is the mapping relations of control instruction and corresponding speech audio normal data;The matching degree that user speech and speech audio normal data are calculated by Matching Model controls smart home device according to the corresponding control instruction of speech audio normal data when matching degree reaches given threshold.

Description

A kind of method, apparatus and computer storage medium of speech recognition
Technical field
The present invention relates to smart home fields, method, apparatus and computer storage more particularly, to a kind of speech recognition Medium.
Background technique
With the development of science and technology, speech recognition technology in smart home field using more and more extensive.
For example, user can make smart home device work by sending phonetic order to smart home device.Such as, it uses Say " booting " that intelligent air condition can identify the phonetic order of user by speech recognition technology, and then execute to intelligent air condition in family Boot action.
However, during control using speech recognition technology smart home device in smart home device, Since the voice that user issues is easy to be influenced by factors such as noise, distances, to reduce the discrimination of voice, and then make Smart home device can not execute corresponding movement fully according to the phonetic order of user.
In the prior art, noise reduction process would generally be carried out to collected user speech to improve phonetic recognization rate, There are two types of common processing methods, one is to collected user speech carry out segment processing (including noise reduction, increase gain Deng), and then extract effective voice messaging and carry out algorithm identification;Another kind be using end-to-end deep learning algorithm to Family voice is trained study, obtains speech recognition modeling, identifies user speech with speech recognition modeling.
But both methods is all extremely limited to the raising of phonetic recognization rate, and training speech recognition modeling when need compared with More time, so that user experience will be reduced.
In consideration of it, how it is convenient, fast and it is effective improve voice discrimination, become a technology urgently to be resolved and ask Topic.
Summary of the invention
The present invention provides the method, apparatus and computer storage medium of a kind of speech recognition, to solve in the prior art Lower, the not convenient and fast technical problem of the discrimination of existing voice.
In a first aspect, in order to solve the above technical problems, a kind of method of speech recognition provided in an embodiment of the present invention, application It is as follows in the technical solution of smart home device, this method:
When acquiring user speech by voice acquisition device, pass through image acquisition device user's face image;
Based on the user speech and the user's face image, predict that the user speech is corresponding pre- with prediction model Survey voice;Wherein, the prediction model is the voice and corresponding standard facial by the corresponding different crowd of each control instruction Image training obtains, and makes the prediction model to different crowd for the same control instruction voice issued and the face of presentation Image can export the similar voice of corresponding with same control instruction received pronunciation after being predicted;
Based on the prediction voice, speech audio standard corresponding with the control instruction is matched from speech database Data;Wherein, the speech database is control instruction and the corresponding speech audio normal data of the smart home device Mapping relations;
The matching degree degree that the user speech Yu the speech audio normal data are calculated by Matching Model, when described When reaching given threshold with degree, the smart home is controlled according to the corresponding control instruction of the speech audio normal data and is set It is standby.
When acquiring user speech by voice acquisition device by smart home device, while being adopted by image collecting device Collect user's face image;And it is based on collected user speech and user's face image, user speech pair is predicted with prediction model The prediction voice answered;Wherein, prediction model is the voice and corresponding index plane by the corresponding different crowd of each control instruction Image training in portion's obtains, and schemes face of the prediction model to different crowd for the same control instruction voice issued and presentation As the similar voice of corresponding with same control instruction received pronunciation can be exported after being predicted;And then it is based on prediction voice, Speech audio normal data corresponding with control instruction is matched from speech database;Wherein, speech database is intelligent family Occupy the control instruction of equipment and the mapping relations of corresponding speech audio normal data;Finally, calculating user by Matching Model The matching degree of voice and speech audio normal data, when matching degree reaches given threshold, according to speech audio normal data pair The control instruction control smart home device answered.To allow smart home device that can fast and easily improve the identification of voice Rate reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.
Preferably, it is based on the user speech and the user's face image, predicts the user speech with prediction model Corresponding prediction voice, comprising:
Through the speech recognition technology in the prediction model from the user speech, the user speech pair is identified The the first control instruction collection answered;
Based on the user's face image from the face image data library in the prediction model, obtain and the user The corresponding second control instruction collection of face-image;Wherein, the face image data library is control instruction and Standard User table The mapping relations of feelings and/or Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, Using the corresponding audio data of the highest control instruction of matching degree as the prediction voice.
Preferably, based on the user's face image from the face image data library in the prediction model, obtain with The corresponding second control instruction collection of the user's face image, comprising:
Corresponding user's expression and/or user's lip shape are extracted from the user's face image, obtain user's expression data And/or user's lip type data;
Based on user's expression data and/or user's lip type data, obtained from the face image data library described Second control instruction collection.
Preferably, after the similarity for calculating the user speech and the speech audio normal data, further includes:
If the similarity cannot reach the given threshold, indicate that user will resurvey use by default prompt information Family voice;Wherein, the default prompt information is sound and/or light prompt information.
Second aspect, the embodiment of the invention provides a kind of devices for speech recognition, are applied to smart home device, The device includes:
Acquisition unit, for being used by image acquisition device when acquiring user speech by voice acquisition device Family face-image;
Predicting unit predicts the use with prediction model for being based on the user speech and the user's face image The corresponding prediction voice of family voice;Wherein, the prediction model be by the corresponding different crowd of each control instruction voice and The training of corresponding face-image obtains, the voice that issues the prediction model different crowd for same control instruction and The face-image of presentation can export the similar voice of corresponding with same control instruction received pronunciation after being predicted;
Acquiring unit matches corresponding with the control instruction for being based on the prediction voice from speech database Speech audio normal data;Wherein, the speech database is control instruction and the corresponding language of the smart home device The mapping relations of sound audio standard data;
Computing unit, for calculating the matching of the user speech Yu the speech audio normal data by Matching Model Degree, when the matching degree reaches given threshold, according to the corresponding control instruction control of the speech audio normal data Smart home device.
Preferably, the predicting unit is specifically used for:
Through the speech recognition technology in the prediction model from the user speech, the user speech pair is identified The the first control instruction collection answered;
Based on the user's face image from the face image data library in the prediction model, obtain and the user The corresponding second control instruction collection of face-image;Wherein, the face image data library is control instruction and Standard User table The mapping relations of feelings and/or Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, Using the corresponding audio data of the highest control instruction of matching degree as the prediction voice.
Preferably, the predicting unit is also used to:
Corresponding user's expression and/or user's lip shape are extracted from the user's face image, obtain user's expression data And/or user's lip type data;
Based on user's expression data and/or user's lip type data, obtained from the face image data library described Second control instruction collection.
Preferably, the computing unit is also used to:
If the similarity cannot reach the given threshold, indicate that user will resurvey use by default prompt information Family voice;Wherein, the default prompt information is sound and/or light prompt information.
The third aspect, the embodiment of the present invention also provide a kind of device for speech recognition, are applied to smart home device, The device includes:
At least one processor, and
The memory being connect at least one described processor;
Wherein, the memory is stored with the instruction that can be executed by least one described processor, described at least one The instruction that device is stored by executing the memory is managed, the method as described in above-mentioned first aspect is executed.
Fourth aspect, the embodiment of the present invention also provide a kind of computer readable storage medium, comprising:
The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers When, so that computer executes the method as described in above-mentioned first aspect.
The technical solution in said one or multiple embodiments through the embodiment of the present invention, the embodiment of the present invention at least have There is following technical effect:
In embodiment provided by the invention, user speech is acquired by voice acquisition device by smart home device When, while passing through image acquisition device user's face image;And it is based on collected user speech and user's face image, With the corresponding prediction voice of prediction model prediction user speech;Wherein, prediction model is by the corresponding difference of each control instruction What the voice of crowd and the training of corresponding face-image obtained, make the prediction model to different crowd for same control instruction The voice of sending and the face-image of presentation can export received pronunciation phase corresponding with the same control instruction after being predicted As voice;And then based on prediction voice, speech audio standard corresponding with control instruction is matched from speech database Data;Wherein, speech database is that the control instruction of smart home device is closed with the mapping of corresponding speech audio normal data System;Finally, calculating the matching degree of user speech and speech audio normal data by Matching Model, reach setting threshold when matching When value, smart home device is controlled according to the corresponding control instruction of speech audio normal data.To allow smart home device energy Enough discriminations for fast and easily improving voice reduce and malfunction because caused by speech recognition is incorrect, improve user experience.
Detailed description of the invention
Fig. 1 is a kind of flow chart of audio recognition method provided in an embodiment of the present invention;
Fig. 2 is the schematic diagram that air-conditioning provided in an embodiment of the present invention carries out speech recognition;
Fig. 3 is the schematic diagram provided in an embodiment of the present invention for obtaining the second control instruction collection;
Fig. 4 is a kind of structural schematic diagram of speech recognition equipment provided in an embodiment of the present invention.
Specific embodiment
Implementation column of the present invention provides the method, apparatus and computer storage medium of a kind of speech recognition, to solve existing skill Lower, the not convenient and fast technical problem of the discrimination of voice present in art.
In order to solve the above technical problems, general thought is as follows for technical solution in the embodiment of the present application:
There is provided a kind of method of speech recognition, comprising: when acquiring user speech by voice acquisition device, pass through image Acquisition device acquires user's face image;Based on user speech and user's face image, user speech pair is predicted with prediction model The prediction voice answered;Wherein, prediction model is the voice and corresponding face figure by the corresponding different crowd of each control instruction As training obtain, make prediction model to different crowd for same control instruction issue voice and presentation face-image into The similar voice of corresponding with same control instruction received pronunciation can be exported after row prediction;Based on prediction voice, from voice data Speech audio normal data corresponding with control instruction is matched in library;Wherein, speech database is the control of smart home device System instructs and the mapping relations of corresponding speech audio normal data;User speech and speech audio mark are calculated by Matching Model The matching degree of quasi- data is controlled when matching degree reaches given threshold according to the corresponding control instruction of speech audio normal data Smart home device.
Due in the above scheme, when smart home device acquires user speech by voice acquisition device, leading to simultaneously Cross image acquisition device user's face image;And it is based on collected user speech and user's face image, with prediction mould Type predicts the corresponding prediction voice of user speech;Wherein, prediction model is by the language of the corresponding different crowd of each control instruction Sound and the training of corresponding face-image obtain, the voice that issues prediction model different crowd for same control instruction and The face-image of presentation can export the similar voice of corresponding with same control instruction received pronunciation after being predicted;And then Based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database;Wherein, voice number According to the control instruction that library is smart home device and the mapping relations of corresponding speech audio normal data;Finally, passing through matching Model calculates the matching degree of user speech and speech audio normal data, when matching degree reaches given threshold, according to voice sound The corresponding control instruction of frequency normal data controls smart home device.To which smart home device can fast and easily be improved The discrimination of voice reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.
In order to better understand the above technical scheme, below by attached drawing and specific embodiment to technical solution of the present invention It is described in detail, it should be understood that the specific features in the embodiment of the present invention and embodiment are to the detailed of technical solution of the present invention Thin explanation, rather than the restriction to technical solution of the present invention, in the absence of conflict, the embodiment of the present invention and embodiment In technical characteristic can be combined with each other.
Referring to FIG. 1, the embodiment of the present invention provides a kind of method of speech recognition, it is applied to smart home device, the party The treatment process of method is as follows.
Step 101: when acquiring user speech by voice acquisition device, passing through image acquisition device user's face Image.
In smart home device such as intelligent air condition, smart television etc., when being controlled with voice them, due to distance Smart home device is farther out or user uses sound when voice there is also other noises such as shutdown of opening the door, laundry washer clothes When the noise etc. that issues, the smart home device for causing a user to control can not accurately identify the corresponding instruction of user speech.
For this purpose, in embodiment provided by the invention, by allowing smart home device acquiring user's language using voice device When sound, also use image acquisition device user's face expression, allow smart home device by user speech with User's face expression carries out comprehensive analysis, judgement, determines the corresponding correct instruction of user speech to control smart home device It works according to instruction.
Wherein, voice acquisition device can be microphone, microphone array etc., and voice acquisition device can be smart home The component part of equipment is also possible to external voice acquisition device, can also be the microphone on smart phone, external language Sound acquisition device can be communicated by wired mode with smart home device, can also wirelessly with intelligence Home equipment is communicated, specifically without limitation.
Image collecting device can be camera, ccd sensor, camera etc., and image collecting device can be smart home The component part of equipment is also possible to external image collecting device, can also be the camera on smart phone, external figure As acquisition device can be communicated by wired mode with smart home device, can also wirelessly with intelligence Home equipment is communicated, specifically without limitation.
After through voice acquisition device and image acquisition device to user speech and user's face image, Execute step 102.
Step 102: being based on user speech and user's face image, predict that the user speech is corresponding pre- with prediction model Survey voice;Wherein, prediction model is by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image It obtains, predicts face-image of the prediction model to different crowd for the same control instruction voice issued and presentation After can export the similar voice of corresponding with same control instruction received pronunciation.
Wherein, prediction model can be obtained by the voice of different crowd and the training of corresponding face-image, smart home Prediction model used in equipment is trained model.
For example, by taking intelligent air condition as an example, it is assumed that the prediction voice of Yao Xunlian " turning on the aircondition " can allow allowing different people respectively Group such as man, Ms, child old man read " turning on the aircondition ", when different crowd reads " turning on the aircondition " while acquiring what corresponding crowd issued Face-image when sound (audio data) and sounding will obtain and instruct corresponding standard audio and standard picture phase with turning on the aircondition The audio data and face-image for being 90% like degree.User is in use, by acquiring user with above-mentioned trained prediction model After voice and face-image, comparable speech can be directly exported.
Further, in order to adapt to the dialect in each place, the different crowd local dialect in each place can also be used Read control instruction, face-image when acquiring corresponding audio data and reading instruction prediction model is trained with it is trained The corresponding similar voice of control instruction and face-image, training process is similar to process above, and details are not described herein.
Specifically, being based on user speech and user's face image, predict that the user speech is corresponding pre- with prediction model Voice is surveyed, can be realized by following procedure:
Firstly, through the speech recognition technology in prediction model from user speech, user speech corresponding the is identified One control instruction collection.
Secondly, being obtained and user's face figure based on user's face image from the face image data library in prediction model As corresponding second control instruction collection;Wherein, face image data library is control instruction and Standard User expression and/or standard The mapping relations of user's lip shape.
Corresponding user's expression and/or user's lip shape are first extracted from user's face image specifically, can be, and are used Family expression data and/or user's lip type data;It is based on user's expression data and/or user's lip type data again, from face-image number According to obtaining the second control instruction collection in library.
Finally, the first control instruction collection is matched one by one with every control instruction that the second control instruction is concentrated, it will The corresponding audio data of the highest control instruction of matching degree is as prediction voice.
For example, referring to Fig. 2, by taking smart home device is air-conditioning as an example, the image collector which uses is set to outer The camera set while air-conditioning acquires user speech by voice acquisition device, is also controlled when user says " turning on the aircondition " Camera acquires user's face image.Wherein, when user issues the voice of " turning on the aircondition ", since washing machine is working, institute To produce noise 1, since another kinsfolk is making child not see TV, and has issued noise 2 and " turn off TV!", So in the user speech that air-conditioning obtains other than the voice of " turning on the aircondition ", the noise 1 of washing machine and other is also mixed Voice noise 2 " turns off TV!".
After air-conditioning obtains user's face image and user speech, identified from user speech by built-in prediction model The corresponding first control instruction collection of user speech out: " booting " instruction and " shutdown " instruct;Meanwhile it being mentioned from user's face image Corresponding user's lip shape is taken, and user's lip shape of extraction and the lip shape data in face image data library are compared one by one, It determines the corresponding word of each lip shape, and then determines that (identification word 1 is to turn on the aircondition to the corresponding identification word of these lip shapes, and identification word 2 is Like air-conditioning), then according to the corresponding relationship for identifying word and air-conditioning instruction in prediction model, determine the corresponding air-conditioning of each identification word Control instruction, and then the instruction 1 " booting " and instruction 2 that obtain the second control instruction concentration corresponding with user's face image are " certainly Dynamic cleaning ", specifically refers to Fig. 3.
It is corresponding in the corresponding first control instruction collection " booting " of acquisition user speech and " shutdown " and user's face image The second control instruction collection " booting " and " automated cleaning " after, the first control instruction collection is concentrated with the second control instruction every Control instruction is matched one by one, using the corresponding audio data of the highest control instruction of matching degree (i.e. " booting " instructs) as Predict voice.
It should be noted that above-described embodiment, only to actually use for extracting lip shape in user's face image In, the corresponding control instruction of user speech can also be assisted in identifying with reference to facial expression, limb action of user etc., Improve the accuracy of user speech identification.
After smart home device predicts the corresponding prediction voice of user speech, step 103-104 can be executed.
Step 103: based on prediction voice, speech audio standard corresponding with control instruction is matched from speech database Data;Wherein, speech database is that the control instruction of smart home device is closed with the mapping of corresponding speech audio normal data System.
Step 104: the matching degree of user speech and speech audio normal data is calculated by Matching Model, when matching degree reaches When to given threshold, smart home device is controlled according to the corresponding control instruction of speech audio normal data.
After smart home device predicts the corresponding prediction voice of user speech, it is also necessary to further verifying prediction Result it is whether correct, specifically can according to prediction voice, from store the control instruction of smart home device with it is corresponding In the speech database of the mapping relations of token sound data, speech audio normal data corresponding with control instruction is obtained, and It is whether correct come the prediction voice for verifying prediction by the similarity for calculating user speech and speech audio normal data, specifically may be used It is otherwise incorrect to be to determine that the prediction voice of prediction is correct when similarity reaches given threshold such as 90%.
If the prediction voice of prediction is correct, smart home is controlled according to the corresponding control instruction of speech audio normal data Equipment.
If the prediction voice of prediction is incorrect, i.e., the similarity for calculating user speech and speech audio normal data it Afterwards, it if similarity cannot reach given threshold, determines that the prediction voice of prediction is incorrect, is then used by default prompt information instruction Family will resurvey user speech;Wherein, presetting prompt information is sound and/or light prompt information.
For example, smart home device can inform user's weight by audio frequency apparatus when similarity does not reach given threshold Newly input voice information again, as air-conditioning plays " what you are saying? " user is allowed to repeat user speech just now.Finger can also be passed through Show lamp instruction user input voice information again again, as air-conditioning can repeat user speech just now with flashing red light, illustrative user.
Based on the same inventive concept, a kind of device for speech recognition is provided in one embodiment of the invention, the device The specific embodiment of audio recognition method can be found in the description of embodiment of the method part, and overlaps will not be repeated, refer to Fig. 4, the device include:
Acquisition unit 401, for passing through image acquisition device when acquiring user speech by voice acquisition device User's face image;
Predicting unit 402, for being based on the user speech and the user's face image, predicted with prediction model described in The corresponding prediction voice of user speech;Wherein, the prediction model is by the voice of the corresponding different crowd of each control instruction And corresponding face-image training obtains, and the prediction model is made to be directed to the voice that same control instruction issues to different crowd And the face-image presented predicted after can export the similar voice of corresponding with same control instruction received pronunciation;
Matching unit 403 matches corresponding with the control instruction for being based on the prediction voice from speech database Speech audio normal data;Wherein, the speech database is control instruction and the corresponding language of the smart home device The mapping relations of sound audio standard data;
Computing unit 404, for calculating the user speech and the speech audio normal data by Matching Model Matching degree is controlled when the matching degree reaches given threshold according to the corresponding control instruction of the speech audio normal data The smart home device.
Preferably, the predicting unit 402 is specifically used for:
Through the speech recognition technology in the prediction model from the user speech, the user speech pair is identified The the first control instruction collection answered;
Based on the user's face image from the face image data library in the prediction model, obtain and the user The corresponding second control instruction collection of face-image;Wherein, the face image data library is control instruction and Standard User table The mapping relations of feelings and/or Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, Using the corresponding audio data of the highest control instruction of matching degree as the prediction voice.
Preferably, the predicting unit 402 is also used to:
Corresponding user's expression and/or user's lip shape are extracted from the user's face image, obtain user's expression data And/or user's lip type data;
Based on user's expression data and/or user's lip type data, obtained from the face image data library described Second control instruction collection.
Preferably, the computing unit 404 is also used to:
If the similarity cannot reach the given threshold, indicate that user will resurvey use by default prompt information Family voice;Wherein, the default prompt information is sound and/or light prompt information.
Based on the same inventive concept, a kind of device for speech recognition is provided in the embodiment of the present invention, comprising: at least One processor, and
The memory being connect at least one described processor;
Wherein, the memory is stored with the instruction that can be executed by least one described processor, described at least one The instruction that device is stored by executing the memory is managed, audio recognition method as described above is executed.
Based on the same inventive concept, the embodiment of the present invention also mentions a kind of computer readable storage medium, comprising:
The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers When, so that computer executes audio recognition method as described above.
In embodiment provided by the invention, user speech is acquired by voice acquisition device by smart home device When, while passing through image acquisition device user's face image;And it is based on collected user speech and user's face image, With the corresponding prediction voice of prediction model prediction user speech;Wherein, prediction model is by the corresponding difference of each control instruction What the voice of crowd and the training of corresponding face-image obtained, issue prediction model different crowd for same control instruction Voice and presentation face-image predicted after can export the similar voice of corresponding with same control instruction received pronunciation; And then based on prediction voice, speech audio normal data corresponding with control instruction is matched from speech database;Its In, speech database is control instruction and the mapping relations of corresponding speech audio normal data of smart home device;Finally, The matching degree that user speech and speech audio normal data are calculated by Matching Model, when matching degree reaches given threshold, root Smart home device is controlled according to the corresponding control instruction of speech audio normal data.To make smart home device fast square Just the discrimination of raising voice, reduces and malfunctions because caused by speech recognition is incorrect, improves user experience.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as the production of method, system or computer program Product.Therefore, in terms of the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and hardware Embodiment form.Moreover, it wherein includes computer available programs generation that the embodiment of the present invention, which can be used in one or more, The meter implemented in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of code The form of calculation machine program product.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, equipment (system) and computer program product Flowchart and/or the block diagram describe.It should be understood that can be realized by computer program instructions in flowchart and/or the block diagram The combination of process and/or box in each flow and/or block and flowchart and/or the block diagram.It can provide these calculating Processing of the machine program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices Device is to generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram Device.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims (10)

1. a kind of method of speech recognition is applied to smart home device characterized by comprising
When acquiring user speech by voice acquisition device, pass through image acquisition device user's face image;
Based on the user speech and the user's face image, the corresponding prediction language of the user speech is predicted with prediction model Sound;Wherein, the prediction model is by the voice of the corresponding different crowd of each control instruction and the training of corresponding face-image It obtains, carries out face-image of the prediction model to different crowd for the same control instruction voice issued and presentation The similar voice of corresponding with same control instruction received pronunciation can be exported after prediction;
Based on the prediction voice, speech audio criterion numeral corresponding with the control instruction is matched from speech database According to;Wherein, the speech database is control instruction and the corresponding speech audio normal data of the smart home device Mapping relations;
The matching degree that the user speech Yu the speech audio normal data are calculated by Matching Model, when the matching degree reaches When to given threshold, the smart home device is controlled according to the corresponding control instruction of the speech audio normal data.
2. the method as described in claim 1, which is characterized in that be based on the user speech and the user's face image, use Prediction model predicts the corresponding prediction voice of the user speech, comprising:
Through the speech recognition technology in the prediction model from the user speech, identify that the user speech is corresponding First control instruction collection;
Based on the user's face image from the face image data library in the prediction model, obtain and the user's face The corresponding second control instruction collection of image;Wherein, the face image data library be control instruction and Standard User expression and/ Or the mapping relations of Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, general With the corresponding audio data of the highest control instruction of degree as the prediction voice.
3. method according to claim 2, which is characterized in that based on the user's face image from the prediction model In face image data library, the second control instruction collection corresponding with the user's face image is obtained, comprising:
Extract corresponding user's expression and/or user's lip shape from the user's face image, obtain user's expression data and/or User's lip type data;
Based on user's expression data and/or user's lip type data, described second is obtained from the face image data library Control instruction collection.
4. the method as described in any claim of claim 1-3, which is characterized in that calculate the user speech and the voice sound After the similarity of frequency normal data, further includes:
If the similarity cannot reach the given threshold, indicate that user will resurvey user's language by default prompt information Sound;Wherein, the default prompt information is sound and/or light prompt information.
5. a kind of device of speech recognition is applied to smart home device characterized by comprising
Acquisition unit, for passing through image acquisition device user face when acquiring user speech by voice acquisition device Portion's image;
Predicting unit predicts user's language with prediction model for being based on the user speech and the user's face image The corresponding prediction voice of sound;Wherein, the prediction model is by the voice and correspondence of the corresponding different crowd of each control instruction Face-image training obtain, the voice for issuing the prediction model different crowd for same control instruction and presentation Face-image predicted after can export the similar voice of corresponding with same control instruction received pronunciation;
Acquiring unit matches language corresponding with the control instruction for being based on the prediction voice from speech database Sound audio standard data;Wherein, the speech database is control instruction and the corresponding voice sound of the smart home device The mapping relations of frequency normal data;
Computing unit, for calculating the matching degree of the user speech Yu the speech audio normal data by Matching Model, When the matching degree reaches given threshold, the intelligence is controlled according to the corresponding control instruction of the speech audio normal data Home equipment.
6. device as claimed in claim 5, which is characterized in that the predicting unit is specifically used for:
Through the speech recognition technology in the prediction model from the user speech, identify that the user speech is corresponding First control instruction collection;
Based on the user's face image from the face image data library in the prediction model, obtain and the user's face The corresponding second control instruction collection of image;Wherein, the face image data library be control instruction and Standard User expression and/ Or the mapping relations of Standard User lip shape;
The first control instruction collection is matched one by one with every control instruction that second control instruction is concentrated, general With the corresponding audio data of the highest control instruction of degree as the prediction voice.
7. device as claimed in claim 6, which is characterized in that the predicting unit is also used to:
Extract corresponding user's expression and/or user's lip shape from the user's face image, obtain user's expression data and/or User's lip type data;
Based on user's expression data and/or user's lip type data, described second is obtained from the face image data library Control instruction collection.
8. the device as described in any claim of claim 5-7, which is characterized in that the computing unit is also used to:
If the similarity cannot reach the given threshold, indicate that user will resurvey user's language by default prompt information Sound;Wherein, the default prompt information is sound and/or light prompt information.
9. a kind of device of speech recognition characterized by comprising
At least one processor, and
The memory being connect at least one described processor;
Wherein, the memory is stored with the instruction that can be executed by least one described processor, at least one described processor By executing the instruction of the memory storage, method according to any of claims 1-4 is executed.
10. a kind of computer readable storage medium, it is characterised in that:
The computer-readable recording medium storage has computer instruction, when the computer instruction is run on computers, So that computer executes such as method of any of claims 1-4.
CN201811238626.0A 2018-10-23 2018-10-23 A kind of method, apparatus and computer storage medium of speech recognition Pending CN109448711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811238626.0A CN109448711A (en) 2018-10-23 2018-10-23 A kind of method, apparatus and computer storage medium of speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811238626.0A CN109448711A (en) 2018-10-23 2018-10-23 A kind of method, apparatus and computer storage medium of speech recognition

Publications (1)

Publication Number Publication Date
CN109448711A true CN109448711A (en) 2019-03-08

Family

ID=65548031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811238626.0A Pending CN109448711A (en) 2018-10-23 2018-10-23 A kind of method, apparatus and computer storage medium of speech recognition

Country Status (1)

Country Link
CN (1) CN109448711A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047486A (en) * 2019-05-20 2019-07-23 合肥美的电冰箱有限公司 Sound control method, device, server, system and storage medium
CN110262278A (en) * 2019-07-31 2019-09-20 珠海格力电器股份有限公司 The control method and device of intelligent appliance equipment, intelligent electric appliance
CN110349577A (en) * 2019-06-19 2019-10-18 深圳前海达闼云端智能科技有限公司 Man-machine interaction method, device, storage medium and electronic equipment
CN111028842A (en) * 2019-12-10 2020-04-17 上海芯翌智能科技有限公司 Method and equipment for triggering voice interaction response
CN111276140A (en) * 2020-01-19 2020-06-12 珠海格力电器股份有限公司 Voice command recognition method, device, system and storage medium
CN111312221A (en) * 2020-01-20 2020-06-19 宁波舜韵电子有限公司 Intelligent range hood based on voice control
CN111739534A (en) * 2020-06-04 2020-10-02 广东小天才科技有限公司 Processing method and device for assisting speech recognition, electronic equipment and storage medium
CN111803936A (en) * 2020-07-16 2020-10-23 网易(杭州)网络有限公司 Voice communication method and device, electronic equipment and storage medium
CN114578705A (en) * 2022-04-01 2022-06-03 深圳冠特家居健康系统有限公司 Intelligent home control system based on 5G Internet of things

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030212557A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Coupled hidden markov model for audiovisual speech recognition
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
CN102324035A (en) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 Method and system of applying lip posture assisted speech recognition technique to vehicle navigation
EP2562746A1 (en) * 2011-08-25 2013-02-27 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image
CN106157956A (en) * 2015-03-24 2016-11-23 中兴通讯股份有限公司 The method and device of speech recognition
WO2017151672A2 (en) * 2016-02-29 2017-09-08 Faraday & Future Inc. Voice assistance system for devices of an ecosystem
CN107272607A (en) * 2017-05-11 2017-10-20 上海斐讯数据通信技术有限公司 A kind of intelligent home control system and method
CN108346427A (en) * 2018-02-05 2018-07-31 广东小天才科技有限公司 A kind of audio recognition method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030212557A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Coupled hidden markov model for audiovisual speech recognition
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
CN102324035A (en) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 Method and system of applying lip posture assisted speech recognition technique to vehicle navigation
EP2562746A1 (en) * 2011-08-25 2013-02-27 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image
CN106157956A (en) * 2015-03-24 2016-11-23 中兴通讯股份有限公司 The method and device of speech recognition
WO2017151672A2 (en) * 2016-02-29 2017-09-08 Faraday & Future Inc. Voice assistance system for devices of an ecosystem
CN107272607A (en) * 2017-05-11 2017-10-20 上海斐讯数据通信技术有限公司 A kind of intelligent home control system and method
CN108346427A (en) * 2018-02-05 2018-07-31 广东小天才科技有限公司 A kind of audio recognition method, device, equipment and storage medium

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047486A (en) * 2019-05-20 2019-07-23 合肥美的电冰箱有限公司 Sound control method, device, server, system and storage medium
CN110349577A (en) * 2019-06-19 2019-10-18 深圳前海达闼云端智能科技有限公司 Man-machine interaction method, device, storage medium and electronic equipment
CN110349577B (en) * 2019-06-19 2022-12-06 达闼机器人股份有限公司 Man-machine interaction method and device, storage medium and electronic equipment
CN110262278A (en) * 2019-07-31 2019-09-20 珠海格力电器股份有限公司 The control method and device of intelligent appliance equipment, intelligent electric appliance
CN111028842A (en) * 2019-12-10 2020-04-17 上海芯翌智能科技有限公司 Method and equipment for triggering voice interaction response
CN111028842B (en) * 2019-12-10 2021-05-11 上海芯翌智能科技有限公司 Method and equipment for triggering voice interaction response
CN111276140A (en) * 2020-01-19 2020-06-12 珠海格力电器股份有限公司 Voice command recognition method, device, system and storage medium
CN111276140B (en) * 2020-01-19 2023-05-12 珠海格力电器股份有限公司 Voice command recognition method, device, system and storage medium
CN111312221B (en) * 2020-01-20 2022-07-22 宁波舜韵电子有限公司 Intelligent range hood based on voice control
CN111312221A (en) * 2020-01-20 2020-06-19 宁波舜韵电子有限公司 Intelligent range hood based on voice control
CN111739534A (en) * 2020-06-04 2020-10-02 广东小天才科技有限公司 Processing method and device for assisting speech recognition, electronic equipment and storage medium
CN111739534B (en) * 2020-06-04 2022-12-27 广东小天才科技有限公司 Processing method and device for assisting speech recognition, electronic equipment and storage medium
CN111803936A (en) * 2020-07-16 2020-10-23 网易(杭州)网络有限公司 Voice communication method and device, electronic equipment and storage medium
CN114578705A (en) * 2022-04-01 2022-06-03 深圳冠特家居健康系统有限公司 Intelligent home control system based on 5G Internet of things
CN114578705B (en) * 2022-04-01 2022-12-27 深圳冠特家居健康系统有限公司 Intelligent home control system based on 5G Internet of things

Similar Documents

Publication Publication Date Title
CN109448711A (en) A kind of method, apparatus and computer storage medium of speech recognition
CN106251874B (en) A kind of voice gate inhibition and quiet environment monitoring method and system
US20180261236A1 (en) Speaker recognition method and apparatus, computer device and computer-readable medium
CN108304385A (en) A kind of speech recognition text error correction method and device
WO2016150001A1 (en) Speech recognition method, device and computer storage medium
CN109783642A (en) Structured content processing method, device, equipment and the medium of multi-person conference scene
CN105810213A (en) Typical abnormal sound detection method and device
CN101923857A (en) Extensible audio recognition method based on man-machine interaction
CN109360572A (en) Call separation method, device, computer equipment and storage medium
WO2020180719A1 (en) Determining input for speech processing engine
CN105308679A (en) Method and system for identifying location associated with voice command to control home appliance
CN109960743A (en) Conference content differentiating method, device, computer equipment and storage medium
US20200194006A1 (en) Voice-Controlled Management of User Profiles
CN106971714A (en) A kind of speech de-noising recognition methods and device applied to robot
CN102637433A (en) Method and system for identifying affective state loaded in voice signal
CN103943111A (en) Method and device for identity recognition
CN108520752A (en) A kind of method for recognizing sound-groove and device
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN103236261A (en) Speaker-dependent voice recognizing method
CN104103280A (en) Dynamic time warping algorithm based voice activity detection method and device
CN109783049A (en) Method of controlling operation thereof, device, equipment and storage medium
CN111105798B (en) Equipment control method based on voice recognition
CN106205610B (en) A kind of voice information identification method and equipment
CN110580897A (en) audio verification method and device, storage medium and electronic equipment
KR102220964B1 (en) Method and device for audio recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190308

RJ01 Rejection of invention patent application after publication