CN110853631A - Voice recognition method and device for smart home - Google Patents

Voice recognition method and device for smart home Download PDF

Info

Publication number
CN110853631A
CN110853631A CN201810873250.4A CN201810873250A CN110853631A CN 110853631 A CN110853631 A CN 110853631A CN 201810873250 A CN201810873250 A CN 201810873250A CN 110853631 A CN110853631 A CN 110853631A
Authority
CN
China
Prior art keywords
voice
frames
user
target
smart home
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810873250.4A
Other languages
Chinese (zh)
Inventor
易斌
许权南
连园园
彭磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN201810873250.4A priority Critical patent/CN110853631A/en
Publication of CN110853631A publication Critical patent/CN110853631A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application discloses a voice recognition method and device for smart home. The method comprises the following steps: acquiring voice information for regulating and controlling the smart home; splitting the voice information to obtain user voice in the voice information; and generating a target control instruction according to the user voice, and regulating and controlling the smart home based on the target control instruction. Through the method and the device, the defect of user voice recognition of the existing intelligent home in the related art is overcome, the control command cannot be effectively recognized by the intelligent home, and the technical problem of poor user experience is solved.

Description

Voice recognition method and device for smart home
Technical Field
The application relates to the field of intelligent home control, in particular to a voice recognition method and device for intelligent home.
Background
Most of current smart homes can be controlled through user voice, but the smart homes cannot be accurately identified, for example, the voice with background noise cannot be identified, and the like
Based on the defects of the voice recognition of the existing smart home, the smart home cannot effectively recognize control commands, and the technical problem of poor user experience is that an effective solution is not provided at present.
Disclosure of Invention
The application provides a voice recognition method and device for an intelligent home, and aims to solve the technical problems that the intelligent home cannot effectively recognize control commands and is poor in user experience degree due to the defect of user voice recognition of the existing intelligent home in the related art.
According to one aspect of the application, a voice recognition method for smart home is provided. The method comprises the following steps: acquiring voice information for regulating and controlling the smart home; splitting the voice information to obtain user voice in the voice information; and generating a target control instruction according to the user voice, and regulating and controlling the smart home based on the target control instruction.
Optionally, splitting the voice information, and obtaining the user voice in the voice information includes: determining a plurality of starting frames and a plurality of ending frames in the voice information; determining a plurality of voice segments according to a plurality of starting frames and a plurality of ending frames, wherein the starting frames and the ending frames in the voice segments are adjacent; and combining the plurality of voice fragments into the user voice according to the sequence positions of the plurality of voice fragments in the voice information.
Optionally, determining a plurality of start frames and a plurality of end frames in the speech information comprises: determining a preset number of adjacent first prepared frames in the voice information, wherein a target parameter of the first prepared frames is greater than a first preset value, and the target parameter is a short-time energy value and/or a short-time zero-crossing rate; determining a preset number of adjacent second prepared frames in the voice information, wherein the target parameter of the second prepared frames is smaller than a second preset value; a first frame of the first preliminary frames is determined to be a starting frame and a first frame of the second preliminary frames is determined to be an ending frame.
Optionally, the combining the plurality of voice segments into the user voice according to the sequential positions of the plurality of voice segments in the voice message comprises: recognizing a plurality of voice fragments by using a first model, and determining a plurality of target voices, wherein the first model is a deep belief network model trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a voice segment and a target voice contained in the voice segment; and combining the target voices into the user voice according to the sequential positions of the target voices in the voice information.
Optionally, the generating the target control instruction according to the user voice includes: acquiring voice characteristics of user voice, wherein the voice characteristics of the user voice at least comprise: mel-frequency cepstral coefficients of the user's speech; the method comprises the steps of identifying voice characteristics of user voice by using a second model, and determining a control instruction corresponding to the user voice, wherein the second model is a deep belief network model trained by using multiple groups of data through machine learning, and each group of data in the multiple groups of data comprises: the voice characteristics of the user voice and the control instruction corresponding to the user voice.
Optionally, the regulating and controlling the smart home based on the target control instruction includes: and regulating and controlling the operation parameters of the target object to the target parameters according to the control instruction corresponding to the user voice, wherein the control instruction at least comprises the target object to be regulated and controlled in the smart home, the operation parameters to be regulated and controlled of the target object and the target parameters.
According to another aspect of the application, a voice recognition device for smart home is provided. The device includes: the intelligent home control system comprises an acquisition unit, a control unit and a control unit, wherein the acquisition unit is used for acquiring voice information for regulating and controlling the intelligent home; the splitting unit is used for splitting the voice information to obtain user voice in the voice information; and the regulating and controlling unit is used for generating a target control instruction according to the user voice and regulating and controlling the smart home based on the target control instruction.
Optionally, the splitting unit includes: the first determining module is used for determining a plurality of starting frames and a plurality of ending frames in the voice information; the second determining module is used for determining a plurality of voice segments according to a plurality of starting frames and a plurality of ending frames, wherein the starting frames and the ending frames in the voice segments are adjacent; and the synthesis module is used for combining the plurality of voice segments into the user voice according to the sequence positions of the plurality of voice segments in the voice information.
In order to achieve the above object, according to another aspect of the present application, there is provided a storage medium including a stored program, wherein the program performs the voice recognition method of smart home of any one of the above.
In order to achieve the above object, according to another aspect of the present application, there is provided a processor configured to execute a program, where the program executes a voice recognition method of a smart home according to any one of the above.
Through the application, the following steps are adopted: acquiring voice information for regulating and controlling the smart home; splitting the voice information to obtain user voice in the voice information; the target control instruction is generated according to the user voice, and the smart home is regulated and controlled based on the target control instruction, so that the technical problems that the smart home cannot effectively recognize the control instruction and the user experience degree is poor due to the defect of user voice recognition of the smart home in the related technology are solved. And then realize the discernment to user's pronunciation and noise in the speech information, filter the noise in the speech information, reach and improve the speech recognition degree of accuracy, improve the technical effect that the user used the experience and feel.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
fig. 1 is a flowchart of a speech recognition method for smart home provided according to an embodiment of the present application; and
fig. 2 is a schematic diagram of a voice recognition device of a smart home according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of description, some terms or expressions referred to in the embodiments of the present application are explained below:
short-time zero-crossing rate: the short-term zero-crossing rate refers to the number of times the speech signal passes through a zero point (from positive to negative or from negative to positive) in each frame.
Short-time energy value: the short-time energy value of the voice signal changes along with time, so that the characteristic change condition of the voice information can be described by analyzing the short-time energy value.
Mel-frequency cepstral coefficient: the method is based on the linear transformation of the logarithmic energy spectrum of the nonlinear Mel scale of the sound frequency, and is widely applied to the voice recognition function.
According to the embodiment of the application, a voice recognition method for smart home is provided.
Fig. 1 is a flowchart of a speech recognition method for smart home according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
and S101, acquiring voice information for regulating and controlling the smart home.
And step S102, splitting the voice information to obtain the user voice in the voice information.
And S103, generating a target control instruction according to the user voice, and regulating and controlling the smart home based on the target control instruction.
According to the voice recognition method for the smart home, voice information for regulating and controlling the smart home is obtained; splitting the voice information to obtain user voice in the voice information; the target control instruction is generated according to the user voice, and the smart home is regulated and controlled based on the target control instruction, so that the technical problems that the smart home cannot effectively recognize the control instruction and the user experience degree is poor due to the defect of user voice recognition of the existing smart home in the related art are solved.
That is, according to the voice recognition method for smart home provided by the embodiment of the application, after the voice information is obtained, the voice information is split to remove a noise part without user voice in the voice information, so that a 'pure' user voice is obtained, so that when the user voice is analyzed subsequently, the influence of noise on an analysis result does not need to be worried about, and an accurate control instruction is obtained. The voice recognition method of the smart home can recognize user voice and noise in the voice information and filter the noise in the voice information, so that the technical effects of improving the accuracy of voice recognition and improving the user experience are achieved.
For the step S102, in the voice recognition method for smart home provided in the embodiment of the present application, the step S102 (splitting the voice information to obtain the user voice in the voice information) may further include: determining a plurality of starting frames and a plurality of ending frames in the voice information; determining a plurality of voice segments according to a plurality of starting frames and a plurality of ending frames, wherein the starting frames and the ending frames in the voice segments are adjacent; and combining the plurality of voice fragments into the user voice according to the sequence positions of the plurality of voice fragments in the voice information.
Optionally, in the speech recognition method for smart home provided in the embodiment of the present application, the determining method for the multiple start frames and the multiple end frames may be: determining a preset number of adjacent first prepared frames in the voice information, wherein a target parameter of the first prepared frames is greater than a first preset value, and the target parameter is a short-time energy value and/or a short-time zero-crossing rate; determining a preset number of adjacent second prepared frames in the voice information, wherein the target parameter of the second prepared frames is smaller than a second preset value; a first frame of the first preliminary frames is determined to be a starting frame and a first frame of the second preliminary frames is determined to be an ending frame.
That is, before speech recognition, a dual-threshold end-point detection method of a short-time energy value and a short-time zero-crossing rate is used to roughly distinguish speech segments from noise segments, so that speech segments containing user speech are obtained, and pure noise segments are filtered out, so that the accuracy of subsequent speech recognition is improved.
The target parameter may be only the short-time energy value, only the short-time zero-crossing rate, or both the short-time energy value and the short-time zero-crossing rate, and is not limited herein. However, the target parameter is preferably any one of a short-time energy value and a short-time energy value, that is, the target parameter is a short-time energy value or a short-time zero crossing rate.
The preset number is multiple. I.e. a plurality of adjacent first preliminary frames in the speech information is determined and a plurality of adjacent second preliminary frames in the speech information is determined. The accuracy of the voice recognition method for the smart home provided by the embodiment of the application can be improved by setting the preset number to be multiple, and the situation that the accurate selection of the initial frame or the end frame of the voice segment is influenced due to the misjudgment of the first prepared frame or the second prepared frame caused by the sudden increase of noise and the like is avoided. In addition, experiments prove that when the preset number is four, the execution rate of the voice recognition method for smart homes provided in the embodiment of the present application can be ensured, and the accuracy of the voice recognition method for smart homes provided in the embodiment of the present application can also be ensured, that is, in a preferred example, four adjacent first preliminary frames in the voice information are determined, and four adjacent second preliminary frames in the voice information are determined.
When the target parameter of the plurality of first preliminary frames is greater than the first preset value and the target parameter is the short-time energy value and/or the short-time zero-crossing rate on the basis that the preset number is multiple, there is an example that the short-time energy value of a part of the first preliminary frames is greater than the first preset value and the short-time zero-crossing rate of another part of the first preliminary frames is greater than the first preset value; there is also another example that the short-time zero crossing rate of any one of the plurality of first preliminary frames is greater than the first preset value, or the short-time energy value of any one of the plurality of first preliminary frames is greater than the first preset value.
In an optional example, starting detection from an end point of the voice information, sequentially determining whether any one of a current short-time energy value and a short-time zero-crossing rate of each frame exceeds a first preset value, if so, continuing to determine whether any one of the current short-time energy value and the short-time zero-crossing rate of the next three adjacent frames of the frame exceeds the first preset value, and if so, determining that the frame is a starting point of the voice segment, i.e., a starting frame. And then sequentially judging whether any value of the current short-time energy value and the short-time zero-crossing rate of each subsequent frame exceeds a second preset value, if so, continuously judging whether any value of the current short-time energy value and the short-time zero-crossing rate of the next three adjacent frames of the frame exceeds the second preset value, if so, determining the frame as the end point of the voice fragment, and ending the frame. A speech segment in the speech information can then be determined on the basis of the start frame and the end frame. And then, continuing to adopt the method to judge the subsequent voice segments in the voice information.
In another alternative example, the method for determining the speech segment according to the plurality of start frames and the plurality of end frames in the speech information may be: each frame is detected in the speech information in turn to determine a start frame, and each subsequent frame of the speech information is further detected to determine an end frame, the start frame and the end frame corresponding to a speech segment. And then, continuously detecting each subsequent frame of the voice information to determine a second starting frame, and detecting each subsequent frame of the voice information to determine a second ending frame, wherein the second starting frame and the second ending frame correspond to another voice segment until each frame in the voice information is detected.
In another alternative example, on the basis of determining a plurality of start frames and a plurality of end frames in the speech information, the determining method of the plurality of speech segments may be: determining a plurality of groups of set frames according to a plurality of start frames and a plurality of end frames in the voice information, wherein each group of set frames comprises a start frame and an end frame adjacent to the start frame, and the position of the start frame in the voice information is positioned before the position of the end frame in the voice information; a plurality of speech segments is determined from the plurality of sets of aggregate frames.
Optionally, after determining the plurality of voice segments, in the voice recognition method for smart home provided in the embodiment of the present application, combining the plurality of voice segments into the user voice according to the sequential positions of the plurality of voice segments in the voice message includes: recognizing a plurality of voice fragments by using a first model, and determining a plurality of target voices, wherein the first model is a deep belief network model trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a voice segment and a target voice contained in the voice segment; and combining the target voices into the user voice according to the sequential positions of the target voices in the voice information.
That is, in an optional embodiment, on the basis of determining the plurality of voice segments, before combining the plurality of voice segments into the user voice according to the sequential positions of the plurality of voice segments in the voice message, the voice recognition method for smart home further includes: recognizing a plurality of voice fragments by using a first model, and determining a plurality of target voices, wherein the first model is a deep belief network model trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a speech segment and a target speech contained in the speech segment. Furthermore, the user speech may be subsequently synthesized using a plurality of target speech combinations, that is, combining a plurality of speech segments into the user speech according to the sequential positions of the plurality of speech segments in the speech information includes: and combining the target voices into the user voice according to the sequential positions of the target voices in the voice information.
That is, because the dual-threshold endpoint detection method of the short-term energy value and the short-term zero-crossing rate can only roughly determine that the voice segment sent by the user is included, the voice segment can also be more accurately screened through the first model, that is, the background noise doped in the voice segment is screened out to obtain a plurality of target voices, so that the condition that the background noise doped in the voice segment interferes with the voice recognition of the user is avoided, and the technical effect of increasing the accuracy of the voice recognition of the smart home is achieved.
It should be noted that: the dual threshold end-point detection of short-term energy values and short-term zero-crossing rates only screens noise segments in the speech information, while the first model screens background noise in the speech segments. That is, the speech information includes noise segments and speech segments, and the speech segments include noise background and target speech.
Further to step S103, in the voice recognition method for smart home provided in the embodiment of the present application, generating the target control instruction according to the user voice may include: acquiring voice characteristics of user voice, wherein the voice characteristics of the user voice at least comprise: mel-frequency cepstral coefficients of the user's speech; the method comprises the steps of identifying voice characteristics of user voice by using a second model, and determining a control instruction corresponding to the user voice, wherein the second model is a deep belief network model trained by using multiple groups of data through machine learning, and each group of data in the multiple groups of data comprises: the voice characteristics of the user voice and the control instruction corresponding to the user voice.
In another alternative example, generating the target control instruction from the user speech may include: acquiring voice characteristics of user voice, wherein the voice characteristics of the user voice at least comprise: mel cepstral coefficients of the user's speech; comparing the voice features of the user voice with a plurality of preset voice features, and determining the preset voice features meeting preset conditions; and determining a target control instruction according to a preset control instruction corresponding to the preset voice feature meeting the preset condition.
By adopting the voice characteristics to identify the control instruction corresponding to the user voice, the accuracy of confirming the control instruction is improved. In addition, the accuracy of confirming the control command is improved by comparing the deep belief network model.
In another alternative example, the first model may be used to recognize a plurality of speech segments and determine a plurality of target speeches, or the first model may also be used to recognize speech features of a user speech and determine a control command corresponding to the user speech, that is, each of the plurality of sets of training data of the first model includes: the voice control method comprises the steps of a voice segment, voice characteristics of user voice contained in the voice segment and a control instruction corresponding to the user voice.
In addition, in the voice recognition method for smart home provided by the embodiment of the application, the controlling smart home based on the target control instruction includes: and regulating and controlling the operation parameters of the target object to the target parameters according to the control instruction corresponding to the user voice, wherein the control instruction at least comprises the target object to be regulated and controlled in the smart home, the operation parameters to be regulated and controlled of the target object and the target parameters.
The control instruction is made to accurately comprise the target object to be regulated, the operating parameter to be regulated of the target object and the target parameter, so that the control instruction is clear, and the technical effect of accurately regulating and controlling the smart home is achieved.
The target object to be regulated comprises at least any one of the following objects: the intelligent household comprises an air conditioner of the intelligent household, a humidifier of the intelligent household, a window curtain of the intelligent household, a lamp of the intelligent household, a floor sweeping and mopping robot of the intelligent household, a sound of the intelligent household and the like; the target operation parameters to be regulated of the target object comprise at least any one of the following: conventional use instructions (e.g., on, off) and non-conventional use instructions (e.g., indoor temperature values, indoor humidity values, indoor light values).
The invention will now be described with reference to another embodiment.
In the method, the voice and other noises sent by the user are distinguished through a double-threshold end point detection method of a short-time energy value and a short-time zero-crossing rate, and the voice of the user with part of the noises filtered is obtained.
And further, inputting the user voice with the listed partial noise into a first model, and further determining which voice stages are voice stages and which noise stages are noise stages so as to determine the fragment interval of the real user voice, wherein the first model is determined by training the established deep belief network model through a layer-by-layer greedy algorithm in combination with the voice characteristic parameters corresponding to the voice instruction for controlling the intelligent home by the user.
And finally, after the voice of the real user is recognized, performing feature extraction on the voice of the real user to determine an energy value in each frame of voice of the real user, and comparing the energy value with a voice energy value in a preset model, thereby determining a voice instruction of the user, and a control target object and specific regulated and controlled operation parameters contained in the voice instruction.
Moreover, the invention can realize the following technical effects:
firstly, according to a double-threshold end point detection method of short-time energy and a short-time zero crossing rate, the speaking sound and noise of the intelligent furniture controlled by a user are distinguished, and the accuracy of voice recognition is improved. Secondly, the deep belief network model is used as a user speech recognition model, and the technical effect of accurately recognizing the speech sound of the user is achieved. In addition, the Mel cepstrum coefficient is used as the voice feature of the user, and the technical effect of accurately determining the control instruction corresponding to the voice of the user is achieved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
The embodiment of the present application further provides a voice recognition device for smart home, and it should be noted that the voice recognition device for smart home according to the embodiment of the present application may be used to execute the voice recognition method for smart home according to the embodiment of the present application. The following introduces a voice recognition device for smart home provided by an embodiment of the present application.
Fig. 2 is a schematic diagram of a speech recognition device of a smart home according to an embodiment of the present application. As shown in fig. 2, the apparatus includes: an acquisition unit 21, a splitting unit 22 and a regulation unit 23.
The acquiring unit 21 is configured to acquire voice information for regulating and controlling smart home.
The splitting unit 22 is configured to split the voice information to obtain the user voice in the voice information.
And the regulating unit 23 is configured to generate a target control instruction according to the user voice, and regulate and control the smart home based on the target control instruction.
Optionally, in the speech recognition device for smart homes provided in the embodiment of the present application, the splitting unit 22 includes: the first determining module is used for determining a plurality of starting frames and a plurality of ending frames in the voice information; the second determining module is used for determining a plurality of voice segments according to a plurality of starting frames and a plurality of ending frames, wherein the starting frames and the ending frames in the voice segments are adjacent; and the synthesis module is used for combining the plurality of voice segments into the user voice according to the sequence positions of the plurality of voice segments in the voice information.
Optionally, in the speech recognition apparatus for smart home provided in the embodiment of the present application, the first determining module includes: the first determining submodule is used for determining a preset number of adjacent first prepared frames in the voice information, wherein the target parameter of the first prepared frames is greater than a first preset value, and the target parameter is a short-time energy value and/or a short-time zero-crossing rate; the second determining submodule is used for determining a preset number of adjacent second prepared frames in the voice information, wherein the target parameter of the second prepared frames is smaller than a second preset value; a third determining sub-module for determining a first frame of the first preliminary frames as a start frame and a first frame of the second preliminary frames as an end frame.
Optionally, in the speech recognition apparatus for smart home provided in the embodiment of the present application, the synthesis module includes: a fourth determining submodule, configured to recognize multiple speech segments by using the first model, and determine multiple target speeches, where the first model is a deep belief network model trained by machine learning using multiple sets of data, and each set of data in the multiple sets of data includes: a voice segment and a target voice contained in the voice segment; and the synthesis submodule is used for combining the target voices into the user voice according to the sequence positions of the target voices in the voice information.
Optionally, in the speech recognition device for smart homes provided in the embodiment of the present application, the control unit 23 includes: the obtaining module is used for obtaining the voice characteristics of the user voice, wherein the voice characteristics of the user voice at least comprise: mel-frequency cepstral coefficients of the user's speech; the third determining module is used for identifying the voice characteristics of the user voice by using a second model and determining a control instruction corresponding to the user voice, wherein the second model is a deep belief network model trained by using multiple groups of data through machine learning, and each group of data in the multiple groups of data comprises: the voice characteristics of the user voice and the control instruction corresponding to the user voice.
Optionally, in the speech recognition device for smart homes provided in the embodiment of the present application, the control unit 23 includes: and the control module is used for regulating and controlling the operation parameters of the target object to the target parameters according to the control instruction corresponding to the voice of the user, wherein the control instruction at least comprises the target object to be regulated and controlled in the smart home, the operation parameters to be regulated and controlled of the target object and the target parameters.
According to the voice recognition device for the smart home, the voice information for regulating and controlling the smart home is obtained through the obtaining unit 21; the splitting unit 22 splits the voice information to obtain user voice in the voice information; the regulation and control unit 23 generates a target control instruction according to the user voice, regulates and controls the smart home based on the target control instruction, and solves the technical problems that the smart home cannot effectively recognize the control instruction and the user experience degree is poor due to the defect of user voice recognition of the existing smart home in the related art, so that the technical effects of improving the voice recognition accuracy and improving the user experience feeling are achieved.
The voice recognition device for smart home comprises a processor and a memory, wherein the acquiring unit 21, the splitting unit 22, the regulating unit 23 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, the speech recognition accuracy is improved by adjusting kernel parameters, and the user experience is improved.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The embodiment of the invention provides a storage medium, wherein a program is stored on the storage medium, and the program realizes a voice recognition method of smart home when being executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein a voice recognition method of an intelligent home is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: acquiring voice information for regulating and controlling the smart home; splitting the voice information to obtain user voice in the voice information; and generating a target control instruction according to the user voice, and regulating and controlling the smart home based on the target control instruction.
Optionally, splitting the voice information, and obtaining the user voice in the voice information includes: determining a plurality of starting frames and a plurality of ending frames in the voice information; determining a plurality of voice segments according to a plurality of starting frames and a plurality of ending frames, wherein the starting frames and the ending frames in the voice segments are adjacent; and combining the plurality of voice fragments into the user voice according to the sequence positions of the plurality of voice fragments in the voice information.
Optionally, determining a plurality of start frames and a plurality of end frames in the speech information comprises: determining a preset number of adjacent first prepared frames in the voice information, wherein a target parameter of the first prepared frames is greater than a first preset value, and the target parameter is a short-time energy value and/or a short-time zero-crossing rate; determining a preset number of adjacent second prepared frames in the voice information, wherein the target parameter of the second prepared frames is smaller than a second preset value; a first frame of the first preliminary frames is determined to be a starting frame and a first frame of the second preliminary frames is determined to be an ending frame.
Optionally, the combining the plurality of voice segments into the user voice according to the sequential positions of the plurality of voice segments in the voice message comprises: recognizing a plurality of voice fragments by using a first model, and determining a plurality of target voices, wherein the first model is a deep belief network model trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a voice segment and a target voice contained in the voice segment, wherein the voice segment further comprises a noise segment; and combining the target voices into the user voice according to the sequential positions of the target voices in the voice information.
Optionally, the generating the target control instruction according to the user voice includes: acquiring voice characteristics of user voice, wherein the voice characteristics of the user voice at least comprise: mel-frequency cepstral coefficients of the user's speech; the method comprises the steps of identifying voice characteristics of user voice by using a second model, and determining a control instruction corresponding to the user voice, wherein the second model is a deep belief network model trained by using multiple groups of data through machine learning, and each group of data in the multiple groups of data comprises: the voice characteristics of the user voice and the control instruction corresponding to the user voice.
Optionally, the regulating and controlling the smart home based on the target control instruction includes: and regulating and controlling the operation parameters of the target object to the target parameters according to the control instruction corresponding to the user voice, wherein the control instruction at least comprises the target object to be regulated and controlled in the smart home, the operation parameters to be regulated and controlled of the target object and the target parameters. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring voice information for regulating and controlling the smart home; splitting the voice information to obtain user voice in the voice information; and generating a target control instruction according to the user voice, and regulating and controlling the smart home based on the target control instruction.
Optionally, splitting the voice information, and obtaining the user voice in the voice information includes: determining a plurality of starting frames and a plurality of ending frames in the voice information; determining a plurality of voice segments according to a plurality of starting frames and a plurality of ending frames, wherein the starting frames and the ending frames in the voice segments are adjacent; and combining the plurality of voice fragments into the user voice according to the sequence positions of the plurality of voice fragments in the voice information.
Optionally, determining a plurality of start frames and a plurality of end frames in the speech information comprises: determining a preset number of adjacent first prepared frames in the voice information, wherein a target parameter of the first prepared frames is greater than a first preset value, and the target parameter is a short-time energy value and/or a short-time zero-crossing rate; determining a preset number of adjacent second prepared frames in the voice information, wherein the target parameter of the second prepared frames is smaller than a second preset value; a first frame of the first preliminary frames is determined to be a starting frame and a first frame of the second preliminary frames is determined to be an ending frame.
Optionally, the combining the plurality of voice segments into the user voice according to the sequential positions of the plurality of voice segments in the voice message comprises: recognizing a plurality of voice fragments by using a first model, and determining a plurality of target voices, wherein the first model is a deep belief network model trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a voice segment and a target voice contained in the voice segment, wherein the voice segment further comprises a noise segment; and combining the target voices into the user voice according to the sequential positions of the target voices in the voice information.
Optionally, the generating the target control instruction according to the user voice includes: acquiring voice characteristics of user voice, wherein the voice characteristics of the user voice at least comprise: mel-frequency cepstral coefficients of the user's speech; the method comprises the steps of identifying voice characteristics of user voice by using a second model, and determining a control instruction corresponding to the user voice, wherein the second model is a deep belief network model trained by using multiple groups of data through machine learning, and each group of data in the multiple groups of data comprises: the voice characteristics of the user voice and the control instruction corresponding to the user voice.
Optionally, the regulating and controlling the smart home based on the target control instruction includes: and regulating and controlling the operation parameters of the target object to the target parameters according to the control instruction corresponding to the user voice, wherein the control instruction at least comprises the target object to be regulated and controlled in the smart home, the operation parameters to be regulated and controlled of the target object and the target parameters.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A voice recognition method for smart home is characterized by comprising the following steps:
acquiring voice information for regulating and controlling the smart home;
splitting the voice information to obtain user voice in the voice information;
and generating a target control instruction according to the user voice, and regulating and controlling the smart home based on the target control instruction.
2. The method of claim 1, wherein splitting the voice information to obtain the user voice in the voice information comprises:
determining a plurality of start frames and a plurality of end frames in the voice information;
determining a plurality of voice segments according to a plurality of starting frames and a plurality of ending frames, wherein the starting frames and the ending frames in the voice segments are adjacent;
and combining the plurality of voice fragments into the user voice according to the sequential positions of the plurality of voice fragments in the voice information.
3. The method of claim 2, wherein determining a plurality of start frames and a plurality of end frames in the speech information comprises:
determining a preset number of adjacent first prepared frames in the voice information, wherein a target parameter of the first prepared frames is greater than a first preset value, and the target parameter is a short-time energy value and/or a short-time zero-crossing rate;
determining a preset number of adjacent second prepared frames in the voice information, wherein the target parameter of the second prepared frames is smaller than a second preset value;
determining a first frame of the first preliminary frames as a start frame and determining a first frame of the second preliminary frames as an end frame.
4. The method of claim 2, wherein combining the plurality of speech segments into the user speech according to their sequential positions in the speech message comprises:
recognizing a plurality of voice fragments by using a first model, and determining a plurality of target voices, wherein the first model is a deep belief network model trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a voice segment and a target voice contained in the voice segment;
and combining a plurality of target voices into the user voice according to the sequential positions of the target voices in the voice information.
5. The method of claim 1, wherein generating target control instructions from the user speech comprises:
acquiring voice characteristics of the user voice, wherein the voice characteristics of the user voice at least comprise: mel-frequency cepstrum coefficients of said user speech;
recognizing the voice characteristics of the user voice by using a second model, and determining a control instruction corresponding to the user voice, wherein the second model is a deep belief network model trained by machine learning by using multiple groups of data, and each group of data in the multiple groups of data comprises: the voice characteristics of the user voice and the control instruction corresponding to the user voice.
6. The method of claim 1, wherein conditioning the smart home based on the target control instruction comprises:
and regulating and controlling the operation parameters of the target object to the target parameters according to the control instruction corresponding to the user voice, wherein the control instruction at least comprises the target object to be regulated and controlled in the smart home, the operation parameters to be regulated and controlled of the target object and the target parameters.
7. The utility model provides a speech recognition device of intelligence house which characterized in that includes:
the intelligent home control system comprises an acquisition unit, a control unit and a control unit, wherein the acquisition unit is used for acquiring voice information for regulating and controlling the intelligent home;
the splitting unit is used for splitting the voice information to obtain user voice in the voice information;
and the regulating and controlling unit is used for generating a target control instruction according to the user voice and regulating and controlling the smart home based on the target control instruction.
8. The apparatus of claim 7, wherein the splitting unit comprises:
a first determining module, configured to determine a plurality of start frames and a plurality of end frames in the speech information;
a second determining module, configured to determine a plurality of speech segments according to the plurality of start frames and the plurality of end frames, where the start frames and the end frames in the speech segments are adjacent to each other;
and the synthesis module is used for combining the plurality of voice segments into the user voice according to the sequence positions of the plurality of voice segments in the voice information.
9. A storage medium, characterized in that the storage medium includes a stored program, wherein the program performs the voice recognition method for smart home according to any one of claims 1 to 6.
10. A processor, wherein the processor is configured to execute a program, and when the program is executed, the method for speech recognition of smart home according to any one of claims 1 to 6 is performed.
CN201810873250.4A 2018-08-02 2018-08-02 Voice recognition method and device for smart home Pending CN110853631A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810873250.4A CN110853631A (en) 2018-08-02 2018-08-02 Voice recognition method and device for smart home

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810873250.4A CN110853631A (en) 2018-08-02 2018-08-02 Voice recognition method and device for smart home

Publications (1)

Publication Number Publication Date
CN110853631A true CN110853631A (en) 2020-02-28

Family

ID=69595109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810873250.4A Pending CN110853631A (en) 2018-08-02 2018-08-02 Voice recognition method and device for smart home

Country Status (1)

Country Link
CN (1) CN110853631A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111540343A (en) * 2020-03-17 2020-08-14 北京捷通华声科技股份有限公司 Corpus identification method and apparatus
CN112364779A (en) * 2020-11-12 2021-02-12 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN113380242A (en) * 2021-05-26 2021-09-10 广州朗国电子科技有限公司 Method and system for controlling multimedia playing content through voice

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104464727A (en) * 2014-12-11 2015-03-25 福州大学 Single-channel music singing separation method based on deep belief network
US20150255090A1 (en) * 2014-03-10 2015-09-10 Samsung Electro-Mechanics Co., Ltd. Method and apparatus for detecting speech segment
CN105575394A (en) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 Voiceprint identification method based on global change space and deep learning hybrid modeling
CN105657535A (en) * 2015-12-29 2016-06-08 北京搜狗科技发展有限公司 Audio recognition method and device
CN105702250A (en) * 2016-01-06 2016-06-22 福建天晴数码有限公司 Voice recognition method and device
CN106328123A (en) * 2016-08-25 2017-01-11 苏州大学 Method of recognizing ear speech in normal speech flow under condition of small database
CN106504756A (en) * 2016-12-02 2017-03-15 珠海市杰理科技股份有限公司 Built-in speech recognition system and method
CN106601233A (en) * 2016-12-22 2017-04-26 北京元心科技有限公司 Voice command recognition method and device and electronic equipment
CN107039036A (en) * 2017-02-17 2017-08-11 南京邮电大学 A kind of high-quality method for distinguishing speek person based on autocoding depth confidence network
CN107240397A (en) * 2017-08-14 2017-10-10 广东工业大学 A kind of smart lock and its audio recognition method and system based on Application on Voiceprint Recognition
CN107481718A (en) * 2017-09-20 2017-12-15 广东欧珀移动通信有限公司 Audio recognition method, device, storage medium and electronic equipment

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system
US20150255090A1 (en) * 2014-03-10 2015-09-10 Samsung Electro-Mechanics Co., Ltd. Method and apparatus for detecting speech segment
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104464727A (en) * 2014-12-11 2015-03-25 福州大学 Single-channel music singing separation method based on deep belief network
CN105657535A (en) * 2015-12-29 2016-06-08 北京搜狗科技发展有限公司 Audio recognition method and device
CN105575394A (en) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 Voiceprint identification method based on global change space and deep learning hybrid modeling
CN105702250A (en) * 2016-01-06 2016-06-22 福建天晴数码有限公司 Voice recognition method and device
CN106328123A (en) * 2016-08-25 2017-01-11 苏州大学 Method of recognizing ear speech in normal speech flow under condition of small database
CN106504756A (en) * 2016-12-02 2017-03-15 珠海市杰理科技股份有限公司 Built-in speech recognition system and method
CN106601233A (en) * 2016-12-22 2017-04-26 北京元心科技有限公司 Voice command recognition method and device and electronic equipment
CN107039036A (en) * 2017-02-17 2017-08-11 南京邮电大学 A kind of high-quality method for distinguishing speek person based on autocoding depth confidence network
CN107240397A (en) * 2017-08-14 2017-10-10 广东工业大学 A kind of smart lock and its audio recognition method and system based on Application on Voiceprint Recognition
CN107481718A (en) * 2017-09-20 2017-12-15 广东欧珀移动通信有限公司 Audio recognition method, device, storage medium and electronic equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111540343A (en) * 2020-03-17 2020-08-14 北京捷通华声科技股份有限公司 Corpus identification method and apparatus
CN111540343B (en) * 2020-03-17 2021-02-05 北京捷通华声科技股份有限公司 Corpus identification method and apparatus
CN112364779A (en) * 2020-11-12 2021-02-12 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN112364779B (en) * 2020-11-12 2022-10-21 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN113380242A (en) * 2021-05-26 2021-09-10 广州朗国电子科技有限公司 Method and system for controlling multimedia playing content through voice

Similar Documents

Publication Publication Date Title
US11670325B2 (en) Voice activity detection using a soft decision mechanism
US9875739B2 (en) Speaker separation in diarization
KR20160014625A (en) Method and system for identifying location associated with voice command to control home appliance
JP7008638B2 (en) voice recognition
US20220317641A1 (en) Device control method, conflict processing method, corresponding apparatus and electronic device
CN107004409B (en) Neural network voice activity detection using run range normalization
CN110853631A (en) Voice recognition method and device for smart home
CN108831459B (en) Voice recognition method and device
CN102568478A (en) Video play control method and system based on voice recognition
US20180122377A1 (en) Voice interaction apparatus and voice interaction method
CN108899033B (en) Method and device for determining speaker characteristics
KR20200119377A (en) Method and apparatus for implementing neural network for identifying speaker
CN108288465A (en) Intelligent sound cuts the method for axis, information data processing terminal, computer program
CN109671430B (en) Voice processing method and device
EP3133472B1 (en) Adaptive user interface for an hvac system
CN116490920A (en) Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system
CN106887233A (en) Audio data processing method and system
CN109783049A (en) Method of controlling operation thereof, device, equipment and storage medium
CN111105798B (en) Equipment control method based on voice recognition
CN111128174A (en) Voice information processing method, device, equipment and medium
US9978393B1 (en) System and method for automatically removing noise defects from sound recordings
CN110853642B (en) Voice control method and device, household appliance and storage medium
WO2020220345A1 (en) Voice command recognition method and apparatus
CN108573712B (en) Voice activity detection model generation method and system and voice activity detection method and system
CN112017662B (en) Control instruction determining method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200228