WO2018054361A1 - 语音识别的环境自适应方法、语音识别装置和家用电器 - Google Patents

语音识别的环境自适应方法、语音识别装置和家用电器 Download PDF

Info

Publication number
WO2018054361A1
WO2018054361A1 PCT/CN2017/103017 CN2017103017W WO2018054361A1 WO 2018054361 A1 WO2018054361 A1 WO 2018054361A1 CN 2017103017 W CN2017103017 W CN 2017103017W WO 2018054361 A1 WO2018054361 A1 WO 2018054361A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
environment
model
voice
speech recognition
Prior art date
Application number
PCT/CN2017/103017
Other languages
English (en)
French (fr)
Inventor
石周
唐红强
Original Assignee
合肥华凌股份有限公司
合肥美的电冰箱有限公司
美的集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 合肥华凌股份有限公司, 合肥美的电冰箱有限公司, 美的集团股份有限公司 filed Critical 合肥华凌股份有限公司
Publication of WO2018054361A1 publication Critical patent/WO2018054361A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the invention belongs to the technical field of electrical appliances manufacturing, and in particular relates to an environment adaptive method for voice recognition, and a voice recognition device and a home appliance including the voice recognition device.
  • the recognition rate of speech recognition has reached a relatively high degree.
  • the recognition rate has problems. If the traditional speech recognition algorithm is directly applied to the home appliance system, it will receive the influence of environmental noise.
  • no voice interaction system gives a specific optimization scheme for the working environment of the home appliance. Therefore, improving the robustness of a speech recognition system in a similar use environment is the key to its application.
  • the present invention aims to solve at least one of the technical problems in the related art to some extent.
  • the present invention needs to propose an environment adaptive method for speech recognition, which can reduce the influence of environmental noise on speech recognition and improve the robustness of speech recognition.
  • the present invention also proposes a voice recognition apparatus and a home appliance including the voice recognition apparatus.
  • an environment adaptive method for speech recognition includes the steps of: acquiring speech information in a current environment; extracting speech features of the speech information, and performing environment adaptation on the speech features. Processing; obtaining a sequence of words corresponding to the maximum probability of the speech feature based on the acoustic model and the language model.
  • the environment adaptive method for speech recognition according to the embodiment of the present invention can remove the environmental noise in the feature extraction process by using the environment adaptive processing in the feature domain, and reduce the influence of the background noise on the speech recognition in the actual application environment, which can be improved in practice. Robustness of speech recognition in the application environment.
  • the obtaining a word sequence corresponding to a maximum probability of the speech feature according to an acoustic model and a language model further comprising: calculating an acoustic probability of the speech feature according to the acoustic model, according to the A language model calculates a linguistic probability of the speech feature; and performs a search based on the acoustic probabilities and the linguistic probabilities to obtain a sequence of words corresponding to a maximum probability of the speech features.
  • the speech feature is subjected to environment adaptive processing by at least one of the following methods: a feature mapping method; a channel length normalization method; and a cepstrum mean normalization method.
  • the environment adaptive method further comprises: performing environment adaptive processing of the model domain based on the training speech and the ambient speech when the model of the acoustic model is trained.
  • the environment adaptive processing of the model domain can reduce the impact of environmental noise on speech recognition during model training.
  • performing environment adaptive processing of the model domain further includes: using a maximum a posteriori probability method or a transform based method for a GMM-HMM (Gaussian Mixture Model-Hidden Markov Model) model The method performs environmental adaptive processing;
  • the network weight of the DNN is fitted based on the training speech, or the transform layer is added in the DNN structure, or the iVector based
  • the method performs environment adaptive processing, or adopts an encoding-based method for environment adaptive processing.
  • the training voice is collected by one of the following modes: separately recording the training voice and the ambient voice in an actual environment; or recording the environmental voice in the actual environment in a laboratory Pure speech is recorded and the ambient speech is superimposed with the pure speech to obtain the training speech. Thereby a training language containing specific environmental noise can be obtained.
  • a speech recognition apparatus includes: an acquisition module, configured to acquire voice information in a current environment; an extraction module, configured to extract a voice feature of the voice information; and an adaptive module, Performing an environment adaptive processing on the speech feature; a model module for providing an acoustic model and a language model; and an identification module for obtaining a word sequence corresponding to a maximum probability of the speech feature according to the acoustic model and the language model .
  • the speech recognition apparatus of the embodiment of the present invention can remove the environmental noise in the feature extraction process by the adaptive processing of the adaptive module in the feature domain, and reduce the influence of the background noise on the speech recognition in the actual application environment, which can be improved in practical applications. Robustness of speech recognition in the environment.
  • the identification module is further configured to calculate an acoustic probability of the speech feature according to the acoustic model, and calculate a language probability of the speech feature according to the language model, according to the acoustic probability A search is made with the language probabilities to obtain a sequence of words corresponding to the greatest probability of the speech features.
  • the adaptive module performs environment adaptive processing on the voice feature by at least one of the following methods: a feature mapping method, a channel length normalization method, and a cepstrum mean normalization method.
  • the adaptive module is further configured to perform environment adaptive processing of the model domain based on the training speech and the environmental speech during model training of the acoustic model.
  • the environment adaptive processing of the model domain can reduce the influence of background noise on speech recognition during model training.
  • the adaptive module is further configured to adopt a method of maximum a posteriori probability for the GMM-HMM model or
  • the environment adaptive processing is performed based on the transform method, or, for the DNN-HMM model, the network weight of the DNN is fitted based on the training speech, or the transform layer is added in the DNN structure, or the environment adaptation is performed by using the ivector-based method. Process, or use an encoding-based approach for environment adaptive processing.
  • the voice recognition apparatus further includes: an acquisition module, configured to collect the training language by one of: separately recording the training voice and the ambient voice in an actual environment, Or recording the environmental voice in the actual environment, recording pure speech in a laboratory, and superimposing the environmental speech with the pure speech to obtain the training speech, so that noise including a specific environment can be obtained. Training voice.
  • the home appliance of still another aspect of the present invention includes: a body; and the above-described voice recognition apparatus.
  • the household appliance can reduce the influence of background noise on speech recognition by adopting the above-mentioned speech recognition device, and improve the robustness of speech recognition in a working environment.
  • Also disclosed in some embodiments of the present invention is a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the environment adaptive method of speech recognition.
  • a refrigerator including a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor executing the program, implementing the An environmentally adaptive method of speech recognition.
  • FIG. 1 is a schematic diagram of a basic framework of speech recognition in accordance with one embodiment of the present invention.
  • FIG. 2 is a schematic structural view of a GMM-HMM model according to the present invention.
  • FIG. 3 is a schematic structural view of a DNN-HMM model according to the present invention.
  • FIG. 4 is a flowchart of an environment adaptive method for speech recognition according to an embodiment of the present invention.
  • FIG. 5 is a schematic illustration of a DNN network in accordance with an embodiment of the present invention.
  • FIG. 6 is a block diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • FIG. 7 is a block diagram of a speech recognition apparatus in accordance with one embodiment of the present invention.
  • FIG. 8 is a block diagram of a home appliance in accordance with an embodiment of the present invention.
  • the basic framework of speech recognition is introduced.
  • the acoustic model is formed by feature extraction and acoustic modeling of the training speech
  • the language model is formed by language modeling of the training corpus
  • the input speech is extracted in the feature
  • the recognition result is obtained by the decoder according to the language model and the acoustic model.
  • the speech features mainly include MFCC (Mel Frequency Cepstrum Coefficient) parameters based on auditory perception, and Perceptual Linear Predictive (PLP) parameters.
  • MFCC Mel Frequency Cepstrum Coefficient
  • PLP Perceptual Linear Predictive
  • the language model is a description of language.
  • the statistical language model of N-gram is more commonly used.
  • the basic idea is to use Markov chain to represent the process of generating word sequences, that is, in the sequence.
  • the probability of occurrence of the kth word is closely related to the previous n-1 words.
  • the model parameters of the language model can be estimated based on the frequency of the combination of individual words and related words in the corpus.
  • the role of the decoder is to combine the acoustic probabilities of the speech features calculated by the acoustic model with the linguistic probabilities calculated by the language model to obtain the most likely subsequences by means of correlation searches.
  • the acoustic model is a description of the sound characteristics and is the core part of the speech recognition system. Several acoustic models are described below in conjunction with Figures 2 and 3.
  • the traditional acoustic model includes the GMM-HMM model, which can be described by two state sets and three transition probabilities.
  • the two state sets include an observable state 0 and an implicit state S, and the observable state 0 is a state that can be observed as the name implies; the implicit state S conforms to the Markov property, that is, the state at time t is only t-1 It is always relevant and cannot be observed between the two.
  • the three transition probabilities include an initial state probability matrix, a state transition matrix, and an observation state output probability.
  • the initial state probability matrix expresses the probability distribution of each implicit state of the initial state
  • the state transition matrix expresses the implicit state between t and t+1.
  • the transition probability, the observed state output probability expresses the probability that the observed value is 0 under the condition that the implicit state is S.
  • FIG. 2 is a schematic structural diagram of a GMM-HMM model according to an embodiment of the present invention, wherein a voice signal is extracted after framing, a probability distribution is described by a GMM, and a transition probability of the implicit state is described by the HMM. And The relationship between the various observations of the GMM.
  • FIG. 3 is a structural diagram of a DNN-HMM model according to a specific embodiment of the present invention, wherein the DNN-HMM model will The model describing the probability of occurrence of the feature is replaced by the GMM with the deep neural network DNN, ie, the DNN to describe the observed probability distribution of the feature, and the HMM describes the transition probability of the implicit state and its relationship with each observation sample of the DNN.
  • the environment adaptive method for speech recognition in the embodiment of the present invention performs environment adaptive processing in the feature domain and the model domain respectively, and improves the voice in the use environment. The robustness of the recognition.
  • FIG. 4 is a flowchart of an environment adaptive method for speech recognition according to an embodiment of the present invention. As shown in FIG. 4, the environment adaptive method for speech recognition includes the following steps:
  • S1 Acquire voice information in the current environment. For example, to obtain voice information in a working environment of a home appliance such as a refrigerator.
  • the MFCC parameters and PLP parameters of the voice information are extracted, and the extracted voice features are subjected to environment adaptive processing, that is, the environment adaptive processing of the feature domain is performed, and the influence of the environmental noise is reduced in the feature domain, that is, the process of feature extraction.
  • environment adaptive processing that is, the environment adaptive processing of the feature domain is performed
  • the influence of the environmental noise is reduced in the feature domain, that is, the process of feature extraction.
  • the background noise is removed, so that the speech in the actual application environment can be better recognized.
  • the voice feature may be subjected to environment adaptive processing by at least one of the following methods: a feature mapping method; a channel length normalization method; a cepstrum mean normalization method, of course, Other methods of implementing environment adaptive processing of feature domains are not listed here.
  • X CMN (n) represents an odd-order cepstrum vector
  • X(n) is a characteristic parameter (cepstrum) vector
  • n represents a dimension
  • X CVN (n) represents an even-order cepstrum vector
  • E represents a mathematical expectation.
  • represents the standard deviation.
  • the third and fourth moments can also be similarly normalized so that their distribution conforms to the standard Gaussian distribution, eliminating distortion caused by environmental noise.
  • the process of performing environment adaptive processing using the feature mapping method and the channel length normalization method reference may be made to the description in the related art.
  • the acoustic probability of the speech feature is calculated according to the acoustic model, and the language profile of the speech feature is calculated according to the language model Rate, search based on acoustic probabilities and linguistic probabilities to obtain a sequence of words corresponding to the maximum probability of speech features, as shown in Figure 1 based on acoustic models and language models, probabilistic calculations by decoders and optimal search by means of correlations
  • the word sequence is used to implement speech recognition.
  • the specific calculation and search process can be found in the related art.
  • the environment adaptive method for speech recognition can remove the environmental noise in the feature extraction process by using the environment adaptive processing in the feature domain, and reduce the influence of background noise on the speech recognition in the actual application environment. It can improve the robustness of speech recognition in practical applications.
  • the environment adaptive method for speech recognition also proposes an environment adaptive operation in the model domain, that is, removing the noise influence of the environment during model training.
  • the environment adaptive processing of the model domain is performed based on the training speech and the environmental speech.
  • the training speech can be understood as a set of speech containing the required semantics, and this part of the speech needs to be labeled. For example, you can collect a lot of "hello" words in the environment to train the "hello" speech model.
  • Ambient speech can be understood as a collection of different speeches in the context of use, which can be used to train background models. It can be understood that both the training speech and the environmental speech are speech with environmental noise, and can express the distribution of speech in the use environment.
  • the training voice can be collected by one of the following methods: one way is to separately record the training voice and the environmental voice in the actual environment, for example, directly using the home appliance to record the training in the actual use environment.
  • Voice and ambient speech in the actual environment to facilitate adaptive operation of the model domain environment. It can be understood that the training speech and the environmental speech are both voice data with the specific actual environment. Or, record the environmental voice in the actual environment, record the pure voice in the laboratory, and superimpose the ambient voice and the pure voice to obtain the training voice.
  • the pure voice can be understood as the voice of the person without background noise. Under normal circumstances, more often the training voice acquisition is done in the laboratory environment. You can record a large amount of environmental voice in the actual working environment through the home appliance, and superimpose the environmental voice and the pure voice in the laboratory to get the actual Training speech and environmental speech with specific environmental noise in the working environment.
  • the environment adaptive processing of the model domain can adopt different methods for different models.
  • the environment of the maximum a posteriori probability or the method based on the transform may be used for the environment adaptive processing.
  • other adaptive methods that can be implemented may also be adopted.
  • the environment adaptive method based on the maximum a posteriori probability is based on the Bayesian criterion, and the model parameters are modified by the prior probability to achieve the maximum posterior probability for the observed data.
  • a model describing all possible environmental conditions is trained by ambient speech collected in different environments. Since it covers a large number of different background speeches, the model can be considered to eliminate the distribution of speech of a specific background;
  • the acoustic model is obtained by re-evaluating the background model parameters based on the training speech. It can be understood that, unlike the related art, the acoustic model is directly trained by training speech.
  • the trained background model covers the distribution of speech in all training environments, and the obtained acoustic model is not based on pure speech training. Rather, it contains a variety of possible noise environments, and the re-evaluated acoustic models are also distributed, thus eliminating the impact of the environment in which speech is trained.
  • a transform-based method such as the maximum likelihood linear regression method, is to find a transformation relationship and transform the model parameters so that the loss function converges on the training data set.
  • the environment-independent background model is trained to estimate the transformation relationship between the target speech and its adaptation to the environment-independent speech recognition system.
  • the method based on the maximum a posteriori probability has better performance.
  • the method based on the transformation can obtain better than the method based on the maximum posterior probability. Effect.
  • the network weight of the DNN can be fitted based on the training speech, or the transformation layer can be added in the DNN structure, or the environment adaptive processing can be performed by using the ivector-based method, or based on The coding method performs environmental adaptive processing.
  • the DNN-HMM model can also be used.
  • the above method based on the maximum a posteriori probability and the method based on the transform are not applicable to the DNN-HMM model.
  • One way is to adjust the weight of the DNN network. The most intuitive method is to directly fit the network weights using the speech data in the target environment (the actual application environment), but it is very prone to over-fitting.
  • a transform layer is added to the DNN structure, and the transform layer is re-evaluated using the training speech in the target environment.
  • FIG. 5 a schematic diagram of the DNN structure according to an embodiment of the present invention is shown. First, a DNN is trained. The network is then inserted into the linear transformation layer for the input layer to re-evaluate the network parameters of the DNN for training speech in different environments. Similarly, a linear transformation layer can be inserted before the output layer.
  • the environment adaptive method for speech recognition for a speech recognition system for home appliances, discloses an adaptive scheme for eliminating the influence of background noise in a specific working environment, including environment adaptive processing and model domain of the feature domain.
  • FIG. 6 is a block diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • the voice recognition apparatus 100 includes an acquisition module 10, an extraction module 20, an adaptation module 30, a model module 40, and an identification module 50.
  • the obtaining module 10 is configured to obtain voice information in the current environment
  • the extracting module 20 is configured to extract voice features of the voice information, for example, extract MFCC parameters, PLP parameters, and the like of the voice information.
  • the adaptive module 30 is configured to perform environment adaptive processing on the voice features, that is, perform environment adaptive processing of the feature domain, and reduce the influence of the ambient noise in the feature domain, that is, remove the background noise in the feature extraction process, thereby being better Identify the voice in the actual application environment.
  • the adaptive module 30 may perform environment adaptive processing on the speech feature by at least one of the following methods: a feature mapping method; a channel length normalization method; a cepstrum mean normalization method
  • a feature mapping method e.g., a feature mapping method
  • a channel length normalization method e.g., a cepstrum mean normalization method
  • other methods that can implement the environment adaptive processing of the feature domain may be adopted, and are not enumerated here.
  • the model module 40 is used to provide an acoustic model and a language model.
  • the acoustic model is a description of the sound characteristics and is the core part of the speech recognition system.
  • Figure 2 and 3 are schematic diagrams of typical acoustic models.
  • the language model is a description of the language. In the speech recognition framework based on statistical learning, Commonly used is the statistical language model of N-gram.
  • the recognition module 50 obtains a sequence of words corresponding to the maximum probability of the speech feature based on the acoustic model and the language model. Specifically, the recognition module 50 calculates an acoustic probability of the speech feature according to the acoustic model, calculates a language probabilities of the speech features according to the language model, searches according to the acoustic probabilities and the language probabilities, and obtains a sequence of words corresponding to the maximum probability of the speech features, thereby implementing speech recognition. For specific calculation and search process, refer to the related technical records.
  • the speech recognition apparatus of the embodiment of the present invention can remove the environmental noise in the feature extraction process by the adaptive processing of the adaptive module in the feature domain, and reduce the influence of the background noise on the speech recognition in the actual application environment, which can be improved in practical applications. Robustness of speech recognition in the environment.
  • the adaptive module 30 is further configured to perform environment adaptive processing of the model domain based on the training speech and the environmental speech during model training of the acoustic model.
  • the voice recognition apparatus 100 further includes an acquisition module 60, and the collection module 60 is configured to collect the training language in one of the following manners: one way is to separately record the training voice and the actual environment.
  • Environmental voice or, recording the environmental voice in the actual environment, recording pure speech in the laboratory, and superimposing the environmental speech with the pure speech to obtain the training speech.
  • the pure speech can be understood as the speech speech of the person without background noise.
  • the environment adaptive processing of the model domain can adopt different methods for different models.
  • the adaptive module 30 may perform the environment adaptive processing using a method of maximum a posteriori probability or a method based on the transform.
  • the environment adaptive method based on the maximum a posteriori probability firstly, the model that contains all possible environmental conditions is trained by the ambient speech collected in different environments. Since it covers a large number of different backgrounds, the model can be considered as eliminated. The distribution of speech of a specific background; and then re-evaluating the background model parameters based on the training speech to obtain an acoustic model.
  • a transform-based method such as a maximum likelihood linear regression method, firstly trains an environment-independent background model to estimate the transformation relationship between the target speech and its adaptation to an environment-independent speech recognition system.
  • the method based on the maximum a posteriori probability has better performance.
  • the method based on the transformation can obtain better than the method based on the maximum posterior probability. Effect.
  • the adaptation module 30 may fit the network weight of the DNN based on the training speech, or add a transformation layer in the DNN structure, as shown in FIG. 5, or adopt an ivector-based method for environment adaptation. Process, or use an encoding-based approach for environment adaptive processing.
  • other adaptive methods that can be used for the DNN-HMM model can also be used.
  • the speech recognition apparatus 100 of the embodiment of the present invention adopts an environment adaptive method to remove the influence of environmental noise on speech recognition, including adaptive operation of the feature domain and adaptive operation of the model domain, and simultaneously apply the two adaptive techniques.
  • speech recognition a speaker voice collection method including ambient noise is given.
  • a home appliance 1000 such as a refrigerator, according to an embodiment of the present invention includes a body 200 and the voice recognition apparatus 100 proposed in the above aspect.
  • the home appliance 1000 can reduce the influence of background noise on speech recognition by using the above-described speech recognition apparatus 100, and improve the robustness of speech recognition in a work environment.
  • Some embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements an environment adaptive method of speech recognition of the above-described embodiments.
  • a refrigerator comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor executing the computer program, implementing the embodiment of the above aspect An environmentally adaptive method for speech recognition.
  • any process or method description in the flowcharts or otherwise described herein may be understood to include one or more steps for implementing a particular logic function or process. Modules, segments or portions of code of executable instructions, and the scope of preferred embodiments of the invention includes additional implementations, which may not be in the order shown or discussed, including in a substantially simultaneous manner depending on the functionality involved. The functions are performed in the reverse order, which should be understood by those skilled in the art to which the embodiments of the present invention pertain.
  • a "computer-readable medium” can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device.
  • computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read only memory (CDROM).
  • the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别的环境自适应方法、语音识别装置和家用电器,该方法包括以下步骤:获取当前环境下的语音信息(S1);提取语音信息的语音特征,并对语音特征进行环境自适应处理(S2);根据声学模型和语言模型获得对应语音特征的最大概率的词序列(S3)。

Description

语音识别的环境自适应方法、语音识别装置和家用电器 技术领域
本发明属于电器制造技术领域,尤其涉及语音识别的环境自适应方法,以及语音识别装置和包括该语音识别装置的家用电器。
背景技术
随着集成电路、人工智能、互联网技术的发展,传统的白家电业出现了新的定位,家电不仅仅具有传统的功能,也成为家庭智慧网络的一部分,可为家庭成员提供更多的智能化服务。但是,传统家电的控制方式已经不能满足更便捷的人机交互的需求,语音控制的应用成为一种未来的发展趋势。
目前,在实验室环境中,语音识别的识别率已经达到相当高的程度,但是,由于家电工作环境的复杂程度较高,环境噪声较大,所以识别率存在问题。如果将传统的语音识别算法直接应用于家电系统会收到环境噪音的影响,目前没有语音交互系统针对家电的工作环境给出特定优化的方案。因而,提升在类似使用环境下的语音识别系统的鲁棒性是其能够应用的关键。
发明内容
本发明旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本发明需要提出一种语音识别的环境自适应方法,该环境自适应方法,可以降低环境噪声对语音识别的影响,提升语音识别的鲁棒性。
本发明还提出一种语音识别装置和包括该语音识别装置的家用电器。
为了解决上述问题,本发明一方面提出的语音识别的环境自适应方法,包括以下步骤:获取当前环境下的语音信息;提取所述语音信息的语音特征,并对所述语音特征进行环境自适应处理;根据声学模型和语言模型获得对应所述语音特征的最大概率的词序列。
本发明实施例的语音识别的环境自适应方法,通过在特征域的环境自适应处理,可以在特征提取过程中去除环境噪声,降低实际应用环境下背景噪音对语音识别的影响,可以提升在实际应用环境下语音识别的鲁棒性。
在本发明的一些实施例中,所述根据声学模型和语言模型获得对应所述语音特征的最大概率的词序列,进一步包括:根据所述声学模型计算所述语音特征的声学概率,根据所述语言模型计算所述语音特征的语言概率;以及根据所述声学概率和所述语言概率进行搜索以获得对应所述语音特征的最大概率的词序列。
具体地,通过以下方法中的至少一种来对所述语音特征进行环境自适应处理:特征映射方法;声道长度归一化方法;倒谱均值归一化方法。
在本发明的一些实施例中,该环境自适应方法还包括:在所述声学模型的模型训练时,基于训练语音和环境语音进行模型域的环境自适应处理。模型域的环境自适应处理,可以在模型训练时减小环境噪声对语音识别的影响。
具体地,进行模型域的环境自适应处理,进一步包括:对于GMM-HMM(Gaussian Mixture Model-Hidden Markov Model,高斯混合模型-隐马尔科夫)模型,采用最大后验概率的方法或者基于变换的方法进行环境自适应处理;
对于DNN-HMM(Deep Neural Networks-Hidden Markov Model,深度神经网络-隐马尔科夫模型)模型,基于所述训练语音拟合DNN的网络权重,或者在DNN结构中增加变换层,或者采用基于ivector的方法进行环境自适应处理,或者采用基于编码的方法进行环境自适应处理。
具体地,所述训练语音通过以下方式中的一种进行采集:在实际环境中分别录制所述训练语音和所述环境语音;或者,在所述实际环境中录制所述环境语音,在实验室里录制纯净语音,并将所述环境语音与所述纯净语音进行叠加以获得所述训练语音。从而可以获得包含特定环境噪音的训练语言。
为了解决上述问题,本发明另一方面提出的语音识别装置,包括:获取模块,用于获取当前环境下的语音信息;提取模块,用于提取所述语音信息的语音特征;自适应模块,用于对所述语音特征进行环境自适应处理;模型模块,用于提供声学模型和语言模型;和识别模块,根据所述声学模型和所述语言模型获得对应所述语音特征的最大概率的词序列。
本发明实施例的语音识别装置,通过自适应模块在特征域的环境自适应处理,可以在特征提取过程中去除环境噪声,降低实际应用环境下背景噪音对语音识别的影响,可以提升在实际应用环境下语音识别的鲁棒性。
在本发明的一些实施例中,所述识别模块进一步用于,根据所述声学模型计算所述语音特征的声学概率,根据所述语言模型计算所述语音特征的语言概率,根据所述声学概率和所述语言概率进行搜索以获得对应所述语音特征的最大概率的词序列。
具体地,所述自适应模块,通过以下方法中的至少一种来对所述语音特征进行环境自适应处理:特征映射方法;声道长度归一化方法;倒谱均值归一化方法。
在本发明的一些实施例中,所述自适应模块还用于,在所述声学模型的模型训练时,基于训练语音和环境语音进行模型域的环境自适应处理。模型域的环境自适应处理,可以在模型训练时减小背景噪音对语音识别的影响。
具体地,所述自适应模块进一步用于,对于GMM-HMM模型,采用最大后验概率的方法或 者基于变换的方法进行环境自适应处理,或者,对于DNN-HMM模型,基于所述训练语音拟合DNN的网络权重,或者在DNN结构中增加变换层,或者采用基于ivector的方法进行环境自适应处理,或者采用基于编码的方法进行环境自适应处理。
在本发明的一些实施例中,该语音识别装置还包括:采集模块,用于通过以下方式中的一种采集所述训练语言:在实际环境中分别录制所述训练语音和所述环境语音,或者,在所述实际环境中录制所述环境语音,在实验室里录制纯净语音,并将所述环境语音与所述纯净语音进行叠加以获得所述训练语音,从而可以获得包括特定环境下噪音的训练语音。
基于上述发明的语音识别装置,本发明再一方面的家用电器,包括:本体;和上述的语音识别装置。
该家用电器,通过采用上述的语音识别装置,可以降低背景噪声对语音识别的影响,提升工作环境下语音识别的鲁棒性。
在本发明的一些实施例中还提出一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现所述的语音识别的环境自适应方法。
在本发明的一些实施例中还提出一种冰箱,所述冰箱包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现所述的语音识别的环境自适应方法。
附图说明
图1是根据本发明的一个实施例的语音识别的基本框架的示意图;
图2是根据本发明的GMM-HMM模型的结构示意图;
图3是根据本发明的DNN-HMM模型的结构示意图;
图4是根据本发明实施例的语音识别的环境自适应方法的流程图;
图5是根据本发明的一个具体实施例的DNN网络的示意图;
图6是根据本发明实施例的语音识别装置的框图;
图7是根据本发明的一个实施例的语音识别装置的框图;以及
图8是根据本发明实施例的家用电器的框图。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。
语音识别作为人类常用的交互方式,一直以来也是人机交互重要的研究方向。语音 识别系统也从最初的自动语音识别系统(Automatic Speech Recognition,ASR)发展为大词汇量连续语音识别(large vocabulary continuous speech Recognition,LVCSR)。
首先,对语音识别的基本框架进行介绍。如图1所示,基于声学模型、语言模型和解码器,声学模型通过将训练语音进行特征提取和进行声学建模形成,语言模型通过将训练语料进行语言建模形成,输入语音在提取特征以获得语音特征之后,通过解码器根据语言模型和声学模型获得识别结果。
其中,语音特征主要包括基于听觉感知的MFCC(Mel Frequency Cepstrum Coefficient,Mel频率倒谱系数)参数、感知线性预测系数(Perceptual Linear Predictive,PLP)参数等。
语言模型是对语言的描述,在基于统计学习的语音识别框架中,较常用的是N-gram的统计语言模型,其基本思想是,用马尔科夫链表示词序列的生成过程,即序列中第k个词的出现概率紧与之前的n-1个词相关。通过收集语料中各个词和相关词组合的频率,可以以此为基础估计出语言模型的模型参数。
解码器的作用是,结合通过声学模型计算语音特征的声学概率和由语言模型计算出的语言概率,通过相关搜索的方式得到最有可能的次序列。
声学模型是对声音特征的描述,是语音识别系统的核心部分。下面结合图2和图3对几个声学模型进行介绍。
在语音识别领域,传统的声学模型包括GMM-HMM模型,HMM模型可以用两个状态集合和三个转移概率来描述。其中,两个状态集合包括可观测状态0和隐含状态S,可观测状态0顾名思义为可以被观察到的状态;隐含状态S符合马尔科夫性质,即t时刻的状态只和t-1时刻相关,一般情况下无法被之间观察到。三个转移概率包括初始状态概率矩阵、状态转移矩阵和观测状态输出概率,初始状态概率矩阵表达初始状态各隐含状态的概率分布,状态转移矩阵表达了t到t+1时刻隐含状态之间的转移概率,观测状态输出概率表达了隐含状态为S的条件下,观测值为0的概率。HMM模型存在三个问题,其一为评估问题,给定观测序列和模型,求某一特定输出的概率。对于语音识别任务来说,就是根据语音序列和模型,确认该序列是某句话的可能性;其二为解码问题,给定观测序列和模型,寻找使观测高了最大的隐含状态序列,对于语音识别任务来说,就是根据语音序列和模型,识别出语音内容;其三为训练问题,给定观测序列,调整模型参数,使产生该观测序列的概率最大,对于语音识别任务来说就是根据大量的语音训练模型参数。
如图2所示,为根据本发明的一个具体实施例的GMM-HMM模型的结构示意图,其中,语音信号分帧之后提取特征,用GMM来描述其概率分布,HMM描述隐含状态的转移概率及其和 GMM的各个观测值的关系。
随着深度神经网络技术的发展,语音识别系统逐渐采用DNN-HMM技术,如图3所示,为根据本发明的一个具体实施例的DNN-HMM模型的结构示意图,其中,DNN-HMM模型将描述特征发生概率的模型从GMM替换为深度神经网络DNN,即DNN来描述特征的观测概率分布,HMM描述隐含状态的转移概率及其和DNN的各个观测样本的关系。
针对相关技术中,语音识别由于受到环境噪声而影响识别率的问题,本发明实施例的语音识别的环境自适应方法,分别在特征域和模型域进行环境自适应处理,提升在使用环境下语音识别的鲁棒性。
下面参照附图4和5描述根据本发明实施例的语音识别的环境自适应方法。
图4是根据本发明的一个实施例的语音识别的环境自适应方法的流程图,如图4所示,该语音识别的环境自适应方法包括以下步骤:
S1,获取当前环境下的语音信息。例如,获取家电例如冰箱通常工作环境下的语音信息。
S2,提取语音信息的语音特征,并对语音特征进行环境自适应处理。
例如,提取语音信息的MFCC参数、PLP参数等,对提取的语音特征进行环境自适应处理,即进行特征域的环境自适应处理,在特征域降低环境噪声的影响,也就是在特征提取的过程中去除背景噪声,从而可以更好地识别实际应用环境下的语音。
在本发明的实施例中,可以通过以下方法中的至少一种来对语音特征进行环境自适应处理:特征映射方法;声道长度归一化方法;倒谱均值归一化方法,当然也可以采用其他可以实现特征域的环境自适应处理的方法,在此不一一列举。
以最常用的倒谱均值归一化方法为例,在没有噪声影响的环境下,语音的Mel倒谱系数服从高斯分布,特性是奇数阶矩(均值等)的期望为0,偶数阶矩(方差等)的期望为一特定常数。根据此结论,可以分别对倒谱的均值、方差进行归一化操作。具体操作如下:
XCMN(n)=X(n)-E[X(n)]
Figure PCTCN2017103017-appb-000001
其中,XCMN(n)表示奇数阶矩倒谱矢量,X(n)是特征参数(倒谱)矢量,n代表维数,XCVN(n)表示偶数阶矩倒谱矢量,E表示数学期望,σ表示标准差。
此外,三四阶矩也可以类似进行归一化操作,使得其分布符合标准高斯分布,消除环境噪音导致的畸变。对于采用特征映射方法和声道长度归一化方法进行环境自适应处理的过程,可以参照相关技术中的说明。
S3,根据声学模型和语言模型获得对应该语音特征的最大概率的词序列。
具体地,根据声学模型计算语音特征的声学概率,根据语言模型计算语音特征的语言概 率,根据声学概率和语言概率进行搜索以获得对应该语音特征的最大概率的词序列,如图1中基于声学模型和语言模型,通过解码器进行概率计算和通过相关搜索方式获得最优可能的词序列,从而实现语音识别,具体计算和搜索过程可以参见相关技术记载。
可以看出,本发明实施例的语音识别的环境自适应方法,通过在特征域的环境自适应处理,可以在特征提取过程中去除环境噪声,降低实际应用环境下背景噪音对语音识别的影响,可以提升在实际应用环境下语音识别的鲁棒性。
虽然,特征域的环境自适应方法处理比较简单,可以应用于使用此特征的任何模型,但是,并不能从统计意义上真正地消除噪音的影响。
本发明实施例的语音识别的环境自适应方法,还提出在模型域进行环境自适应操作,即在模型训练时去除环境的噪声影响。具体地,在声学模型的模型训练时,基于训练语音和环境语音进行模型域的环境自适应处理。其中,训练语音可以理解为包含所需要的语义的语音的集合,这部分语音需要进行标注处理。例如,可以采集使用环境下很多条“你好”这句话的语音,用来训练“你好”的语音模型。环境语音可以理解为在该使用环境下不同语音的集合,可以用来训练背景模型。可以理解的是,训练语音和环境语音都是带有环境噪声的语音,可以表达在该使用环境下语音的分布。
在智能家电的语音识别系统中,环境噪声会对识别率造成直接的影响,由于相同型号家电的麦克风的位置固定,信道增益接近,工作的环境噪声类型有限,因而可以被采集。模型域的环境自适应操作的关键问题是需要建立带有特定环境的语音数据库,即采集最接近工作环境的训练语音。通过采集工作环境的背景噪声对其进行自适应处理,可以降低噪声环境对声学模型的影响。
在本发明的实施例中,训练语音可以通过以下方式的一种进行采集:一种方式是,在实际环境中分别录制训练语音和环境语音,例如,直接使用家电在实际的使用环境中录制训练语音以及在该实际环境下的环境语音,以方便模型域的环境自适应操作。可以理解的是,该训练语音和环境语音均为带有该特定实际环境的语音数据。或者,在实际环境中录制环境语音,在实验室里录制纯净语音,并将环境语音与纯净语音进行叠加以获得训练语音,其中,纯净语音可以理解为没有背景噪声的人说话语音。通常情况下,更多时候训练语音的采集是在实验室环境下完成的,可以通过家电在实际工作环境下录制大量环境语音,在实验室内将该环境语音与纯净语音进行叠加,得到在实际工作环境下即带有特定环境噪声的训练语音和环境语音。
具体来说,模型域的环境自适应处理,可以针对不同的模型采用不同的方法。
在本发明的实施例中,对于GMM-HMM模型,可以采用最大后验概率的方法或者基于变换的方法进行环境自适应处理,当然,也可以采用其他可以实现的、合适的自适应方法。
其中,基于最大后验概率的环境自适应方法,其基于贝叶斯准则,通过先验概率,修改模型参数,达到对于观测数据的最大后验概率。具体来说,首先,通过不同环境下采集的环境语音训练出描述包含所有可能环境情况的模型,由于其涵盖了大量不同背景下的语音,可以认为该模型消除了特定背景的语音的分布;再基于训练语音重估背景模型参数,得到声学模型。可以理解的是,不同于相关技术中通过训练语音直接训练得到声学模型,在本申请中,训练出的背景模型是涵盖了所有训练环境的语音的分布,获得的声学模型不是基于纯净语音训练,而是包含了各种可能的噪声环境,重估出的声学模型也是同分布的,因而消除了训练语音的环境的影响。
基于变换的方法例如最大似然线形回归方法,是寻找一种变换关系,对模型参数进行变换,使得在训练数据集上,损失函数收敛。首先,训练环境无关的背景模型,估计目标语音与其之间的变换关系,使其适应环境无关的语音识别系统。在实际应用中,在训练语音的数据充分时,基于最大后验概率的方法的性能较好,在训练语音的数据不充分时,基于变换的方法可以取得比基于最大后验概率的方法更好的效果。
在模型域的环境自适应处理,对于DNN-HMM模型,可以基于训练语音拟合DNN的网络权重,或者在DNN结构中增加变换层,或者采用基于ivector的方法进行环境自适应处理,或者采用基于编码的方法进行环境自适应处理。当然,对于DNN-HMM模型也可以采用其他可以使用的自适应方法。
具体来说,由于DNN的结构与GMM的结构不同,所以,上述的基于最大后验概率的方法和基于变换的方法,对于DNN-HMM模型不能适用。一种方式,可以调节DNN网络的权重,最直观的方法是利用目标环境(实际应用环境)下的语音数据直接拟合网络权重,但是,非常容易出现过拟合的现象。一种方式,在DNN结构中增加一个变换层,利用目标环境下的训练语音重估变换层,如图5所示为根据本发明的一个实施例的DNN结构的示意图,首先,训练好一个DNN网络,之后,针对输入层插入线形变换层,针对不同环境下的训练语音,重估DNN的网络参数。与此类似,可以在输出层之前插入线形变换层。
综上,本发明实施例的语音识别的环境自适应方法,针对家电的语音识别系统,揭示了在特定工作环境下消除背景噪声影响的自适应方案,包括特征域的环境自适应处理和模型域的环境自适应处理,以及训练语音的数据采集方式。
下面参照附图对本发明实施例的语音识别装置进行说明。
图6是根据本发明的一个实施例的语音识别装置的框图,如图6所示,该语音识别装置100包括获取模块10、提取模块20、自适应模块30、模型模块40和识别模块50。
其中,获取模块10用于获取当前环境下的语音信息;提取模块20用于提取语音信息的语音特征,例如,提取语音信息的MFCC参数、PLP参数等。
自适应模块30用于对语音特征进行环境自适应处理,即进行特征域的环境自适应处理,在特征域降低环境噪声的影响,也就是在特征提取的过程中去除背景噪声,从而可以更好地识别实际应用环境下的语音。在本发明的实施例中,自适应模块30可以通过以下方法中的至少一种来对语音特征进行环境自适应处理:特征映射方法;声道长度归一化方法;倒谱均值归一化方法,当然也可以采用其他可以实现特征域的环境自适应处理的方法,在此不一一列举。
模型模块40用于提供声学模型和语言模型。声学模型是对声音特征的描述,是语音识别系统的核心部分,如图2和3中为典型的声学模型的示意图;语言模型是对语言的描述,在基于统计学习的语音识别框架中,较常用的是N-gram的统计语言模型。
识别模块50根据声学模型和语言模型获得对应语音特征的最大概率的词序列。具体地,识别模块50根据声学模型计算语音特征的声学概率,根据语言模型计算语音特征的语言概率,根据声学概率和语言概率进行搜索以获得对应语音特征的最大概率的词序列,从而实现语音识别,具体计算和搜索过程可以参见相关技术记载。
本发明实施例的语音识别装置,通过自适应模块在特征域的环境自适应处理,可以在特征提取过程中去除环境噪声,降低实际应用环境下背景噪音对语音识别的影响,可以提升在实际应用环境下语音识别的鲁棒性。
虽然,特征域的环境自适应方法处理比较简单,可以应用于使用此特征的任何模型,但是,并不能从统计意义上真正地消除噪音的影响。自适应模块30还用于,在声学模型的模型训练时,基于训练语音和环境语音进行模型域的环境自适应处理。
进一步地,如图7所示,该语音识别装置100还包括采集模块60,采集模块60用于通过以下方式中的一种采集训练语言:一种方式是,在实际环境中分别录制训练语音和环境语音;或者,在实际环境中录制环境语音,在实验室里录制纯净语音,并将环境语音与纯净语音进行叠加以获得训练语音,其中,纯净语音可以理解为没有背景噪声的人说话语音。
具体来说,模型域的环境自适应处理,可以针对不同的模型采用不同的方法。对于GMM-HMM模型,自适应模块30可以采用最大后验概率的方法或者基于变换的方法进行环境自适应处理。其中,基于最大后验概率的环境自适应方法,首先,通过不同环境下采集的环境语音训练出描述包含所有可能环境情况的模型,由于其涵盖了大量不同背景下的语音,可以认为该模型消除了特定背景的语音的分布;再基于训练语音重估背景模型参数,得到声学模型。基于变换的方法例如最大似然线形回归方法,首先,训练环境无关的背景模型,估计目标语音与其之间的变换关系,使其适应环境无关的语音识别系统。在实际应用中,在训练语音的数据充分时,基于最大后验概率的方法的性能较好,在训练语音的数据不充分时,基于变换的方法可以取得比基于最大后验概率的方法更好的效果。
或者,对于DNN-HMM模型,自适应模块30可以基于训练语音拟合DNN的网络权重,或者,在DNN结构中增加变换层,参照图5所示,或者,采用基于ivector的方法进行环境自适应处理,或者采用基于编码的方法进行环境自适应处理。当然,对于DNN-HMM模型也可以采用其他可以使用的自适应方法。
总之,本发明实施例的语音识别装置100,采用环境自适应方法去除环境噪声对语音识别的影响,包括特征域的自适应操作和模型域的自适应操作,将两种自适应技术同时应用于语音识别中,并给出包含环境噪声的说话人语音采集方式。
基于上述方面实施例的语音识别装置,下面参照附图8描述根据本发明实施例提出的家用电器。
如图8所示,本发明实施例的家用电器1000,例如冰箱,包括本体200和上述方面提出的语音识别装置100。
该家用电器1000,通过采用上述的语音识别装置100,可以降低背景噪声对语音识别的影响,提升工作环境下语音识别的鲁棒性。
在本发明的一些实施例还提出一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述方面实施例的语音识别的环境自适应方法。
在本发明的一些实施例中还提出一种冰箱,冰箱包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行该计算机程序时,实现上述方面实施例的语音识别的环境自适应方法。
需要说明的是,在本说明书的描述中,流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。 另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (15)

  1. 一种语音识别的环境自适应方法,其特征在于,包括以下步骤:
    获取当前环境下的语音信息;
    提取所述语音信息的语音特征,并对所述语音特征进行环境自适应处理;以及
    根据声学模型和语言模型获得对应所述语音特征的最大概率的词序列。
  2. 如权利要求1所述的语音识别的环境自适应方法,其特征在于,所述根据声学模型和语言模型获得对应所述语音特征的最大概率的词序列,进一步包括:
    根据所述声学模型计算所述语音特征的声学概率,根据所述语言模型计算所述语音特征的语言概率;以及
    根据所述声学概率和所述语言概率进行搜索以获得对应所述语音特征的最大概率的词序列。
  3. 如权利要求1所述的语音识别的环境自适应方法,其特征在于,通过以下方法中的至少一种来对所述语音特征进行环境自适应处理:
    特征映射方法;
    声道长度归一化方法;
    倒谱均值归一化方法。
  4. 如权利要求1所述的语音识别的环境自适应方法,其特征在于,还包括:
    在所述声学模型的模型训练时,基于训练语音和环境语音进行模型域的环境自适应处理。
  5. 如权利要求4所述的语音识别的环境自适应方法,其特征在于,进行模型域的环境自适应处理,进一步包括:
    对于GMM-HMM模型,采用最大后验概率的方法或者基于变换的方法进行环境自适应处理;
    对于DNN-HMM模型,基于所述训练语音拟合DNN的网络权重,或者在DNN结构中增加变换层,或者采用基于ivector的方法进行环境自适应处理,或者采用基于编码的方法进行环境自适应处理。
  6. 如权利要求4或5所述的语音识别的环境自适应方法,其特征在于,所述训练语音通过以下方式中的一种进行采集:
    在实际环境中分别录制所述训练语音和所述环境语音;或者
    在所述实际环境中录制所述环境语音,在实验室里录制纯净语音,并将所述环境语音与所述纯净语音进行叠加以获得所述训练语音。
  7. 一种语音识别装置,其特征在于,包括:
    获取模块,用于获取当前环境下的语音信息;
    提取模块,用于提取所述语音信息的语音特征;
    自适应模块,用于对所述语音特征进行环境自适应处理;
    模型模块,用于提供声学模型和语言模型;和
    识别模块,根据所述声学模型和所述语言模型获得对应所述语音特征的最大概率的词序列。
  8. 如权利要求7所述的语音识别装置,其特征在于,所述识别模块进一步用于,根据所述声学模型计算所述语音特征的声学概率,根据所述语言模型计算所述语音特征的语言概率,根据所述声学概率和所述语言概率进行搜索以获得对应所述语音特征的最大概率的词序列。
  9. 如权利要求7所述的语音识别装置,其特征在于,所述自适应模块,通过以下方法中的至少一种来对所述语音特征进行环境自适应处理:
    特征映射方法;
    声道长度归一化方法;
    倒谱均值归一化方法。
  10. 如权利要求7所述的语音识别装置,其特征在于,所述自适应模块还用于,在所述声学模型的模型训练时,基于训练语音和环境语音进行模型域的环境自适应处理。
  11. 如权利要求10所述的语音识别装置,其特征在于,所述自适应模块进一步用于,对于GMM-HMM模型,采用最大后验概率的方法或者基于变换的方法进行环境自适应处理,或者,对于DNN-HMM模型,基于所述训练语音拟合DNN的网络权重,或者在DNN结构中增加变换层,或者采用基于ivector的方法进行环境自适应处理,或者采用基于编码的方法进行环境自适应处理。
  12. 如权利要求10或11所述的语音识别装置,其特征在于,还包括:
    采集模块,用于通过以下方式中的一种采集所述训练语言:在实际环境中分别录制所述训练语音和所述环境语音,或者,在所述实际环境中录制所述环境语音,在实验室里录制纯净语音,并将所述环境语音与所述纯净语音进行叠加以获得所述训练语音。
  13. 一种家用电器,其特征在于,包括:
    本体;和
    如权利要求7-12任一项所述的语音识别装置。
  14. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-6任一项所述的语音识别的环境自适应方法。
  15. 一种冰箱,所述冰箱包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时,实现如权利要求1-6任一项所述的语音识别的环境自适应方法。
PCT/CN2017/103017 2016-09-23 2017-09-22 语音识别的环境自适应方法、语音识别装置和家用电器 WO2018054361A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610847088.XA CN106409289B (zh) 2016-09-23 2016-09-23 语音识别的环境自适应方法、语音识别装置和家用电器
CN201610847088.X 2016-09-23

Publications (1)

Publication Number Publication Date
WO2018054361A1 true WO2018054361A1 (zh) 2018-03-29

Family

ID=57998225

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/103017 WO2018054361A1 (zh) 2016-09-23 2017-09-22 语音识别的环境自适应方法、语音识别装置和家用电器

Country Status (2)

Country Link
CN (1) CN106409289B (zh)
WO (1) WO2018054361A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110875052A (zh) * 2018-08-31 2020-03-10 深圳市优必选科技有限公司 机器人的语音去噪方法、机器人装置以及存储装置
CN111243574A (zh) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 一种语音模型自适应训练方法、系统、装置及存储介质
CN112466056A (zh) * 2020-12-01 2021-03-09 上海旷日网络科技有限公司 一种基于语音识别的自助柜取件系统及方法

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106409289B (zh) * 2016-09-23 2019-06-28 合肥美的智能科技有限公司 语音识别的环境自适应方法、语音识别装置和家用电器
CN106991999B (zh) * 2017-03-29 2020-06-02 北京小米移动软件有限公司 语音识别方法及装置
CN107680582B (zh) * 2017-07-28 2021-03-26 平安科技(深圳)有限公司 声学模型训练方法、语音识别方法、装置、设备及介质
CN109151218B (zh) * 2018-08-21 2021-11-19 平安科技(深圳)有限公司 通话语音质检方法、装置、计算机设备及存储介质
CN109635098B (zh) * 2018-12-20 2020-08-21 东软集团股份有限公司 一种智能问答方法、装置、设备及介质
CN110099246A (zh) * 2019-02-18 2019-08-06 深度好奇(北京)科技有限公司 监控调度方法、装置、计算机设备及存储介质
CN112152667A (zh) * 2019-06-11 2020-12-29 华为技术有限公司 一种识别电器的方法及装置
CN110570845B (zh) * 2019-08-15 2021-10-22 武汉理工大学 一种基于域不变特征的语音识别方法
CN110738991A (zh) * 2019-10-11 2020-01-31 东南大学 基于柔性可穿戴传感器的语音识别设备
CN110930985B (zh) * 2019-12-05 2024-02-06 携程计算机技术(上海)有限公司 电话语音识别模型、方法、系统、设备及介质
CN110875050B (zh) * 2020-01-17 2020-05-08 深圳亿智时代科技有限公司 用于真实场景的语音数据收集方法、装置、设备及介质
CN113628612A (zh) * 2020-05-07 2021-11-09 北京三星通信技术研究有限公司 语音识别方法、装置、电子设备及计算机可读存储介质
CN113156826B (zh) * 2021-03-25 2022-08-16 青岛酒店管理职业技术学院 一种基于人工智能的家居自动管理方法、管理系统及终端

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860062A (en) * 1996-06-21 1999-01-12 Matsushita Electric Industrial Co., Ltd. Speech recognition apparatus and speech recognition method
CN1542737A (zh) * 2003-03-12 2004-11-03 ��ʽ����Ntt����Ħ 语音识别噪声自适应系统、方法及程序
CN105448303A (zh) * 2015-11-27 2016-03-30 百度在线网络技术(北京)有限公司 语音信号的处理方法和装置
CN106409289A (zh) * 2016-09-23 2017-02-15 合肥华凌股份有限公司 语音识别的环境自适应方法、语音识别装置和家用电器

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154383B (zh) * 2006-09-29 2010-10-06 株式会社东芝 噪声抑制、提取语音特征、语音识别及训练语音模型的方法和装置
CN102568478B (zh) * 2012-02-07 2015-01-07 合一网络技术(北京)有限公司 一种基于语音识别的视频播放控制方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860062A (en) * 1996-06-21 1999-01-12 Matsushita Electric Industrial Co., Ltd. Speech recognition apparatus and speech recognition method
CN1542737A (zh) * 2003-03-12 2004-11-03 ��ʽ����Ntt����Ħ 语音识别噪声自适应系统、方法及程序
CN105448303A (zh) * 2015-11-27 2016-03-30 百度在线网络技术(北京)有限公司 语音信号的处理方法和装置
CN106409289A (zh) * 2016-09-23 2017-02-15 合肥华凌股份有限公司 语音识别的环境自适应方法、语音识别装置和家用电器

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110875052A (zh) * 2018-08-31 2020-03-10 深圳市优必选科技有限公司 机器人的语音去噪方法、机器人装置以及存储装置
CN111243574A (zh) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 一种语音模型自适应训练方法、系统、装置及存储介质
CN111243574B (zh) * 2020-01-13 2023-01-03 苏州奇梦者网络科技有限公司 一种语音模型自适应训练方法、系统、装置及存储介质
CN112466056A (zh) * 2020-12-01 2021-03-09 上海旷日网络科技有限公司 一种基于语音识别的自助柜取件系统及方法
CN112466056B (zh) * 2020-12-01 2022-04-05 上海旷日网络科技有限公司 一种基于语音识别的自助柜取件系统及方法

Also Published As

Publication number Publication date
CN106409289A (zh) 2017-02-15
CN106409289B (zh) 2019-06-28

Similar Documents

Publication Publication Date Title
WO2018054361A1 (zh) 语音识别的环境自适应方法、语音识别装置和家用电器
Tripathi et al. Deep learning based emotion recognition system using speech features and transcriptions
WO2021104099A1 (zh) 一种基于情景感知的多模态抑郁症检测方法和系统
CN107329996B (zh) 一种基于模糊神经网络的聊天机器人系统与聊天方法
US20120130716A1 (en) Speech recognition method for robot
Gangireddy et al. Unsupervised Adaptation of Recurrent Neural Network Language Models.
CN108899047A (zh) 音频信号的掩蔽阈值估计方法、装置及存储介质
CN108986788A (zh) 一种基于后验知识监督的噪声鲁棒声学建模方法
Ding et al. Audio-visual keyword spotting based on multidimensional convolutional neural network
CN112509560B (zh) 一种基于缓存语言模型的语音识别自适应方法和系统
US20210056958A1 (en) System and method for tone recognition in spoken languages
Elshaer et al. Transfer learning from sound representations for anger detection in speech
CN112767921A (zh) 一种基于缓存语言模型的语音识别自适应方法和系统
Kipyatkova et al. A study of neural network Russian language models for automatic continuous speech recognition systems
Wöllmer et al. Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting
KR20210009593A (ko) 강인한 음성인식을 위한 음향 및 언어모델링 정보를 이용한 음성 끝점 검출 방법 및 장치
Lecouteux et al. Distant speech recognition for home automation: Preliminary experimental results in a smart home
Ons et al. A self learning vocal interface for speech-impaired users
Higuchi et al. Speaker Adversarial Training of DPGMM-Based Feature Extractor for Zero-Resource Languages.
Yang Ensemble deep learning with HuBERT for speech emotion recognition
CN110807370A (zh) 一种基于多模态的会议发言人身份无感确认方法
Yoshida et al. Audio-visual voice activity detection based on an utterance state transition model
WO2022176124A1 (ja) 学習装置、推定装置、それらの方法、およびプログラム
CN112037772B (zh) 基于多模态的响应义务检测方法、系统及装置
CN111933187B (zh) 情感识别模型的训练方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17852429

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17852429

Country of ref document: EP

Kind code of ref document: A1