WO2018068649A1 - 一种语音激活检测方法及装置 - Google Patents

一种语音激活检测方法及装置 Download PDF

Info

Publication number
WO2018068649A1
WO2018068649A1 PCT/CN2017/103861 CN2017103861W WO2018068649A1 WO 2018068649 A1 WO2018068649 A1 WO 2018068649A1 CN 2017103861 W CN2017103861 W CN 2017103861W WO 2018068649 A1 WO2018068649 A1 WO 2018068649A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
activation
neural network
module
Prior art date
Application number
PCT/CN2017/103861
Other languages
English (en)
French (fr)
Inventor
范利春
朱磊
Original Assignee
芋头科技(杭州)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 芋头科技(杭州)有限公司 filed Critical 芋头科技(杭州)有限公司
Publication of WO2018068649A1 publication Critical patent/WO2018068649A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to the field of speech recognition, and in particular to a speech activation detection method and apparatus for secondary confirmation using a bidirectional cyclic neural network.
  • voice recognition devices require voice activation before they can pick up the sound for recognition.
  • the voice recognition device is only activated after the activation of the voice recognition device to reduce the power consumption of the device, and the second is to reduce the voice that does not need to be recognized into the voice recognition process, thereby causing unnecessary response.
  • the sound pickup state can be entered by means of touch or a button.
  • voice activation usually sets the activation word first, then speaks the activation word to the device, and the device wakes up and enters the pickup state.
  • the simplest and most intuitive method of voice activation is to use the voice recognition method to send the activated voice to the voice recognizer for recognition. If the recognition result is an activation word or contains an activation word, the device is activated. In fact, it is only necessary to perform acoustic scoring on the activated speech, calculate the acoustic score on the set activation word, and also set the threshold of acceptance and rejection according to the score; however, the threshold is very difficult to control because of the threshold Setting too low will result in a lot of false activations, and setting the threshold too high makes it difficult for the speech recognition device to be activated. This This phenomenon is especially serious for shorter activation words.
  • the present invention discloses a voice activation detecting method, wherein the voice detection when the voice recognition device provided with the activation word is activated includes the following steps:
  • Step S1 performing endpoint detection on the voice data to be measured to obtain voice data including a voice signal
  • Step S2 using a speech recognition acoustic model formed by pre-training, to obtain a three-tone posterior probability associated with the speech data including the speech signal;
  • Step S3 performing flow dynamic programming on the three-tone posterior probability to process a path score of the voice data including the voice signal on the activation word;
  • Step S4 comparing the path score with a preset first threshold:
  • path score is less than the first threshold, determining that the voice data including the voice signal is an inactive voice, and then exiting;
  • Step S5 performing backtracking, finding a starting position of the voice data including the voice signal, and acquiring a voice segment according to the starting position;
  • Step S6 Perform forward processing on the voice segment by using a bidirectional cyclic neural network formed by pre-training, and determine whether to activate the voice recognition device according to the processing result.
  • the voice activation detecting method described above, wherein in the step S6, according to the processing knot The determining step of determining whether to activate the voice recognition device specifically includes:
  • the processing result is compared to a preset second threshold and the device is activated when the processing result is greater than the second threshold.
  • the speech recognition acoustic model is a GMM-HMM based acoustic model or a DNN-HMM framework based acoustic model.
  • a triphone posterior probability associated with the speech data including the speech signal is an acoustic scoring matrix, the acoustic scoring matrix including each frame of the speech data including the speech signal The score of the voice on the three-tones contained in the activation word.
  • the voice segment is a voice segment including only the activation word.
  • the bidirectional cyclic neural network is a BLSTM cyclic neural network.
  • the training step of pre-training to form the bidirectional cyclic neural network comprises:
  • Step S61 processing the voice containing the activation word to obtain a voice segment containing only the activation word
  • Step S62 training the bidirectional cyclic neural network by using the speech segment containing only the activation word.
  • the invention also discloses a voice activation detecting device, which is applied to an action word set Performing voice detection on the voice recognition device to activate the voice recognition device, including:
  • An endpoint detection module performs endpoint detection on the voice data to be measured to obtain voice data including a voice signal
  • An acoustic scoring module is coupled to the endpoint detection module to process a triphone posterior probability associated with the speech data comprising the speech signal using a speech recognition acoustic model formed by pre-training;
  • a dynamic programming module connected to the acoustic scoring module, performing flow dynamic programming on the three-tone posterior probability to process a path score of the speech data including the speech signal on the activation word;
  • a comparison module connected to the dynamic programming module, and the first threshold is preset in the comparison module, the comparison module compares the path score with a preset first threshold, and determines according to the comparison result Whether the voice data including the voice signal is an activated voice;
  • a backtracking module connected to the comparison module, to perform backtracking when the voice data containing the voice signal is determined to be an active voice, and find a starting position of the voice data including the voice signal, according to the Get the voice segment from the starting position;
  • Processing a comparison module connecting with the backtracking module, and including a bidirectional cyclic neural network formed by pre-training, to forward processing the voice segment by using a bidirectional cyclic neural network formed by pre-training, and determining whether to activate according to the processing result A speech recognition device.
  • processing comparison module includes a processing order Unit and comparison unit
  • the processing unit performs forward processing on the voice segment by using a bidirectional cyclic neural network formed by pre-training;
  • the comparing unit compares the processing result with a preset second threshold, and activates the device when the processing result is greater than the second threshold.
  • the endpoint detection module is an endpoint detection module based on short-term energy, pitch or neural network.
  • a triphone posterior probability associated with the speech data including the speech signal is an acoustic scoring matrix, the acoustic scoring matrix including each frame of the speech data including the speech signal The score of the voice on the three-tones contained in the activation word.
  • the voice segment is a voice segment including only the activation word.
  • the method and device for detecting voice activation disclosed by the present invention adopts a method of two-time activation detection, and in the first activation confirmation, only acoustic scoring is used, and then the method of dynamic programming is used, and the path score and the threshold are compared. To determine whether the voice data containing the voice signal is likely to be activated, and then send the voice segment that is likely to be activated to the second time. In the process of voice activation confirmation using BLSTM cyclic neural network, it is finally determined whether to activate the speech recognition device by calculating all the frames of the entire speech; in the two activation confirmations, the threshold of the first activation can be set appropriately loosely.
  • the second activation confirmation is relatively more accurate due to the known starting point, and the two activation detections can simultaneously reduce false activation and leakage activation, that is, effectively reduce the error rate of activation, thereby making it more effective. Guaranteed activation performance.
  • FIG. 1 is a flowchart of a voice activation detecting method in an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a voice activation detecting apparatus according to an embodiment of the present invention.
  • the embodiment relates to a voice activation detection method, which is applied to voice detection when a voice recognition device provided with an activation word is activated, and the method mainly includes the following steps:
  • Step S1 performing endpoint detection on the voice data to be measured, to obtain a voice signal Voice data.
  • endpoint detection step is placed first in the method flow.
  • a large resource is wasted, and the subsequent acoustic calculation after the endpoint detection is only This is done for voice data containing speech signals, which saves computing resources.
  • methods for endpoint detection such as methods using short-term energy, methods using pitch, and methods using neural networks (ie, endpoint detection can be endpoint detection based on short-term energy, pitch, or neural network, etc.) ).
  • the neural network method is used to perform endpoint detection on the voice data to be obtained to obtain voice data including a voice signal; specifically, the input of the neural network is a voice feature of each frame, and the neural network The output has 2 nodes, corresponding to voice and non-speech.
  • the continuous frame judgment it is considered that the continuous occurrence of a certain number of speech frames is regarded as the starting end point, and the continuous occurrence of a certain number of non-speech frames is regarded as the ending end point.
  • step S2 the three-tone posterior probability associated with the speech data containing the speech signal is obtained by using the speech recognition acoustic model formed by the pre-training.
  • the triphone posterior probability associated with the speech data containing the speech signal is an acoustic scoring matrix
  • the acoustic scoring matrix includes each frame of speech data including the speech signal in the activation word
  • the score on the included three-tones ie, the score calculation needs to get the score of each frame of speech on the three-tones contained in the activation word, and finally obtain an acoustic score matrix).
  • the above speech recognition acoustic model is based on Acoustic model of GMM-HMM or acoustic model based on DNN (Deep Neural Network)-HMM framework.
  • step S3 the three-tone posterior probability is subjected to streaming dynamic programming to process the path score of the speech data containing the speech signal on the activation word.
  • the streaming dynamic programming of the first activation confirmation in order to limit the size of the search space, it is necessary to set the shortest and longest time segments of the activation word. At the same time, this also ensures the duration of the activation word segment, which increases the reliability. More specifically, the shortest and longest time segments of each of the active words are set.
  • the dynamic scoring algorithm is used to calculate the matching score of each speech segment on the acoustic scoring matrix. If the matching score of the speech segment in the speech is higher than the threshold, the wake-up word is included. details as follows:
  • a keyword such as "number” it contains 2 words, 4 vowels, which is equivalent to 4 tri-phones, that is, 12 states, assuming their status numbers are 1-12. Then for a test speech, the probability of these 12 states is extracted from the output of the acoustic scoring model for each frame as an acoustic score for the frame under the "digital" keyword. Then for a T-frame of speech, the speech can be converted to a 12*T matrix.
  • the matching score of any speech segment can be calculated by means of its corresponding 12*T matrix.
  • the calculation details are as follows: Generally, the length of each state is 2-10 frames, then " The "number" keyword has a length of 24-120 frames. For any t-th frame in the voice stream, use it as the end frame of the speech segment, and take 24 to 120 frames forward, that is, t-120, t-119, ..., t-24 are used as the initials of the speech segment respectively.
  • Frames which constitute 96 cases to be discriminated, respectively, dynamic rules for the matrix in these 96 cases
  • To divide the obtained result is divided by the frame length to obtain an average score, and the highest average score in the 96 cases is taken as the matching score of the t-th frame.
  • step S4 the path score is compared with a preset first threshold: if the path score is smaller than the first threshold, it is determined that the voice data containing the voice signal is an inactive voice, and then exits.
  • the path score of the dynamic programming can be obtained.
  • the path score is compared with a preset first threshold. If the first threshold is less than the first threshold, it is considered to be an inactive voice, and then exits; and if the threshold is exceeded, the first activation detection is passed, and step S5 is continued.
  • step S5 backtracking is performed to find the starting position of the voice data containing the voice signal, and the voice segment is obtained according to the starting position.
  • the voice detected by the first activation uses a backtracking algorithm of the dynamic programming to find a starting point, thereby acquiring a segment of the voice that may include an activation word.
  • the selection of this speech segment has a greater impact on the secondary confirmation of activation using a bidirectional cyclic neural network, preferably a speech segment containing the activation word, in order to obtain the best results.
  • step S6 the speech segment is forward processed by using a BLSTM (Bidirectional Long Short Term Memory) cyclic neural network formed by pre-training, and whether the speech recognition device is activated is determined according to the processing result.
  • BLSTM Bidirectional Long Short Term Memory
  • BLSTM Cyclic Neural Network which.
  • Two-way long-term memory is a neural network learning model. “Two-way” means that input is provided to two separate regression networks by forward and backward. Both regression networks are connected to the same output layer, and "long-term memory” represents an alternative neural architecture that can learn long-term dependencies.
  • neural networks especially cyclic neural networks
  • the two-way cyclic neural network has more powerful modeling capabilities than the one-way cyclic neural network.
  • the start point is found by using the backtracking algorithm of the dynamic plan by the voice of the first activation detection, thereby A speech segment that may contain an activation word is obtained, which in turn enables the bidirectional cyclic neural network to be applied in speech activation detection.
  • step S6 the BLSTM cyclic neural network needs to be pre-trained, which includes several hidden layers, the input is a feature of the speech segment, and the output node is 2, representing the inactive node and the activating node, respectively.
  • the training data also needs to be processed, and the speech containing the activation word is subjected to the first four processing steps to obtain a speech segment containing only the activation word for training.
  • the anti-sample is a false activation data, the pronunciation is similar to the activation word, and the speech segment is also processed to be trained.
  • the speech segment containing the real activation word is set to 1 for each frame, and the label for each frame is set to 0.
  • the entire speech segment is sent to the BLSTM cyclic neural network for calculation.
  • Each frame of speech will get an output result, and finally according to the weighted score of all frames.
  • the output of the BLSTM cyclic neural network with the speech segment primed frame is calculated, and the threshold is set for the node of the tag 1. If the output value is greater than the threshold, the speech segment is considered to be the activation word, and the device is activated; if the output value is less than the threshold, the language is considered The fragment is not an activation word. The device is not activated.
  • the embodiment relates to a voice activation detecting apparatus, which is applied to a voice recognition device provided with an activation word to perform voice detection when the voice recognition device is activated.
  • the voice activation detection device includes: the endpoint detection module for acquiring the voice data of the voice signal, and the endpoint detection module connected to the endpoint detection module to process the voice data associated with the voice signal by using the voice recognition acoustic model formed by the pre-training
  • the acoustic scoring module of the posterior probabilistic probability is connected with the acoustic scoring module, and the stream dynamic programming of the posterior probabilities of the three-tones is processed to process the path score of the speech data containing the speech signal on the active word into the dynamic programming module.
  • the processing comparison module includes a bidirectional cyclic neural network formed by pre-training to forward processing the speech segment by using a bidirectional cyclic neural network formed by pre-training, and judging whether to activate the speech recognition device according to the processing result.
  • the processing comparison module includes a processing unit that performs forward processing on the voice segment by using a bidirectional cyclic neural network formed by pre-training, and compares the processing result with a preset second threshold, and The comparison unit of the device is activated when the processing result is greater than the second threshold.
  • the endpoint detection module is an endpoint detection module based on short-term energy, pitch or neural network.
  • the speech recognition acoustic model is a GMM-HMM based acoustic model or a DNN-HMM framework based acoustic model.
  • the triphone proton probability associated with the speech data including the speech signal is an acoustic scoring matrix, and the acoustic scoring matrix includes each frame of the speech data including the speech signal. The score on the included three-tone.
  • the speech segment is a speech segment including only an activation word.
  • the bidirectional cyclic neural network is a BLSTM bidirectional cyclic neural network.
  • the embodiment is a structural embodiment corresponding to the embodiment of the voice activation detecting method described above, and the embodiment can be implemented in cooperation with the embodiment of the voice activation detecting method.
  • the related technical details mentioned in the foregoing embodiments of the voice activation detection method are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related art details mentioned in the present embodiment can also be applied to the embodiment of the above-described voice activation detecting method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种语音激活检测方法及装置,通过采用两次激活检测的方法,将第一次激活判断中取到的语音片段输入BLSTM循环神经网络,通过对整条语音的所有帧进行处理,最终判定是否激活语音识别设备。在两次激活确认中,第一次激活的阈值可以设置适当宽松,以保证较高的检出率;第二次激活确认由于已知起始点,相对更加准确,两次激活检测能够同时降低误激活和漏激活,即有效降低激活的等错误率,从而更加有效的保证激活的性能。

Description

一种语音激活检测方法及装置 技术领域
本发明涉及语音识别领域,尤其涉及一种利用双向循环神经网络进行二次确认的语音激活检测方法及装置。
背景技术
很多语音识别设备需要语音激活之后才能够拾音进而进行识别。语音识别设备激活之后才进行拾音一来是为了降低设备功耗,二来是为了减少不需要进行识别的语音进入语音识别流程,进而带来不必要的响应。近场语音识别环境中,比如手机端的语音识别,可以采用触摸或按键等方式进入拾音状态。在远场识别中,或是近场识别但不方便用手操作的时候,利用语音对设备进行激活使设备进入拾音状态就变得不可或缺。语音激活通常是先设定激活词,然后对设备说出激活词,设备被唤醒之后进入拾音状态。
语音激活最简单直观的方法是利用语音识别的方法,把激活的语音送入语音识别器进行识别,如果识别结果是激活词或者包含激活词,那么设备激活。实际上,只需要对激活语音进行声学打分,计算在设定激活词上的声学得分即可,同时还可以根据得分设定接受和拒绝的阈值;然而阈值是非常难于控制的,这是因为阈值设定太低会带来很多的误激活,阈值设定太高则使得语音识别设备难以被激活。这 种现象对于较短的激活词尤其严重。
如何找到一种同时降低误激活和降低漏激活(即降低激活的等错误率)的方法成为本领域技术人员致力于研究的方向。
发明内容
针对上述存在的问题,本发明公开一种语音激活检测方法,其中,应用于对设置有激活词的语音识别设备进行激活时的语音检测,包括以下步骤:
步骤S1,对待测语音数据进行端点检测,以获取包含语音信号的语音数据;
步骤S2,利用预先训练形成的语音识别声学模型处理得到关联于所述包含语音信号的语音数据的三音子后验概率;
步骤S3,对所述三音子后验概率进行流式动态规划,以处理得到所述包含语音信号的语音数据在所述激活词上的路径得分;
步骤S4,将所述路径得分与预先设定的第一阈值进行比较:
若所述路径得分小于所述第一阈值,则判断所述包含语音信号的语音数据为非激活语音,随后退出;
步骤S5,进行回溯,找到所述包含语音信号的语音数据的起始位置,并根据所述起始位置获取语音片段;
步骤S6,利用预先训练形成的双向循环神经网络对所述语音片段进行前向处理,并根据处理结果判断是否激活所述语音识别设备。
上述的语音激活检测方法,其中,所述步骤S6中,根据处理结 果判断是否激活所述语音识别设备的判断步骤具体包括:
将所述处理结果与预先设定的第二阈值进行比较,并在所述处理结果大于所述第二阈值时激活所述设备。
上述的语音激活检测方法,其中,所述端点检测为基于短时能量、音高或神经网络的端点检测。
上述的语音激活检测方法,其中,所述语音识别声学模型为基于GMM-HMM的声学模型或基于DNN-HMM框架的声学模型。
上述的语音激活检测方法,其中,关联于所述包含语音信号的语音数据的三音子后验概率为一声学得分矩阵,所述声学得分矩阵包括所述包含语音信号的语音数据的每一帧语音在所述激活词所包含的三音子上的得分。
上述的语音激活检测方法,其中,所述语音片段为只包括所述激活词的语音片段。
上述的语音激活检测方法,其中,所述双向循环神经网络为BLSTM循环神经网络。
上述的语音激活检测方法,其中,所述步骤S6中,预先训练形成所述双向循环神经网络的训练步骤包括:
步骤S61,对包含激活词的语音进行处理以获取只包含激活词的语音片段;
步骤S62,利用所述只包含激活词的语音片段对所述双向循环神经网络进行训练。
本发明还公开了一种语音激活检测装置,应用于设置有激活词的 语音识别设备上,以在对所述语音识别设备进行激活时进行语音检测,包括:
端点检测模块,对待测语音数据进行端点检测,以获取包含语音信号的语音数据;
声学打分模块,与所述端点检测模块连接,以利用预先训练形成的语音识别声学模型处理得到关联于所述包含语音信号的语音数据的三音子后验概率;
动态规划模块,与所述声学打分模块连接,对所述三音子后验概率进行流式动态规划,以处理得到所述包含语音信号的语音数据在所述激活词上的路径得分;
比较模块,与所述动态规划模块连接,且所述比较模块中预先设定有第一阈值,所述比较模块将所述路径得分与预先设定的第一阈值进行比较,并根据比较结果判断所述包含语音信号的语音数据是否为激活语音;
回溯模块,与所述比较模块连接,以在所述比较结果判断所述包含语音信号的语音数据为激活语音时进行回溯,找到所述包含语音信号的语音数据的起始位置,并根据所述起始位置获取语音片段;
处理比较模块,与所述回溯模块连接,并包括预先训练形成的双向循环神经网络,以利用预先训练形成的双向循环神经网络对所述语音片段进行前向处理,并根据处理结果判断是否激活所述语音识别设备。
上述的语音激活检测装置,其中,所述处理比较模块包括处理单 元和比较单元;
所述处理单元利用预先训练形成的双向循环神经网络对所述语音片段进行前向处理;
所述比较单元将所述处理结果与预先设定的第二阈值进行比较,并在所述处理结果大于所述第二阈值时激活所述设备。
上述的语音激活检测装置,其中,所述端点检测模块为基于短时能量、音高或神经网络的端点检测模块。
上述的语音激活检测装置,其中,所述语音识别声学模型为基于GMM-HMM的声学模型或基于DNN-HMM框架的声学模型。
上述的语音激活检测装置,其中,关联于所述包含语音信号的语音数据的三音子后验概率为一声学得分矩阵,所述声学得分矩阵包括所述包含语音信号的语音数据的每一帧语音在所述激活词所包含的三音子上的得分。
上述的语音激活检测装置,其中,所述语音片段为只包括所述激活词的语音片段。
上述的语音激活检测装置,其中,所述双向循环神经网络为BLSTM循环神经网络。
上述发明具有如下优点或者有益效果:
本发明公开的一种语音激活检测方法及装置,采用两次激活检测的方法,并在第一次激活确认中,仅使用声学打分,然后利用动态规划的方法,并依据路径得分和阈值的比较来判断包含语音信号的语音数据是否有可能激活,然后将有可能激活的语音片段送入到第二次使 用BLSTM循环神经网络进行语音激活确认的流程中,通过对整条语音的所有帧进行计算,最终判定是否激活语音识别设备;在两次激活确认中,第一次激活的阈值可以设置的适当宽松,以保证较高的检出率;第二次激活确认由于已知起始点,相对更加准确,两次激活检测能够同时降低误激活和漏激活,即有效降低激活的等错误率,从而更加有效的保证激活的性能。
附图说明
通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明及其特征、外形和优点将会变得更加明显。在全部附图中相同的标记指示相同的部分。并未可以按照比例绘制附图,重点在于示出本发明的主旨。
图1是本发明实施例中语音激活检测方法的流程图;
图2是本发明实施例中语音激活检测装置的结构示意图。
具体实施方式
下面结合附图和具体的实施例对本发明作进一步的说明,但是不作为本发明的限定。
如图1所示,本实施例涉及一种语音激活检测方法,应用于对设置有激活词的语音识别设备进行激活时的语音检测,该方法主要由包括以下步骤:
步骤S1,对待测语音数据进行端点检测,以获取包含语音信号 的语音数据。
之所以将端点检测的步骤放在方法流程的第一位,是由于如果持续对待测语音数据(连续语音信号)进行声学计算,会浪费较大的资源,而进行端点检测后后续的声学计算只针对包含语音信号的语音数据进行,这样能够节省计算资源。端点检测的方法有很多,例如使用短时能量的方法,使用音高(pitch)的方法以及使用神经网络的方法等(即端点检测可以为基于短时能量、音高或神经网络的端点检测等)。
在本发明的一个优选的实施例中,采用神经网络的方法对待测语音数据进行端点检测,以获取包含语音信号的语音数据;具体的,神经网络的输入是每一帧语音特征,神经网络的输出有2个节点,分别对应语音和非语音。在持续的帧判断中,设定连续出现一定数量的语音帧则认为是起始端点,连续出现一定数量的非语音帧则认为是结束端点。
步骤S2,利用预先训练形成的语音识别声学模型处理得到关联于包含语音信号的语音数据的三音子后验概率。
在本发明一个优选的实施例中,关联于包含语音信号的语音数据的三音子后验概率为一声学得分矩阵,声学得分矩阵包括包含语音信号的语音数据的每一帧语音在激活词所包含的三音子上的得分,(即得分计算需要得到每一帧语音在激活词所包含的三音子上的得分,最终得到一个声学得分矩阵)。
在本发明一个优选的实施例中,上述语音识别声学模型为基于 GMM-HMM的声学模型或基于DNN(深度神经网络)-HMM框架的声学模型。
步骤S3,对三音子后验概率进行流式动态规划,以处理得到包含语音信号的语音数据在激活词上的路径得分。
在第一次激活确认的流式动态规划中,为了限制搜索空间的大小,需要设定激活词的最短以及最长时间片段。同时这样做也保证了激活词片段的时长,从而增加了可靠性。更具体的讲,是设定了激活词中每个音子的最短以及最长时间片段。
在声学打分矩阵上使用动态规划算法计算出各语音片段的匹配得分,若该语音中有语音片段的匹配得分高于阈值,则包含有唤醒词。具体如下:
对于一个关键词,如“数字”,其包含有2个字,4个声韵母,相当于4个tri-phone,即12个状态,假设其状态号依次为1-12。那么对于一段测试语音,从每一帧的声学打分模型的输出中提取出这12个状态下的概率,作为该帧在“数字”关键词下的声学打分。那么对于一段T帧的语音,则该语音可以转换为12*T的矩阵。
对于该帧长为T的语音,借助于其对应的12*T矩阵,可计算出任一语音片段的匹配得分,计算细节如下:一般来说,每一个状态的长度为2-10帧,那么“数字”关键词的长度为24-120帧。对于语音流中的任意第t帧,将其作为该语音片段的终止帧,向前取24到120帧,即分别将t-120、t-119、…、t-24作为该语音片段的初始帧,从而构成了96种待判别的情况,分别对这96种情况下的矩阵做动态规 划,将得到的结果除以帧长得到平均得分,取这96种情况下的最高平均得分作为第t帧的匹配得分。
步骤S4,将路径得分与预先设定的第一阈值进行比较:若路径得分小于第一阈值,则判断包含语音信号的语音数据为非激活语音,随后退出。
经过第一次语音激活判断(第一次语音激活判断包括步骤S3和步骤S4)之后,可以获取到动态规划的路径得分。将这个路径得分与预先设定的第一阈值进行对比,小于此第一阈值的则认为是非激活语音,随后退出;而超过阈值的认为通过了第一次激活检测,继续进行步骤S5。
步骤S5,进行回溯,找到包含语音信号的语音数据的起始位置,并根据起始位置获取语音片段。
具体的,通过第一次激活检测的语音使用过动态规划的回溯算法找到起始点,从而获取到一段可能包含激活词的语音片段。这段语音片段的选择对于后面使用双向循环神经网络进行激活的二次确认有较大的影响,最好是恰好包含激活词的语音片段,这样才能获得最好的效果。
步骤S6,利用预先训练形成的BLSTM(Bidirectional Long Short Term Memory,双向长短时记忆)循环神经网络对语音片段进行前向处理,并根据处理结果判断是否激活语音识别设备。
BLSTM循环神经网络,其中。双向长短时记忆是神经网络学习模型,“双向”表示输入被前向和后向提供给两个单独的回归网络, 这两个回归网络均连接至相同的输出层,并且“长短时记忆”表示能够学习长期依赖性的替选的神经架构。
在此,值得一提的是,神经网络,尤其是循环神经网络,由于强大的建模能力而被语音识别领域广泛采用。而双向循环神经网络拥有比单向循环神经网络更加强大的建模能力。但是,需要知道起始点和结束点才能进行准确计算的要求,使得双向循环神经网络在语音领域难以应用;本发明实施例通过第一次激活检测的语音使用动态规划的回溯算法找到起始点,从而获取到一段可能包含激活词的语音片段,进而可以使得双向循环神经网络在语音激活检测中得以应用。
在步骤S6中,BLSTM循环神经网络需要预先进行训练,它包含几个隐藏层,输入为语音片段的特征,输出节点为2,分别代表非激活节点和激活节点。训练数据同样需要进行处理,将包含激活词的语音进行前面的四个处理步骤,得到只包含激活词的语音片段来进行训练。反样本是误激活数据,发音类似激活词,同样经过处理之后得到语音片段来进行训练。训练中,包含真正激活词的语音片段每一帧的标签都设置为1,反之则将每一帧的标签都设置为0。
进行激活词二次确认的时候,将整个语音片段送入到BLSTM循环神经网络中进行计算,每一帧语音都会得到一个输出结果,最后根据所有帧的加权得分。
将语音片段素有帧的BLSTM循环神经网络的输出计算均值,针对标签1的节点设定阈值,输出值大于阈值的,认为语音片段确实是激活词,设备激活;输出值小于阈值的,认为语言片段并非激活词, 设备不激活。
如图2所示,本实施例涉及一种语音激活检测装置,应用于设置有激活词的语音识别设备上,以在对语音识别设备进行激活时进行语音检测,具体的,该语音激活检测装置包括对待测语音数据进行端点检测,以获取包含语音信号的语音数据的端点检测模块、与端点检测模块连接,以利用预先训练形成的语音识别声学模型处理得到关联于包含语音信号的语音数据的三音子后验概率的声学打分模块、与声学打分模块连接,对三音子后验概率进行流式动态规划,以处理得到包含语音信号的语音数据在激活词上的路径得分放入动态规划模块、与动态规划模块连接的比较模块、与比较模块连接的回溯模块以及与回溯模块连接的计算比较模块;其中,比较模块中预先设定有第一阈值,该比较模块将路径得分与预先设定的第一阈值进行比较,并根据比较结果判断包含语音信号的语音数据是否为激活语音;回溯模块以在比较结果判断包含语音信号的语音数据为激活语音时进行回溯,找到包含语音信号的语音数据的起始位置,并根据起始位置获取语音片段;处理比较模块包括预先训练形成的双向循环神经网络,以利用预先训练形成的双向循环神经网络对语音片段进行前向处理,并根据处理结果判断是否激活语音识别设备。
在本发明一个优选的实施例中,上述处理比较模块包括利用预先训练形成的双向循环神经网络对语音片段进行前向处理的处理单元和将处理结果与预先设定的第二阈值进行比较,并在处理结果大于第二阈值时激活设备的比较单元。
在本发明一个优选的实施例中,上述端点检测模块为基于短时能量、音高或神经网络的端点检测模块。
在本发明一个优选的实施例中,上述语音识别声学模型为基于GMM-HMM的声学模型或基于DNN-HMM框架的声学模型。
在本发明一个优选的实施例中,上述关联于包含语音信号的语音数据的三音子后验概率为一声学得分矩阵,声学得分矩阵包括包含语音信号的语音数据的每一帧语音在激活词所包含的三音子上的得分。
在本发明一个优选的实施例中,上述语音片段为只包括激活词的语音片段。
在本发明一个优选的实施例中,上述双向循环神经网络为BLSTM双向循环神经网络。
不难发现,本实施例为与上述语音激活检测方法的实施例相对应的结构实施例,本实施例可与上述语音激活检测方法的实施例互相配合实施。上述语音激活检测方法的实施例中提到的相关技术细节在本实施例中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述语音激活检测方法的实施例中。
本领域技术人员应该理解,本领域技术人员在结合现有技术以及上述实施例可以实现变化例,在此不做赘述。这样的变化例并不影响本发明的实质内容,在此不予赘述。
以上对本发明的较佳实施例进行了描述。需要理解的是,本发明并不局限于上述特定实施方式,其中未尽详细描述的设备和结构应该 理解为用本领域中的普通方式予以实施;任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案作出许多可能的变动和修饰,或修改为等同变化的等效实施例,这并不影响本发明的实质内容。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。

Claims (15)

  1. 一种语音激活检测方法,其特征在于,应用于对设置有激活词的语音识别设备进行激活时的语音检测,包括以下步骤:
    步骤S1,对待测语音数据进行端点检测,以获取包含语音信号的语音数据;
    步骤S2,利用预先训练形成的语音识别声学模型处理得到关联于所述包含语音信号的语音数据的三音子后验概率;
    步骤S3,对所述三音子后验概率进行流式动态规划,以处理得到所述包含语音信号的语音数据在所述激活词上的路径得分;
    步骤S4,将所述路径得分与预先设定的第一阈值进行比较:
    若所述路径得分小于所述第一阈值,则判断所述包含语音信号的语音数据为非激活语音,随后退出;
    步骤S5,进行回溯,找到所述包含语音信号的语音数据的起始位置,并根据所述起始位置获取语音片段;
    步骤S6,利用预先训练形成的双向循环神经网络对所述语音片段进行前向处理,并根据处理结果判断是否激活所述语音识别设备。
  2. 如权利要求1所述的语音激活检测方法,其特征在于,所述步骤S6中,根据处理结果判断是否激活所述语音识别设备的判断步骤具体包括:
    将所述处理结果与预先设定的第二阈值进行比较,并在所述处理结果大于所述第二阈值时激活所述设备。
  3. 如权利要求1所述的语音激活检测方法,其特征在于,所述端点检测为基于短时能量、音高或神经网络的端点检测。
  4. 如权利要求1所述的语音激活检测方法,其特征在于,所述语音识别声学模型为基于GMM-HMM的声学模型或基于DNN-HMM框架的声学模型。
  5. 如权利要求1所述的语音激活检测方法,其特征在于,关联于所述包含语音信号的语音数据的三音子后验概率为一声学得分矩阵,所述声学得分矩阵包括所述包含语音信号的语音数据的每一帧语音在所述激活词所包含的三音子上的得分。
  6. 如权利要求1所述的语音激活检测方法,其特征在于,所述语音片段为只包括所述激活词的语音片段。
  7. 如权利要求1所述的语音激活检测方法,其特征在于,所述双向循环神经网络为BLSTM循环神经网络。
  8. 如权利要求1所述的语音激活检测方法,其特征在于,所述步骤S6中,预先训练形成所述双向循环神经网络的训练步骤包括:
    步骤S61,对包含激活词的语音进行处理以获取只包含激活词的语音片段;
    步骤S62,利用所述只包含激活词的语音片段对所述双向循环神经网络进行训练。
  9. 一种语音激活检测装置,其特征在于,应用于设置有激活词的语音识别设备上,以在对所述语音识别设备进行激活时进行语音检测,包括:
    端点检测模块,对待测语音数据进行端点检测,以获取包含语音信号的语音数据;
    声学打分模块,与所述端点检测模块连接,以利用预先训练形成的语音识别声学模型处理得到关联于所述包含语音信号的语音数据的三音子后验概率;
    动态规划模块,与所述声学打分模块连接,对所述三音子后验概率进行流式动态规划,以处理得到所述包含语音信号的语音数据在所述激活词上的路径得分;
    比较模块,与所述动态规划模块连接,且所述比较模块中预先设定有第一阈值,所述比较模块将所述路径得分与预先设定的第一阈值进行比较,并根据比较结果判断所述包含语音信号的语音数据是否为激活语音;
    回溯模块,与所述比较模块连接,以在所述比较结果判断所述包含语音信号的语音数据为激活语音时进行回溯,找到所述包含语音信号的语音数据的起始位置,并根据所述起始位置获取语音片段;
    处理比较模块,与所述回溯模块连接,并包括预先训练形成的双 向循环神经网络,以利用预先训练形成的双向循环神经网络对所述语音片段进行前向处理,并根据处理结果判断是否激活所述语音识别设备。
  10. 如权利要求9所述的语音激活检测装置,其特征在于,所述处理比较模块包括处理单元和比较单元;
    所述处理单元利用预先训练形成的双向循环神经网络对所述语音片段进行前向处理;
    所述比较单元将所述处理结果与预先设定的第二阈值进行比较,并在所述处理结果大于所述第二阈值时激活所述设备。
  11. 如权利要求9所述的语音激活检测装置,其特征在于,所述端点检测模块为基于短时能量、音高或神经网络的端点检测模块。
  12. 如权利要求9所述的语音激活检测装置,其特征在于,所述语音识别声学模型为基于GMM-HMM的声学模型或基于DNN-HMM框架的声学模型。
  13. 如权利要求9所述的语音激活检测装置,其特征在于,关联于所述包含语音信号的语音数据的三音子后验概率为一声学得分矩阵,所述声学得分矩阵包括所述包含语音信号的语音数据的每一帧语音在所述激活词所包含的三音子上的得分。
  14. 如权利要求9所述的语音激活检测装置,其特征在于,所述语音片段为只包括所述激活词的语音片段。
  15. 如权利要求9所述的语音激活检测装置,其特征在于,所述双向循环神经网络为BLSTM循环神经网络。
PCT/CN2017/103861 2016-10-11 2017-09-28 一种语音激活检测方法及装置 WO2018068649A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610886934.9A CN107919116B (zh) 2016-10-11 2016-10-11 一种语音激活检测方法及装置
CN201610886934.9 2016-10-11

Publications (1)

Publication Number Publication Date
WO2018068649A1 true WO2018068649A1 (zh) 2018-04-19

Family

ID=61892655

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/103861 WO2018068649A1 (zh) 2016-10-11 2017-09-28 一种语音激活检测方法及装置

Country Status (3)

Country Link
CN (1) CN107919116B (zh)
TW (1) TWI659412B (zh)
WO (1) WO2018068649A1 (zh)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10264030B2 (en) 2016-02-22 2019-04-16 Sonos, Inc. Networked microphone device control
US9772817B2 (en) 2016-02-22 2017-09-26 Sonos, Inc. Room-corrected voice detection
US9811314B2 (en) 2016-02-22 2017-11-07 Sonos, Inc. Metadata exchange involving a networked playback system and a networked microphone system
US10095470B2 (en) 2016-02-22 2018-10-09 Sonos, Inc. Audio response playback
US10134399B2 (en) 2016-07-15 2018-11-20 Sonos, Inc. Contextualization of voice inputs
US10115400B2 (en) 2016-08-05 2018-10-30 Sonos, Inc. Multiple voice services
US10181323B2 (en) 2016-10-19 2019-01-15 Sonos, Inc. Arbitration-based voice recognition
US10475449B2 (en) 2017-08-07 2019-11-12 Sonos, Inc. Wake-word detection suppression
US10048930B1 (en) 2017-09-08 2018-08-14 Sonos, Inc. Dynamic computation of system response volume
US10482868B2 (en) 2017-09-28 2019-11-19 Sonos, Inc. Multi-channel acoustic echo cancellation
US10466962B2 (en) 2017-09-29 2019-11-05 Sonos, Inc. Media playback system with voice assistance
CN108665889B (zh) * 2018-04-20 2021-09-28 百度在线网络技术(北京)有限公司 语音信号端点检测方法、装置、设备及存储介质
US11175880B2 (en) 2018-05-10 2021-11-16 Sonos, Inc. Systems and methods for voice-assisted media content selection
US10959029B2 (en) 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
US11076035B2 (en) 2018-08-28 2021-07-27 Sonos, Inc. Do not disturb feature for audio notifications
US10587430B1 (en) 2018-09-14 2020-03-10 Sonos, Inc. Networked devices, systems, and methods for associating playback devices based on sound codes
US11024331B2 (en) 2018-09-21 2021-06-01 Sonos, Inc. Voice detection optimization using sound metadata
US11100923B2 (en) * 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US11899519B2 (en) 2018-10-23 2024-02-13 Sonos, Inc. Multiple stage network microphone device with reduced power consumption and processing load
US11183183B2 (en) 2018-12-07 2021-11-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11132989B2 (en) 2018-12-13 2021-09-28 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
CN109360585A (zh) 2018-12-19 2019-02-19 晶晨半导体(上海)股份有限公司 一种语音激活检测方法
US11120794B2 (en) 2019-05-03 2021-09-14 Sonos, Inc. Voice assistant persistence across multiple network microphone devices
US11200894B2 (en) 2019-06-12 2021-12-14 Sonos, Inc. Network microphone device with command keyword eventing
US11189286B2 (en) 2019-10-22 2021-11-30 Sonos, Inc. VAS toggle based on device orientation
US11200900B2 (en) 2019-12-20 2021-12-14 Sonos, Inc. Offline voice control
US11562740B2 (en) 2020-01-07 2023-01-24 Sonos, Inc. Voice verification for media playback
CN113192499A (zh) * 2020-01-10 2021-07-30 青岛海信移动通信技术股份有限公司 一种语音唤醒方法及终端
US11308958B2 (en) 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification
CN113593539A (zh) * 2020-04-30 2021-11-02 阿里巴巴集团控股有限公司 流式端到端语音识别方法、装置及电子设备
US11482224B2 (en) 2020-05-20 2022-10-25 Sonos, Inc. Command keywords with input detection windowing
US11984123B2 (en) 2020-11-12 2024-05-14 Sonos, Inc. Network device interaction by range
CN112652296B (zh) * 2020-12-23 2023-07-04 北京华宇信息技术有限公司 流式语音端点检测方法、装置及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281745A (zh) * 2008-05-23 2008-10-08 深圳市北科瑞声科技有限公司 一种车载语音交互系统
CN103077708A (zh) * 2012-12-27 2013-05-01 安徽科大讯飞信息科技股份有限公司 一种语音识别系统中拒识能力提升方法
CN103325370A (zh) * 2013-07-01 2013-09-25 百度在线网络技术(北京)有限公司 语音识别方法和语音识别系统
CN104143326A (zh) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 一种语音命令识别方法和装置
CN105374352A (zh) * 2014-08-22 2016-03-02 中国科学院声学研究所 一种语音激活方法及系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020120446A1 (en) * 2001-02-23 2002-08-29 Motorola, Inc. Detection of inconsistent training data in a voice recognition system
US20030033143A1 (en) * 2001-08-13 2003-02-13 Hagai Aronowitz Decreasing noise sensitivity in speech processing under adverse conditions
CN102194452B (zh) * 2011-04-14 2013-10-23 西安烽火电子科技有限责任公司 复杂背景噪声中的语音激活检测方法
CN102436816A (zh) * 2011-09-20 2012-05-02 安徽科大讯飞信息科技股份有限公司 一种语音数据解码方法和装置
US8543397B1 (en) * 2012-10-11 2013-09-24 Google Inc. Mobile device voice activation
CN103839544B (zh) * 2012-11-27 2016-09-07 展讯通信(上海)有限公司 语音激活检测方法和装置
CN103646649B (zh) * 2013-12-30 2016-04-13 中国科学院自动化研究所 一种高效的语音检测方法
CN203882609U (zh) * 2014-05-08 2014-10-15 钰太芯微电子科技(上海)有限公司 基于语音激活检测的唤醒装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281745A (zh) * 2008-05-23 2008-10-08 深圳市北科瑞声科技有限公司 一种车载语音交互系统
CN103077708A (zh) * 2012-12-27 2013-05-01 安徽科大讯飞信息科技股份有限公司 一种语音识别系统中拒识能力提升方法
CN103325370A (zh) * 2013-07-01 2013-09-25 百度在线网络技术(北京)有限公司 语音识别方法和语音识别系统
CN104143326A (zh) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 一种语音命令识别方法和装置
CN105374352A (zh) * 2014-08-22 2016-03-02 中国科学院声学研究所 一种语音激活方法及系统

Also Published As

Publication number Publication date
CN107919116B (zh) 2019-09-13
TW201814689A (zh) 2018-04-16
TWI659412B (zh) 2019-05-11
CN107919116A (zh) 2018-04-17

Similar Documents

Publication Publication Date Title
WO2018068649A1 (zh) 一种语音激活检测方法及装置
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
US10777189B1 (en) Dynamic wakeword detection
CN107767863B (zh) 语音唤醒方法、系统及智能终端
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
US11657832B2 (en) User presence detection
JP7263492B2 (ja) エンドツーエンドストリーミングキーワードスポッティング
US10304440B1 (en) Keyword spotting using multi-task configuration
CN108711421B (zh) 一种语音识别声学模型建立方法及装置和电子设备
US11069352B1 (en) Media presence detection
US10872599B1 (en) Wakeword training
WO2019001428A1 (zh) 一种语音唤醒方法、装置以及电子设备
US20230089285A1 (en) Natural language understanding
US11205420B1 (en) Speech processing using a recurrent neural network
WO2021057038A1 (zh) 基于多任务模型的语音识别与关键词检测装置和方法
US11348601B1 (en) Natural language understanding using voice characteristics
Hwang et al. Online keyword spotting with a character-level recurrent neural network
US11410646B1 (en) Processing complex utterances for natural language understanding
US11398226B1 (en) Complex natural language processing
US11990122B2 (en) User-system dialog expansion
US11138858B1 (en) Event-detection confirmation by voice user interface
US11557292B1 (en) Speech command verification
US11288513B1 (en) Predictive image analysis
CN114945980A (zh) 小尺寸多通道关键字定位
Li et al. Recurrent neural network based small-footprint wake-up-word speech recognition system with a score calibration method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17860238

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17860238

Country of ref document: EP

Kind code of ref document: A1