TW201830377A

TW201830377A - Speech point detection method and speech recognition method

Info

Publication number: TW201830377A
Application number: TW107104564A
Authority: TW
Inventors: 范利春
Original assignee: 大陸商芋頭科技(杭州)有限公司
Priority date: 2017-02-13
Filing date: 2018-02-08
Publication date: 2018-08-16
Also published as: CN108428448A; WO2018145584A1; TWI659409B

Abstract

The invention discloses a speech point detection method and a speech recognition method, and relates to the technical field of the speech recognition; the method comprises: extracting speech features from a speech data and inputting the speech features into a silence model; outputting a label according to the speech features to indicate whether the speech data is a silence frame; confirming a speech endpoint of a segment of speech according to the labels of the speech data of consecutive frames: during the non-activated state, if the length of the speech data of the non-silence frames is greater than a preset first threshold, it is determined that the first non-silence frame of the speech data is the starting point of a segment of speech; during the activated state, if the length of the speech data of the silence frame is greater than a preset second threshold, it is determined that the first silence frame of the speech data is the ending point of a speech data. The benefits of the above mentioned solution are: the problem of inaccurate detection of speech point in the prior art and excessive requirements for the detection environment can be solved.

Description

Speech endpoint detection method and speech recognition method

本發明涉及語音辨識技術領域，尤其涉及一種語音端點檢測方法及語音辨識方法。The present invention relates to the field of speech recognition technologies, and in particular, to a speech endpoint detection method and a speech recognition method.

隨著語音辨識技術的發展，語音辨識在人們生活中的應用越來越廣泛。當使用者使用手持設備中的語音辨識技術時，通常會配合語音辨識按鍵來控制需要識別的語音段落的開始和結束的時間，但是當使用者處於智慧家居環境中使用語音辨識技術時，會因為距離拾音設備較遠而無法採用按鍵配合的方式手動決定語音段落的開始端點和結束端點，這時就需要另外一種方式來對語音開始和結束的時間進行自動判斷，也即語音端點檢測技術（Voice Active Detection，VAD）。With the development of speech recognition technology, speech recognition has become more and more widely used in people's lives. When the user uses the voice recognition technology in the handheld device, the voice recognition button is usually used to control the start and end time of the voice segment to be recognized, but when the user uses the voice recognition technology in the smart home environment, The distance between the start end and the end point of the voice paragraph can be manually determined by the remote selection of the sound pickup device. In this case, another method is needed to automatically judge the start and end time of the voice, that is, the voice endpoint detection. Voice Active Detection (VAD).

傳統的端點檢測方法主要基於次頻帶能量進行，即計算每幀語音數據在某一頻段的能量，並與預先設定的能量閾值進行比較來判斷語音的開始端點和結束端點。這種端點檢測方法對檢測環境的要求較高，其必須在安靜的環境中進行語音辨識才能保證檢測到的語音端點的準確性。而在相對嘈雜的雜訊環境中，不同種類的雜訊會對不同的次頻帶能量產生影響，從而對上述端點檢測方法帶來較大的干擾，尤其是低訊噪比和非平穩的雜訊環境中，對次頻帶能量的計算會造成很大干擾，從而使得最終的檢測結果不準確。而只有保證語音端點檢測的準確性才能保證語音正確被採集，進而正確被識別。端點檢測的結果不準確有可能會使語音被截斷或者錄入更多的雜訊，會導致語音辨識不能對整句話解碼，從而帶來漏報或者誤報等問題，甚至會造成整句話的解碼全部錯誤，降低語音辨識結果的準確性。The traditional endpoint detection method is mainly based on the sub-band energy, that is, calculating the energy of each frame of speech data in a certain frequency band, and comparing with a preset energy threshold to determine the start end point and the end end point of the speech. This endpoint detection method has a high requirement for the detection environment, and it must perform speech recognition in a quiet environment to ensure the accuracy of the detected speech endpoint. In a relatively noisy environment, different kinds of noise will affect the energy of different sub-bands, which will cause greater interference to the above-mentioned endpoint detection methods, especially low signal-to-noise ratio and non-stationary In the information environment, the calculation of the energy of the sub-band will cause a lot of interference, which will make the final test result inaccurate. Only the accuracy of the voice endpoint detection can ensure that the voice is correctly collected and correctly recognized. The inaccurate result of the endpoint detection may cause the speech to be truncated or more noise to be recorded, which may result in the speech recognition not being able to decode the entire sentence, thus causing problems such as underreporting or false positives, and even causing the whole sentence. Decode all errors and reduce the accuracy of the speech recognition results.

根據現有技術中存在的上述問題，現提供一種語音端點檢測方法及語音辨識方法的技術方案，旨在解決現有技術中語音端點檢測不準確以及對於檢測環境要求過高的問題。上述技術方案具體包括：According to the above problems in the prior art, a technical solution for a voice endpoint detection method and a voice recognition method is provided, which aims to solve the problem that the voice endpoint detection is inaccurate and the detection environment is too high in the prior art. The above technical solutions specifically include:

一種語音端點檢測方法，其中，預先訓練形成一用於判斷語音數據是否為靜音幀的靜音模型，隨後獲取外部輸入的包括連續幀的語音數據的一段語音，並執行下述步驟：步驟S1，提取每一幀語音數據的語音特徵，並將語音特徵輸入至靜音模型中；步驟S2，靜音模型根據語音特徵輸出關聯於每一幀語音數據的標籤，標籤用於表示語音數據是否為靜音幀；步驟S3，根據連續幀的語音數據的標籤確認一段語音的語音端點：當採集語音的拾音設備處於非啟動狀態時，若連續出現非靜音幀的語音數據的長度大於一預設的第一閾值，則判斷第一幀為非靜音幀的語音數據為一段語音的起始端點；當採集語音的拾音設備處於啟動狀態時，若連續出現靜音幀的語音數據的長度大於一預設的第二閾值，則判斷第一幀為靜音幀的語音數據為一段語音的結束端點。A voice endpoint detection method, wherein a pre-training forms a muting model for determining whether the voice data is a silence frame, and then acquiring a segment of voice of the externally input voice data including consecutive frames, and performing the following steps: Step S1, Extracting a voice feature of each frame of voice data, and inputting the voice feature into the mute model; Step S2, the mute model outputs a tag associated with each frame of voice data according to the voice feature, and the tag is used to indicate whether the voice data is a silence frame; Step S3, confirming the voice endpoint of a voice according to the label of the voice data of the continuous frame: When the sound pickup device that collects the voice is in a non-activated state, if the length of the voice data of the non-silent frame continuously appears is greater than a preset first The threshold value is used to determine that the voice data of the first frame is a non-silent frame is the starting end point of a voice; when the voice pickup device that collects the voice is in the startup state, if the voice data of the silent frame continuously appears, the length of the voice data is greater than a preset number. The second threshold determines that the voice data of the first frame is a silence frame is the end point of a voice.

優選的，該語音端點檢測方法，其中，透過下述方法預先訓練形成靜音模型：步驟A1，輸入預設的多個訓練用語音數據，並提取每個訓練用語音數據的語音特徵；步驟A2，根據對應的語音特徵，針對每幀訓練用語音數據進行自動標注操作，獲得對應每幀語音數據的一標籤；標籤用於表示對應的一幀語音數據為靜音幀或者非靜音幀；步驟A3，根據訓練用語音數據以及對應的標籤訓練得到靜音模型；靜音模型的輸出層上設置有第一節點和第二節點；第一節點用於表示對應靜音幀的標籤；第二節點用於表示對應非靜音幀的標籤。Preferably, the voice endpoint detection method is configured to form a silent model by using the following method: Step A1, inputting a preset plurality of training voice data, and extracting voice features of each training voice data; Step A2 According to the corresponding voice feature, the automatic labeling operation is performed for each frame of the training voice data to obtain a label corresponding to each frame of voice data; the label is used to indicate that the corresponding one frame of voice data is a silent frame or a non-silent frame; Step A3, The silent model is obtained according to the training voice data and the corresponding label training; the first node and the second node are disposed on the output layer of the silent model; the first node is used to indicate the label of the corresponding silent frame; the second node is used to indicate the corresponding non- The label of the mute frame.

優選的，該語音端點檢測方法，其中，對應外部輸入的每個訓練用語音數據均預先設置一標注文本，以標注訓練用語音數據對應的文本內容；則步驟A2具體包括：步驟A21，獲取語音特徵和對應的標注文本；步驟A22，利用預先訓練形成的聲學模型對語音特徵和對應的標注文本進行強制對齊，以得到每幀語音特徵對應到音素的輸出標籤；步驟A23，對經過強制對齊的訓練用語音數據進行後處理，以將靜音音素的輸出標籤映射到表示靜音幀的標籤上，以及將非靜音音素的輸出標籤映射到表示非靜音幀的標籤上。Preferably, the voice endpoint detection method, wherein each training voice data corresponding to the external input is preset with an annotated text to mark the text content corresponding to the training voice data; then step A2 specifically includes: Step A21, Obtaining the voice feature and the corresponding annotated text; Step A22, using the acoustic model formed by the pre-training to forcibly align the voice feature and the corresponding annotated text to obtain an output label corresponding to the phoneme of each frame of the voice feature; Step A23, The forced alignment training is post-processed with speech data to map the output label of the mute phoneme to the tag representing the mute frame and the output tag of the non-silent phoneme to the tag representing the non-silent frame.

優選的，該語音端點檢測方法，其中，步驟A22中，預先訓練形成的聲學模型為高斯混合模型-隱藏式馬可夫模型，或者為深度神經網路-隱藏式馬可夫模型。Preferably, the speech endpoint detection method, wherein, in step A22, the acoustic model formed by the pre-training is a Gaussian mixture model-hidden Markov model, or a deep neural network-hidden Markov model.

優選的，該語音端點檢測方法，其中，靜音模型為包括多層神經網路的深度神經網路模型。Preferably, the speech endpoint detection method, wherein the muting model is a deep neural network model including a multi-layer neural network.

優選的，該語音端點檢測方法，其中，靜音模型的每兩層神經網路之間包括至少一個非線性變換。Preferably, the speech endpoint detection method includes at least one non-linear transformation between every two layers of neural networks of the muting model.

優選的，該語音端點檢測方法，其中，靜音模型的每層神經網路為全連接的神經網路，或者卷積神經網路，或者遞迴神經網路。Preferably, the voice endpoint detection method, wherein each layer of the neural network of the silent model is a fully connected neural network, or a convolutional neural network, or a recurrent neural network.

優選的，該語音端點檢測方法，其中，靜音模型為包括多層神經網路的深度神經網路模型；靜音模型的輸出層上設置有第一節點和第二節點；第一節點用於表示對應靜音幀的標籤；第二節點用於表示對應非靜音幀的標籤；則步驟S2具體包括：步驟S21，語音特徵輸入靜音模型後，通過多層神經網路的前向計算分別得到輸出層中關聯於第一節點的第一取值以及關聯於第二節點的第二取值；步驟S22，將第一取值與第二取值進行比較：若第一取值大於第二取值，則將第一節點作為語音數據的標籤並輸出；若第一取值小於第二取值，則將第二節點作為語音數據的標籤並輸出。一種語音辨識方法，其中，採用上述的語音端點檢測方法檢測得到需要識別的一段語音的起始端點和結束端點。Preferably, the voice endpoint detection method, wherein the silence model is a deep neural network model including a multi-layer neural network; the first node and the second node are disposed on an output layer of the silent model; The second node is used to indicate the label of the corresponding non-silent frame. Step S2 specifically includes: Step S21: After the voice feature is input into the mute model, the forward calculation by the multi-layer neural network is respectively associated with the output layer. a first value of the first node and a second value associated with the second node; Step S22, comparing the first value with the second value: if the first value is greater than the second value, A node is used as a label of the voice data and is output; if the first value is smaller than the second value, the second node is used as a label of the voice data and output. A speech recognition method, wherein the speech endpoint detection method is used to detect a start endpoint and an end endpoint of a segment of speech that need to be identified.

上述技術方案的有益效果是：提供一種語音端點檢測方法，能夠解決現有技術中語音端點檢測不準確以及對於檢測環境要求過高的問題，因此提升語音端點檢測的準確性，擴展端點檢測方法的泛用性，從而改進整個語音辨識過程。The foregoing technical solution has the beneficial effects of providing a voice endpoint detection method, which can solve the problem that the voice endpoint detection is inaccurate and the detection environment is too high in the prior art, thereby improving the accuracy of the voice endpoint detection and extending the endpoint. The ubiquity of the detection method improves the overall speech recognition process.

下面將結合本發明實施例中的附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有作出創造性勞動的前提下所獲得的所有其他實施例，都屬於本發明保護的範圍。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

需要說明的是，在不衝突的情況下，本發明中的實施例及實施例中的特徵可以相互組合。It should be noted that the embodiments in the present invention and the features in the embodiments may be combined with each other without conflict.

下面結合附圖和具體實施例對本發明作進一步說明，但不作為本發明的限定。The invention is further illustrated by the following figures and specific examples, but is not to be construed as limiting.

根據現有技術中存在的上述問題，現提供一種語音端點檢測方法，該方法中，預先訓練形成一用於判斷語音數據是否為靜音幀的靜音模型，隨後獲取外部輸入的包括連續幀的語音數據的一段語音，並執行如圖1中所示的下述步驟：步驟S1，提取每一幀語音數據的語音特徵，並將語音特徵輸入至靜音模型中；步驟S2，靜音模型根據語音特徵輸出關聯於每一幀語音數據的標籤，標籤用於表示語音數據是否為靜音幀；步驟S3，根據連續幀的語音數據的標籤確認一段語音的語音端點：當採集語音的拾音設備處於非啟動狀態時，若連續出現非靜音幀的語音數據的長度大於一預設的第一閾值，則判斷第一幀為非靜音幀的語音數據為一段語音的起始端點；當採集語音的拾音設備處於啟動狀態時，若連續出現靜音幀的語音數據的長度大於一預設的第二閾值，則判斷第一幀為靜音幀的語音數據為一段語音的結束端點。According to the above problems existing in the prior art, there is provided a speech endpoint detection method, in which a pre-training forms a muting model for judging whether speech data is a silence frame, and then acquiring externally input speech data including consecutive frames. a piece of speech, and performing the following steps as shown in FIG. 1: Step S1, extracting the speech features of each frame of speech data, and inputting the speech features into the mute model; Step S2, the mute model outputs the association according to the speech features For each frame of voice data, the label is used to indicate whether the voice data is a silent frame; Step S3, the voice endpoint of a voice is confirmed according to the label of the voice data of the continuous frame: When the voice pickup device that collects the voice is in a non-activated state If the length of the voice data of the non-silent frame is greater than a preset first threshold, determining that the voice data of the first frame is a non-silent frame is a starting end of a voice; when the voice collecting device is in the voice collection In the startup state, if the length of the voice data of the silent frame continuously exceeds a preset second threshold, it is determined A mute voice frame data of a voice terminal end.

具體地，本實施例中，首先形成一靜音模型，該靜音模型可以用於判斷一段語音中的每幀語音數據是否為靜音幀。所謂靜音幀，是指不包含需要進行語音辨識的有效語音的語音數據；所謂非靜音幀，是指包含需要進行語音辨識的有效語音的語音數據。Specifically, in this embodiment, a mute model is first formed, and the mute model can be used to determine whether each frame of speech data in a segment of speech is a mute frame. The so-called silent frame refers to voice data that does not contain valid voices that require voice recognition; the so-called silent frame refers to voice data that contains valid voices that require voice recognition.

隨後，本實施例中，訓練形成靜音模型後，提取外部輸入的一段語音中每一幀語音數據的語音特徵，並將提取到的語音特徵輸入到靜音模型中，以輸出關聯於該幀語音數據的標籤。本實施例中，一共存在兩個標籤，分別用於表示該幀語音數據為靜音幀/非靜音幀。Then, in the embodiment, after the training forms the silent model, the voice features of each frame of the voice data input from the external input are extracted, and the extracted voice features are input into the silent model to output the voice data associated with the frame. s Mark. In this embodiment, there are two tags in total, which are respectively used to indicate that the frame voice data is a silent frame/non-silent frame.

本實施例中，在得到了每一幀語音數據的靜音和非靜音分類後，再判斷語音端點。然而並非出現一幀非靜音幀就能認為一段語音開始，也不能出現一幀靜音幀就認為一段語音結束，而是需要根據連續靜音幀/非靜音幀的幀數來判斷一段語音的起始端點和結束端點。具體為：In this embodiment, after the mute and non-mute classification of each frame of voice data is obtained, the speech endpoint is determined. However, instead of a frame of non-silent frame, a speech start can be considered, and a silence frame cannot be regarded as a speech end. Instead, the start end of a speech is determined according to the number of consecutive silent/non-silent frames. And end endpoints. Specifically:

當採集語音的拾音設備處於非啟動狀態時，若連續出現非靜音幀的語音數據的長度大於一預設的第一閾值，則判斷第一幀為非靜音幀的語音數據為一段語音的起始端點；When the sound collecting device that collects the voice is in a non-activated state, if the length of the voice data of the non-silent frame is greater than a preset first threshold, determining that the voice data of the first frame is a non-silent frame is a voice. Starting point

當採集語音的拾音設備處於啟動狀態時，若連續出現靜音幀的語音數據的長度大於一預設的第二閾值，則判斷第一幀為靜音幀的語音數據為一段語音的結束端點。When the sound collecting device that collects the voice is in the activated state, if the length of the voice data of the silent frame is greater than a preset second threshold, the voice data of the first frame being the silence frame is determined to be the end point of the voice.

本發明的一個較佳的實施例中，上述第一閾值可以取值30，上述第二閾值可以取值50。即：當採集語音的拾音設備處於非啟動狀態時，若連續出現非靜音幀的長度大於30（出現連續30幀非靜音幀），則判斷第一幀非靜音幀為該段語音的起始端點。In a preferred embodiment of the present invention, the first threshold may take a value of 30, and the second threshold may take a value of 50. That is, when the sound collection device that collects the voice is in the non-activated state, if the length of the non-silent frame continuously appears to be greater than 30 (the consecutive 30 frames of non-silent frames appear), it is determined that the first frame of the non-silent frame is the beginning of the segment of the voice. point.

當採集語音的拾音設備處於啟動狀態時，若連續出現靜音幀的長度大於50（出現連續50幀靜音幀），則判斷第一幀靜音幀為該段語音的結束端點。When the sound collecting device that collects the voice is in the startup state, if the length of the silent frame continuously appears to be greater than 50 (the consecutive 50 frames of the silence frame appear), it is determined that the first frame silence frame is the end point of the segment of the voice.

本發明的另一個較佳的實施例中，上述第一閾值同樣可以取值70，上述第二閾值可以取值50。In another preferred embodiment of the present invention, the first threshold may also take a value of 70, and the second threshold may take a value of 50.

本發明的其他實施例中，可以根據實際情況自由設定第一閾值和第二閾值的取值，以滿足不同環境下語音端點檢測的需求。In other embodiments of the present invention, the values of the first threshold and the second threshold may be freely set according to actual conditions to meet the requirements of voice endpoint detection in different environments.

本發明的較佳的實施例中，可以通過如圖2中所示的下述方法預先訓練形成靜音模型：步驟A1，輸入預設的多個訓練用語音數據，並提取每個訓練用語音數據的語音特徵；步驟A2，根據對應的語音特徵，針對每幀訓練用語音數據進行自動標注操作，獲得對應每幀語音數據的一標籤；標籤用於表示對應的一幀語音數據為靜音幀或者非靜音幀；步驟A3，根據對應的語音特徵，針對每幀訓練用語音數據進行自動標注操作，獲得對應每幀語音數據的一標籤；標籤用於表示對應的一幀語音數據為靜音幀或者非靜音幀；靜音模型的輸出層上設置有第一節點和第二節點；第一節點用於表示對應靜音幀的標籤；第二節點用於表示對應非靜音幀的標籤。In a preferred embodiment of the present invention, the silent model can be pre-trained by the following method as shown in FIG. 2: Step A1, inputting a preset plurality of training voice data, and extracting each training voice data The voice feature; Step A2, according to the corresponding voice feature, the automatic labeling operation is performed for each frame of the training voice data, and a label corresponding to each frame of voice data is obtained; the label is used to indicate that the corresponding one frame of voice data is a silent frame or a non- Silent frame; Step A3, according to the corresponding voice feature, perform automatic labeling operation for each frame of training voice data, and obtain a label corresponding to each frame of voice data; the label is used to indicate that the corresponding one frame of voice data is a silent frame or non-mute a first node and a second node are disposed on an output layer of the mute model; a first node is used to indicate a tag of the corresponding mute frame; and a second node is used to represent a tag corresponding to the non-silent frame.

具體地，本實施例中，首先輸入預設的多個訓練用語音數據。所謂訓練用語音數據，是指預先知道其文本內容的語音數據。該訓練用語音數據可以根據事先已經訓練好中文的語音辨識系統的中文語音數據集提取得到，並且擁有對應訓練用語音數據的標注文本。即上述步驟A1中輸入的訓練用語音數據與訓練後續語音辨識的聲學模型時應用的語音數據是相同的。Specifically, in this embodiment, a preset plurality of training voice data is first input. The so-called training voice data refers to voice data in which the text content is known in advance. The training voice data can be extracted according to the Chinese voice data set of the voice recognition system that has been trained in Chinese in advance, and has the annotation text corresponding to the training voice data. That is, the training voice data input in the above step A1 is the same as the voice data applied when training the acoustic model of the subsequent voice recognition.

本實施例中，在輸入訓練用語音數據後，針對每個訓練用語音數據分別提取其語音特徵。語音特徵的提取同樣可以使用訓練語音辨識的聲學模型時提取的語音特徵。常見的語音特徵可以包括梅爾頻率倒譜系數（Mel Frequency Cepstrum Coefficient, MFCC）、感知線性預測（Perceptual Linear Predictive，PLP）或者濾波器組（Filter-Bank，FBANK）特徵。同樣地，在本發明的其他實施例中，可以採用其他類似的語音特徵來完成靜音模型的訓練。In this embodiment, after the training voice data is input, the voice features are extracted for each training voice data. The extraction of speech features can also be performed using speech features extracted while training the acoustic model of speech recognition. Common speech features may include Mel Frequency Cepstrum Coefficient (MFCC), Perceptual Linear Predictive (PLP) or Filter-Bank (FBANK) features. As such, in other embodiments of the invention, other similar speech features may be employed to accomplish the training of the silent model.

本實施例中，上述步驟A2中，在作為靜音模型的訓練輸入參數之前，首先需要對上述訓練用語音數據進行自動標注操作，以使每幀語音數據幀對齊。上述自動標注操作中，每一幀語音數據都會獲得一個標籤，上述自動標注的處理方法在下文中會詳述，在經過自動標注操作後，就可以訓練靜音模型了。In this embodiment, in the above step A2, before the training input parameter as the silent model, it is first necessary to perform an automatic labeling operation on the training voice data to align each frame of the voice data frame. In the above automatic labeling operation, each frame of voice data will obtain a label, and the above automatic labeling processing method will be described in detail below. After the automatic labeling operation, the silent model can be trained.

本發明的較佳的實施例中，對應外部輸入的每個訓練用語音數據均預先設置一標注文本，以標注訓練用語音數據對應的文本內容；則上述步驟A2具體如圖3所示，可以包括：步驟A21，獲取語音特徵和對應的標注文本；步驟A22，利用預先訓練形成的聲學模型對語音特徵和對應的標注文本進行強制對齊，以得到每幀語音特徵對應到音素的輸出標籤；步驟A23，對經過強制對齊的訓練用語音數據進行後處理，以將靜音音素的輸出標籤映射到表示靜音幀的標籤上，以及將非靜音音素的輸出標籤映射到表示非靜音幀的標籤上。In the preferred embodiment of the present invention, each of the training voice data corresponding to the external input is preset with an annotated text to mark the text content corresponding to the training voice data; then the step A2 is specifically shown in FIG. The method may include: Step A21: acquiring a voice feature and a corresponding annotated text; Step A22: forcibly aligning the voice feature and the corresponding annotated text by using an acoustic model formed by the pre-training to obtain an output corresponding to the phoneme of each frame of the voice feature. Step A23, post-processing the forced-aligned training voice data to map the output label of the silent phoneme to the label representing the silence frame, and map the output label of the non-silent phoneme to the label indicating the non-silent frame on.

具體地，本實施例中，若採用手工對訓練用語音數據進行自動標注操作，則需要耗費大量的人工成本，並且對於雜訊的標注在不同的標注人員的標注結果中也會出現不一致的情況，從而影響後續訓練模型的過程。因此本發明技術方案中提供一種高效可行的自動標注方法。Specifically, in the embodiment, if the automatic labeling operation is performed on the training voice data manually, a large amount of labor costs are required, and the labeling of the noise may be inconsistent in the labeling results of different labeling personnel. , thereby affecting the process of the subsequent training model. Therefore, an efficient and feasible automatic labeling method is provided in the technical solution of the present invention.

上述方法中，首先獲取每一幀訓練用語音數據的語音特徵以及對應的標注文本，隨後對語音特徵和標注文本進行強制對齊。In the above method, the speech features of the speech data for each frame and the corresponding annotated text are first acquired, and then the speech features and the annotated text are forcibly aligned.

本實施例中，可以利用後續語音辨識的聲學模型（即預先訓練形成的聲學模型）對語音特徵和標注文本進行強制對齊。本發明中的語音辨識的聲學模型可以是高斯混合模型-隱藏式馬可夫模型（Gaussian Mixture Model- Hidden Markov Model，GMM-HMM），也可以是深度神經網路--隱藏式馬可夫模型（Deep Neural Network- Hidden Markov Model，DNN-HMM），或者其他適宜的模型。上述聲學模型中的建模單元是音素（phone）級別的，例如上下文獨立的音素（Context Independent Phone，ci-phone）或者上下文相關的音素（Context Dependent Phone，cd-phone）。採用上述聲學模型進行強制對齊操作可以將對訓練用語音數據幀對齊到音素級別。In this embodiment, the acoustic features of the subsequent speech recognition (ie, the acoustic model formed by the pre-training) may be used to forcibly align the speech features and the annotated text. The acoustic model of speech recognition in the present invention may be a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) or a deep neural network--Hidden Markov Model (Deep Neural Network) - Hidden Markov Model, DNN-HMM), or other suitable model. The modeling unit in the above acoustic model is a phone level, such as a Context Independent Phone (ci-phone) or a Context Dependent Phone (cd-phone). The forced alignment operation using the acoustic model described above can align the training speech data frame to the phoneme level.

本實施例中，上述步驟A23中，對經過強制對齊的訓練用語音數據進行後處理之後，即可得到幀對應到靜音標簽的語音數據。上述後處理操作中，通常將部分音素看作是靜音音素，將其他音素看作是非靜音音素，經過上述映射之後，每一幀語音數據都可以跟靜音/非靜音的標籤對應起來。In this embodiment, in the above step A23, after the forced-aligned training voice data is post-processed, the voice data corresponding to the mute label can be obtained. In the above post-processing operation, part of the phoneme is generally regarded as a silent phoneme, and other phonemes are regarded as non-silent phonemes. After the above mapping, each frame of voice data can be associated with a mute/unmute tag.

本發明的較佳的實施例中，隨後利用上文中得到的語音特徵和幀對齊的標籤就能夠訓練靜音模型了。上述靜音模型可以為包括多層神經網路的深度神經網路模型。上述靜音模型的每層可以是全連接的神經網路、卷積神經網路、遞迴神經網路等，每兩層神經網路之間可以包含一個或多個非線性變換，例如signoid非線性變換、tanh非線性變換、maxpool非線性變換、RELU非線性變換或者softmax非線性變換。In a preferred embodiment of the invention, the mute model can then be trained using the speech features and frame aligned tags obtained above. The above silent model may be a deep neural network model including a multilayer neural network. Each layer of the above silent model may be a fully connected neural network, a convolutional neural network, a recurrent neural network, etc., and each two layers of neural networks may contain one or more nonlinear transformations, such as signoid nonlinearity. Transform, tanh nonlinear transform, maxpool nonlinear transform, RELU nonlinear transform or softmax nonlinear transform.

本發明的較佳的實施例中，如圖4所示，該靜音模型中包括多層神經網路41，以及包括一輸出層42。在該靜音模型的輸出層42中設置第一節點421和第二節點422。上述第一節點421用於表示對應靜音幀的標籤，第二節點422用於表示對應非靜音幀的標籤。在輸出層42的第一節點421和第二節點422上可以進行softmax非線性變換或者其他非線性變換操作，也可以不使用非線性變換操作。In a preferred embodiment of the invention, as shown in FIG. 4, the silent model includes a multilayer neural network 41 and includes an output layer 42. A first node 421 and a second node 422 are disposed in the output layer 42 of the mute model. The first node 421 is used to indicate a label of a corresponding silent frame, and the second node 422 is used to indicate a label corresponding to a non-silent frame. A softmax nonlinear transform or other non-linear transform operation may be performed on the first node 421 and the second node 422 of the output layer 42, or may not use a non-linear transform operation.

則本發明的較佳的實施例中，上述步驟S2具體如圖5所示，包括：步驟S21，語音特徵輸入靜音模型後，通過多層神經網路的前向計算分別得到輸出層中關聯於第一節點的第一取值以及關聯於第二節點的第二取值；步驟S22，將第一取值與第二取值進行比較：若第一取值大於第二取值，則將第一節點作為語音數據的標籤並輸出；若第一取值小於第二取值，則將第二節點作為語音數據的標籤並輸出。In the preferred embodiment of the present invention, the step S2 is specifically as shown in FIG. 5, and includes: Step S21: After the voice feature is input into the mute model, the forward calculation by the multi-layer neural network respectively obtains the correlation in the output layer. a first value of a node and a second value associated with the second node; Step S22, comparing the first value with the second value: if the first value is greater than the second value, the first value is The node is used as a label of the voice data and is output; if the first value is smaller than the second value, the second node is used as a label of the voice data and output.

具體地，本實施例中，將語音特徵輸入到訓練好的靜音模型中，多層神經網路進行前向計算，並最終得到輸出層中的兩個輸出節點（第一節點和第二節點）的取值，即第一取值和第二取值。隨後比較第一取值和第二取值的大小：若第一取值較大，則選擇第一節點作為語音數據的標籤並輸出，即此時語音數據為靜音幀；相應地，若第二取值較大，則選擇第二節點作為語音數據的標籤並輸出，即此時語音數據為非靜音幀。Specifically, in this embodiment, the voice feature is input into the trained silent model, and the multi-layer neural network performs forward calculation, and finally obtains two output nodes (first node and second node) in the output layer. The values are the first value and the second value. Then comparing the size of the first value and the second value: if the first value is larger, the first node is selected as the label of the voice data and output, that is, the voice data is a silence frame; accordingly, if the second If the value is large, the second node is selected as the label of the voice data and output, that is, the voice data is a non-silent frame at this time.

本發明的一個較佳的實施例中，上述語音端點檢測方法的一個完整流程如下文中：首先準備一事先訓練好的中文語音辨識系統，這裡選擇的語音辨識系統具有中文語音數據集，並且擁有語音數據的標注文本。In a preferred embodiment of the present invention, a complete process of the voice endpoint detection method is as follows: First, a pre-trained Chinese speech recognition system is prepared, wherein the selected speech recognition system has a Chinese speech data set and has Annotated text for voice data.

上述語音辨識系統的聲學模型採用的訓練用語音特徵為FBANK特徵，因此訓練靜音模型時依然採用FBANK特徵。The acoustic characteristics of the acoustic model of the above speech recognition system are FBANK features, so the FBANK feature is still used when training the silent model.

將訓練用語音數據提取語音特徵，並同對應的標注文本輸入語音辨識系統中進行強制對齊，將每一幀語音特徵對應到音素級別標籤，然後將對齊結果中非靜音音素映射到非靜音的標籤上，將靜音音素映射到靜音的標籤上，以完成靜音模型的訓練資料標籤準備。The speech data is extracted from the training speech data, and the corresponding annotation text is input into the speech recognition system for forced alignment, each frame speech feature is corresponding to the phoneme level label, and then the non-silent phoneme in the alignment result is mapped to the non-mute On the label, the mute phoneme is mapped to the mute tag to complete the training profile tag preparation of the mute model.

隨後，利用上述訓練用語音數據以及對應的標籤訓練形成靜音模型在利用上述訓練形成的靜音模型進行語音端點的檢測時，將一段語音中的每一幀語音數據提取語音特徵後送入訓練好的靜音模型中，經過多層神經網路的前向計算後輸出第一節點的第一取值和第二節點的第二取值，再比較兩個取值的大小，輸出取值較大的對應的節點的標籤作為該幀語音數據的標籤，以表示該幀語音數據為靜音幀/非靜音幀。Then, using the above-mentioned training voice data and corresponding label training to form a silent model, when the voice endpoint is detected by using the silence model formed by the above training, the voice features of each frame of the voice are extracted and sent to the training. In the silent model, after the forward calculation of the multi-layer neural network, the first value of the first node and the second value of the second node are output, and then the magnitudes of the two values are compared, and the corresponding value of the output value is larger. The label of the node is used as a label of the frame voice data to indicate that the frame voice data is a silent frame/non-silent frame.

最後，判斷是否存在連續幀的靜音幀/非靜音幀：當採集語音的拾音設備處於非啟動狀態時，若存在連續30幀非靜音幀，則將該連續30幀非靜音幀中的第一幀語音數據作為整段待識別的語音的起始端點；當採集語音的拾音設備處於啟動狀態時，若存在連續50幀靜音幀，則將該連續50幀靜音幀中的第一幀語音數據作為整段待識別的語音的結束端點。Finally, it is determined whether there is a silent frame/non-silent frame of consecutive frames: when the sound collecting device that collects the voice is in a non-activated state, if there are consecutive 30 frames of non-silent frames, the first of the consecutive 30 frames of non-silent frames The frame voice data is used as the starting end point of the entire speech to be recognized; when the sound collecting device that collects the voice is in the activated state, if there are consecutive 50 frames of the silence frame, the first frame of the continuous 50 frames of the silenced frame is the voice data. As the end point of the entire speech to be recognized.

本發明的較佳的實施例中，還提供一種語音辨識方法，其中採用上述語音端點檢測方法檢測得到需要識別的一段語音的起始端點和結束端點，以確定需要識別的語音的範圍，隨後再採用現有的語音辨識技術對該段語音進行識別。In a preferred embodiment of the present invention, a voice recognition method is further provided, wherein the voice endpoint detection method is used to detect a start endpoint and an end endpoint of a voice to be identified to determine a range of voices to be recognized. The speech recognition technology is then used to identify the speech.

以上僅為本發明較佳的實施例，並非因此限制本發明的實施方式及保護範圍，對於本領域技術人員而言，應當能夠意識到凡運用本發明說明書及圖示內容所作出的等同替換和顯而易見的變化所得到的方案，均應當包含在本發明的保護範圍內。The above is only a preferred embodiment of the present invention, and is not intended to limit the scope of the embodiments and the scope of the present invention, and those skilled in the art should be able to The resulting solutions to the obvious variations are intended to be included within the scope of the invention.

S1‧‧‧步驟S1S1‧‧‧Step S1

S2‧‧‧步驟S2S2‧‧‧Step S2

S3‧‧‧步驟S3S3‧‧‧Step S3

A1‧‧‧步驟A1A1‧‧‧Step A1

A2‧‧‧步驟A2A2‧‧‧Step A2

A3‧‧‧步驟A3A3‧‧‧Step A3

A21‧‧‧步驟A21A21‧‧‧Step A21

A22‧‧‧步驟A22A22‧‧‧Step A22

A23‧‧‧步驟A23A23‧‧‧Step A23

41‧‧‧多層神經網路41‧‧‧Multilayer neural network

42‧‧‧輸出層42‧‧‧Output layer

421‧‧‧第一節點421‧‧‧ first node

422‧‧‧第二節點422‧‧‧second node

S21‧‧‧步驟S21S21‧‧‧Step S21

S22‧‧‧步驟S22S22‧‧‧Step S22

圖1是本發明的較佳的實施例中，一種語音端點檢測方法的總體流程示意圖；圖2是本發明的較佳的實施例中，訓練形成靜音模型的流程示意圖；圖3是本發明的較佳的實施例中，於圖2的基礎上，對訓練用語音數據進行自動標注的流程示意圖；圖4是本發明的較佳的實施例中，包括多層神經網路的靜音模型的結構示意圖；圖5是本發明的較佳的實施例中，於圖1的基礎上，處理並輸出關聯於語音數據的標籤的流程示意圖。1 is a schematic overall flow chart of a voice endpoint detection method in a preferred embodiment of the present invention; FIG. 2 is a schematic flowchart of training to form a silent model in a preferred embodiment of the present invention; In a preferred embodiment, a schematic diagram of automatic labeling of training speech data on the basis of FIG. 2; FIG. 4 is a diagram showing the structure of a muting model including a multi-layer neural network in a preferred embodiment of the present invention; BRIEF DESCRIPTION OF THE DRAWINGS Figure 5 is a flow diagram of processing and outputting tags associated with voice data on the basis of Figure 1 in a preferred embodiment of the present invention.

Claims

A speech endpoint detection method, wherein a pre-training forms a muting model for judging whether the speech data is a silence frame, and then acquiring a segment of the voice of the externally input speech data including consecutive frames, and performing the following steps: Step S1 Extracting a speech feature of the speech data of each frame, and inputting the speech feature into the mute model; Step S2, the mute model outputs a tag associated with the speech data of each frame according to the speech feature, the tag is used for Indicates whether the voice data is a silent frame; Step S3, confirming a voice endpoint of the voice according to the label of the voice data of the continuous frame: when the sound collecting device that collects the voice is in a non-activated state, if the voice is continuously unmuted If the length of the voice data of the frame is greater than a preset first threshold, determining that the voice data of the first frame is the unvoiced frame is a starting end of the voice; when the sound collecting device that collects the voice is in an activated state If the length of the voice data of the silent frame continuously exceeds a preset second threshold, determining that the first frame is the static The frame of speech data for end points of the speech period ends.

The voice endpoint detection method of claim 1, wherein the silent model is pre-trained by the following method: Step A1, inputting a preset plurality of training voice data, and extracting each of the training voices The voice feature of the data; Step A2, according to the corresponding voice feature, performing automatic labeling operation on the training voice data for each frame, obtaining a label corresponding to the voice data of each frame; the label is used to indicate the corresponding frame of the voice The data is a silent frame or a non-silent frame; Step A3, the mute model is obtained according to the training voice data and the corresponding tag; the output node of the mute model is provided with a first node and a second node; the first node The label used to indicate the corresponding silence frame; the second node is used to indicate the label corresponding to the non-silent frame.

The voice endpoint detection method of claim 2, wherein each of the training voice data corresponding to the external input is preset with an annotation text to mark the text content corresponding to the training voice data; Step A2 specifically includes: Step A21: acquiring the voice feature and the corresponding tagged text; Step A22, forcibly aligning the phonetic feature and the corresponding tagged text by using an acoustic model formed by pre-training to obtain each frame The voice feature corresponds to an output tag of the phoneme; Step A23, post-processing the training voice data subjected to the forced alignment to map the output tag of the mute phoneme to the tag representing the mute frame, and to be unmuted The output label of the phoneme is mapped to the label representing the non-silent frame.

The speech endpoint detection method according to claim 3, wherein in the step A22, the acoustic model formed by the pre-training is a Gaussian mixture model-hidden Markov model, or a deep neural network-hidden Markov model. .

The speech endpoint detection method of claim 1, wherein the muting model is a deep neural network model including a multi-layer neural network.

The speech endpoint detection method of claim 5, wherein each of the two layers of the muting model includes at least one non-linear transformation between the neural networks.

The speech endpoint detection method of claim 5, wherein the neural network of each layer of the muting model is a fully connected neural network, or a convolutional neural network, or a recurrent neural network.

The voice endpoint detection method of claim 2, wherein the muting model is a deep neural network model including a multi-layer neural network; the first node and the second node are disposed on an output layer of the muting model; The first node is used to indicate the label corresponding to the silence frame; the second node is used to indicate the label corresponding to the non-silent frame; then the step S2 specifically includes: Step S21, after the voice feature is input into the silent model, The forward calculation of the multi-layer neural network respectively obtains a first value associated with the first node in the output layer and a second value associated with the second node; Step S22, the first value is The second value is compared: if the first value is greater than the second value, the first node is used as the label of the voice data and is output; if the first value is less than the second value, The second node is used as the tag of the voice data and output.

A speech recognition method, wherein the speech endpoint detection method as described in claims 1-8 detects the start endpoint and the end endpoint of a speech that needs to be recognized.