TW201830377A - Speech point detection method and speech recognition method - Google Patents

Speech point detection method and speech recognition method Download PDF

Info

Publication number
TW201830377A
TW201830377A TW107104564A TW107104564A TW201830377A TW 201830377 A TW201830377 A TW 201830377A TW 107104564 A TW107104564 A TW 107104564A TW 107104564 A TW107104564 A TW 107104564A TW 201830377 A TW201830377 A TW 201830377A
Authority
TW
Taiwan
Prior art keywords
speech
voice
frame
model
node
Prior art date
Application number
TW107104564A
Other languages
Chinese (zh)
Other versions
TWI659409B (en
Inventor
范利春
Original Assignee
大陸商芋頭科技(杭州)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大陸商芋頭科技(杭州)有限公司 filed Critical 大陸商芋頭科技(杭州)有限公司
Publication of TW201830377A publication Critical patent/TW201830377A/en
Application granted granted Critical
Publication of TWI659409B publication Critical patent/TWI659409B/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a speech point detection method and a speech recognition method, and relates to the technical field of the speech recognition; the method comprises: extracting speech features from a speech data and inputting the speech features into a silence model; outputting a label according to the speech features to indicate whether the speech data is a silence frame; confirming a speech endpoint of a segment of speech according to the labels of the speech data of consecutive frames: during the non-activated state, if the length of the speech data of the non-silence frames is greater than a preset first threshold, it is determined that the first non-silence frame of the speech data is the starting point of a segment of speech; during the activated state, if the length of the speech data of the silence frame is greater than a preset second threshold, it is determined that the first silence frame of the speech data is the ending point of a speech data. The benefits of the above mentioned solution are: the problem of inaccurate detection of speech point in the prior art and excessive requirements for the detection environment can be solved.

Description

一種語音端點檢測方法及語音辨識方法Speech endpoint detection method and speech recognition method

本發明涉及語音辨識技術領域,尤其涉及一種語音端點檢測方法及語音辨識方法。The present invention relates to the field of speech recognition technologies, and in particular, to a speech endpoint detection method and a speech recognition method.

隨著語音辨識技術的發展,語音辨識在人們生活中的應用越來越廣泛。當使用者使用手持設備中的語音辨識技術時,通常會配合語音辨識按鍵來控制需要識別的語音段落的開始和結束的時間,但是當使用者處於智慧家居環境中使用語音辨識技術時,會因為距離拾音設備較遠而無法採用按鍵配合的方式手動決定語音段落的開始端點和結束端點,這時就需要另外一種方式來對語音開始和結束的時間進行自動判斷,也即語音端點檢測技術(Voice Active Detection,VAD)。With the development of speech recognition technology, speech recognition has become more and more widely used in people's lives. When the user uses the voice recognition technology in the handheld device, the voice recognition button is usually used to control the start and end time of the voice segment to be recognized, but when the user uses the voice recognition technology in the smart home environment, The distance between the start end and the end point of the voice paragraph can be manually determined by the remote selection of the sound pickup device. In this case, another method is needed to automatically judge the start and end time of the voice, that is, the voice endpoint detection. Voice Active Detection (VAD).

傳統的端點檢測方法主要基於次頻帶能量進行,即計算每幀語音數據在某一頻段的能量,並與預先設定的能量閾值進行比較來判斷語音的開始端點和結束端點。這種端點檢測方法對檢測環境的要求較高,其必須在安靜的環境中進行語音辨識才能保證檢測到的語音端點的準確性。而在相對嘈雜的雜訊環境中,不同種類的雜訊會對不同的次頻帶能量產生影響,從而對上述端點檢測方法帶來較大的干擾,尤其是低訊噪比和非平穩的雜訊環境中,對次頻帶能量的計算會造成很大干擾,從而使得最終的檢測結果不準確。而只有保證語音端點檢測的準確性才能保證語音正確被採集,進而正確被識別。端點檢測的結果不準確有可能會使語音被截斷或者錄入更多的雜訊,會導致語音辨識不能對整句話解碼,從而帶來漏報或者誤報等問題,甚至會造成整句話的解碼全部錯誤,降低語音辨識結果的準確性。The traditional endpoint detection method is mainly based on the sub-band energy, that is, calculating the energy of each frame of speech data in a certain frequency band, and comparing with a preset energy threshold to determine the start end point and the end end point of the speech. This endpoint detection method has a high requirement for the detection environment, and it must perform speech recognition in a quiet environment to ensure the accuracy of the detected speech endpoint. In a relatively noisy environment, different kinds of noise will affect the energy of different sub-bands, which will cause greater interference to the above-mentioned endpoint detection methods, especially low signal-to-noise ratio and non-stationary In the information environment, the calculation of the energy of the sub-band will cause a lot of interference, which will make the final test result inaccurate. Only the accuracy of the voice endpoint detection can ensure that the voice is correctly collected and correctly recognized. The inaccurate result of the endpoint detection may cause the speech to be truncated or more noise to be recorded, which may result in the speech recognition not being able to decode the entire sentence, thus causing problems such as underreporting or false positives, and even causing the whole sentence. Decode all errors and reduce the accuracy of the speech recognition results.

根據現有技術中存在的上述問題,現提供一種語音端點檢測方法及語音辨識方法的技術方案,旨在解決現有技術中語音端點檢測不準確以及對於檢測環境要求過高的問題。上述技術方案具體包括:According to the above problems in the prior art, a technical solution for a voice endpoint detection method and a voice recognition method is provided, which aims to solve the problem that the voice endpoint detection is inaccurate and the detection environment is too high in the prior art. The above technical solutions specifically include:

一種語音端點檢測方法,其中,預先訓練形成一用於判斷語音數據是否為靜音幀的靜音模型,隨後獲取外部輸入的包括連續幀的語音數據的一段語音,並執行下述步驟: 步驟S1,提取每一幀語音數據的語音特徵,並將語音特徵輸入至靜音模型中; 步驟S2,靜音模型根據語音特徵輸出關聯於每一幀語音數據的標籤,標籤用於表示語音數據是否為靜音幀; 步驟S3,根據連續幀的語音數據的標籤確認一段語音的語音端點: 當採集語音的拾音設備處於非啟動狀態時,若連續出現非靜音幀的語音數據的長度大於一預設的第一閾值,則判斷第一幀為非靜音幀的語音數據為一段語音的起始端點; 當採集語音的拾音設備處於啟動狀態時,若連續出現靜音幀的語音數據的長度大於一預設的第二閾值,則判斷第一幀為靜音幀的語音數據為一段語音的結束端點。A voice endpoint detection method, wherein a pre-training forms a muting model for determining whether the voice data is a silence frame, and then acquiring a segment of voice of the externally input voice data including consecutive frames, and performing the following steps: Step S1, Extracting a voice feature of each frame of voice data, and inputting the voice feature into the mute model; Step S2, the mute model outputs a tag associated with each frame of voice data according to the voice feature, and the tag is used to indicate whether the voice data is a silence frame; Step S3, confirming the voice endpoint of a voice according to the label of the voice data of the continuous frame: When the sound pickup device that collects the voice is in a non-activated state, if the length of the voice data of the non-silent frame continuously appears is greater than a preset first The threshold value is used to determine that the voice data of the first frame is a non-silent frame is the starting end point of a voice; when the voice pickup device that collects the voice is in the startup state, if the voice data of the silent frame continuously appears, the length of the voice data is greater than a preset number. The second threshold determines that the voice data of the first frame is a silence frame is the end point of a voice.

優選的,該語音端點檢測方法,其中,透過下述方法預先訓練形成靜音模型: 步驟A1,輸入預設的多個訓練用語音數據,並提取每個訓練用語音數據的語音特徵; 步驟A2,根據對應的語音特徵,針對每幀訓練用語音數據進行自動標注操作,獲得對應每幀語音數據的一標籤;標籤用於表示對應的一幀語音數據為靜音幀或者非靜音幀; 步驟A3,根據訓練用語音數據以及對應的標籤訓練得到靜音模型; 靜音模型的輸出層上設置有第一節點和第二節點; 第一節點用於表示對應靜音幀的標籤; 第二節點用於表示對應非靜音幀的標籤。Preferably, the voice endpoint detection method is configured to form a silent model by using the following method: Step A1, inputting a preset plurality of training voice data, and extracting voice features of each training voice data; Step A2 According to the corresponding voice feature, the automatic labeling operation is performed for each frame of the training voice data to obtain a label corresponding to each frame of voice data; the label is used to indicate that the corresponding one frame of voice data is a silent frame or a non-silent frame; Step A3, The silent model is obtained according to the training voice data and the corresponding label training; the first node and the second node are disposed on the output layer of the silent model; the first node is used to indicate the label of the corresponding silent frame; the second node is used to indicate the corresponding non- The label of the mute frame.

優選的,該語音端點檢測方法,其中,對應外部輸入的每個訓練用語音數據均預先設置一標注文本,以標注訓練用語音數據對應的文本內容; 則步驟A2具體包括: 步驟A21,獲取語音特徵和對應的標注文本; 步驟A22,利用預先訓練形成的聲學模型對語音特徵和對應的標注文本進行強制對齊,以得到每幀語音特徵對應到音素的輸出標籤; 步驟A23,對經過強制對齊的訓練用語音數據進行後處理,以將靜音音素的輸出標籤映射到表示靜音幀的標籤上,以及將非靜音音素的輸出標籤映射到表示非靜音幀的標籤上。Preferably, the voice endpoint detection method, wherein each training voice data corresponding to the external input is preset with an annotated text to mark the text content corresponding to the training voice data; then step A2 specifically includes: Step A21, Obtaining the voice feature and the corresponding annotated text; Step A22, using the acoustic model formed by the pre-training to forcibly align the voice feature and the corresponding annotated text to obtain an output label corresponding to the phoneme of each frame of the voice feature; Step A23, The forced alignment training is post-processed with speech data to map the output label of the mute phoneme to the tag representing the mute frame and the output tag of the non-silent phoneme to the tag representing the non-silent frame.

優選的,該語音端點檢測方法,其中,步驟A22中,預先訓練形成的聲學模型為高斯混合模型-隱藏式馬可夫模型,或者為深度神經網路-隱藏式馬可夫模型。Preferably, the speech endpoint detection method, wherein, in step A22, the acoustic model formed by the pre-training is a Gaussian mixture model-hidden Markov model, or a deep neural network-hidden Markov model.

優選的,該語音端點檢測方法,其中,靜音模型為包括多層神經網路的深度神經網路模型。Preferably, the speech endpoint detection method, wherein the muting model is a deep neural network model including a multi-layer neural network.

優選的,該語音端點檢測方法,其中,靜音模型的每兩層神經網路之間包括至少一個非線性變換。Preferably, the speech endpoint detection method includes at least one non-linear transformation between every two layers of neural networks of the muting model.

優選的,該語音端點檢測方法,其中,靜音模型的每層神經網路為全連接的神經網路,或者卷積神經網路,或者遞迴神經網路。Preferably, the voice endpoint detection method, wherein each layer of the neural network of the silent model is a fully connected neural network, or a convolutional neural network, or a recurrent neural network.

優選的,該語音端點檢測方法,其中,靜音模型為包括多層神經網路的深度神經網路模型; 靜音模型的輸出層上設置有第一節點和第二節點; 第一節點用於表示對應靜音幀的標籤; 第二節點用於表示對應非靜音幀的標籤; 則步驟S2具體包括: 步驟S21,語音特徵輸入靜音模型後,通過多層神經網路的前向計算分別得到輸出層中關聯於第一節點的第一取值以及關聯於第二節點的第二取值; 步驟S22,將第一取值與第二取值進行比較: 若第一取值大於第二取值,則將第一節點作為語音數據的標籤並輸出; 若第一取值小於第二取值,則將第二節點作為語音數據的標籤並輸出。 一種語音辨識方法,其中,採用上述的語音端點檢測方法檢測得到需要識別的一段語音的起始端點和結束端點。Preferably, the voice endpoint detection method, wherein the silence model is a deep neural network model including a multi-layer neural network; the first node and the second node are disposed on an output layer of the silent model; The second node is used to indicate the label of the corresponding non-silent frame. Step S2 specifically includes: Step S21: After the voice feature is input into the mute model, the forward calculation by the multi-layer neural network is respectively associated with the output layer. a first value of the first node and a second value associated with the second node; Step S22, comparing the first value with the second value: if the first value is greater than the second value, A node is used as a label of the voice data and is output; if the first value is smaller than the second value, the second node is used as a label of the voice data and output. A speech recognition method, wherein the speech endpoint detection method is used to detect a start endpoint and an end endpoint of a segment of speech that need to be identified.

上述技術方案的有益效果是:提供一種語音端點檢測方法,能夠解決現有技術中語音端點檢測不準確以及對於檢測環境要求過高的問題,因此提升語音端點檢測的準確性,擴展端點檢測方法的泛用性,從而改進整個語音辨識過程。The foregoing technical solution has the beneficial effects of providing a voice endpoint detection method, which can solve the problem that the voice endpoint detection is inaccurate and the detection environment is too high in the prior art, thereby improving the accuracy of the voice endpoint detection and extending the endpoint. The ubiquity of the detection method improves the overall speech recognition process.

下面將結合本發明實施例中的附圖,對本發明實施例中的技術方案進行清楚、完整地描述,顯然,所描述的實施例僅僅是本發明一部分實施例,而不是全部的實施例。基於本發明中的實施例,本領域普通技術人員在沒有作出創造性勞動的前提下所獲得的所有其他實施例,都屬於本發明保護的範圍。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

需要說明的是,在不衝突的情況下,本發明中的實施例及實施例中的特徵可以相互組合。It should be noted that the embodiments in the present invention and the features in the embodiments may be combined with each other without conflict.

下面結合附圖和具體實施例對本發明作進一步說明,但不作為本發明的限定。The invention is further illustrated by the following figures and specific examples, but is not to be construed as limiting.

根據現有技術中存在的上述問題,現提供一種語音端點檢測方法,該方法中,預先訓練形成一用於判斷語音數據是否為靜音幀的靜音模型,隨後獲取外部輸入的包括連續幀的語音數據的一段語音,並執行如圖1中所示的下述步驟: 步驟S1,提取每一幀語音數據的語音特徵,並將語音特徵輸入至靜音模型中; 步驟S2,靜音模型根據語音特徵輸出關聯於每一幀語音數據的標籤,標籤用於表示語音數據是否為靜音幀; 步驟S3,根據連續幀的語音數據的標籤確認一段語音的語音端點: 當採集語音的拾音設備處於非啟動狀態時,若連續出現非靜音幀的語音數據的長度大於一預設的第一閾值,則判斷第一幀為非靜音幀的語音數據為一段語音的起始端點; 當採集語音的拾音設備處於啟動狀態時,若連續出現靜音幀的語音數據的長度大於一預設的第二閾值,則判斷第一幀為靜音幀的語音數據為一段語音的結束端點。According to the above problems existing in the prior art, there is provided a speech endpoint detection method, in which a pre-training forms a muting model for judging whether speech data is a silence frame, and then acquiring externally input speech data including consecutive frames. a piece of speech, and performing the following steps as shown in FIG. 1: Step S1, extracting the speech features of each frame of speech data, and inputting the speech features into the mute model; Step S2, the mute model outputs the association according to the speech features For each frame of voice data, the label is used to indicate whether the voice data is a silent frame; Step S3, the voice endpoint of a voice is confirmed according to the label of the voice data of the continuous frame: When the voice pickup device that collects the voice is in a non-activated state If the length of the voice data of the non-silent frame is greater than a preset first threshold, determining that the voice data of the first frame is a non-silent frame is a starting end of a voice; when the voice collecting device is in the voice collection In the startup state, if the length of the voice data of the silent frame continuously exceeds a preset second threshold, it is determined A mute voice frame data of a voice terminal end.

具體地,本實施例中,首先形成一靜音模型,該靜音模型可以用於判斷一段語音中的每幀語音數據是否為靜音幀。所謂靜音幀,是指不包含需要進行語音辨識的有效語音的語音數據;所謂非靜音幀,是指包含需要進行語音辨識的有效語音的語音數據。Specifically, in this embodiment, a mute model is first formed, and the mute model can be used to determine whether each frame of speech data in a segment of speech is a mute frame. The so-called silent frame refers to voice data that does not contain valid voices that require voice recognition; the so-called silent frame refers to voice data that contains valid voices that require voice recognition.

隨後,本實施例中,訓練形成靜音模型後,提取外部輸入的一段語音中每一幀語音數據的語音特徵,並將提取到的語音特徵輸入到靜音模型中,以輸出關聯於該幀語音數據的標籤。本實施例中,一共存在兩個標籤,分別用於表示該幀語音數據為靜音幀/非靜音幀。Then, in the embodiment, after the training forms the silent model, the voice features of each frame of the voice data input from the external input are extracted, and the extracted voice features are input into the silent model to output the voice data associated with the frame. s Mark. In this embodiment, there are two tags in total, which are respectively used to indicate that the frame voice data is a silent frame/non-silent frame.

本實施例中,在得到了每一幀語音數據的靜音和非靜音分類後,再判斷語音端點。然而並非出現一幀非靜音幀就能認為一段語音開始,也不能出現一幀靜音幀就認為一段語音結束,而是需要根據連續靜音幀/非靜音幀的幀數來判斷一段語音的起始端點和結束端點。具體為:In this embodiment, after the mute and non-mute classification of each frame of voice data is obtained, the speech endpoint is determined. However, instead of a frame of non-silent frame, a speech start can be considered, and a silence frame cannot be regarded as a speech end. Instead, the start end of a speech is determined according to the number of consecutive silent/non-silent frames. And end endpoints. Specifically:

當採集語音的拾音設備處於非啟動狀態時,若連續出現非靜音幀的語音數據的長度大於一預設的第一閾值,則判斷第一幀為非靜音幀的語音數據為一段語音的起始端點;When the sound collecting device that collects the voice is in a non-activated state, if the length of the voice data of the non-silent frame is greater than a preset first threshold, determining that the voice data of the first frame is a non-silent frame is a voice. Starting point

當採集語音的拾音設備處於啟動狀態時,若連續出現靜音幀的語音數據的長度大於一預設的第二閾值,則判斷第一幀為靜音幀的語音數據為一段語音的結束端點。When the sound collecting device that collects the voice is in the activated state, if the length of the voice data of the silent frame is greater than a preset second threshold, the voice data of the first frame being the silence frame is determined to be the end point of the voice.

本發明的一個較佳的實施例中,上述第一閾值可以取值30,上述第二閾值可以取值50。即: 當採集語音的拾音設備處於非啟動狀態時,若連續出現非靜音幀的長度大於30(出現連續30幀非靜音幀),則判斷第一幀非靜音幀為該段語音的起始端點。In a preferred embodiment of the present invention, the first threshold may take a value of 30, and the second threshold may take a value of 50. That is, when the sound collection device that collects the voice is in the non-activated state, if the length of the non-silent frame continuously appears to be greater than 30 (the consecutive 30 frames of non-silent frames appear), it is determined that the first frame of the non-silent frame is the beginning of the segment of the voice. point.

當採集語音的拾音設備處於啟動狀態時,若連續出現靜音幀的長度大於50(出現連續50幀靜音幀),則判斷第一幀靜音幀為該段語音的結束端點。When the sound collecting device that collects the voice is in the startup state, if the length of the silent frame continuously appears to be greater than 50 (the consecutive 50 frames of the silence frame appear), it is determined that the first frame silence frame is the end point of the segment of the voice.

本發明的另一個較佳的實施例中,上述第一閾值同樣可以取值70,上述第二閾值可以取值50。In another preferred embodiment of the present invention, the first threshold may also take a value of 70, and the second threshold may take a value of 50.

本發明的其他實施例中,可以根據實際情況自由設定第一閾值和第二閾值的取值,以滿足不同環境下語音端點檢測的需求。In other embodiments of the present invention, the values of the first threshold and the second threshold may be freely set according to actual conditions to meet the requirements of voice endpoint detection in different environments.

本發明的較佳的實施例中,可以通過如圖2中所示的下述方法預先訓練形成靜音模型: 步驟A1,輸入預設的多個訓練用語音數據,並提取每個訓練用語音數據的語音特徵; 步驟A2,根據對應的語音特徵,針對每幀訓練用語音數據進行自動標注操作,獲得對應每幀語音數據的一標籤;標籤用於表示對應的一幀語音數據為靜音幀或者非靜音幀; 步驟A3,根據對應的語音特徵,針對每幀訓練用語音數據進行自動標注操作,獲得對應每幀語音數據的一標籤;標籤用於表示對應的一幀語音數據為靜音幀或者非靜音幀; 靜音模型的輸出層上設置有第一節點和第二節點; 第一節點用於表示對應靜音幀的標籤; 第二節點用於表示對應非靜音幀的標籤。In a preferred embodiment of the present invention, the silent model can be pre-trained by the following method as shown in FIG. 2: Step A1, inputting a preset plurality of training voice data, and extracting each training voice data The voice feature; Step A2, according to the corresponding voice feature, the automatic labeling operation is performed for each frame of the training voice data, and a label corresponding to each frame of voice data is obtained; the label is used to indicate that the corresponding one frame of voice data is a silent frame or a non- Silent frame; Step A3, according to the corresponding voice feature, perform automatic labeling operation for each frame of training voice data, and obtain a label corresponding to each frame of voice data; the label is used to indicate that the corresponding one frame of voice data is a silent frame or non-mute a first node and a second node are disposed on an output layer of the mute model; a first node is used to indicate a tag of the corresponding mute frame; and a second node is used to represent a tag corresponding to the non-silent frame.

具體地,本實施例中,首先輸入預設的多個訓練用語音數據。所謂訓練用語音數據,是指預先知道其文本內容的語音數據。該訓練用語音數據可以根據事先已經訓練好中文的語音辨識系統的中文語音數據集提取得到,並且擁有對應訓練用語音數據的標注文本。即上述步驟A1中輸入的訓練用語音數據與訓練後續語音辨識的聲學模型時應用的語音數據是相同的。Specifically, in this embodiment, a preset plurality of training voice data is first input. The so-called training voice data refers to voice data in which the text content is known in advance. The training voice data can be extracted according to the Chinese voice data set of the voice recognition system that has been trained in Chinese in advance, and has the annotation text corresponding to the training voice data. That is, the training voice data input in the above step A1 is the same as the voice data applied when training the acoustic model of the subsequent voice recognition.

本實施例中,在輸入訓練用語音數據後,針對每個訓練用語音數據分別提取其語音特徵。語音特徵的提取同樣可以使用訓練語音辨識的聲學模型時提取的語音特徵。常見的語音特徵可以包括梅爾頻率倒譜系數(Mel Frequency Cepstrum Coefficient, MFCC)、感知線性預測(Perceptual Linear Predictive,PLP)或者濾波器組(Filter-Bank,FBANK)特徵。同樣地,在本發明的其他實施例中,可以採用其他類似的語音特徵來完成靜音模型的訓練。In this embodiment, after the training voice data is input, the voice features are extracted for each training voice data. The extraction of speech features can also be performed using speech features extracted while training the acoustic model of speech recognition. Common speech features may include Mel Frequency Cepstrum Coefficient (MFCC), Perceptual Linear Predictive (PLP) or Filter-Bank (FBANK) features. As such, in other embodiments of the invention, other similar speech features may be employed to accomplish the training of the silent model.

本實施例中,上述步驟A2中,在作為靜音模型的訓練輸入參數之前,首先需要對上述訓練用語音數據進行自動標注操作,以使每幀語音數據幀對齊。上述自動標注操作中,每一幀語音數據都會獲得一個標籤,上述自動標注的處理方法在下文中會詳述,在經過自動標注操作後,就可以訓練靜音模型了。In this embodiment, in the above step A2, before the training input parameter as the silent model, it is first necessary to perform an automatic labeling operation on the training voice data to align each frame of the voice data frame. In the above automatic labeling operation, each frame of voice data will obtain a label, and the above automatic labeling processing method will be described in detail below. After the automatic labeling operation, the silent model can be trained.

本發明的較佳的實施例中,對應外部輸入的每個訓練用語音數據均預先設置一標注文本,以標注訓練用語音數據對應的文本內容; 則上述步驟A2具體如圖3所示,可以包括: 步驟A21,獲取語音特徵和對應的標注文本; 步驟A22,利用預先訓練形成的聲學模型對語音特徵和對應的標注文本進行強制對齊,以得到每幀語音特徵對應到音素的輸出標籤; 步驟A23,對經過強制對齊的訓練用語音數據進行後處理,以將靜音音素的輸出標籤映射到表示靜音幀的標籤上,以及將非靜音音素的輸出標籤映射到表示非靜音幀的標籤上。In the preferred embodiment of the present invention, each of the training voice data corresponding to the external input is preset with an annotated text to mark the text content corresponding to the training voice data; then the step A2 is specifically shown in FIG. The method may include: Step A21: acquiring a voice feature and a corresponding annotated text; Step A22: forcibly aligning the voice feature and the corresponding annotated text by using an acoustic model formed by the pre-training to obtain an output corresponding to the phoneme of each frame of the voice feature. Step A23, post-processing the forced-aligned training voice data to map the output label of the silent phoneme to the label representing the silence frame, and map the output label of the non-silent phoneme to the label indicating the non-silent frame on.

具體地,本實施例中,若採用手工對訓練用語音數據進行自動標注操作,則需要耗費大量的人工成本,並且對於雜訊的標注在不同的標注人員的標注結果中也會出現不一致的情況,從而影響後續訓練模型的過程。因此本發明技術方案中提供一種高效可行的自動標注方法。Specifically, in the embodiment, if the automatic labeling operation is performed on the training voice data manually, a large amount of labor costs are required, and the labeling of the noise may be inconsistent in the labeling results of different labeling personnel. , thereby affecting the process of the subsequent training model. Therefore, an efficient and feasible automatic labeling method is provided in the technical solution of the present invention.

上述方法中,首先獲取每一幀訓練用語音數據的語音特徵以及對應的標注文本,隨後對語音特徵和標注文本進行強制對齊。In the above method, the speech features of the speech data for each frame and the corresponding annotated text are first acquired, and then the speech features and the annotated text are forcibly aligned.

本實施例中,可以利用後續語音辨識的聲學模型(即預先訓練形成的聲學模型)對語音特徵和標注文本進行強制對齊。本發明中的語音辨識的聲學模型可以是高斯混合模型-隱藏式馬可夫模型(Gaussian Mixture Model- Hidden Markov Model,GMM-HMM),也可以是深度神經網路--隱藏式馬可夫模型(Deep Neural Network- Hidden Markov Model,DNN-HMM),或者其他適宜的模型。上述聲學模型中的建模單元是音素(phone)級別的,例如上下文獨立的音素(Context Independent Phone,ci-phone)或者上下文相關的音素(Context Dependent Phone,cd-phone)。採用上述聲學模型進行強制對齊操作可以將對訓練用語音數據幀對齊到音素級別。In this embodiment, the acoustic features of the subsequent speech recognition (ie, the acoustic model formed by the pre-training) may be used to forcibly align the speech features and the annotated text. The acoustic model of speech recognition in the present invention may be a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) or a deep neural network--Hidden Markov Model (Deep Neural Network) - Hidden Markov Model, DNN-HMM), or other suitable model. The modeling unit in the above acoustic model is a phone level, such as a Context Independent Phone (ci-phone) or a Context Dependent Phone (cd-phone). The forced alignment operation using the acoustic model described above can align the training speech data frame to the phoneme level.

本實施例中,上述步驟A23中,對經過強制對齊的訓練用語音數據進行後處理之後,即可得到幀對應到靜音標簽的語音數據。上述後處理操作中,通常將部分音素看作是靜音音素,將其他音素看作是非靜音音素,經過上述映射之後,每一幀語音數據都可以跟靜音/非靜音的標籤對應起來。In this embodiment, in the above step A23, after the forced-aligned training voice data is post-processed, the voice data corresponding to the mute label can be obtained. In the above post-processing operation, part of the phoneme is generally regarded as a silent phoneme, and other phonemes are regarded as non-silent phonemes. After the above mapping, each frame of voice data can be associated with a mute/unmute tag.

本發明的較佳的實施例中,隨後利用上文中得到的語音特徵和幀對齊的標籤就能夠訓練靜音模型了。上述靜音模型可以為包括多層神經網路的深度神經網路模型。上述靜音模型的每層可以是全連接的神經網路、卷積神經網路、遞迴神經網路等,每兩層神經網路之間可以包含一個或多個非線性變換,例如signoid非線性變換、tanh非線性變換、maxpool非線性變換、RELU非線性變換或者softmax非線性變換。In a preferred embodiment of the invention, the mute model can then be trained using the speech features and frame aligned tags obtained above. The above silent model may be a deep neural network model including a multilayer neural network. Each layer of the above silent model may be a fully connected neural network, a convolutional neural network, a recurrent neural network, etc., and each two layers of neural networks may contain one or more nonlinear transformations, such as signoid nonlinearity. Transform, tanh nonlinear transform, maxpool nonlinear transform, RELU nonlinear transform or softmax nonlinear transform.

本發明的較佳的實施例中,如圖4所示,該靜音模型中包括多層神經網路41,以及包括一輸出層42。在該靜音模型的輸出層42中設置第一節點421和第二節點422。上述第一節點421用於表示對應靜音幀的標籤,第二節點422用於表示對應非靜音幀的標籤。在輸出層42的第一節點421和第二節點422上可以進行softmax非線性變換或者其他非線性變換操作,也可以不使用非線性變換操作。In a preferred embodiment of the invention, as shown in FIG. 4, the silent model includes a multilayer neural network 41 and includes an output layer 42. A first node 421 and a second node 422 are disposed in the output layer 42 of the mute model. The first node 421 is used to indicate a label of a corresponding silent frame, and the second node 422 is used to indicate a label corresponding to a non-silent frame. A softmax nonlinear transform or other non-linear transform operation may be performed on the first node 421 and the second node 422 of the output layer 42, or may not use a non-linear transform operation.

則本發明的較佳的實施例中,上述步驟S2具體如圖5所示,包括: 步驟S21,語音特徵輸入靜音模型後,通過多層神經網路的前向計算分別得到輸出層中關聯於第一節點的第一取值以及關聯於第二節點的第二取值; 步驟S22,將第一取值與第二取值進行比較: 若第一取值大於第二取值,則將第一節點作為語音數據的標籤並輸出; 若第一取值小於第二取值,則將第二節點作為語音數據的標籤並輸出。In the preferred embodiment of the present invention, the step S2 is specifically as shown in FIG. 5, and includes: Step S21: After the voice feature is input into the mute model, the forward calculation by the multi-layer neural network respectively obtains the correlation in the output layer. a first value of a node and a second value associated with the second node; Step S22, comparing the first value with the second value: if the first value is greater than the second value, the first value is The node is used as a label of the voice data and is output; if the first value is smaller than the second value, the second node is used as a label of the voice data and output.

具體地,本實施例中,將語音特徵輸入到訓練好的靜音模型中,多層神經網路進行前向計算,並最終得到輸出層中的兩個輸出節點(第一節點和第二節點)的取值,即第一取值和第二取值。隨後比較第一取值和第二取值的大小: 若第一取值較大,則選擇第一節點作為語音數據的標籤並輸出,即此時語音數據為靜音幀; 相應地,若第二取值較大,則選擇第二節點作為語音數據的標籤並輸出,即此時語音數據為非靜音幀。Specifically, in this embodiment, the voice feature is input into the trained silent model, and the multi-layer neural network performs forward calculation, and finally obtains two output nodes (first node and second node) in the output layer. The values are the first value and the second value. Then comparing the size of the first value and the second value: if the first value is larger, the first node is selected as the label of the voice data and output, that is, the voice data is a silence frame; accordingly, if the second If the value is large, the second node is selected as the label of the voice data and output, that is, the voice data is a non-silent frame at this time.

本發明的一個較佳的實施例中,上述語音端點檢測方法的一個完整流程如下文中: 首先準備一事先訓練好的中文語音辨識系統,這裡選擇的語音辨識系統具有中文語音數據集,並且擁有語音數據的標注文本。In a preferred embodiment of the present invention, a complete process of the voice endpoint detection method is as follows: First, a pre-trained Chinese speech recognition system is prepared, wherein the selected speech recognition system has a Chinese speech data set and has Annotated text for voice data.

上述語音辨識系統的聲學模型採用的訓練用語音特徵為FBANK特徵,因此訓練靜音模型時依然採用FBANK特徵。The acoustic characteristics of the acoustic model of the above speech recognition system are FBANK features, so the FBANK feature is still used when training the silent model.

將訓練用語音數據提取語音特徵,並同對應的標注文本輸入語音辨識系統中進行強制對齊,將每一幀語音特徵對應到音素級別標籤,然後將對齊結果中非靜音音素映射到非靜音的標籤上,將靜音音素映射到靜音的標籤上,以完成靜音模型的訓練資料標籤準備。The speech data is extracted from the training speech data, and the corresponding annotation text is input into the speech recognition system for forced alignment, each frame speech feature is corresponding to the phoneme level label, and then the non-silent phoneme in the alignment result is mapped to the non-mute On the label, the mute phoneme is mapped to the mute tag to complete the training profile tag preparation of the mute model.

隨後,利用上述訓練用語音數據以及對應的標籤訓練形成靜音模型 在利用上述訓練形成的靜音模型進行語音端點的檢測時,將一段語音中的每一幀語音數據提取語音特徵後送入訓練好的靜音模型中,經過多層神經網路的前向計算後輸出第一節點的第一取值和第二節點的第二取值,再比較兩個取值的大小,輸出取值較大的對應的節點的標籤作為該幀語音數據的標籤,以表示該幀語音數據為靜音幀/非靜音幀。Then, using the above-mentioned training voice data and corresponding label training to form a silent model, when the voice endpoint is detected by using the silence model formed by the above training, the voice features of each frame of the voice are extracted and sent to the training. In the silent model, after the forward calculation of the multi-layer neural network, the first value of the first node and the second value of the second node are output, and then the magnitudes of the two values are compared, and the corresponding value of the output value is larger. The label of the node is used as a label of the frame voice data to indicate that the frame voice data is a silent frame/non-silent frame.

最後,判斷是否存在連續幀的靜音幀/非靜音幀: 當採集語音的拾音設備處於非啟動狀態時,若存在連續30幀非靜音幀,則將該連續30幀非靜音幀中的第一幀語音數據作為整段待識別的語音的起始端點; 當採集語音的拾音設備處於啟動狀態時,若存在連續50幀靜音幀,則將該連續50幀靜音幀中的第一幀語音數據作為整段待識別的語音的結束端點。Finally, it is determined whether there is a silent frame/non-silent frame of consecutive frames: when the sound collecting device that collects the voice is in a non-activated state, if there are consecutive 30 frames of non-silent frames, the first of the consecutive 30 frames of non-silent frames The frame voice data is used as the starting end point of the entire speech to be recognized; when the sound collecting device that collects the voice is in the activated state, if there are consecutive 50 frames of the silence frame, the first frame of the continuous 50 frames of the silenced frame is the voice data. As the end point of the entire speech to be recognized.

本發明的較佳的實施例中,還提供一種語音辨識方法,其中採用上述語音端點檢測方法檢測得到需要識別的一段語音的起始端點和結束端點,以確定需要識別的語音的範圍,隨後再採用現有的語音辨識技術對該段語音進行識別。In a preferred embodiment of the present invention, a voice recognition method is further provided, wherein the voice endpoint detection method is used to detect a start endpoint and an end endpoint of a voice to be identified to determine a range of voices to be recognized. The speech recognition technology is then used to identify the speech.

以上僅為本發明較佳的實施例,並非因此限制本發明的實施方式及保護範圍,對於本領域技術人員而言,應當能夠意識到凡運用本發明說明書及圖示內容所作出的等同替換和顯而易見的變化所得到的方案,均應當包含在本發明的保護範圍內。The above is only a preferred embodiment of the present invention, and is not intended to limit the scope of the embodiments and the scope of the present invention, and those skilled in the art should be able to The resulting solutions to the obvious variations are intended to be included within the scope of the invention.

S1‧‧‧步驟S1S1‧‧‧Step S1

S2‧‧‧步驟S2S2‧‧‧Step S2

S3‧‧‧步驟S3S3‧‧‧Step S3

A1‧‧‧步驟A1A1‧‧‧Step A1

A2‧‧‧步驟A2A2‧‧‧Step A2

A3‧‧‧步驟A3A3‧‧‧Step A3

A21‧‧‧步驟A21A21‧‧‧Step A21

A22‧‧‧步驟A22A22‧‧‧Step A22

A23‧‧‧步驟A23A23‧‧‧Step A23

41‧‧‧多層神經網路41‧‧‧Multilayer neural network

42‧‧‧輸出層42‧‧‧Output layer

421‧‧‧第一節點421‧‧‧ first node

422‧‧‧第二節點422‧‧‧second node

S21‧‧‧步驟S21S21‧‧‧Step S21

S22‧‧‧步驟S22S22‧‧‧Step S22

圖1是本發明的較佳的實施例中,一種語音端點檢測方法的總體流程示意圖; 圖2是本發明的較佳的實施例中,訓練形成靜音模型的流程示意圖; 圖3是本發明的較佳的實施例中,於圖2的基礎上,對訓練用語音數據進行自動標注的流程示意圖; 圖4是本發明的較佳的實施例中,包括多層神經網路的靜音模型的結構示意圖; 圖5是本發明的較佳的實施例中,於圖1的基礎上,處理並輸出關聯於語音數據的標籤的流程示意圖。1 is a schematic overall flow chart of a voice endpoint detection method in a preferred embodiment of the present invention; FIG. 2 is a schematic flowchart of training to form a silent model in a preferred embodiment of the present invention; In a preferred embodiment, a schematic diagram of automatic labeling of training speech data on the basis of FIG. 2; FIG. 4 is a diagram showing the structure of a muting model including a multi-layer neural network in a preferred embodiment of the present invention; BRIEF DESCRIPTION OF THE DRAWINGS Figure 5 is a flow diagram of processing and outputting tags associated with voice data on the basis of Figure 1 in a preferred embodiment of the present invention.

Claims (9)

一種語音端點檢測方法,其中,預先訓練形成一用於判斷語音數據是否為靜音幀的靜音模型,隨後獲取外部輸入的包括連續幀的該語音數據的一段語音,並執行下述步驟: 步驟S1,提取每一幀該語音數據的語音特徵,並將該語音特徵輸入至該靜音模型中; 步驟S2,該靜音模型根據該語音特徵輸出關聯於每一幀該語音數據的標籤,該標籤用於表示該語音數據是否為靜音幀; 步驟S3,根據連續幀的該語音數據的該標籤確認一段該語音的語音端點: 當採集該語音的拾音設備處於非啟動狀態時,若連續出現非靜音幀的該語音數據的長度大於一預設的第一閾值,則判斷第一幀為該非靜音幀的該語音數據為一段該語音的起始端點; 當採集該語音的拾音設備處於啟動狀態時,若連續出現該靜音幀的該語音數據的長度大於一預設的第二閾值,則判斷第一幀為該靜音幀的該語音數據為一段該語音的結束端點。A speech endpoint detection method, wherein a pre-training forms a muting model for judging whether the speech data is a silence frame, and then acquiring a segment of the voice of the externally input speech data including consecutive frames, and performing the following steps: Step S1 Extracting a speech feature of the speech data of each frame, and inputting the speech feature into the mute model; Step S2, the mute model outputs a tag associated with the speech data of each frame according to the speech feature, the tag is used for Indicates whether the voice data is a silent frame; Step S3, confirming a voice endpoint of the voice according to the label of the voice data of the continuous frame: when the sound collecting device that collects the voice is in a non-activated state, if the voice is continuously unmuted If the length of the voice data of the frame is greater than a preset first threshold, determining that the voice data of the first frame is the unvoiced frame is a starting end of the voice; when the sound collecting device that collects the voice is in an activated state If the length of the voice data of the silent frame continuously exceeds a preset second threshold, determining that the first frame is the static The frame of speech data for end points of the speech period ends. 如請求項第1項所述之語音端點檢測方法,其中,通過下述方法預先訓練形成該靜音模型: 步驟A1,輸入預設的多個訓練用語音數據,並提取每個該訓練用語音數據的語音特徵; 步驟A2,根據對應的該語音特徵,針對每幀該訓練用語音數據進行自動標注操作,獲得對應每幀該語音數據的一標籤;該標籤用於表示對應的一幀該語音數據為靜音幀或者非靜音幀; 步驟A3,根據該訓練用語音數據以及對應的該標籤訓練得到該靜音模型; 該靜音模型的輸出層上設置有第一節點和第二節點; 該第一節點用於表示對應該靜音幀的該標籤; 該第二節點用於表示對應該非靜音幀的該標籤。The voice endpoint detection method of claim 1, wherein the silent model is pre-trained by the following method: Step A1, inputting a preset plurality of training voice data, and extracting each of the training voices The voice feature of the data; Step A2, according to the corresponding voice feature, performing automatic labeling operation on the training voice data for each frame, obtaining a label corresponding to the voice data of each frame; the label is used to indicate the corresponding frame of the voice The data is a silent frame or a non-silent frame; Step A3, the mute model is obtained according to the training voice data and the corresponding tag; the output node of the mute model is provided with a first node and a second node; the first node The label used to indicate the corresponding silence frame; the second node is used to indicate the label corresponding to the non-silent frame. 如請求項第2項所述之語音端點檢測方法,其中,對應外部輸入的每個該訓練用語音數據均預先設置一標注文本,以標注該訓練用語音數據對應的文本內容; 則該步驟A2具體包括: 步驟A21,獲取該語音特徵和對應的該標注文本; 步驟A22,利用預先訓練形成的聲學模型對該語音特徵和對應的該標注文本進行強制對齊,以得到每幀該語音特徵對應到音素的輸出標籤; 步驟A23,對經過該強制對齊的該訓練用語音數據進行後處理,以將靜音音素的該輸出標籤映射到表示該靜音幀的該標籤上,以及將非靜音音素的該輸出標籤映射到表示該非靜音幀的該標籤上。The voice endpoint detection method of claim 2, wherein each of the training voice data corresponding to the external input is preset with an annotation text to mark the text content corresponding to the training voice data; Step A2 specifically includes: Step A21: acquiring the voice feature and the corresponding tagged text; Step A22, forcibly aligning the phonetic feature and the corresponding tagged text by using an acoustic model formed by pre-training to obtain each frame The voice feature corresponds to an output tag of the phoneme; Step A23, post-processing the training voice data subjected to the forced alignment to map the output tag of the mute phoneme to the tag representing the mute frame, and to be unmuted The output label of the phoneme is mapped to the label representing the non-silent frame. 如請求項第3項所述之語音端點檢測方法,其中,該步驟A22中,預先訓練形成的該聲學模型為高斯混合模型-隱藏式馬可夫模型,或者為深度神經網路-隱藏式馬可夫模型。The speech endpoint detection method according to claim 3, wherein in the step A22, the acoustic model formed by the pre-training is a Gaussian mixture model-hidden Markov model, or a deep neural network-hidden Markov model. . 如請求項第1項所述之語音端點檢測方法,其中,該靜音模型為包括多層神經網路的深度神經網路模型。The speech endpoint detection method of claim 1, wherein the muting model is a deep neural network model including a multi-layer neural network. 如請求項第5項所述之語音端點檢測方法,其中,該靜音模型的每兩層該神經網路之間包括至少一個非線性變換。The speech endpoint detection method of claim 5, wherein each of the two layers of the muting model includes at least one non-linear transformation between the neural networks. 如請求項第5項所述之語音端點檢測方法,其中,該靜音模型的每層該神經網路為全連接的神經網路,或者卷積神經網路,或者遞迴神經網路。The speech endpoint detection method of claim 5, wherein the neural network of each layer of the muting model is a fully connected neural network, or a convolutional neural network, or a recurrent neural network. 如請求項第2項所述之語音端點檢測方法,其中,該靜音模型為包括多層神經網路的深度神經網路模型; 該靜音模型的輸出層上設置有第一節點和第二節點; 該第一節點用於表示對應該靜音幀的該標籤; 該第二節點用於表示對應非靜音幀的該標籤; 則該步驟S2具體包括: 步驟S21,該語音特徵輸入該靜音模型後,通過多層該神經網路的前向計算分別得到該輸出層中關聯於該第一節點的第一取值以及關聯於該第二節點的第二取值; 步驟S22,將該第一取值與該第二取值進行比較: 若該第一取值大於該第二取值,則將該第一節點作為該語音數據的該標籤並輸出; 若該第一取值小於該第二取值,則將該第二節點作為該語音數據的該標籤並輸出。The voice endpoint detection method of claim 2, wherein the muting model is a deep neural network model including a multi-layer neural network; the first node and the second node are disposed on an output layer of the muting model; The first node is used to indicate the label corresponding to the silence frame; the second node is used to indicate the label corresponding to the non-silent frame; then the step S2 specifically includes: Step S21, after the voice feature is input into the silent model, The forward calculation of the multi-layer neural network respectively obtains a first value associated with the first node in the output layer and a second value associated with the second node; Step S22, the first value is The second value is compared: if the first value is greater than the second value, the first node is used as the label of the voice data and is output; if the first value is less than the second value, The second node is used as the tag of the voice data and output. 一種語音辨識方法,其中,採用如請求項第1-8項所述之語音端點檢測方法檢測得到需要識別的一段語音的該起始端點和該結束端點。A speech recognition method, wherein the speech endpoint detection method as described in claims 1-8 detects the start endpoint and the end endpoint of a speech that needs to be recognized.
TW107104564A 2017-02-13 2018-02-08 Speech point detection method and speech recognition method TWI659409B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
??201710076757.2 2017-02-13
CN201710076757.2A CN108428448A (en) 2017-02-13 2017-02-13 A kind of sound end detecting method and audio recognition method

Publications (2)

Publication Number Publication Date
TW201830377A true TW201830377A (en) 2018-08-16
TWI659409B TWI659409B (en) 2019-05-11

Family

ID=63107183

Family Applications (1)

Application Number Title Priority Date Filing Date
TW107104564A TWI659409B (en) 2017-02-13 2018-02-08 Speech point detection method and speech recognition method

Country Status (3)

Country Link
CN (1) CN108428448A (en)
TW (1) TWI659409B (en)
WO (1) WO2018145584A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667817A (en) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 Voice recognition method, device, computer system and readable storage medium

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
CN109036459B (en) * 2018-08-22 2019-12-27 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device, computer equipment and computer storage medium
CN110875033A (en) * 2018-09-04 2020-03-10 蔚来汽车有限公司 Method, apparatus, and computer storage medium for determining a voice end point
CN110910905B (en) * 2018-09-18 2023-05-02 京东科技控股股份有限公司 Mute point detection method and device, storage medium and electronic equipment
CN109378016A (en) * 2018-10-10 2019-02-22 四川长虹电器股份有限公司 A kind of keyword identification mask method based on VAD
CN111063356B (en) * 2018-10-17 2023-05-09 北京京东尚科信息技术有限公司 Electronic equipment response method and system, sound box and computer readable storage medium
CN109119070B (en) * 2018-10-19 2021-03-16 科大讯飞股份有限公司 Voice endpoint detection method, device, equipment and storage medium
CN110010153A (en) * 2019-03-25 2019-07-12 平安科技(深圳)有限公司 A kind of mute detection method neural network based, terminal device and medium
CN110634483B (en) * 2019-09-03 2021-06-18 北京达佳互联信息技术有限公司 Man-machine interaction method and device, electronic equipment and storage medium
US11227601B2 (en) * 2019-09-21 2022-01-18 Merry Electronics(Shenzhen) Co., Ltd. Computer-implement voice command authentication method and electronic device
CN110827858B (en) * 2019-11-26 2022-06-10 思必驰科技股份有限公司 Voice endpoint detection method and system
CN111128174A (en) * 2019-12-31 2020-05-08 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN111583933B (en) * 2020-04-30 2023-10-27 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
US11870475B2 (en) * 2020-09-29 2024-01-09 Sonos, Inc. Audio playback management of multiple concurrent connections
CN112365899A (en) * 2020-10-30 2021-02-12 北京小米松果电子有限公司 Voice processing method, device, storage medium and terminal equipment
CN112652296B (en) * 2020-12-23 2023-07-04 北京华宇信息技术有限公司 Method, device and equipment for detecting streaming voice endpoint
CN112967739B (en) * 2021-02-26 2022-09-06 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on long-term and short-term memory network
CN115910043B (en) * 2023-01-10 2023-06-30 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle
CN116469413B (en) * 2023-04-03 2023-12-01 广州市迪士普音响科技有限公司 Compressed audio silence detection method and device based on artificial intelligence

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IT1315917B1 (en) * 2000-05-10 2003-03-26 Multimedia Technologies Inst M VOICE ACTIVITY DETECTION METHOD AND METHOD FOR LASEGMENTATION OF ISOLATED WORDS AND RELATED APPARATUS.
US20030004720A1 (en) * 2001-01-30 2003-01-02 Harinath Garudadri System and method for computing and transmitting parameters in a distributed voice recognition system
CN100580770C (en) * 2005-08-08 2010-01-13 中国科学院声学研究所 Voice end detection method based on energy and harmonic
JP2007114413A (en) * 2005-10-19 2007-05-10 Toshiba Corp Voice/non-voice discriminating apparatus, voice period detecting apparatus, voice/non-voice discrimination method, voice period detection method, voice/non-voice discrimination program and voice period detection program
TWI299855B (en) * 2006-08-24 2008-08-11 Inventec Besta Co Ltd Detection method for voice activity endpoint
CN101625857B (en) * 2008-07-10 2012-05-09 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
CN101740024B (en) * 2008-11-19 2012-02-08 中国科学院自动化研究所 Method for automatic evaluation of spoken language fluency based on generalized fluency
CN102034475B (en) * 2010-12-08 2012-08-15 安徽科大讯飞信息科技股份有限公司 Method for interactively scoring open short conversation by using computer
CN103730110B (en) * 2012-10-10 2017-03-01 北京百度网讯科技有限公司 A kind of method and apparatus of detection sound end
CN103117060B (en) * 2013-01-18 2015-10-28 中国科学院声学研究所 For modeling method, the modeling of the acoustic model of speech recognition
JP5753869B2 (en) * 2013-03-26 2015-07-22 富士ソフト株式会社 Speech recognition terminal and speech recognition method using computer terminal
CN103886871B (en) * 2014-01-28 2017-01-25 华为技术有限公司 Detection method of speech endpoint and device thereof
CN104681036B (en) * 2014-11-20 2018-09-25 苏州驰声信息科技有限公司 A kind of detecting system and method for language audio
CN104409080B (en) * 2014-12-15 2018-09-18 北京国双科技有限公司 Sound end detecting method and device
CN105118502B (en) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN105374350B (en) * 2015-09-29 2017-05-17 百度在线网络技术(北京)有限公司 Speech marking method and device
CN105206258B (en) * 2015-10-19 2018-05-04 百度在线网络技术(北京)有限公司 The generation method and device and phoneme synthesizing method and device of acoustic model
CN105869628A (en) * 2016-03-30 2016-08-17 乐视控股(北京)有限公司 Voice endpoint detection method and device
CN105976810B (en) * 2016-04-28 2020-08-14 Tcl科技集团股份有限公司 Method and device for detecting end point of effective speech segment of voice
CN105957518B (en) * 2016-06-16 2019-05-31 内蒙古大学 A kind of method of Mongol large vocabulary continuous speech recognition
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667817A (en) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 Voice recognition method, device, computer system and readable storage medium

Also Published As

Publication number Publication date
CN108428448A (en) 2018-08-21
WO2018145584A1 (en) 2018-08-16
TWI659409B (en) 2019-05-11

Similar Documents

Publication Publication Date Title
TWI659409B (en) Speech point detection method and speech recognition method
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
WO2017084360A1 (en) Method and system for speech recognition
Van Segbroeck et al. A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice.
WO2020029404A1 (en) Speech processing method and device, computer device and readable storage medium
Ananthapadmanabha et al. Detection of the closure-burst transitions of stops and affricates in continuous speech using the plosion index
US20140337024A1 (en) Method and system for speech command detection, and information processing system
WO2014153800A1 (en) Voice recognition system
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
JP5051882B2 (en) Voice dialogue apparatus, voice dialogue method, and robot apparatus
JP6915637B2 (en) Information processing equipment, information processing methods, and programs
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
CN109215634A (en) A kind of method and its system of more word voice control on-off systems
WO2014173325A1 (en) Gutturophony recognition method and device
CN112509568A (en) Voice awakening method and device
CN111883181A (en) Audio detection method and device, storage medium and electronic device
JP2007288242A (en) Operator evaluation method, device, operator evaluation program, and recording medium
CN111429919B (en) Crosstalk prevention method based on conference real recording system, electronic device and storage medium
TWI299855B (en) Detection method for voice activity endpoint
JP5342629B2 (en) Male and female voice identification method, male and female voice identification device, and program
CN115331670B (en) Off-line voice remote controller for household appliances
JP2000250593A (en) Device and method for speaker recognition
Sudhakar et al. Automatic speech segmentation to improve speech synthesis performance
CN107039046B (en) Voice sound effect mode detection method based on feature fusion