TW201828279A

TW201828279A - Voice recognition method and device capable of precisely recognizing the information related to the client end when recognizing voice signal sampled from the user end

Info

Publication number: TW201828279A
Application number: TW106102245A
Authority: TW
Inventors: 李曉輝; 李宏言
Original assignee: 阿里巴巴集團服務有限公司
Priority date: 2017-01-20
Filing date: 2017-01-20
Publication date: 2018-08-01
Also published as: TWI731921B

Abstract

The present invention discloses a voice recognition method, which includes using a preset voice knowledge source to generate a search space, which includes user end preset data, for decoding voice signals; extracting feature vector sequence of the voice signal to be recognized; computing probability of the feature vector corresponding to the basic unit of the search space; performing decoding operation in the search space by using the probability as an input, in order to obtain a word sequence corresponding to the feature vector sequence. The present invention further provides a voice recognition device, and another voice recognition method and device. According to the method provided by the present invention, since the user end preset data is included when generating the search space for decoding, the information related the client end can be relatively and accurately identified when the voice signal collected by the user end is recognized, so as to increase accuracy of voice recognition, and improve user experience.

Description

Speech recognition method and device

本申請關於語音辨識技術，具體關於一種語音辨識方法及裝置。本申請同時關於另一種語音辨識方法及裝置。 The present application relates to a speech recognition technology, and more particularly to a speech recognition method and apparatus. This application also relates to another speech recognition method and apparatus.

語音是語言的聲學表現，是人類交流資訊最自然、最有效、最方便的手段，也是人類思維的一種依託。自動語音辨識(Automatic Speech Recognition-ASR)通常是指讓電腦等設備透過對語音的識別和理解，把人的口語轉化為相應的輸出文本或者命令的過程。其核心框架是：在利用統計模型建模的基礎上，根據從待識別語音信號中提取的特徵序列O，利用下述貝葉斯決策準則來求解與待識別語音信號對應的最佳詞序列W*：W*=argmaxP(O|W)P(W) Speech is the acoustic expression of language. It is the most natural, effective and convenient means for human beings to exchange information. It is also a kind of support for human thinking. Automatic Speech Recognition (ASR) usually refers to the process of converting a person's spoken language into a corresponding output text or command through recognition and understanding of speech. The core framework is: based on the statistical model modeling, based on the feature sequence O extracted from the speech signal to be recognized, the following Bayesian decision criterion is used to solve the optimal word sequence corresponding to the speech signal to be recognized. *:W*=argmaxP(O|W)P(W)

在具體實施中，上述求解最佳詞序列的過程稱為解碼過程(實現解碼功能的模組通常稱為解碼器)，即：在由發音詞典、語言模型等多種知識源組成的搜索空間中搜索出上式所示的最佳詞序列。 In a specific implementation, the process of solving the optimal word sequence is called a decoding process (a module that implements a decoding function is generally called a decoder), that is, searching in a search space composed of a plurality of knowledge sources such as a pronunciation dictionary and a language model. The best word sequence shown in the above formula.

隨著技術的發展，硬體的計算能力和儲存容量有了很大的進步，語音辨識系統已經逐步在業界得以應用，在用戶端設備上也出現了各種用語音作為人機交互媒介的應用，例如智慧手機上的撥打電話應用，使用者只需發出語音指示(如：“給張三打電話”)，即可自動實現電話撥打功能。 With the development of technology, the computing power and storage capacity of hardware have been greatly improved. The speech recognition system has been gradually applied in the industry, and various applications using voice as a human-computer interaction medium have appeared on the user-side device. For example, the dialing application on the smart phone, the user only needs to give a voice indication (such as: "call Zhang three"), then the phone dialing function can be automatically realized.

目前的語音辨識應用通常採用兩種模式，一種是基於用戶端和伺服器的模式，即：用戶端採集語音，經由網路上傳至伺服器，伺服器透過解碼將語音辨識為文本，然後回傳到用戶端。之所以採用這樣的模式，是因為用戶端的計算能力相對較弱，其記憶體空間也比較有限，而伺服器在這兩方面都具有明顯的優勢；但是採用這種模式，如果沒有網路接入環境，用戶端則無法完成語音辨識功能。針對上述問題出現了僅依賴於用戶端的第二種語音辨識應用模式，在該模式下，透過縮減規模，將原本存放在伺服器上的模型和搜索空間放在用戶端設備本地，由用戶端自行完成採集語音以及解碼的操作。 The current voice recognition application usually adopts two modes, one is based on the mode of the client and the server, that is, the user collects the voice, uploads it to the server via the network, and the server recognizes the voice as text through decoding, and then returns the voice. To the client. The reason for adopting this mode is that the computing power of the client is relatively weak, and the memory space is limited, and the server has obvious advantages in both aspects; but this mode, if there is no network access In the environment, the client cannot complete the voice recognition function. In response to the above problem, a second voice recognition application mode that relies only on the user end appears. In this mode, by reducing the scale, the model and the search space originally stored on the server are placed locally on the client device, and the user terminal itself The operation of collecting voice and decoding is completed.

在實際應用中，無論是第一種模式還是第二種模式，在採用上述通用框架進行語音辨識時，通常無法有效識別語音信號中與用戶端設備本地資訊相關的內容，例如：通訊錄中的聯絡人名稱，從而導致識別準確率低，給用戶的使用帶來不便，影響用戶的使用體驗。 In practical applications, whether the first mode or the second mode, when using the above-mentioned general framework for voice recognition, it is generally impossible to effectively identify the content of the voice signal related to the local information of the user equipment, for example, in the address book. The name of the contact person leads to low recognition accuracy, which causes inconvenience to the user and affects the user experience.

本申請實施例提供一種語音辨識方法和裝置，以解決現有的語音辨識技術對用戶端本地相關資訊的識別準確率低的問題。本申請實施例還提供另一種語音辨識方法和裝置。 The embodiment of the present invention provides a voice recognition method and device, which solves the problem that the existing voice recognition technology has low recognition accuracy of local related information at the user end. Another embodiment of the present application provides a voice recognition method and apparatus.

本申請提供一種語音辨識方法，包括：利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間；提取待識別語音信號的特徵向量序列；計算特徵向量對應於搜索空間基本單元的機率；以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 The present application provides a voice recognition method, including: generating a search space for decoding a voice signal by using a preset voice knowledge source; extracting a feature vector sequence of the voice signal to be recognized; and calculating a feature The vector corresponds to the probability of searching for the basic unit of the space; the input is performed at the probability, and a decoding operation is performed in the search space to obtain a sequence of words corresponding to the sequence of feature vectors.

可選的，所述搜索空間包括：加權有限狀態轉換器。 Optionally, the search space includes: a weighted finite state converter.

可選的，所述搜索空間基本單元包括：上下文相關的三音素；所述預設的知識源包括：發音詞典、語言模型、以及三音素狀態繫結欄位表。 Optionally, the search space basic unit includes: a context-dependent triphone; the preset knowledge source includes: a pronunciation dictionary, a language model, and a triphone state linkage field table.

可選的，所述利用預設的語音知識源生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間，包括：透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典以及語言模型的單一加權有限狀態轉換器；其中，所述語言模型是採用如下方式預先訓練得到的：將用於訓練語言模型的文本中的預設命名實體替換為與預設主題類別對應的標籤，並利用所述文本訓練語言模型。 Optionally, the generating, by using the preset voice knowledge source, a search space for decoding the voice signal that includes the preset information of the UE, including: by using a replacement label, to a pre-generated at least language model based Adding user preset information corresponding to the preset theme category in the weighted finite state converter, and obtaining a single weighted finite state converter based on the triphone state tied field table, the pronunciation dictionary, and the language model; wherein the language The model is pre-trained in the following manner: The preset named entity in the text for training the language model is replaced with a label corresponding to the preset theme category, and the language model is trained using the text.

可選的，所述透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典以及語言模型的單一加權有限狀態轉換器，包括：透過替換標籤的方式，向預先生成的基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；將添加了用戶端預設資訊的所述加權有限狀態轉換器、與預先生成的基於三音素狀態繫結欄位表和發音詞典的加權有限狀態轉換器進行合併，得到所述單一加權有限狀態轉換器。 Optionally, the method for replacing the label is used to add the user preset information corresponding to the preset theme category to the pre-generated at least language-based weighted finite state converter, and obtain a triphone-based state binding column. A single weighted finite state converter of the bit table, the pronunciation dictionary, and the language model, including: adding a user preset corresponding to the preset theme category to the pre-generated language model-based weighted finite state converter by replacing the label Information; the weighted finite state converter to which the user preset information is added, combined with a pre-generated weighted finite state converter based on the triphone stateline field table and the pronunciation dictionary, to obtain the single weighted limited State converter.

可選的，所述用於訓練語言模型的文本是指，針對所述預設主題類別的文本。 Optionally, the text used to train the language model refers to text for the preset theme category.

可選的，所述預設主題類別的數目至少為2個；所述語言模型的數目、以及所述至少基於語言模型的加權有限狀態器的數目分別與預設主題類別的數目一致；所述透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，包括：確定待識別語音信號所屬的預設主題類別；選擇預先生成的、與所述預設主題類別相對應的所述至少基於語言模型的加權有限狀態轉換器；透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的加權有限狀態轉換器中添加用戶端預設資訊。 Optionally, the number of the preset topic categories is at least two; the number of the language models, and the number of the at least language-based weighted finite state devices are respectively consistent with the number of preset topic categories; Adding the user preset information corresponding to the preset theme category to the pre-generated at least language-based weighted finite state converter by means of replacing the label, including: determining a preset theme category to which the to-be-identified voice signal belongs; a pre-generated, at least language-based weighted finite state converter corresponding to the preset theme category; by replacing the corresponding label with the user preset information corresponding to the preset theme category, Add the client preset information to the selected weighted finite state converter.

可選的，所述確定待識別語音信號所屬的預設主題類別，採用如下方式實現：根據採集所述語音信號的用戶端類型、或應用程式確定所述所屬的預設主題類別。 Optionally, the determining the preset topic category to which the to-be-identified voice signal belongs is implemented by determining, according to the type of the client that collects the voice signal, or the application, the preset topic category.

可選的，所述預設主題類別包括：撥打電話或發送簡訊，播放樂曲，或者，設置指令；相應的用戶端預設資訊包括：通訊錄中的聯絡人名稱，曲庫中的樂曲名稱，或者，指令集中的指令。 Optionally, the preset theme category includes: making a call or sending a short message, playing a music, or setting a command; the corresponding user preset information includes: a contact name in the address book, a music name in the music library, Or, an instruction in the instruction set.

可選的，所述合併操作包括：採用基於預測的方法進行合併。 Optionally, the merging operation comprises: merging using a prediction-based method.

可選的，預先訓練所述語言模型所採用的詞表與所述發音詞典包含的詞一致。 Optionally, the vocabulary used in pre-training the language model is consistent with the words included in the pronunciation dictionary.

可選的，所述計算特徵向量對應於搜索空間基本單元的機率，包括：採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率；根據特徵向量對應於所述各三音素狀態的機率，採用預先訓練的HMM模型計算特徵向量對應於各三音素的機率。 Optionally, the calculating the feature vector corresponds to the probability of searching for the basic unit of the space, comprising: calculating a probability that the feature vector corresponds to each triphone state by using the pre-trained DNN model; and corresponding to the triphone state according to the feature vector. Probability, using a pre-trained HMM model to calculate the probability that the feature vector corresponds to each triphone.

可選的，透過如下方式提升所述採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率的步驟的執行速度：利用硬體平臺提供的資料並行處理能力。 Optionally, the execution speed of the step of calculating the probability that the feature vector corresponds to each triphone state by using the pre-trained DNN model is improved by: using data parallel processing capability provided by the hardware platform.

可選的，所述提取待識別語音信號的特徵向量序列，包括：按照預先設定的幀長度對待識別語音信號進行分幀處理，得到多個音訊幀；提取各音訊幀的特徵向量，得到所述特徵向量序列。 Optionally, the extracting the feature vector sequence of the to-be-identified speech signal includes: performing frame processing on the speech signal to be recognized according to a preset frame length to obtain a plurality of audio frames; and extracting feature vectors of each audio frame to obtain the A sequence of feature vectors.

可選的，所述提取各音訊幀的特徵向量包括：提取MFCC特徵、PLP特徵、或者LPC特徵。 Optionally, the extracting the feature vector of each audio frame comprises: extracting an MFCC feature, a PLP feature, or an LPC feature.

可選的，在所述得到與所述特徵向量序列對應的詞序列後，執行下述操作：透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Optionally, after the obtaining the word sequence corresponding to the feature vector sequence, performing the following operations: verifying the accuracy of the word sequence by performing text matching with the user preset information, and according to the verification result Generate corresponding speech recognition results.

可選的，所述透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果得到相應的語音辨識結果，包括：從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；在所述用戶端預設資訊中查找所述待驗證詞；若找到，則判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；否則透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the performing the text matching with the user preset information to verify the accuracy of the word sequence, and obtaining the corresponding voice recognition result according to the verification result, including: selecting, according to the word sequence, corresponding to the Determining a word to be verified of the user preset information; searching for the to-be-verified word in the user-preset information; if found, determining to pass the accuracy verification, and using the word sequence as a speech recognition result; Otherwise, the word sequence is corrected by a pinyin-based fuzzy matching method, and the corrected word sequence is used as a speech recognition result.

可選的，所述透過基於拼音的模糊匹配方式修正所述詞序列，包括：將所述待驗證詞轉換為待驗證拼音序列；將所述用戶端預設資訊中的各個詞分別轉換為比對拼音序列；依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照所述相似度從高到低排序靠前的詞；用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 Optionally, the correcting the word sequence by using a pinyin-based fuzzy matching manner includes: converting the to-be-verified word into a to-be-verified pinyin sequence; converting each word in the user-side preset information into a ratio a pair of Pinyin sequences; calculating a similarity between the Pinyin sequence to be verified and each of the Pinyin sequences in sequence, and selecting a word sorted according to the similarity from high to low from the user preset information; The corrected word sequence is obtained by replacing the word to be verified in the word sequence with the selected word.

可選的，所述相似度包括：基於編輯距離計算的相似度。 Optionally, the similarity includes: a similarity calculated based on an edit distance.

可選的，所述方法在用戶端設備上實施；所述用戶端設備包括：智慧移動終端、智慧音箱、或者機器人。 Optionally, the method is implemented on a user equipment, where the user equipment includes: a smart mobile terminal, a smart speaker, or a robot.

相應的，本申請還提供一種語音辨識裝置，包括：搜索空間生成單元，用於利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間；特徵向量提取單元，用於提取待識別語音信號的特徵向量序列；機率計算單元，用於計算特徵向量對應於搜索空間基本單元的機率；解碼搜索單元，用於以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 Correspondingly, the present application further provides a voice recognition device, comprising: a search space generating unit, configured to generate, by using a preset voice knowledge source, a search space for decoding a voice signal including user preset information; a vector extracting unit, configured to extract a feature vector sequence of the speech signal to be recognized; a probability calculating unit, configured to calculate a probability that the feature vector corresponds to the basic unit of the search space; and a decoding search unit, configured to input the probability, in the A decoding operation is performed in the search space to obtain a sequence of words corresponding to the sequence of feature vectors.

可選的，所述搜索空間生成單元具體用於，透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器；所述語言模型是由語言模型訓練單元預先生成的，所述語言模型訓練單元用於，將用於訓練語言模型的文本中的預設命名實體替換為與預設主題類別對應的標籤，並利用所述文本訓練語言模型。 Optionally, the search space generating unit is configured to: add, by using a replacement label, a user preset information corresponding to the preset theme category to the pre-generated at least language-based weighted finite state converter, and Obtaining a single weighted finite state converter based on a triphone state tied field table, a pronunciation dictionary, and a language model; the language model is pre-generated by a language model training unit, and the language model training unit is used The preset named entity in the text of the training language model is replaced with a label corresponding to the preset theme category, and the language model is trained using the text.

可選的，所述搜索空間生成單元包括：第一用戶端資訊添加子單元，用於透過替換標籤的方式，向預先生成的基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；加權有限狀態轉換器合併子單元，用於將添加了所述用戶端預設資訊的所述加權有限狀態轉換器、與預先生成的基於三音素狀態繫結欄位表和發音詞典的加權有限狀態轉換器進行合併，得到所述單一加權有限狀態轉換器。 Optionally, the search space generating unit includes: a first user information adding subunit, configured to add, according to the manner of replacing the label, a pre-generated language model-based weighted finite state converter corresponding to the preset topic category User preset information; a weighted finite state converter merging subunit for using the weighted finite state converter to which the user preset information is added, and a pre-generated triphone-based state binding field table The weighted finite state converter of the pronunciation dictionary is combined to obtain the single weighted finite state converter.

可選的，所述解碼空間生成單元包括：第二用戶端資訊添加子單元，用於透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；統一加權有限狀態轉換器獲取子單元，用於在所述第二用戶端資訊添加子單元完成添加操作之後，得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器；其中，所述第二用戶端資訊添加子單元包括：主題確定子單元，用於確定待識別語音信號所屬的預設主題類別；加權有限狀態轉換器選擇子單元，用於選擇預先生成的、與所述預設主題類別相對應的所述至少基於語言模型的加權有限狀態轉換器；標籤替換子單元，用於透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的加權有限狀態轉換器中添加用戶端預設資訊。 Optionally, the decoding space generating unit includes: a second user information adding subunit, configured to add a preset theme category to the pre-generated at least language model-based weighted finite state converter by replacing the label Corresponding user preset information; a unified weighted finite state converter obtaining subunit, configured to obtain a triphone based state binding field table, a pronunciation dictionary, and after the second user information adding subunit completes the adding operation, And a single weighted finite state converter of the language model; wherein the second user information adding subunit comprises: a topic determining subunit, configured to determine a preset topic category to which the to-be-identified speech signal belongs; weighted finite state converter selection a subunit, configured to select the at least language-based weighted finite state converter corresponding to the preset topic category, and a label replacement subunit, configured to correspond to the preset topic category The manner in which the user preset information replaces the corresponding label to the selected weighted finite state converter Add client preset information.

可選的，所述主題確定子單元具體用於，根據採集所述語音信號的用戶端類型、或應用程式確定所述所屬的預設主題類別。 Optionally, the topic determining subunit is specifically configured to determine, according to a type of the client that collects the voice signal, or an application, the belonging preset topic category.

可選的，所述加權有限狀態轉換器合併子單元具體用於，採用基於預測的方法執行合併操作，並得到所述單一加權有限狀態轉換器。 Optionally, the weighted finite state converter combining subunit is specifically configured to perform a combining operation by using a prediction based method, and obtain the single weighted finite state converter.

可選的，所述機率計算單元包括：三音素狀態機率計算子單元，用於採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率；三音素機率計算子單元，用於根據特徵向量對應於所述各三音素狀態的機率，採用預先訓練的HMM模型計算特徵向量對應於各三音素的機率。 Optionally, the probability calculation unit includes: a triphone state probability calculation sub-unit, configured to calculate a probability that the feature vector corresponds to each triphone state by using a pre-trained DNN model; and a triphone probability calculation sub-unit for The vector corresponds to the probability of each of the triphone states, and the pre-trained HMM model is used to calculate the probability that the feature vector corresponds to each triphone.

可選的，所述特徵向量提取單元包括：分幀子單元，用於按照預先設定的幀長度對待識別語音信號進行分幀處理，得到多個音訊幀；特徵提取子單元，用於提取各音訊幀的特徵向量，得到所述特徵向量序列。 Optionally, the feature vector extraction unit includes: a framing sub-unit, configured to perform framing processing on the speech signal to be recognized according to a preset frame length, to obtain a plurality of audio frames; and a feature extraction sub-unit, configured to extract each audio The feature vector of the frame results in the sequence of feature vectors.

可選的，所述裝置包括：準確性驗證單元，用於在所述解碼搜索單元得到與特徵向量序列對應的詞序列後，透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Optionally, the device includes: an accuracy verification unit, configured to verify, after the decoding search unit obtains a sequence of words corresponding to the sequence of feature vectors, verify the word sequence by performing text matching with the preset information of the UE Accuracy, and generate corresponding speech recognition results based on the verification results.

可選的，所述準確性驗證單元包括：待驗證詞選擇子單元，用於從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；查找子單元，用於在所述用戶端預設資訊中查找所述待驗證詞；識別結果確認子單元，用於當所述查找子單元找到所述待驗證詞之後，判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；識別結果修正子單元，用於當所述查找子單元未找到所述待驗證詞之後，透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the accuracy verification unit includes: a to-be-verified word selection sub-unit, configured to select a to-be-verified word corresponding to the user-preset information from the word sequence; and the search sub-unit is used in the Searching for the to-be-verified word in the user-preset information; the recognition result confirmation sub-unit is configured to, after the finding sub-unit finds the to-be-verified word, determine to pass the accuracy verification, and the word sequence is As a speech recognition result, the recognition result correction subunit is configured to, after the search subunit does not find the to-be-verified word, correct the word sequence through a pinyin-based fuzzy matching manner, and use the corrected word sequence as a speech Identify the results.

可選的，所述識別結果修正子單元，包括：待驗證拼音序列轉換子單元，用於將所述待驗證詞轉換為待驗證拼音序列；比對拼音序列轉換子單元，用於將所述用戶端預設資訊中的各個詞分別轉換為比對拼音序列；相似度計算選擇子單元，用於依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照所述相似度從高到低排序靠前的詞；待驗證詞替換子單元，用於用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 Optionally, the recognition result correction subunit includes: a pinyin sequence conversion subunit to be verified, configured to convert the to-be-verified word into a to-be-verified pinyin sequence; and a comparison pinyin sequence conversion subunit, Each word in the preset information of the client is converted into a comparison pinyin sequence, and a similarity calculation selection sub-unit is configured to sequentially calculate the similarity between the to-be-verified pinyin sequence and each of the comparison pinyin sequences, and from the Selecting, in the user preset information, a word that is sorted according to the similarity from high to low; the word to be verified is replaced by a subunit for replacing the word to be verified in the word sequence with the selected word, and the correction is obtained. After the word sequence.

此外，本申請還提供另一種語音辨識方法，包括：透過解碼獲取與待識別語音信號對應的詞序列；透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 In addition, the present application further provides another voice recognition method, including: acquiring a sequence of words corresponding to the voice signal to be recognized through decoding; verifying the accuracy of the word sequence by performing text matching with the preset information of the user, and according to the verification result Generate corresponding speech recognition results.

可選的，所述透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果，包括：從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；在所述用戶端預設資訊中查找所述待驗證詞；若找到，則判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；否則透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the performing the text matching with the user preset information to verify the accuracy of the word sequence, and generating a corresponding voice recognition result according to the verification result, including: selecting, from the word sequence, corresponding to the user Ending the information of the information to be verified; searching for the word to be verified in the preset information of the user; if found, determining to pass the accuracy verification, and using the word sequence as a speech recognition result; The word matching sequence is modified based on the pinyin-based fuzzy matching method, and the corrected word sequence is used as the speech recognition result.

相應的，本申請還提供另一種語音辨識裝置，包括：詞序列獲取單元，用於透過解碼獲取與待識別語音信號對應的詞序列；詞序列驗證單元，用於透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Correspondingly, the present application further provides another speech recognition apparatus, including: a word sequence acquisition unit, configured to obtain a word sequence corresponding to the to-be-identified speech signal through decoding, and a word sequence verification unit, configured to perform preset information with the user terminal. The text matching verifies the accuracy of the word sequence, and generates a corresponding speech recognition result according to the verification result.

可選的，所述詞序列驗證單元包括：待驗證詞選擇子單元，用於從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；查找子單元，用於在所述用戶端預設資訊中查找所述待驗證詞；識別結果確認子單元，用於當所述查找子單元找到所述待驗證詞之後，判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；識別結果修正子單元，用於當所述查找子單元未找到所述待驗證詞之後，透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the word sequence verification unit includes: a to-be-verified word selection sub-unit, configured to select a to-be-verified word corresponding to the user-preset information from the word sequence; and the search sub-unit is used in the Searching for the to-be-verified word in the user-preset information; the recognition result confirmation sub-unit is configured to, after the finding sub-unit finds the to-be-verified word, determine to pass the accuracy verification, and the word sequence is As a speech recognition result, the recognition result correction subunit is configured to, after the search subunit does not find the to-be-verified word, correct the word sequence through a pinyin-based fuzzy matching manner, and use the corrected word sequence as a speech Identify the results.

與現有技術相比，本申請具有以下優點：本申請提供的語音辨識方法，在利用預設的語音知識源生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間的基礎上，計算從待識別語音信號中提取的特徵向量對應於搜索空間基本單元的機率，並且根據所述機率在搜索空間中執行解碼操作，從而得到與所述待識別語音信號對應的詞序列。本申請提供的上述方法，由於在生成用於解碼的搜索空間時包含了用戶端預設資訊，因此在對用戶端採集的語音信號進行識別時能夠相對準確地識別出與客戶端相關的資訊，從而可以提高語音辨識的準確率，提升用戶的使用體驗。 Compared with the prior art, the present application has the following advantages: the voice recognition method provided by the present application is based on using a preset voice knowledge source to generate a search space for decoding a voice signal including user preset information. Calculating a probability that the feature vector extracted from the speech signal to be recognized corresponds to the basic unit of the search space, and performing a decoding operation in the search space according to the probability, thereby obtaining a word sequence corresponding to the speech signal to be recognized. The above method provided by the present application includes the user preset information when generating the search space for decoding, so that the information related to the client can be relatively accurately identified when the voice signal collected by the user end is recognized. Thereby, the accuracy of speech recognition can be improved, and the user experience is improved.

901‧‧‧搜索空間生成單元 901‧‧‧Search space generation unit

902‧‧‧特徵向量提取單元 902‧‧‧Feature vector extraction unit

903‧‧‧機率計算單元 903‧‧‧ probability calculation unit

904‧‧‧解碼搜索單元 904‧‧‧Decoding search unit

1101‧‧‧詞序列獲取單元 1101‧‧‧word sequence acquisition unit

1102‧‧‧詞序列驗證單元 1102‧‧‧word sequence verification unit

圖1是本申請的一種語音辨識方法的實施例的流程圖；圖2是本申請實施例提供的生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間的處理流程圖；圖3是本申請實施例提供的執行替換操作前的G結構WFST的示意圖；圖4是本申請實施例提供的執行替換操作後的G結構WFST的示意圖；圖5是本申請實施例提供的提取待識別語音信號的特徵向量序列的處理流程圖；圖6是本申請實施例提供的計算特徵向量對應於各三音素的機率的處理流程圖；圖7是本申請實施例提供的透過文字匹配驗證詞序列的準確性、並根據驗證結果生成相應語音辨識結果的處理流程圖；圖8為本申請實施例提供的語音辨識的整體框架圖；圖9是本申請的一種語音辨識裝置的實施例的示意圖；圖10是本申請的另一種語音辨識方法的實施例的流程圖；圖11是本申請的另一種語音辨識裝置的實施例的示意圖。 1 is a flowchart of an embodiment of a voice recognition method according to the present application; FIG. 2 is a flowchart of a process for generating a search space for decoding a voice signal including user preset information according to an embodiment of the present application; 3 is a schematic diagram of a G structure WFST before performing a replacement operation according to an embodiment of the present application; FIG. 4 is a schematic diagram of a G structure WFST after performing a replacement operation according to an embodiment of the present application; FIG. 5 is an extraction provided by an embodiment of the present application. A flowchart of the processing of the feature vector sequence of the speech signal to be recognized; FIG. 6 is a flowchart of the process of calculating the probability that the feature vector corresponds to each triphone provided by the embodiment of the present application; FIG. 7 is a text matching verification provided by the embodiment of the present application. A flowchart of the process for generating the corresponding speech recognition result according to the accuracy of the word sequence; FIG. 8 is an overall frame diagram of the speech recognition provided by the embodiment of the present application; FIG. 9 is an embodiment of the speech recognition device of the present application. FIG. 10 is a flowchart of another embodiment of a speech recognition method of the present application; FIG. 11 is another speech recognition apparatus of the present application. Schematic diagram of the embodiment.

在下面的描述中闡述了很多具體細節以便於充分理解本申請。但是，本申請能夠以很多不同於在此描述的其它方式來實施，本領域技術人員可以在不違背本申請內涵的情況下做類似推廣，因此，本申請不受下面公開的具體實施的限制。 In the following description, numerous specific details are set forth in order to provide a thorough understanding of the application. However, the present invention can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without departing from the scope of the present application. Therefore, the present application is not limited by the specific embodiments disclosed below.

在本申請中，分別提供了一種語音辨識方法及相應裝置，以及另一種語音辨識方法及相應裝置，在下面的實施例中逐一進行詳細說明。為了便於理解，在對實施例進行描述之前，先對本申請的技術方案、相關的技術術語、以及實施例的撰寫方式作簡要說明。 In the present application, a voice recognition method and corresponding device, and another voice recognition method and corresponding device are respectively provided, which are described in detail in the following embodiments. For the sake of easy understanding, prior to describing the embodiments, the technical solutions, related technical terms, and manners of writing the embodiments of the present application are briefly described.

本申請提供的語音辨識方法，通常應用於以語音作為人機交互媒介的應用中，此類應用將採集的語音信號識別為文本，再根據文本執行相應的操作，所述語音信號中通常涉及用戶端本地的預設資訊(例如，通訊錄中的聯絡人名稱)。現有的語音辨識技術，對於上述待識別語音信號採用通用的搜索空間進行解碼識別，而通用的搜索空間並沒有考慮此類應用在不同用戶端之間的差異性，因此通常無法有效識別語音信號中與用戶端本地資訊相關的內容，導致識別準確率低。針對這一問題，本申請的技術方案透過在構建用於對語音信號進行解碼的搜索空間的過程中融入用戶端預設資訊，相當於針對用戶端的具體語音辨識需求進行了定制，從而能夠有效識別與客戶端相關的本地資訊，達到提高語音辨識準確率的目的。 The voice recognition method provided by the present application is generally applied to an application in which voice is used as a human-computer interaction medium. Such an application recognizes the collected voice signal as text, and then performs a corresponding operation according to the text, where the voice signal usually involves the user. Local preset information (for example, the contact name in the address book). The existing speech recognition technology uses a common search space for decoding and recognizing the above-mentioned speech signal to be recognized, and the general search space does not consider the difference between such applications in different user terminals, and thus generally cannot effectively recognize the speech signal. Content related to the local information of the client leads to low recognition accuracy. In response to this problem, the technical solution of the present application integrates the user preset information in the process of constructing a search space for decoding a voice signal, which is equivalent to customizing the specific voice recognition requirement of the user end, thereby being able to effectively identify Local information related to the client, to achieve the purpose of improving the accuracy of speech recognition.

在語音辨識系統中，根據待識別的語音信號得到與其最匹配的詞序列的過程叫做解碼。而本申請所述的用於對語音信號進行解碼的搜索空間是指，由語音辨識系統涉及的語音知識源(例如：聲學模型、發音詞典以及語言模型等)所覆蓋的、所有可能的語音辨識結果所組成的空間。相應的，解碼的過程就是針對待識別語音信號在搜索空間中進行搜索和匹配、得到最佳匹配的詞序列的過程。 In a speech recognition system, the process of obtaining a sequence of words that best matches the speech signal to be recognized is called decoding. The search space used to decode the speech signal described in the present application refers to all possible speech recognition covered by the speech knowledge source (eg, acoustic model, pronunciation dictionary, language model, etc.) involved in the speech recognition system. The resulting space. Correspondingly, the decoding process is a process of searching and matching in the search space for the speech signal to be recognized to obtain the best matching word sequence.

所述搜索空間的形式可以是多樣化的，可以採用各種知識源處於相對獨立的不同層面的搜索空間，解碼的過程就是逐層計算搜索的過程；也可以採用基於加權有限狀態轉換器(Weighted Finite State Transducer-簡稱WFST)的搜索空間，將各種知識源有機融入到統一的WFST網路(也稱WFST搜索空間)中。考慮到後者便於引入不同的知識源、並且可以提高搜索效率，是本申請技術方案進行語音辨識的較佳方式，因此在本申請提供的實施例中重點描述基於WFST網路的實施方式。 The form of the search space may be diversified, and various knowledge sources may be used in different independent search spaces. The decoding process is a layer-by-layer calculation process; a weighted finite state converter (Weighted Finite) may also be used. The State Transducer (WFST) search space organically integrates various knowledge sources into a unified WFST network (also known as the WFST search space). In view of the fact that the latter facilitates the introduction of different knowledge sources and can improve the search efficiency, it is a preferred method for voice recognition in the technical solution of the present application. Therefore, the embodiment based on the WFST network is mainly described in the embodiment provided by the present application.

所述WFST搜索空間，其核心是利用加權有限狀態轉換器來類比語言的文法結構以及相關的聲學特性。具體操作方法是：將處於不同層次的知識源分別用WFST的形式表示，然後運用WFST的特性及合併演算法，將上述處於不同層次的WFST整合成一個單一的WFST網路，構成用於進行語音辨識的搜索空間。 The core of the WFST search space is to use a weighted finite state converter to analogize the grammatical structure of the language and the associated acoustic properties. The specific operation method is: the knowledge sources at different levels are respectively expressed in the form of WFST, and then the WFST features and the merge algorithm are used to integrate the above-mentioned WFSTs at different levels into a single WFST network, which is used for voice. Identify the search space.

WFST網路的基本單元(即驅動WFST進行狀態轉換的基本輸入單元)可以根據具體的需求進行選擇。考慮音素上下文對音素發音的影響，為了獲得更高的識別準確率，在本申請提供的實施例中採用上下文相關的三音素 (Context Dependent triphone，簡稱三音素或者三音子)作為WFST網路的基本單元，相應的構建WFST搜索空間的知識源包括：三音素狀態繫結欄位表、發音詞典、以及語言模型。 The basic unit of the WFST network (the basic input unit that drives WFST for state transition) can be selected according to specific needs. Considering the influence of the phoneme context on phoneme pronunciation, in order to obtain higher recognition accuracy, a context-dependent Contextual Dependent triphone (Triphone or Triphone) is used as the WFST network in the embodiment provided by the present application. The basic unit, the corresponding knowledge source for constructing the WFST search space includes: a triphone state tied column table, a pronunciation dictionary, and a language model.

所述三音素狀態繫結欄位表通常包含各三音素彼此之間基於發音特點的綁定關係，通常在以三音素為建模單位訓練聲學模型時，由於三音素可能的組合方式數目眾多，為了減少對訓練資料的要求，通常可以基於發音特點、採用決策樹聚類方法在最大似然準則下對不同的三音素進行聚類，並使用捆綁技術把具有相同發音特點的三音素綁定到一起以便進行參數共用，從而得到所述三音素狀態繫結欄位表；所述發音詞典通常包含音素與詞之間的對應關係，是橋接在聲學層(實體層)和語義層之間的橋樑，讓聲學層的內容和語義層的內容耦合關聯在一起；所述語言模型則提供了語言結構的相關知識，用於計算詞序列在自然語言中出現的機率，在具體實施中通常採用n元(n-gram)文法語言模型，具體可以透過統計單詞之間相互跟隨出現的可能性來建模。 The triphone state binding field table usually includes a binding relationship between each of the three phonemes based on the pronunciation characteristics. Generally, when the acoustic model is trained in the triphone element, the number of possible combinations of the triphones is large. In order to reduce the requirements of the training materials, it is usually possible to cluster different triphones under the maximum likelihood criterion based on the pronunciation characteristics and using the decision tree clustering method, and bind the triphones with the same pronunciation characteristics to the binding function by using the bundling technique. Together, in order to perform parameter sharing, the triphone state state linkage table is obtained; the pronunciation dictionary usually includes a correspondence between phonemes and words, and is a bridge between the acoustic layer (physical layer) and the semantic layer. Correlating the content of the acoustic layer with the content of the semantic layer; the language model provides knowledge of the language structure and is used to calculate the probability of the word sequence appearing in natural language. In the specific implementation, n yuan is usually used. (n-gram) A grammar language model that can be modeled by the possibility that statistical words follow each other.

採用基於上述知識源構建的WFST網路進行語音辨識時，為了驅動WFST進行所需的搜索，可以先提取待識別語音信號的特徵向量序列，然後利用預先訓練好的模型計算從特徵向量對應於各三音素的機率，並根據所述各三音素的機率，在WFST搜索空間中執行解碼操作，得到與待識別語音信號對應的詞序列。 When using the WFST network constructed based on the above knowledge source for speech recognition, in order to drive the WFST to perform the required search, the feature vector sequence of the speech signal to be recognized may be extracted first, and then the pre-trained model is used to calculate the corresponding feature vector corresponding to each The probability of the triphone, and performing a decoding operation in the WFST search space according to the probability of each of the three phonemes, to obtain a sequence of words corresponding to the speech signal to be recognized.

需要說明的是，在本申請提供的實施例中採用上下文相關的三音素作為WFST網路的基本單元，在其他實施方式中，也可以採用其他語音單位作為WFST網路的基本單元，例如：單音素、或者三音素狀態等。採用不同的基本單元，在構建搜索空間時、以及根據特徵向量計算機率時會有一定的差別，例如以三音素狀態作為基本單元，那麼在構建WFST網路時可以融合基於HMM的聲學模型，在進行語音辨識時，則可以計算特徵向量對應於各三音素狀態的機率。上述這些都是具體實施方式的變更，只要在構建搜索空間的過程中包含了用戶端預設資訊，就同樣可以實現本申請的技術方案，就都不偏離本申請的技術核心，也都在本申請的保護範圍之內。 It should be noted that, in the embodiment provided by the present application, a context-dependent triphone is used as a basic unit of the WFST network. In other embodiments, other voice units may also be used as a basic unit of the WFST network, for example, a single Phoneme, or triphone status. Different basic units are used. When constructing the search space and when there is a certain difference according to the feature vector computer rate, for example, the triphone state is used as the basic unit, then the HMM-based acoustic model can be merged when constructing the WFST network. When speech recognition is performed, the probability that the feature vector corresponds to each triphone state can be calculated. The foregoing is a modification of the specific implementation manner. As long as the user preset information is included in the process of constructing the search space, the technical solution of the present application can also be implemented without departing from the technical core of the present application. Within the scope of protection of the application.

下面，對本申請的實施例進行詳細說明。請參考圖1，其為本申請的一種語音辨識方法的實施例的流程圖。所述方法包括步驟101至步驟104，在具體實施時，為了提高執行效率，通常可以在步驟101之前完成相關的準備工作(此階段也可以稱作準備階段)，生成基於類的語言模型、預設結構的WFST以及用於進行語音辨識的聲學模型等，從而為步驟101的執行做好準備。下面先對準備階段作詳細說明。 Hereinafter, embodiments of the present application will be described in detail. Please refer to FIG. 1, which is a flowchart of an embodiment of a speech recognition method of the present application. The method includes the steps 101 to 104. In a specific implementation, in order to improve the execution efficiency, the related preparation work (this stage may also be referred to as a preparation stage) may be completed before the step 101, and a class-based language model is generated. A structured WFST and an acoustic model for speech recognition are provided to prepare for the execution of step 101. The preparation phase will be described in detail below.

在準備階段可以採用如下方式訓練語言模型：將用於訓練語言模型的文本中的預設命名實體替換為與預設主題類別對應的標籤，並利用所述文本訓練語言模型。所述命名實體通常是指文本中具有特定類別的實體，例如：人名、歌曲名、機構名、地名等。 In the preparation phase, the language model can be trained in such a manner that the preset named entity in the text for training the language model is replaced with a label corresponding to the preset topic category, and the language model is trained using the text. The named entity generally refers to an entity having a specific category in the text, such as a person name, a song name, an institution name, a place name, and the like.

下面以撥打電話應用為例進行說明：預設主題類別為撥打電話，對應的標籤為“$CONTACT”，預設命名實體為人名，那麼在預先訓練語言模型時，可以將訓練文本中的人名替換為對應的標籤，比如將“我要打電話給小明”中的“小明”替換為“$CONTACT”，然後得到的訓練文本為“我要打電話給$CONTACT”。採用進行上述實體替換之後的文本訓練語言模型，得到基於類的語言模型。在訓練得到上述語言模型的基礎上，還可以預先生成基於語言模型的WFST，以下簡稱為G結構的WFST。 The following is an example of making a call application: the default subject category is a call, the corresponding label is “$CONTACT”, and the preset named entity is a person name. Then, when the language model is pre-trained, the name of the person in the training text can be replaced. For the corresponding label, for example, replace "Xiaoming" in "I want to call Xiaoming" with "$CONTACT", and then get the training text as "I want to call $CONTACT". A text-based language model after the above entity replacement is used to obtain a class-based language model. On the basis of training to obtain the above language model, it is also possible to generate a WFST based on a language model in advance, hereinafter referred to as a WFST of a G structure.

較佳地，為了縮減語言模型以及對應的G結構的WFST的規模，在預先訓練語言模型時，可以選用針對所述預設主題類別的文本(也可以稱為基於類的訓練文本)進行訓練，例如，預設主題類別為撥打電話，那麼針對所述預設主題類別的文本可以包括：我要打電話給小明，給小明打個電話等等。 Preferably, in order to reduce the size of the WFST of the language model and the corresponding G structure, when the language model is pre-trained, the text for the preset topic category (also referred to as class-based training text) may be selected for training. For example, if the preset theme category is a call, the text for the preset theme category may include: I want to call Xiao Ming, call Xiao Ming, and the like.

考慮到用戶端設備以及以語音作為人機交互媒介的應用程式的多樣性，可以預設兩個或者兩個以上的主題類別，並針對每種主題類別分別預先訓練基於類的語言模型、並構建基於所述語言模型的G結構WFST。 Considering the diversity of client devices and applications that use voice as a human-computer interaction medium, two or more topic categories can be preset, and class-based language models are pre-trained and built for each topic category. G structure WFST based on the language model.

在準備階段還可以預先構建基於發音詞典的WFST，簡稱為L結構的WFST，以及基於三音素狀態繫結欄位表的WFST，簡稱為C結構的WFST，並採用預設的方式對上述各WFST進行有針對性的、選擇性地合併操作，例如：可以將C結構與L結構的WFST合併為CL結構的WFST，也可以將L結構與G結構的WFST合併為LG結構的WFST，還可以將C結構、L結構以及G結構的WFST合併為CLG結構的WFST。本實施例在準備階段生成了CL結構的WFST以及G結構的WFST(關於合併操作的說明可以參見步驟101中的相關文字)。 In the preparation stage, a WFST based on a pronunciation dictionary may be pre-built, a WFST referred to as an L structure, and a WFST based on a triphone-state state-binding field table, referred to as a WFST of a C structure, and in a preset manner for each of the above WFSTs Perform targeted and selective merge operations. For example, you can merge the WFST of the C structure and the L structure into the WFST of the CL structure, or merge the WFST of the L structure and the G structure into the WFST of the LG structure, or The WFST of the C structure, the L structure, and the G structure is merged into the WFST of the CLG structure. In this embodiment, the WFST of the CL structure and the WFST of the G structure are generated in the preparation phase (for the description of the merge operation, refer to the related text in step 101).

此外，在準備階段還可以預先訓練用於進行語音辨識的聲學模型。在本實施例中，每個三音素用一個HMM(Hidden Markov Model-隱瑪律可夫模型)表徵，HMM的隱含狀態為三音素中的一個狀態(每個三音素通常包含三個狀態)，採用GMM(Gaussian mixture model-高斯混合模型)模型確定HMM中每個隱含狀態輸出各特徵向量的發射機率，以從大量語音資料中提取的特徵向量作為訓練樣本，採用Baum-Welch演算法學習GMM模型和HMM模型的參數，可以得到對應於每個狀態的GMM模型以及對應於每個三音素的HMM模型。在後續步驟103中則可以使用預先訓練好的GMM和HMM模型計算特徵向量對應於各三音素的機率。 In addition, an acoustic model for speech recognition can also be pre-trained in the preparation phase. In this embodiment, each triphone is characterized by an HMM (Hidden Markov Model), and the implicit state of the HMM is a state in the triphone (each triphone usually contains three states) The GMM (Gaussian mixture model) model is used to determine the transmitter rate of each eigenvector in each implicit state of the HMM. The eigenvectors extracted from a large amount of speech data are used as training samples, and the Baum-Welch algorithm is used to learn. The parameters of the GMM model and the HMM model can obtain a GMM model corresponding to each state and an HMM model corresponding to each triphone. In a subsequent step 103, the pre-trained GMM and HMM models can then be used to calculate the probability that the feature vector corresponds to each triphone.

為了提升語音辨識的準確率，本實施例在進行語音辨識時用DNN(Deep Neural Networks-深度神經網路)模型替代了GMM模型，相應的，在準備階段可以預先訓練用於根據輸入的特徵向量輸出對應於各三音素狀態機率的DNN模型。在具體實施時，可以在訓練GMM和HMM模型的基礎上，透過對訓練樣本進行強制對齊的方式、為訓練樣本添加對應於各三音素狀態的標籤，並用打好標籤的訓練樣本訓練得到所述DNN模型。 In order to improve the accuracy of speech recognition, the present embodiment replaces the GMM model with a DNN (Deep Neural Networks) model for speech recognition, and correspondingly, it can be pre-trained for input based on the feature vector in the preparation stage. A DNN model corresponding to the probability of each triphone state is output. In a specific implementation, on the basis of training the GMM and the HMM model, a label corresponding to each triphone state may be added to the training sample by forcibly aligning the training samples, and the labeled training sample is trained to obtain the DNN model.

需要說明的是，在具體實施時，由於準備階段的運算量比較大，對記憶體以及計算速度的要求相對較高，因此準備階段的操作通常是在伺服器端完成的。為了在沒有網路接入環境的情況下依然能夠完成語音辨識功能，本申請提供的方法通常在用戶端設備上實施，因此準備階段生成的各WFST以及用於進行聲學機率計算的各模型，可以預先安裝到用戶端設備中，例如：與應用程式一起打包並安裝到用戶端。 It should be noted that, in the specific implementation, since the calculation amount in the preparation phase is relatively large, the requirements on the memory and the calculation speed are relatively high, and therefore the operation in the preparation phase is usually completed on the server side. In order to complete the voice recognition function without the network access environment, the method provided by the present application is usually implemented on the user equipment, so each WFST generated in the preparation phase and each model used for calculating the acoustic probability may be Pre-installed into the client device, for example: packaged with the application and installed to the client.

至此，對本實施例涉及的準備階段進行了詳細說明，下面對本實施例的具體步驟101至104做詳細說明。 So far, the preparation phase involved in the embodiment has been described in detail, and the specific steps 101 to 104 of the present embodiment will be described in detail below.

步驟101、利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間。 Step 101: Generate, by using a preset voice knowledge source, a search space for decoding a voice signal that includes preset information of the UE.

本步驟構建WFST搜索空間，為後續的語音辨識做好準備。在具體實施時，本步驟通常在用語音作為人機交互媒介的用戶端應用程式的啟動階段(也稱為初始化階段)執行，透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器。 This step builds the WFST search space to prepare for subsequent speech recognition. In the specific implementation, this step is usually performed in the startup phase (also referred to as the initialization phase) of the client application using voice as the human-computer interaction medium, and the pre-generated at least language-based weighting is limited by means of replacing the label. The state preset information corresponding to the preset theme category is added to the state converter, and a single weighted finite state converter based on the triphone state tied field table, the pronunciation dictionary, and the language model is obtained.

本步驟的處理過程可以包括以下步驟101-1至101-4，下面結合圖2做進一步說明。 The processing of this step may include the following steps 101-1 to 101-4, which will be further described below in conjunction with FIG. 2.

步驟101-1、確定待識別語音信號所屬的預設主題類別。 Step 101-1: Determine a preset topic category to which the to-be-identified voice signal belongs.

在具體實施時，可以根據採集所述語音信號的用戶端類型、或應用程式確定所述所屬的預設主題類別。所述預設主題類別包括：撥打電話、發送簡訊、播放樂曲、設置指令、或者其他應用場景相關的主題類別。其中，與撥打電話或發送簡訊對應的用戶端預設資訊包括：通訊錄中的連絡人名稱；與播放樂曲對應的用戶端預設資訊包括：曲庫中的樂曲名稱；與設置指令對應的用戶端預設資訊包括：指令集中的指令；對於其他應用場景相關的主題類別，也同樣可以與應用場景所涉及的用戶端預設資訊相對應，此處不再一一贅述。 In a specific implementation, the preset preset topic category may be determined according to a client type that collects the voice signal, or an application. The preset theme categories include: making a call, sending a short message, playing a music piece, setting a command, or other subject categories related to the application scene. The user preset information corresponding to making a call or sending a short message includes: a contact name in the address book; the user preset information corresponding to the played music includes: a music name in the music library; and a user corresponding to the setting instruction The preset information includes: instructions in the command set; the topic categories related to other application scenarios may also correspond to the user preset information involved in the application scenario, and are not described here.

例如：對於智慧手機，可以根據用戶端類型確定待識別語音信號所屬的預設主題類別為：撥打電話或發送簡訊，相應的用戶端預設資訊為通訊錄中的連絡人名稱；對於智慧音箱，可以確定主題類別為：播放樂曲，相應的用戶端預設資訊為曲庫中的樂曲名稱；對於機器人，可以確定主題類別為：設置指令，相應的用戶端預設資訊為指令集中的指令。 For example, for a smart phone, the preset theme category to which the to-be-identified voice signal belongs may be determined according to the type of the client: making a call or sending a short message, and the corresponding user preset information is the contact name in the address book; for a smart speaker, It can be determined that the theme category is: playing the music, the corresponding user preset information is the music name in the music library; for the robot, the theme category can be determined as: setting instruction, and the corresponding user preset information is an instruction in the instruction set.

考慮到用戶端設備可以同時具有多個用語音作為人機交互媒介的應用，不同的應用涉及不同的用戶端預設資訊，例如：智慧手機也可以安裝基於語音交互的音樂播放機，在這種情況下可以根據當前啟動的應用程式確定待識別語音信號所屬的預設主題類別。 Considering that the user equipment can simultaneously have multiple applications using voice as a human-machine interaction medium, different applications involve different user-preset information, for example, a smart phone can also install a music player based on voice interaction. In the case, the preset theme category to which the to-be-identified voice signal belongs may be determined according to the currently launched application.

步驟101-2、選擇預先生成的、與所述預設主題類別相對應的G結構WFST。 Step 101-2: Select a G structure WFST that is generated in advance and corresponds to the preset topic category.

對於存在多個預設主題類別的情況，在準備階段通常會生成多個G結構WFST，每個G結構WFST分別與不同的預設主題類別相對應。本步驟從預先生成的多個G結構WFST中選擇與步驟101-1所確定的預設主題類別相對應的G結構WFST。 For the case where there are multiple preset topic categories, a plurality of G structures WFST are usually generated in the preparation phase, and each G structure WFST corresponds to a different preset topic category. This step selects the G structure WFST corresponding to the preset topic category determined in step 101-1 from the plurality of G structures WFST generated in advance.

步驟101-3、透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的G結構WFST中添加用戶端預設資訊。 Step 101-3: Add user preset information to the selected G structure WFST by replacing the corresponding label with the user preset information corresponding to the preset theme category.

在準備階段針對每種預設主題類別訓練基於類的語言模型時，將訓練文本中的預設命名實體替換為了與相應主題類別對應的標籤，例如主題類別為撥打電話或發送簡訊，將訓練文本中的人名替換為“$CONTACT”標籤，主題類別為播放樂曲，將訓練文本中的樂曲名稱替換為“$SONG”標籤，因此，生成的G結構的WFST中通常包含與預設主題類別對應的標籤資訊。本步驟用與步驟101-1所確定的預設主題類別對應的用戶端預設資訊，替換步驟101-2所選G結構WFST中的相應標籤，從而實現向所選G結構WFST中添加用戶端預設資訊的目的。 When training the class-based language model for each preset topic category in the preparation phase, the preset named entity in the training text is replaced with a label corresponding to the corresponding topic category, for example, the topic category is to make a call or send a text message, and the training text is The name of the person in the box is replaced with the "$CONTACT" tag, and the theme category is to play the music, and the name of the music in the training text is replaced with the "$SONG" tag. Therefore, the generated WFST of the G structure usually includes the corresponding theme category. Label information. In this step, the user preset information corresponding to the preset theme category determined in step 101-1 is used, and the corresponding label in the G structure WFST selected in step 101-2 is replaced, thereby adding a user terminal to the selected G structure WFST. The purpose of the preset information.

例如，主題類別為撥打電話或者發送簡訊，則可以將G結構WFST中的“$CONTACT”標籤替換為用戶端本地通訊錄中的連絡人名稱，如“張三”、“李四”等；主題類別為播放樂曲，則可以將G結構WFST中的“$SONG”標籤替換為用戶端本地曲庫中的歌曲名稱，如 “義勇軍進行曲”等。具體的替換，可以透過將與所述標籤對應的狀態轉移鏈路替換為若干組並行的狀態轉移鏈路的方式實現。請參見圖3和圖4給出的用用戶端通訊錄中的連絡人進行替換的例子，其中圖3為替換前的G結構WFST的示意圖，圖4為用通訊錄中的“張三”和“李四”進行替換後得到的G結構WFST的示意圖。 For example, if the subject category is to make a call or send a short message, the "$CONTACT" tag in the G structure WFST can be replaced with the contact name in the local address book of the client, such as "Zhang San", "Li Si", etc.; If the category is to play a song, you can replace the "$SONG" tag in the G structure WFST with the song name in the local library of the client, such as "The Volunteer March". A specific alternative can be implemented by replacing the state transition link corresponding to the tag with a plurality of sets of parallel state transition links. Please refer to FIG. 3 and FIG. 4 for an example of replacing with the contact person in the client address book, wherein FIG. 3 is a schematic diagram of the G structure WFST before replacement, and FIG. 4 is a view of "Zhang San" in the address book. Schematic diagram of the G structure WFST obtained after the replacement of "Li Si".

步驟101-4、將添加了用戶端預設資訊的G結構的WFST、與預先生成的CL結構的WFST進行合併，得到單一的WFST網路。 In step 101-4, the WFST of the G structure with the user preset information added is merged with the WFST of the pre-generated CL structure to obtain a single WFST network.

在本實施例中，語音辨識所用到的知識源涉及從語言層(語言模型)到實體層(三音素狀態繫結欄位表)的內容，本步驟的任務是將不同層次的WFST合併(也稱為組合、結合)到一起，得到單一的WFST網路。 In this embodiment, the knowledge source used for speech recognition involves content from the language layer (language model) to the physical layer (triphone state linkage field table). The task of this step is to merge different levels of WFST (also Called together, combined, to get a single WFST network.

對於兩個WFST，進行合併的基本條件是：其中一個WFST的輸出符號是另外一個WFST輸入符號集合的子集。在滿足上述要求的前提下，如果將兩個WFST，例如分別為A和B，整合成一個新的WFST：C，那麼C的每個狀態都由A的狀態和B的狀態組成，C的每個成功路經，都由A的成功路經P_a和B的成功路徑P_b組成，輸入為i[P]=i[P_a]，輸出為o[P]=o[P_b]，其加權值為由P_a和P_b的加權值進行相應運算後得到，最後得到的C包含A和B共有的有限狀態轉換器特性以及搜索空間。在具體實施時，可以採用OpenFst庫提供的合併演算法完成兩個WFST的合併操作。 For both WFSTs, the basic condition for merging is that one of the WFST output symbols is a subset of another WFST input symbol set. Under the premise of satisfying the above requirements, if two WFSTs, for example, A and B, respectively, are integrated into a new WFST:C, then each state of C consists of the state of A and the state of B, each of C Success paths are composed of A's success path P _a and B's success path P _b , the input is i[P]=i[P _a ], and the output is o[P]=o[P _b ], The weighted value is obtained by performing corresponding operations on the weighted values of P _a and P _b , and the resulting C contains the finite state converter characteristics common to A and B and the search space. In the specific implementation, the merge operation of the two WFSTs can be completed by using the merge algorithm provided by the OpenFst library.

具體到本實施例，可以這樣理解，L結構的WFST可以看作是單音素與詞之間的對應關係，C結構的WFST則在三音素與單音素之間建立對應關係，其輸出和L結構WFST的輸入相互對應，可以進行合併，在本實施例的準備階段已經透過合併得到了CL結構的WFST，本步驟將所述CL結構的WFST與步驟101-3中添加了用戶端預設資訊的G結構WFST進行合併，得到了一個輸入為三音素機率，輸出為詞序列的WFST網路，從而將處於不同層次的分別對應不同知識源的WFST，整合為一個單一的WFST網路，構成了用於進行語音辨識的搜索空間。 Specifically, in this embodiment, it can be understood that the WFST of the L structure can be regarded as the correspondence between the mono phoneme and the word, and the WFST of the C structure establishes the correspondence between the triphone and the mono phoneme, and the output and the L structure. The input of the WFST corresponds to each other and can be merged. In the preparation stage of this embodiment, the WFST of the CL structure is obtained through the merge. In this step, the WFST of the CL structure and the preset information of the UE are added in step 101-3. The G structure WFST is merged, and a WFST network with the input triphone probability and output as a word sequence is obtained, thereby integrating the WFSTs at different levels corresponding to different knowledge sources into a single WFST network, which constitutes a use. A search space for speech recognition.

較佳地，為了加快CL結構WFST和G結構WFST的合併速度，減少初始化的耗時，本實施例在執行所述合併操作時沒有採用常規的WFST合併方法，而是採用了基於預測的合併方法(lookahead合併方法)。所述lookahead合併方法，即在兩個WFST的合併過程中，透過對未來路徑的預測，判斷當前執行的合併操作是否會導致無法到達的最終狀態(non-coaccessible state)，如果是，則阻塞當前操作、不再執行後續的合併操作。透過預測可以提前終止沒有必要的合併操作，不僅可以節省合併時間，而且可以縮減最終生成的WFST的規模，減少對存儲空間的佔用。具體實施時，可以採用OpenFst庫提供的具備lookahead功能的篩檢程式(filter)，實現上述預測篩選功能。 Preferably, in order to speed up the merging speed of the CL structure WFST and the G structure WFST, and reduce the time-consuming initialization, the embodiment does not adopt the conventional WFST merging method when performing the merging operation, but adopts a prediction-based merging method. (lookahead merge method). The lookahead merging method, that is, during the merging process of the two WFSTs, by predicting the future path, determining whether the currently performed merge operation will result in a non-coaccessible state, and if so, blocking the current state The operation does not perform subsequent merge operations. Through the prediction, the unnecessary merge operation can be terminated early, which not only saves the merge time, but also reduces the size of the finally generated WFST and reduces the occupation of the storage space. In the specific implementation, the screening function with the lookahead function provided by the OpenFst library can be used to implement the above prediction filtering function.

較佳地，為了加快CL結構WFST和G結構WFST的合併速度，在本實施例中預先訓練所述語言模型所採用的詞表與所述發音詞典包含的詞是一致的。一般而言，詞表中的詞的數目通常大於發音詞典中的詞的數目，而詞表中的詞的數目和G結構的WFST的大小有直接關係，如果G結構的WFST比較大，和CL結構的WFST進行合併就比較耗時，所以本實施例在準備階段訓練語言模型時，縮減了詞表的規模，讓詞表中的詞與發音詞典中的詞保持一致，從而達到了縮減CL結構WFST和G結構WFST的合併時間的效果。 Preferably, in order to speed up the merging speed of the CL structure WFST and the G structure WFST, the vocabulary used in the pre-training of the language model in this embodiment is consistent with the words contained in the pronunciation dictionary. In general, the number of words in the vocabulary is usually larger than the number of words in the pronunciation dictionary, and the number of words in the vocabulary is directly related to the size of the WFST of the G structure, if the WFST of the G structure is relatively large, and CL The WFST of the structure is more time-consuming to be merged. Therefore, when training the language model in the preparation stage, the scale of the vocabulary is reduced, and the words in the vocabulary are consistent with the words in the pronunciation dictionary, thereby achieving the reduced CL structure. The effect of the merge time of WFST and G structure WFST.

至此，透過步驟101-1至101-4，完成了本技術方案的初始化過程，生成了包含用戶端預設資訊的WFST搜索空間。 At this point, through the steps 101-1 to 101-4, the initialization process of the technical solution is completed, and the WFST search space including the preset information of the user end is generated.

需要說明的是，本實施例在準備階段預先完成CL結構的WFST的合併、並生成G結構的WFST，在本步驟101中則向G結構WFST中添加用戶端預設資訊，並將CL結構和G結構合併得到單一的WFST。在其他實施方式中，也可以採用其他合併策略，例如，在準備階段預先完成LG結構的WFST的合併，在本步驟101中向該WFST中添加用戶端預設資訊，然後再與準備階段生成的C結構WFST進行合併；或者，在準備階段直接完成CLG結構的WFST的合併，並在本步驟101中向該WFST中添加用戶端預設資訊也是可以的。考慮到準備階段生成的WFST要佔據用戶端的儲存空間，在有多個基於類的語言模型(相應有多個G結構的WFST)的應用場景中，如果在準備階段將每個G結構WFST與其他WFST進行合併，將佔據較大儲存空間，因此本實施例採取的合併方式是較佳實施方式，可以減少在準備階段生成的WFST對用戶端儲存空間的佔用。 It should be noted that, in this embodiment, the WFST of the CL structure is merged in advance in the preparation phase, and the WFST of the G structure is generated. In this step 101, the user preset information is added to the G structure WFST, and the CL structure and The G structure merges to get a single WFST. In other implementations, other merge policies may also be adopted. For example, the WFST merge of the LG structure is pre-completed in the preparation phase. In this step 101, the user preset information is added to the WFST, and then generated in the preparation phase. The C-structure WFST is merged; or, the WFST merge of the CLG structure is directly completed in the preparation phase, and it is also possible to add the user-preset information to the WFST in this step 101. Considering that the WFST generated in the preparation phase occupies the storage space of the client, in the application scenario where there are multiple class-based language models (the corresponding WFST with multiple G structures), if each G structure WFST and other in the preparation phase The WFST merges and occupies a large storage space. Therefore, the merge mode adopted in this embodiment is a preferred implementation manner, which can reduce the occupation of the client storage space by the WFST generated in the preparation phase.

步驟102、提取待識別語音信號的特徵向量序列。 Step 102: Extract a sequence of feature vectors of the speech signal to be recognized.

待識別語音信號通常是時域信號，本步驟透過分幀和提取特徵向量兩個處理過程，獲取能夠表徵所述語音信號的特徵向量序列，下面結合附圖5做進一步說明。 The speech signal to be recognized is usually a time domain signal. In this step, a sequence of feature vectors capable of characterizing the speech signal is obtained through two processes of framing and extracting feature vectors, which will be further described below with reference to FIG. 5.

步驟102-1、按照預先設定的幀長度對待識別語音信號進行分幀處理，得到多個音訊幀。 Step 102-1: Perform frame processing on the speech signal to be recognized according to a preset frame length to obtain a plurality of audio frames.

在具體實施時，可以根據需求預先設定幀長度，例如可以設置為10ms、或者15ms，然後根據所述幀長度對待識別語音信號進行逐幀切分，從而將語音信號切分為多個音訊幀。根據所採用的切分策略的不同，相鄰音訊幀可以不存在交疊、也可以是有交疊的。 In a specific implementation, the frame length may be preset according to requirements, for example, may be set to 10 ms, or 15 ms, and then the speech signal to be recognized is segmented frame by frame according to the frame length, thereby dividing the speech signal into a plurality of audio frames. Depending on the segmentation strategy used, adjacent audio frames may or may not overlap or overlap.

步驟102-2、提取各音訊幀的特徵向量，得到所述特徵向量序列。 Step 102-2: Extract a feature vector of each audio frame to obtain the feature vector sequence.

將待識別語音信號切分為多個音訊幀後，可以逐幀提取能夠表徵所述語音信號的特徵向量。由於語音信號在時域上的描述能力相對較弱，通常可以針對每個音訊幀進行傅立葉變換，然後提取頻域特徵作為音訊幀的特徵向量，例如，可以提取MFCC(Mel Frequency Cepstrum Coefficient-梅爾頻率倒譜系數)特徵、PLP(Perceptual Linear Predictive-感知線性預測)特徵、或者LPC (Linear Predictive Coding-線性預測編碼)特徵等。 After the speech signal to be recognized is sliced into a plurality of audio frames, the feature vector capable of characterizing the speech signal can be extracted frame by frame. Since the description ability of the speech signal in the time domain is relatively weak, the Fourier transform can usually be performed for each audio frame, and then the frequency domain feature is extracted as the feature vector of the audio frame. For example, the MFCC (Mel Frequency Cepstrum Coefficient-Mel can be extracted). Frequency cepstral coefficient) feature, PLP (Perceptual Linear Predictive) feature, or LPC (Linear Predictive Coding) feature.

下面以提取某一音訊幀的MFCC特徵為例，對特徵向量的提取過程作進一步描述。首先將音訊幀的時域信號透過FFT(Fast Fourier Transformation-快速傅氏變換)得到對應的頻譜資訊，將所述頻譜資訊透過Mel濾波器組得到Mel頻譜，在Mel頻譜上進行倒譜分析，其核心一般是採用DCT(Discrete Cosine Transform-離散餘弦變換)進行逆變換，然後取預設的N個係數(例如N=12或者38)，則得到了所述音訊幀的特徵向量：MFCC特徵。對每個音訊幀都採用上述方式進行處理，可以得到表徵所述語音信號的一系列特徵向量，即本申請所述的特徵向量序列。 The following takes the MFCC feature of an audio frame as an example to further describe the extraction process of the feature vector. First, the time domain signal of the audio frame is transmitted through FFT (Fast Fourier Transformation) to obtain corresponding spectrum information, and the spectrum information is transmitted through the Mel filter group to obtain a Mel spectrum, and cepstrum analysis is performed on the Mel spectrum. The core generally uses DCT (Discrete Cosine Transform) to perform inverse transform, and then takes a preset N coefficients (for example, N=12 or 38), and then obtains the feature vector of the audio frame: MFCC feature. Each audio frame is processed in the above manner, and a series of feature vectors characterizing the speech signal, that is, the feature vector sequence described in the present application, can be obtained.

步驟103、計算特徵向量對應於搜索空間基本單元的機率。 Step 103: Calculate the probability that the feature vector corresponds to the basic unit of the search space.

在本實施例中，WFST搜索空間基本單元是三音素，因此本步驟計算特徵向量對應於各三音素的機率。為了提高語音辨識的準確率，本實施例採用HMM模型和具備強大特徵提取能力的DNN模型計算所述機率，在其他實施方式中，也可以採用其他方式，例如：採用傳統的GMM和HMM模型計算所述機率也同樣可以實現本申請的技術方案，也在本申請的保護範圍之內。 In this embodiment, the basic unit of the WFST search space is a triphone, so this step calculates the probability that the feature vector corresponds to each triphone. In order to improve the accuracy of speech recognition, the present embodiment uses the HMM model and the DNN model with powerful feature extraction capabilities to calculate the probability. In other embodiments, other methods may also be used, for example, using traditional GMM and HMM models. The probability that the probability can also achieve the technical solution of the present application is also within the protection scope of the present application.

在具體實施時，可以在計算特徵向量對應於各三音素狀態的基礎上，進一步計算特徵向量對應於各三音素的機率，下面結合附圖6，對本步驟的處理過程作進一步描述。 In a specific implementation, the probability that the feature vector corresponds to each triphone can be further calculated on the basis that the calculated feature vector corresponds to each triphone state. The processing of this step is further described below with reference to FIG. 6.

步驟103-1、採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率。 Step 103-1: Calculate the probability that the feature vector corresponds to each triphone state by using the pre-trained DNN model.

在本實施例的準備階段已經預先訓練好了DNN模型，本步驟以步驟102提取的特徵向量作為所述DNN模型的輸入，則可以得到特徵向量對應於各三音素狀態的機率。例如：三音素的數量為1000，每個三音素包含3個狀態，那麼總共有3000個三音素狀態，本步驟DNN模型輸出：特徵向量對應於3000個三音素狀態中每一狀態的機率。 In the preparation stage of the embodiment, the DNN model has been pre-trained. In this step, the feature vector extracted in step 102 is used as the input of the DNN model, and the probability that the feature vector corresponds to each triphone state can be obtained. For example, if the number of triphones is 1000 and each triphone contains 3 states, then there are a total of 3000 triphone states. In this step, the DNN model outputs: the feature vector corresponds to the probability of each of the 3000 triphone states.

較佳地，由於採用DNN模型涉及的計算量通常很大，本實施例透過利用硬體平臺提供的資料並行處理能力提升採用DNN模型進行計算的速度。例如，目前嵌入式設備和移動設備用的比較多的是ARM架構平臺，在現行的大多數ARM平臺上，都有SIMD(single instruction multiple data-單指令多資料流程)的NEON指令集，該指令集可以在一個指令中處理多個資料，具備一定的資料並行處理能力，在本實施例中透過向量化程式設計可以形成單指令流多資料流程的程式設計泛型，從而可以充分利用硬體平臺提供的資料並行處理能力，實現加速DNN計算的目的。 Preferably, since the amount of calculation involved in adopting the DNN model is usually large, the present embodiment improves the speed of calculation using the DNN model by utilizing the data parallel processing capability provided by the hardware platform. For example, currently more embedded devices and mobile devices are used in ARM architecture platforms. On most current ARM platforms, there are SIMD (single instruction multiple data) NEON instruction sets. The set can process multiple data in one instruction, and has certain data parallel processing capability. In this embodiment, the vectorization programming can form a programming generic of single instruction flow and multiple data flow, so that the hardware platform can be fully utilized. The data processing capability provided is provided to accelerate the calculation of DNN.

在用戶端設備上實施本申請技術方案時，為了能夠與用戶端的硬體能力相匹配，通常會縮減DNN模型的規模，這樣往往會導致DNN模型的精確度下降，對不同語音內容的識別能力也會隨著下降。本實施例由於利用硬體加速機制，則可以不縮減或者儘量少縮減DNN模型的規模，從而可以最大限度地保留DNN模型的精確性，提高識別準確率。 When implementing the technical solution of the present application on the client device, in order to match the hardware capability of the client, the size of the DNN model is usually reduced, which often leads to a decrease in the accuracy of the DNN model, and the ability to recognize different voice content. Will go down. In this embodiment, since the hardware acceleration mechanism is utilized, the scale of the DNN model can be reduced or minimized, so that the accuracy of the DNN model can be maximized and the recognition accuracy can be improved.

步驟103-2、根據特徵向量對應於所述各三音素狀態的機率，採用預先訓練的HMM模型計算特徵向量對應於各三音素的機率。 Step 103-2: Calculate the probability that the feature vector corresponds to each triphone by using the pre-trained HMM model according to the probability that the feature vector corresponds to the triphone state.

在準備階段已經訓練好了針對每個三音素的HMM模型，本步驟根據連續輸入的若干個特徵向量對應於各三音素狀態的機率，利用HMM模型計算對應於各三音素的轉移機率，從而得到特徵向量對應於各三音素的機率。 In the preparation stage, the HMM model for each triphone is already trained. This step is based on the probability that several consecutive feature vectors correspond to the triphone states, and the HMM model is used to calculate the transfer probability corresponding to each triphone. The feature vector corresponds to the probability of each triphone.

該計算過程實際上就是根據連續的特徵向量在各HMM上的傳播過程計算相應轉移機率的過程，下面以計算針對某一三音素(包括3個狀態)的機率為例對計算過程作進一步說明，其中，pe(i,j)表示第i幀特徵向量在第j個狀態上的發射機率，pt(h,k)表示從h狀態轉移到k狀態的轉移機率：1)第一幀的特徵向量對應於相應HMM的狀態1，具有發射機率pe(1,1)；2)對於第二幀的特徵向量，如果從所述HMM的狀態1轉移到狀態1，對應的機率為pe(1,1)*pt(1,1)*pe(2,1)，如果從狀態1轉移到狀態2，對應的機率為pe(1,1)*pt(1,2)*pe(2,2)，根據上述機率判斷轉移到狀態1還是狀態2； 3)對於第三幀的特徵向量以及後續幀的特徵向量也採用上述類似的計算方式，直到從狀態3轉移出去，至此在所述HMM上的傳播結束，從而得到了連續多幀的特徵向量針對該HMM的機率，即：對應於該HMM表徵的三音素的機率。 The calculation process is actually a process of calculating the corresponding transfer probability according to the propagation process of successive feature vectors on each HMM. The following is a further description of the calculation process for calculating the probability of a certain three phonemes (including three states). Where pe(i,j) represents the transmitter rate of the ith frame feature vector in the jth state, and pt(h,k) represents the transfer probability from the h state to the k state: 1) the eigenvector of the first frame Corresponding to state 1 of the corresponding HMM, having a transmitter rate pe(1,1); 2) for a feature vector of the second frame, if transitioning from state 1 to state 1 of the HMM, the corresponding probability is pe(1,1) ) *pt(1,1)*pe(2,1), if transitioning from state 1 to state 2, the corresponding probability is pe(1,1)*pt(1,2)*pe(2,2), Judging whether to transfer to state 1 or state 2 according to the above probability; 3) using the similar calculation method for the feature vector of the third frame and the feature vector of the subsequent frame until the transition from state 3 to the propagation of the HMM Ending, thereby obtaining the probability that the feature vector of successive multi-frames is directed to the HMM, that is, the probability of corresponding to the triphone of the HMM representation.

針對連續輸入的特徵向量，採用上述方式計算在各HMM上傳播的轉移機率，從而得到對應於各三音素的機率。 For the continuously input feature vector, the transfer probability spread on each HMM is calculated in the above manner, thereby obtaining the probability corresponding to each triphone.

步驟104、以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 Step 104: Perform a decoding operation in the search space by using the probability, and obtain a sequence of words corresponding to the sequence of feature vectors.

根據步驟103輸出的特徵向量對應於各三音素的機率，在WFST網路中執行解碼操作，得到與所述特徵向量序列對應的詞序列。該過程通常為執行圖搜索、找到得分最高的路徑的搜索過程。常用的搜索方法是維特比演算法，它的好處在於採用動態規劃的方法節省了計算量，也可以做到時間同步解碼。 According to the probability that the feature vector outputted in step 103 corresponds to each triphone, a decoding operation is performed in the WFST network to obtain a sequence of words corresponding to the sequence of feature vectors. This process is typically a search process that performs a graph search and finds the path with the highest score. The commonly used search method is the Viterbi algorithm, which has the advantage of using a dynamic programming method to save computational complexity and time synchronization decoding.

考慮到在實際解碼過程中，由於搜索空間巨大，維特比演算法的計算量仍然很大，為了減少計算量、提高計算速度，在解碼過程中並不是把所有路徑的可能的後續路徑都展開，而是只展開那些在最佳路徑附近的路徑，即：可以在採用維特比演算法進行搜索的過程中，採用適當的剪枝策略以提高搜索效率，例如：可以採用維特比柱搜索演算法或者是採用長條圖剪枝策略等。 Considering that in the actual decoding process, due to the huge search space, the calculation of the Viterbi algorithm is still very large. In order to reduce the amount of calculation and increase the calculation speed, the possible subsequent paths of all paths are not expanded in the decoding process. Rather, only those paths that are near the best path are developed, that is, the appropriate pruning strategy can be used to improve the search efficiency in the process of searching with the Viterbi algorithm. For example, the Viterbi column search algorithm can be used or It is a bar graph pruning strategy.

至此，透過解碼得到了與特徵向量序列對應的詞序列，即，獲取了待識別語音信號對應的識別結果。由於在步驟101構建用於進行語音辨識的搜索空間時，添加了用戶端預設資訊，因此上述語音辨識過程通常可以比較準確地識別出與用戶端本地資訊相關的語音內容。 So far, the word sequence corresponding to the feature vector sequence is obtained by decoding, that is, the recognition result corresponding to the speech signal to be recognized is acquired. Since the user preset information is added when the search space for voice recognition is constructed in step 101, the voice recognition process can generally accurately identify the voice content related to the local information of the user terminal.

考慮到用戶端本地資訊有可能被使用者修改或者刪除，為了進一步保證透過上述解碼過程獲得的詞序列的準確性，本實施例還提供一種較佳實施方式：透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 In order to further ensure the accuracy of the word sequence obtained through the above decoding process, the present embodiment further provides a preferred embodiment: the information is preset through the user terminal. Text matching is performed to verify the accuracy of the word sequence, and corresponding speech recognition results are generated according to the verification result.

在具體實施時，上述較佳實施方式可以包括以下所列的步驟104-1至步驟104-4，下面結合附圖7做進一步說明。 In a specific implementation, the above preferred embodiment may include the steps 104-1 to 104-4 listed below, which will be further described below with reference to FIG.

步驟104-1、從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞。 Step 104-1: Select, from the sequence of words, a word to be verified corresponding to the preset information of the UE.

例如：針對打電話應用，所述用戶端預設資訊為通訊錄中的連絡人名稱，語音辨識的結果為詞序列“給小明打電話”，那麼透過與範本匹配的方式或者語法解析過程，可以確定所述詞序列中的“小明”是與用戶端預設資訊對應的待驗證詞。 For example, for the calling application, the preset information of the user terminal is the name of the contact person in the address book, and the result of the voice recognition is the word sequence “calling Xiao Ming”, then the method of matching with the template or the syntax parsing process may be It is determined that "Xiao Ming" in the sequence of words is a word to be verified corresponding to the preset information of the client.

步驟104-2、在所述用戶端預設資訊中查找所述待驗證詞，若找到，判定透過準確性驗證，執行步驟104-3，否則執行步驟104-4。 Step 104-2: Search for the to-be-verified word in the user preset information. If found, determine the transmission accuracy verification, perform step 104-3, otherwise perform step 104-4.

本步驟透過執行文字層面的精準匹配，判斷所述待驗證詞是否屬於相對應的用戶端預設資訊，從而驗證所述詞序列的準確性。 In this step, by performing an exact match on the text level, it is determined whether the to-be-verified word belongs to the corresponding user-preset information, thereby verifying the accuracy of the word sequence.

仍沿用步驟104-1中的例子，本步驟在用戶端通訊錄中查找是否存在“小明”這個聯絡人，即：通訊錄中與連絡人名稱相關的資訊中是否包含“小明”這一字串，若包含，則判定透過準確性驗證，繼續執行步驟104-3，否則，轉到步驟104-4執行。 Still using the example in step 104-1, this step searches the user address book for the presence of the "Xiaoming" contact, that is, whether the information related to the name of the contact in the address book contains the string "Xiaoming". If yes, it is determined that the accuracy verification is passed, and step 104-3 is continued; otherwise, the process proceeds to step 104-4.

步驟104-3、將所述詞序列作為語音辨識結果。 Step 104-3, the word sequence is used as a speech recognition result.

執行到本步驟，說明透過解碼得到的詞序列中所包含的待驗證詞，與用戶端預設資訊是相吻合的，可以將所述詞序列作為語音辨識結果輸出，從而觸發使用該語音辨識結果的應用程式執行相應的操作。 Executing this step, the word to be verified included in the sequence of words obtained by decoding is consistent with the preset information of the client, and the sequence of words can be output as a speech recognition result, thereby triggering the use of the speech recognition result. The application performs the appropriate action.

步驟104-4、透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Step 104-4, correcting the word sequence by using a pinyin-based fuzzy matching method, and using the corrected word sequence as a speech recognition result.

執行到本步驟，通常說明所述透過解碼得到的詞序列中所包含的待驗證詞，與用戶端預設資訊是不相吻合的，如果將該詞序列作為語音辨識結果輸出，那麼相關應用程式通常無法執行正確的操作，因此在這種情況下，可以透過拼音層面的模糊匹配對所述詞序列進行必要的修正。 Executing this step, generally speaking, the word to be verified included in the sequence of words obtained by decoding is inconsistent with the preset information of the client. If the sequence of words is output as a speech recognition result, the related application Normally, the correct operation cannot be performed, so in this case, the word sequence can be corrected as necessary through the fuzzy matching of the pinyin level.

在具體實施時，可以透過如下方式實現上述修正功能：透過查找發音詞典，將所述待驗證詞轉換為待驗證拼音序列，將所述用戶端預設資訊中的各個詞也分別轉換為比對拼音序列，然後依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照相似度從高到低排序靠前的詞，最後用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 In a specific implementation, the foregoing correcting function may be implemented by: searching a pronunciation dictionary, converting the to-be-verified word into a to-be-verified pinyin sequence, and converting each word in the user-preset information into a comparison. a pinyin sequence, and then calculating a similarity between the to-be-verified pinyin sequence and each of the aligned pinyin sequences, and selecting a word sorted according to the similarity from high to low from the user-preset information, and finally using The selected word replaces the word to be verified in the sequence of words to obtain the corrected sequence of words.

在具體實施時可以採用不同的方式計算兩個拼音序列之間的相似度，本實施例採用基於編輯距離計算所述相似度的方式，例如：用兩個拼音序列之間的編輯距離與1相加求和的倒數作為所述相似度。所述編輯距離(Edit Distance)，是指兩個字串之間，由一個轉成另一個所需的最少編輯操作次數，所述編輯操作包括將一個字元替換成另一個字元，插入一個字元，刪除一個字元，一般來說，編輯距離越小，兩個串的相似度越大。 In the specific implementation, the similarity between the two pinyin sequences can be calculated in different ways. In this embodiment, the similarity is calculated based on the edit distance, for example, the edit distance between the two pinyin sequences is compared with the phase 1 The reciprocal of the summation is used as the similarity. The Edit Distance refers to the minimum number of editing operations required between one string and another, and the editing operation includes replacing one character with another character, inserting one Character, delete a character, in general, the smaller the edit distance, the greater the similarity between the two strings.

仍沿用上述步驟104-1中的例子，詞序列為“給小明打電話”，待驗證詞為“小明”，如果在用戶端通訊錄的連絡人中沒有找到“小明”，則可以透過查找發音詞典，將小明轉換為待驗證拼音序列“xiaoming”，並將通訊錄中的各個聯絡人名稱也都轉換為相應的拼音序列，即：比對拼音序列，然後依次計算“xiaoming”與各比對拼音序列之間的編輯距離，然後選擇編輯距離最小(相似度最高)的比對拼音序列所對應的連絡人名稱(例如：“xiaomin”對應的“小敏”)，替換所述詞序列中的待驗證詞，從而完成了對所述詞序列的修正，並可以將修正後的詞序列作為最終的語音辨識結果。 Still using the example in the above step 104-1, the word sequence is "calling Xiaoming", and the word to be verified is "Xiaoming". If the "Xiaoming" is not found in the contact person of the user address book, the pronunciation can be searched. The dictionary converts Xiao Ming into the pinyin sequence "xiaoming" to be verified, and converts each contact name in the address book into the corresponding pinyin sequence, that is, compares the pinyin sequence, and then calculates "xiaoming" and each comparison in turn. The edit distance between the pinyin sequences, and then select the contact name corresponding to the aligned pinyin sequence with the smallest edit distance (the highest similarity) (for example, "Xiao Min" corresponding to "xiaomin"), replacing the word sequence The word to be verified is completed, thereby correcting the sequence of words, and the corrected sequence of words can be used as the final speech recognition result.

在具體實施時，也可以先計算出待驗證拼音序列與各比對拼音序列之間的相似度並按照相似度從高到低排序，選擇排序靠前的若干個(例如三個)比對拚音序列對應的詞，然後將這些詞透過螢幕輸出等方式提示給用戶端使用者，由用戶從中選擇正確的詞，並根據使用者選擇的詞替換所述詞序列中的待驗證詞。 In the specific implementation, the similarity between the Pinyin sequence to be verified and each of the Pinyin sequences can be calculated first and sorted according to the similarity from high to low, and several (eg, three) aligned Pinyin are selected. The words corresponding to the sequence are then presented to the user by means of screen output, etc., by which the user selects the correct word and replaces the word to be verified in the word sequence according to the word selected by the user.

至此，透過上述步驟101-步驟104對本申請提供的語音辨識方法的具體實施方式進行了詳細說明。為了便於理解，請參考圖8，其為本實施例提供的語音辨識過程的整體框架圖。其中虛線框對應本實施例描述的準備階段，實線框對應具體的語音辨識處理過程。 So far, the specific implementation manner of the speech recognition method provided by the present application is described in detail through the above steps 101-104. For ease of understanding, please refer to FIG. 8 , which is an overall frame diagram of the speech recognition process provided by this embodiment. The dotted line frame corresponds to the preparation phase described in this embodiment, and the solid line frame corresponds to a specific speech recognition process.

需要說明的是，本實施例描述的步驟101可以在以語音作為交互媒介的用戶端應用程式每次啟動時都執行一次，即每次啟動都重新生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間，也可以僅在所述用戶端應用程式首次啟動時生成所述搜索空間並存儲、後續採用定期更新的方式，這樣可以減少每次應用程式啟動時生成搜索空間的時間開銷(可以直接使用之前已生成的搜索空間)，提高語音辨識的執行效率，改善用戶的使用體驗。 It should be noted that the step 101 described in this embodiment may be executed once every time the client application that uses voice as the interaction medium starts, that is, each time the startup includes the user preset information, The search space in which the voice signal is decoded may also generate the search space and store the subsequent update mode only when the client application is first started, thereby reducing the time for generating the search space each time the application starts. Overhead (you can directly use the previously generated search space) to improve the efficiency of speech recognition and improve the user experience.

此外，本申請提供的方法通常在用戶端設備上實施，所述用戶端設備包括：智慧移動終端、智慧音箱、機器人、或者其他可以運行所述方法的設備，本實施例即描述了在用戶端實施本申請所提供方法的具體實施方式。但是在其他實施方式中，本申請提供的方法也可以在基於用戶端和伺服器模式的應用場景下實施，在這種情況下，在準備階段生成的各個WFST以及用於聲學機率計算的模型無需預先安裝到用戶端設備中，每次用戶端應用啟動時，可以將相應的用戶端預設資訊上傳給伺服器，並將後續採集到待識別語音信號也上傳給伺服器，由伺服器一側實施本申請提供的方法，並將解碼得到的詞序列回傳給用戶端，同樣可以實現本申請的技術方案，並取得相應的有益效果。 In addition, the method provided by the present application is generally implemented on a client device, including: a smart mobile terminal, a smart speaker, a robot, or other device that can run the method, and the embodiment is described at the user end. Specific embodiments of the methods provided herein are implemented. However, in other embodiments, the method provided by the present application may also be implemented in an application scenario based on a client and a server mode, in which case each WFST generated in the preparation phase and a model for acoustic probability calculation need not be used. Pre-installed into the client device, each time the client application starts, the corresponding user preset information can be uploaded to the server, and the subsequent acquired voice signal to be recognized is also uploaded to the server, and the server side is The method provided by the present application is implemented, and the decoded sequence of words is transmitted back to the user end, and the technical solution of the present application can also be implemented, and corresponding beneficial effects are obtained.

綜上所述，本申請提供的語音辨識方法，由於在生成用於對語音信號進行解碼的搜索空間時包含了用戶端預設資訊，因此在對用戶端採集的語音信號進行識別時能夠相對準確地識別出與用戶端本地相關的資訊，從而可以提高語音辨識的準確率，提升用戶的使用體驗。 In summary, the voice recognition method provided by the present application is relatively accurate when the voice signal collected by the user end is recognized because the user terminal preset information is included in the search space for decoding the voice signal. The local information related to the user end is identified, so that the accuracy of the speech recognition can be improved, and the user experience is improved.

特別是在用戶端設備上採用本申請提供的方法進行語音辨識，由於添加了用戶端本地資訊，因此可以在一定程度上彌補由於機率計算模型以及搜索空間規模縮小導致的識別準確率下降的問題，從而既能夠滿足在沒有網路接入環境下進行語音辨識的需求，同時也能達到一定的識別準確率。進一步地，在解碼得到詞序列後，透過採用本實施例給出的基於文字層面以及拼音層面的匹配驗證方案，可以進一步提升語音辨識的準確率。透過實際的測試結果表明：常規的語音辨識方法的字元錯誤率(CER)在20%左右，而使用本申請提供的方法，字元錯誤率在3%以下，以上資料充分說明了本方法的有益效果是顯著的。 In particular, the method provided by the present application is used for voice recognition on the user equipment. Since the local information of the user terminal is added, the problem of the accuracy of the recognition due to the probability calculation model and the reduction of the search space size can be compensated to some extent. Therefore, the requirement for voice recognition without network access environment can be satisfied, and a certain recognition accuracy can also be achieved. Further, after the word sequence is decoded, the accuracy of the speech recognition can be further improved by using the matching verification scheme based on the text level and the pinyin level given in the embodiment. The actual test results show that the character error rate (CER) of the conventional speech recognition method is about 20%, and using the method provided by this application, the character error rate is below 3%. The above data fully demonstrates the method. The beneficial effects are significant.

在上述的實施例中，提供了一種語音辨識方法，與之相對應的，本申請還提供一種語音辨識裝置。請參看圖9，其為本申請的一種語音辨識裝置的實施例的示意圖。由於裝置實施例基本相似於方法實施例，所以描述得比較簡單，相關之處參見方法實施例的部分說明即可。下述描述的裝置實施例僅僅是示意性的。 In the above embodiment, a voice recognition method is provided. Correspondingly, the present application further provides a voice recognition device. Please refer to FIG. 9, which is a schematic diagram of an embodiment of a speech recognition apparatus of the present application. Since the device embodiment is substantially similar to the method embodiment, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative.

本實施例的一種語音辨識裝置，包括：搜索空間生成單元901，用於利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間；特徵向量提取單元902，用於提取待識別語音信號的特徵向量序列；機率計算單元903，用於計算特徵向量對應於搜索空間基本單元的機率；解碼搜索單元904，用於以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 The voice recognition device of the embodiment includes: a search space generating unit 901, configured to generate, by using a preset voice knowledge source, a search space for decoding a voice signal including user preset information; feature vector extraction a unit 902, configured to extract a sequence of feature vectors of the to-be-recognized speech signal, a probability calculation unit 903, configured to calculate a probability that the feature vector corresponds to the basic unit of the search space, and a decoding search unit 904, configured to input the probability at the A decoding operation is performed in the search space to obtain a sequence of words corresponding to the sequence of feature vectors.

可選的，所述解碼空間生成單元包括：第二用戶端資訊添加子單元，用於透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；統一加權有限狀態轉換器獲取子單元，用於在所述第二用戶端資訊添加子單元完成添加操作之後，得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器；其中，所述第二用戶端資訊添加子單元包括：主題確定子單元，用於確定待識別語音信號所屬的預設主題類別；加權有限狀態轉換器選擇子單元，用於選擇預先生成的、與所述預設主題類別相對應的所述至少基於語言模型的加權有限狀態轉換器；標籤替換子單元，用於透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的加權有限狀態轉換器中添加用戶端預設資訊。 Optionally, the decoding space generating unit includes: a second user information adding subunit, configured to add a preset theme category to the pre-generated at least language model-based weighted finite state converter by replacing the label Corresponding user preset information; a unified weighted finite state converter obtaining subunit, configured to obtain a triphone based state binding field table, a pronunciation dictionary, and after the second user information adding subunit completes the adding operation, And a single weighted finite state converter of the language model; wherein the second user information adding subunit comprises: a topic determining subunit, configured to determine a preset topic category to which the to-be-identified speech signal belongs; weighted finite state converter selection a subunit, configured to select the at least language-based weighted finite state converter corresponding to the preset topic category, and a label replacement subunit, configured to correspond to the preset topic category The manner in which the user preset information replaces the corresponding label to the selected weighted finite state converter Plus default client information.

此外，本申請還提供另一種語音辨識方法，請參考圖10，其為本申請提供的另一種語音辨識方法的實施例的流程圖，本實施例與之前提供的方法實施例內容相同的部分不再贅述，下面重點描述不同之處。本申請提供的另一種語音辨識方法包括： In addition, the present application further provides another voice recognition method. Please refer to FIG. 10 , which is a flowchart of another embodiment of the voice recognition method provided by the present application. Again, the following highlights the differences. Another speech recognition method provided by the present application includes:

步驟1001、透過解碼獲取與待識別語音信號對應的詞序列。 Step 1001: Obtain a sequence of words corresponding to the to-be-identified speech signal by decoding.

對於語音辨識來說，解碼的過程就是在用於語音辨識的搜索空間中進行搜索的過程，以獲取與待識別語音信號對應的最佳詞序列。所述搜索空間可以是基於各種知識源的WFST網路，也可以是其他形式的搜索空間；所述搜索空間可以包含用戶端預設資訊，也可以不包含用戶端預設資訊，本實施例並不對此作具體的限定。 For speech recognition, the process of decoding is the process of searching in the search space for speech recognition to obtain the best word sequence corresponding to the speech signal to be recognized. The search space may be a WFST network based on various knowledge sources, or may be other forms of search space; the search space may include user preset information, or may not include user preset information, and the embodiment is This is not specifically limited.

步驟1002、透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Step 1002: Verify the accuracy of the word sequence by performing text matching with the preset information of the client, and generate a corresponding voice recognition result according to the verification result.

本步驟包括以下操作：從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；在所述用戶端預設資訊中查找所述待驗證詞；若找到，則判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；否則透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 The step of the present invention includes the following steps: selecting a word to be verified corresponding to the preset information of the user from the sequence of words; searching for the word to be verified in the preset information of the user terminal; The accuracy verification is performed, and the word sequence is used as a speech recognition result; otherwise, the word sequence is corrected by a pinyin-based fuzzy matching method, and the modified word sequence is used as a speech recognition result.

所述透過基於拼音的模糊匹配方式修正所述詞序列，包括：將所述待驗證詞轉換為待驗證拼音序列；將所述用戶端預設資訊中的各個詞分別轉換為比對拼音序列；依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照所述相似度從高到低排序靠前的詞；用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 The correcting the word sequence by the pinyin-based fuzzy matching method includes: converting the to-be-verified word into a to-be-verified pinyin sequence; and converting each word in the user-side preset information into a comparison pinyin sequence; And sequentially calculating a similarity between the to-be-verified pinyin sequence and each of the aligned pinyin sequences, and selecting, from the user-preset information, a word that ranks according to the similarity from high to low; using the selected word The word to be verified in the sequence of words is replaced to obtain the corrected sequence of words.

其中，所述轉換拼音序列可以透過查找發音詞典實現，所述相似度可以根據兩個拼音序列之間的編輯距離計算。 The converted pinyin sequence can be implemented by searching a pronunciation dictionary, and the similarity can be calculated according to an editing distance between two pinyin sequences.

本申請提供的方法，通常應用於用語音作為交互媒介的應用程式中，此類應用程式採集的待識別語音中可能會涉及用戶端資訊，而本申請提供的方法，透過將解碼得到的詞序列與用戶端預設資訊進行文字匹配，可以驗證所述詞序列的準確性，從而為對詞序列進行必要修正提供了依據。進一步地，透過採用基於拼音層面的模糊匹配，可以對所述詞序列進行修正，從而提升語音辨識的準確率。 The method provided by the present application is generally applied to an application that uses voice as an interactive medium. The voice to be recognized collected by such an application may involve user-side information, and the method provided by the present application transmits the decoded word sequence. Text matching with the user preset information can verify the accuracy of the word sequence, thereby providing a basis for necessary correction of the word sequence. Further, by using the fuzzy matching based on the pinyin layer, the word sequence can be corrected, thereby improving the accuracy of the speech recognition.

在上述的實施例中，提供了另一種語音辨識方法，與之相對應的，本申請還提供另一種語音辨識裝置。請參看圖11，其為本申請的另一種語音辨識裝置的實施例示意圖。由於裝置實施例基本相似於方法實施例，所以描述得比較簡單，相關之處參見方法實施例的部分說明即可。下述描述的裝置實施例僅僅是示意性的。 In the above embodiment, another voice recognition method is provided, and correspondingly, the present application also provides another voice recognition device. Please refer to FIG. 11 , which is a schematic diagram of an embodiment of another speech recognition apparatus of the present application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative.

本實施例的一種語音辨識裝置，包括：詞序列獲取單元1101，用於透過解碼獲取與待識別語音信號對應的詞序列；詞序列驗證單元1102，用於透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 The speech recognition apparatus of the embodiment includes: a word sequence obtaining unit 1101, configured to obtain a word sequence corresponding to the to-be-recognized speech signal by decoding; the word sequence verification unit 1102 is configured to perform text matching by using the preset information of the user end. The accuracy of the word sequence is verified, and corresponding speech recognition results are generated according to the verification result.

本申請雖然以較佳實施例公開如上，但其並不是用來限定本申請，任何本領域技術人員在不脫離本申請的精神和範圍內，都可以做出可能的變動和修改，因此本申請的保護範圍應當以本申請權利要求所界定的範圍為準。 The present application is disclosed in the above preferred embodiments, but it is not intended to limit the present application, and any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. The scope of protection should be based on the scope defined by the claims of the present application.

在一個典型的配置中，計算設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。 In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

記憶體可能包括電腦可讀介質中的非永久性記憶體，隨機存取記憶體(RAM)和/或非揮發性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。 The memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer readable media, such as read only memory (ROM) or flash memory (flash) RAM). Memory is an example of a computer readable medium.

1、電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可程式設計唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁磁片儲存或其他磁性存放裝置或任何其他非傳輸媒體，可用於儲存可以被計算設備訪問的資訊。按照本文中的界定，電腦可讀介質不包括非暫存電腦可讀媒體(transitory media)，如調製的資料信號和載波。 1. Computer readable media including both permanent and non-permanent, removable and non-removable media can be stored by any method or technique. Information can be computer readable instructions, data structures, modules of programs, or other materials. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM). Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM only, digitally versatile A compact disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non-transportable medium can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.

2、本領域技術人員應明白，本申請的實施例可提供為方法、系統或電腦程式產品。因此，本申請可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本申請可採用在一個或多個其中包含有電腦可用程式碼的電腦可用存儲介質(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 2. Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment combining the software and hardware. Moreover, the present application can take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer usable code therein. .

Claims

A voice recognition method, comprising: generating a search space for decoding a voice signal by using a preset voice knowledge source; and extracting a feature vector sequence of the voice signal to be recognized; The feature vector corresponds to the probability of searching for the basic unit of the space; at the probability of inputting, performing a decoding operation in the search space to obtain a sequence of words corresponding to the sequence of feature vectors.

The speech recognition method according to claim 1, wherein the search space comprises: a weighted finite state converter.

The speech recognition method according to claim 2, wherein the search space basic unit comprises: a context-dependent triphone; the preset knowledge source includes: a pronunciation dictionary, a language model, and a triphone state linkage column. Bit table.

The voice recognition method according to the third aspect of the patent application, wherein the predetermined voice knowledge source generates a search space for decoding a voice signal including user preset information, including: a method, adding, to a pre-generated at least language-based weighted finite state converter, user preset information corresponding to a preset topic category, and obtaining a single based on a triphone state state column table, a pronunciation dictionary, and a language model a weighted finite state converter; wherein the language model is pre-trained by replacing a preset named entity in the text for training the language model with a label corresponding to the preset topic category, and training the text using the text Language model.

The speech recognition method according to claim 4, wherein the method of replacing the label adds a user preset corresponding to the preset theme category to the pre-generated at least language-based weighted finite state converter Information, and a single weighted finite state converter based on a triphone state tied field table, pronunciation dictionary, and language model, including: adding a pre-generated language model-based weighted finite state converter by replacing the label User preset information corresponding to the preset theme category; the weighted finite state converter to which the user preset information is added, and the weighted finite state transition based on the pre-generated triphone-based state-binding field table and pronunciation dictionary The combiner is combined to obtain the single weighted finite state converter.

The speech recognition method of claim 4, wherein the text for training the language model is a text that points to the preset topic category.

The speech recognition method according to claim 4, wherein the number of the preset subject categories is at least two; the number of the language models, and the number of the at least language-based weighted finite state devices are respectively The number of the subject categories is consistent; the user preset information corresponding to the preset topic category is added to the pre-generated at least language-based weighted finite state converter by means of the replacement label, including: determining the to-be-identified voice signal a preset theme category; a pre-generated, at least language-based weighted finite state converter corresponding to the preset theme category; and replacing the corresponding user preset information corresponding to the preset theme category The way the tag is added to add the user preset information to the selected weighted finite state converter.

According to the voice recognition method of claim 7, wherein the determining the preset subject category to which the to-be-identified voice signal belongs is implemented by: determining the belonging according to the type of the client that collects the voice signal, or the application program. The default theme category.

According to the speech recognition method of claim 8, wherein the preset theme category includes: making a call or sending a short message, playing a music, or setting a command; and corresponding user preset information includes: The name of the contact, the name of the song in the library, or the instruction in the command set.

The speech recognition method of claim 5, wherein the merging operation comprises: merging using a prediction-based method.

The speech recognition method according to claim 4, wherein the vocabulary used to pre-train the language model is consistent with the words included in the pronunciation dictionary.

The speech recognition method according to claim 3, wherein the calculating feature vector corresponds to a probability of searching for a basic unit of the space, comprising: calculating a probability that the feature vector corresponds to each triphone state by using a pre-trained DNN model; The feature vector corresponds to the probability of each of the triphone states, and the pre-trained HMM model is used to calculate the probability that the feature vector corresponds to each triphone.

The speech recognition method according to claim 12, wherein the execution speed of the step of calculating the probability that the feature vector corresponds to each triphone state using the pre-trained DNN model is improved by: providing by using a hardware platform Data processing capabilities.

The speech recognition method according to any one of claims 1 to 13, wherein the extracting the feature vector sequence of the speech signal to be recognized comprises: performing frame processing on the speech signal to be recognized according to a preset frame length, Obtaining a plurality of audio frames; extracting feature vectors of the audio frames to obtain the feature vector sequence.

The speech recognition method according to claim 14, wherein the extracting the feature vector of each audio frame comprises: extracting an MFCC feature, a PLP feature, or an LPC feature.

The speech recognition method according to any one of claims 1 to 13, wherein after the word sequence corresponding to the feature vector sequence is obtained, performing the following operation: performing text by using the preset information of the user terminal Matching verifies the accuracy of the word sequence and generates corresponding speech recognition results based on the verification results.

According to the speech recognition method of claim 16, wherein the text is matched with the preset information of the user end to verify the accuracy of the word sequence, and the corresponding speech recognition result is obtained according to the verification result, including: Selecting a word to be verified corresponding to the preset information of the client in the sequence of words; searching for the to-be-verified word in the preset information of the user; if found, determining to verify by the accuracy, and using the sequence of words as a voice The result is identified; otherwise, the word sequence is corrected by the pinyin-based fuzzy matching method, and the corrected word sequence is used as the speech recognition result.

The speech recognition method according to claim 17, wherein the correcting the word sequence by using a pinyin-based fuzzy matching method comprises: converting the to-be-verified word into a to-be-verified pinyin sequence; Each word in the conversion is converted into a comparison pinyin sequence; the similarity between the to-be-verified pinyin sequence and each of the comparison pinyin sequences is calculated in turn, and the user-preset information is selected to be sorted according to the similarity from high to low. The word before; the word to be verified in the sequence of words is replaced with the selected word to obtain the corrected sequence of words.

The speech recognition method according to claim 18, wherein the similarity comprises: a similarity calculated based on an edit distance.

The voice recognition method according to any one of claims 1 to 13, wherein the method is implemented on a client device; the client device includes: a smart mobile terminal, a smart speaker, or a robot.

A voice recognition device, comprising: a search space generating unit, configured to generate, by using a preset voice knowledge source, a search space for decoding a voice signal including user preset information; and a feature vector extracting unit a feature vector sequence for extracting a speech signal to be recognized; a probability calculation unit configured to calculate a probability that the feature vector corresponds to the basic unit of the search space; and a decoding search unit configured to perform decoding in the search space at the probability of input Operation, obtaining a sequence of words corresponding to the sequence of feature vectors.

The speech recognition apparatus according to claim 21, wherein the search space generating unit is specifically configured to add and preset to a pre-generated at least language-based weighted finite state converter by replacing the label. a user-preset information corresponding to the topic category, and a single weighted finite state converter based on the triphone state-binding field table, the pronunciation dictionary, and the language model; the language model is pre-generated by the language model training unit, The language model training unit is configured to replace the preset named entity in the text for training the language model with a label corresponding to the preset theme category, and use the text to train the language model.

The speech recognition apparatus according to claim 22, wherein the search space generating unit comprises: a first user information adding subunit for weighting the pre-generated language model based on the manner of replacing the label Adding user preset information corresponding to the preset theme category to the state converter; and weighting the finite state converter merging subunit for using the weighted finite state converter to which the user preset information is added, and the pre-generated The single weighted finite state converter is obtained by combining the weighted finite state converters based on the triphone state tied field table and the pronunciation dictionary.

The speech recognition apparatus according to claim 22, wherein the decoding space generating unit comprises: a second user information adding subunit, configured to transmit the pre-generated at least language model based weighting by means of replacing the label The finite state converter adds a user preset information corresponding to the preset theme category; the unified weighted finite state converter acquisition subunit is configured to obtain the triphone based on the third user information adding subunit after the adding operation is completed. a state-binding field table, a pronunciation dictionary, and a single weighted finite state converter of the language model; wherein the second user information adding sub-unit includes: a topic determining sub-unit, configured to determine a preset to which the to-be-identified voice signal belongs a subject category; a weighted finite state converter selection subunit, configured to select the pre-generated, at least language-based weighted finite state converter corresponding to the preset topic category; a label replacement sub-unit for transmitting The manner in which the preset information of the preset theme category replaces the corresponding label, The selected weighted finite state transducer UE added preset information.

The speech recognition apparatus according to claim 24, wherein the theme determining sub-unit is specifically configured to determine the belonging preset topic category according to the type of the client that collects the voice signal or the application.

The speech recognition apparatus according to claim 23, wherein the weighted finite state converter merging subunit is specifically configured to perform a merging operation using a prediction based method and obtain the single weighted finite state converter.

The speech recognition device according to claim 21, wherein the probability calculation unit comprises: a triphone state probability calculation sub-unit, configured to calculate a probability that the feature vector corresponds to each triphone state by using the pre-trained DNN model; The triphone probability calculation sub-unit is configured to calculate the probability that the feature vector corresponds to each triphone by using the pre-trained HMM model according to the probability that the feature vector corresponds to the triphone state.

The speech recognition apparatus according to any one of claims 21 to 27, wherein the feature vector extraction unit comprises: a framing subunit for performing framing processing on the speech signal to be recognized according to a preset frame length Obtaining a plurality of audio frames; and extracting a sub-unit for extracting feature vectors of each audio frame to obtain the feature vector sequence.

The speech recognition apparatus according to any one of claims 21 to 27, further comprising: an accuracy verification unit, configured to transmit the word sequence corresponding to the feature vector sequence after the decoding search unit obtains The client preset information performs text matching to verify the accuracy of the word sequence, and generates a corresponding speech recognition result according to the verification result.

The speech recognition device according to claim 29, wherein the accuracy verification unit comprises: a to-be-verified word selection sub-unit, configured to select a to-be-verified word corresponding to the user-preset information from the word sequence. a search subunit for finding the to-be-verified word in the user preset information; a recognition result confirmation subunit, configured to, after the search subunit finds the to-be-verified word, determine to verify by the accuracy, and The word sequence is used as a speech recognition result; the recognition result correction sub-unit is configured to correct the word sequence through the pinyin-based fuzzy matching method after the search sub-unit does not find the to-be-verified word, and use the corrected word sequence as a speech Identify the results.

The speech recognition device according to claim 30, wherein the recognition result correction subunit comprises: a pinyin sequence conversion subunit to be verified, for converting the word to be verified into a pinyin sequence to be verified; a sequence conversion sub-unit, configured to convert each word in the preset information of the user end into a comparison pinyin sequence; the similarity calculation selection sub-unit is configured to sequentially calculate the between the pinyin sequence to be verified and each of the comparison pinyin sequences Similarity, and selecting, from the user preset information, a word that ranks according to the similarity from high to low; the word to be verified replaces the subunit, and replaces the word to be verified in the word sequence with the selected word. , get the corrected word sequence.

A voice recognition method, comprising: obtaining a sequence of words corresponding to a voice signal to be recognized through decoding; verifying the accuracy of the word sequence by performing text matching with the preset information of the user terminal, and generating a corresponding voice according to the verification result Identify the results.

According to the speech recognition method of claim 32, the text is matched with the preset information of the user end to verify the accuracy of the word sequence, and the corresponding speech recognition result is generated according to the verification result, including: Selecting a word to be verified corresponding to the preset information of the user in the sequence of words; searching for the to-be-verified word in the preset information of the user; if found, determining to verify by the accuracy, and using the word sequence as a speech recognition The result is otherwise corrected by the pinyin-based fuzzy matching method, and the modified word sequence is used as the speech recognition result.

The speech recognition method according to claim 33, wherein the correcting the word sequence by using a pinyin-based fuzzy matching method comprises: converting the to-be-verified word into a to-be-verified pinyin sequence; Each word in the conversion is converted into a comparison pinyin sequence; the similarity between the to-be-verified pinyin sequence and each of the comparison pinyin sequences is calculated in turn, and the user-preset information is selected to be sorted according to the similarity from high to low. The word before; the word to be verified in the sequence of words is replaced with the selected word to obtain the corrected sequence of words.

A speech recognition apparatus, comprising: a word sequence obtaining unit, configured to obtain a word sequence corresponding to the to-be-recognized speech signal through decoding; the word sequence verification unit is configured to verify the text by performing text matching with the user preset information. The accuracy of the word sequence, and the corresponding speech recognition result is generated according to the verification result.

The speech recognition apparatus according to claim 35, wherein the word sequence verification unit comprises: a to-be-verified word selection sub-unit, for selecting a to-be-verified word corresponding to the user-preset information from the word sequence. a search subunit for finding the to-be-verified word in the user preset information; a recognition result confirmation subunit, configured to, after the search subunit finds the to-be-verified word, determine to verify by the accuracy, and The word sequence is used as a speech recognition result; the recognition result correction sub-unit is configured to correct the word sequence through the pinyin-based fuzzy matching method after the search sub-unit does not find the to-be-verified word, and use the corrected word sequence as a speech Identify the results.

The speech recognition apparatus according to claim 36, wherein the recognition result correction subunit comprises: a to-be-verified pinyin sequence conversion sub-unit, configured to convert the to-be-verified word into a to-be-verified pinyin sequence; a sequence conversion sub-unit, configured to convert each word in the preset information of the user end into a comparison pinyin sequence; the similarity calculation selection sub-unit is configured to sequentially calculate the between the pinyin sequence to be verified and each of the comparison pinyin sequences Similarity, and selecting, from the user preset information, a word that ranks according to the similarity from high to low; the word to be verified replaces the subunit, and replaces the word to be verified in the word sequence with the selected word. , get the corrected word sequence.