TWI636452B

TWI636452B - Method and system of voice recognition

Info

Publication number: TWI636452B
Application number: TW106135251A
Authority: TW
Inventors: 王健宗; 程寧; 查高密; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2017-05-10
Filing date: 2017-10-13
Publication date: 2018-09-21
Also published as: CN107204184B; CN107204184A; WO2018205389A1; TW201901661A

Abstract

一種語音識別方法及系統，該方法包括：從預先確定的資料源獲取特定類型的資訊文本；對獲取的各個資訊文本進行語句切分得到若干語句，對各個語句進行分詞處理得到對應的分詞，由各個語句與對應的分詞構成第一映射語料；根據得到的各個第一映射語料，訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。本發明有效提高語音識別的精度且有效降低語音識別的成本。 A speech recognition method and system. The method includes: obtaining a specific type of information text from a predetermined data source; segmenting each obtained information text to obtain a plurality of sentences; and performing word segmentation processing on each sentence to obtain a corresponding word segmentation. Each sentence and the corresponding participle constitute a first mapping corpus; according to each obtained first mapping corpus, a first language model of a preset type is trained, and speech recognition is performed based on the trained first language model. The invention effectively improves the accuracy of speech recognition and effectively reduces the cost of speech recognition.

Description

Speech recognition method and system

本發明涉及電腦技術領域，尤其涉及一種語音識別方法及系統。 The present invention relates to the field of computer technology, and in particular, to a speech recognition method and system.

語言模型在語音識別任務中扮演著重要的角色，在現有的語音識別中，一般利用標注過的對話文本建立語言模型，通過該語言模型確定每個字的機率。然而，現有技術中利用標注過的對話文本建立語言模型的方式，由於目前用戶在日常生活中需要用到語音識別技術的場景過少(例如，比較常見的場景是語音搜索、語音控制等領域)，且能夠收集的語料類型和範圍過於集中，使得這種方式存在以下兩個缺點：一個是購買價格昂貴、成本很高；另一個是很難獲取到足夠數量的語料，獲取標注過的對話文本比較困難，而且升級擴充的及時性、準確性難以保障，進而影響語言模型的訓練效果和識別精度，從而影響語音識別的準確性。 The language model plays an important role in the speech recognition task. In the existing speech recognition, generally, the labeled dialogue text is used to establish a language model, and the probability of each word is determined by the language model. However, in the prior art, a labeled language is used to build a language model. Because users currently use speech recognition technology in too few scenarios (for example, the more common scenarios are in the areas of voice search and voice control), The type and scope of the corpus that can be collected is too concentrated, which makes this method have the following two disadvantages: one is that it is expensive and expensive to buy; the other is that it is difficult to obtain a sufficient amount of corpus and obtain labeled dialogues The text is difficult, and the timeliness and accuracy of the upgrade cannot be guaranteed, which will affect the training effect and recognition accuracy of the language model, and thus affect the accuracy of speech recognition.

因此，如何利用現有的語料資源有效提高語音識別的精度且有效降低語音識別的成本已經成為一個亟待解決的技術問題。 Therefore, how to use the existing corpus resources to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition has become an urgent technical problem.

本發明的主要目的在於提供一種語音識別方法及系統，旨在有效提高語音識別的精度且有效降低語音識別的成本。 The main purpose of the present invention is to provide a speech recognition method and system, which aims to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition.

為實現上述目的，本發明提供一種語音識別方法，所述方法包括以下步驟：A、從預先確定的資料源獲取特定類型的資訊文本；B、對獲取的各個資訊文本進行語句切分得到若干語句，對各個語句進行分詞處理得到對應的分詞，由各個語句與對應的分詞構成第一映射語料；C、根據得到的各個第一映射語料，訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。 To achieve the above object, the present invention provides a speech recognition method, which includes the following steps: A. obtaining a specific type of information text from a predetermined data source; B. performing sentence segmentation on each obtained information text to obtain several sentences , Performing segmentation processing on each sentence to obtain a corresponding segmentation, and forming a first mapping corpus from each statement and the corresponding segmentation; C. training a first type of first language model based on the obtained first mapping corpus, and based on The trained first language model performs speech recognition.

在一實施例中，所述步驟C替換為：根據得到的各個第一映射語料，訓練預設類型的第一語言模型；根據各個預先確定的樣本語句與對應的分詞的第二映射語料，訓練預設類型的第二語言模型；根據預先確定的模型混合公式，將訓練的所述第一語言模型及第二語言模型進行混合，以獲得混合語言模型，並基於獲得的所述混合語言模型進行語音識別。 In an embodiment, the step C is replaced by: training a first language model of a preset type according to each obtained first mapping corpus; and according to each predetermined sample sentence and a second mapping corpus of a corresponding word segmentation Training a second language model of a preset type; mixing the first language model and the second language model trained according to a predetermined model mixing formula to obtain a mixed language model, and based on the obtained mixed language The model performs speech recognition.

在一實施例中，所述預先確定的模型混合公式為：M=a*M1+b*M2其中，M為混合語言模型，M1代表預設類型的第一語言模型，a代表預設的模型M1的權重係數，M2代表預設類型的第二語言模型，b代表預設的模型M2的權重係數。 In an embodiment, the predetermined model mixing formula is: M = a * M1 + b * M2, where M is a mixed language model, M1 represents a first language model of a preset type, and a represents a preset model. The weight coefficient of M1, M2 represents a second language model of a preset type, and b represents a weight coefficient of a preset model M2.

在一實施例中，所述預設類型的第一語言模型及/或第二語言模型為n-gram語言模型，所述預設類型的第一語言模型或第二語言模型的訓練過程如下：S1、將各個第一映射語料或者各個第二映射語料分為第一比例的訓練集和第二比例的驗證集；S2、利用所述訓練集訓練所述第一語言模型或者第二語言模型；S3、利用所述驗證集驗證訓練的第一語言模型或者第二語言模型的準確率，若準確率大於或者等於預設準確率，則訓練結束，或者，若準確率小於預設準確率，則增加第一映射語料或者第二映射語料的數量並重新執行步驟S1、S2、S3。 In an embodiment, the first language model and / or the second language model of the preset type is an n-gram language model, and the training process of the first language model or the second language model of the preset type is as follows: S1. Divide each first mapping corpus or each second mapping corpus into a first scale training set and a second scale verification set; S2, use the training set to train the first language model or the second language S3. Use the validation set to verify the accuracy of the trained first language model or the second language model. If the accuracy is greater than or equal to the preset accuracy, then the training ends, or if the accuracy is less than the preset accuracy , Increase the number of the first mapping corpus or the second mapping corpus and re-execute steps S1, S2, and S3.

在一實施例中，所述對各個切分的語句進行分詞處理的步驟包括：在一個切分的語句被選擇進行分詞處理時，根據正向最大匹配法將該切分的語句與預先確定的字詞典庫進行匹配，得到第一匹配結果，所述第一匹配結果中包含有第一數量的第一詞組和第三數量的單字；根據逆向最大匹配法將該切分的語句與預先確定的字詞典庫進行匹配，得到第二匹配結果，所述第二匹配結果中包含有第二數量的第二詞組和第四數量的單字；若所述第一數量與所述第二數量相等，且所述第三數量小於或者等於所述第四數量，則將所述第一匹配結果作為該切分的語句的分詞結果；若所述第一數量與所述第二數量相等，且所述第三數量大於所述第四數量，則將所述第二匹配結果作為該切分的語句的分詞結果；若所述第一數量與所述第二數量不相等，且所述第一數量大於所述第二數量，則將所述第二匹配結果作為該切分的語句的分詞結果；若所述第一數量與所述第二數量不相等，且所述第一數量小於所述第二數量，則將所述第一匹配結果作為該切分的語句的分詞結果。 In an embodiment, the step of segmenting each segmented sentence includes: when a segmented sentence is selected for segmentation processing, the segmented sentence and a predetermined The word dictionary library performs matching to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of single words; and the segmented sentence and a predetermined Word dictionary library to match, get To the second matching result, the second matching result includes a second number of second phrases and a fourth number of words; if the first number is equal to the second number and the third number is less than Or equal to the fourth number, the first matching result is used as the word segmentation result of the segmented sentence; if the first number is equal to the second number, and the third number is greater than the first number Four numbers, the second matching result is used as the word segmentation result of the segmented sentence; if the first number is not equal to the second number, and the first number is greater than the second number, then Using the second matching result as the word segmentation result of the segmented sentence; if the first number is not equal to the second number, and the first number is smaller than the second number, the first A matching result is used as the segmentation result of the segmented sentence.

此外，為實現上述目的，本發明還提供一種語音識別系統，所述語音識別系統包括：獲取模組，用於從預先確定的資料源獲取特定類型的資訊文本；分詞模組，用於對獲取的各個資訊文本進行語句切分得到若干語句，對各個語句進行分詞處理得到對應的分詞，由各個語句與對應的分詞構成第一映射語料；訓練識別模組，用於根據得到的各個第一映射語料，訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。 In addition, in order to achieve the above object, the present invention also provides a speech recognition system. The speech recognition system includes: an acquisition module for acquiring a specific type of information text from a predetermined data source; a word segmentation module for acquiring Sentence segmentation of each information text to obtain several sentences, and segmentation processing of each sentence to obtain corresponding segmentation. Each sentence and corresponding segmentation constitute a first mapping corpus; a training recognition module is used to The corpus is mapped, a first language model of a preset type is trained, and speech recognition is performed based on the trained first language model.

在一實施例中，所述訓練識別模組還用於：根據得到的各個第一映射語料，訓練預設類型的第一語言模型；根據各個預先確定的樣本語句與對應的分詞的第二映射語料，訓練預設類型的第二語言模型；根據預先確定的模型混合公式，將訓練的所述第一語言模型及第二語言模型進行混合，以獲得混合語言模型，並基於獲得的所述混合語言模型進行語音識別。 In an embodiment, the training recognition module is further configured to: train a first language model of a preset type according to each obtained first mapping corpus; and according to each predetermined sample sentence and a corresponding second word segmentation second Map the corpus to train a second language model of a preset type; mix the first language model and the second language model trained according to a predetermined model mixing formula to obtain a mixed language model, and based on the obtained all language models The mixed language model described above is used for speech recognition.

在一實施例中，所述預先確定的模型混合公式為：M=a*M1+b*M2其中，M為混合語言模型，M1代表預設類型的第一語言模型，a代表預設的模型M1的權重係數，M2代表預設類型的第二語言模型，b代表預設的模型M2的權重係數。 In an embodiment, the predetermined model mixing formula is: M = a * M1 + b * M2, where M is a mixed language model, M1 represents a first language model of a preset type, and a represents a preset model. Weighting coefficient of M1, M2 represents a preset second language model, and b represents a preset Weighting coefficient of model M2.

在一實施例中，所述分詞模組還用於：在一個切分的語句被選擇進行分詞處理時，根據正向最大匹配法將該切分的語句與預先確定的字詞典庫進行匹配，得到第一匹配結果，所述第一匹配結果中包含有第一數量的第一詞組和第三數量的單字；根據逆向最大匹配法將該切分的語句與預先確定的字詞典庫進行匹配，得到第二匹配結果，所述第二匹配結果中包含有第二數量的第二詞組和第四數量的單字；若所述第一數量與所述第二數量相等，且所述第三數量小於或者等於所述第四數量，則將所述第一匹配結果作為該切分的語句的分詞結果；若所述第一數量與所述第二數量相等，且所述第三數量大於所述第四數量，則將所述第二匹配結果作為該切分的語句的分詞結果；若所述第一數量與所述第二數量不相等，且所述第一數量大於所述第二數量，則將所述第二匹配結果作為該切分的語句的分詞結果；若所述第一數量與所述第二數量不相等，且所述第一數量小於所述第二數量，則將所述第一匹配結果作為該切分的語句的分詞結果。 In an embodiment, the word segmentation module is further configured to: when a segmented sentence is selected for word segmentation processing, match the segmented sentence with a predetermined word dictionary library according to a forward maximum matching method, A first matching result is obtained, where the first matching result includes a first number of first phrases and a third number of words; and the segmented sentence is matched with a predetermined word dictionary library according to a reverse maximum matching method, A second matching result is obtained, and the second matching result includes a second number of second phrases and a fourth number of words; if the first number is equal to the second number and the third number is less than Or equal to the fourth number, the first matching result is used as the word segmentation result of the segmented sentence; if the first number is equal to the second number, and the third number is greater than the first number Four numbers, the second matching result is used as the word segmentation result of the segmented sentence; if the first number is not equal to the second number, and the first number is greater than the second number, then Will the first The matching result is used as the word segmentation result of the segmented sentence; if the first number is not equal to the second number, and the first number is less than the second number, the first matching result is used as the The segmentation result of the segmented sentence.

本發明提出的語音識別方法及系統，通過對從預先確定的資料源獲取的特定類型的資訊文本進行語句切分，並對各個切分的語句進行分詞處理，得到各個切分的語句與對應的分詞的第一映射語料，根據該第一映射語料訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。由於可通過對從預先確定的多個資料源中獲取的資訊文本進行語句切分及相應的分詞處理來得到語料資源，並基於該語料資源訓練語言模型，無需獲取標注過的對話文本，且能獲取到足夠數量的語料資源，能保證語言模型的訓練效果和識別精度，從而有效提高語音識別的精度且有效降低語音識別的成本。 The speech recognition method and system provided by the present invention perform sentence segmentation on a specific type of information text obtained from a predetermined data source, and perform segmentation processing on each segmented sentence to obtain each segmented sentence and the corresponding Segmentation of the first mapping corpus, according to the first A mapping corpus trains a first language model of a preset type, and performs speech recognition based on the trained first language model. Since the corpus resource can be obtained by sentence segmentation and corresponding word segmentation processing of the information text obtained from a plurality of predetermined data sources, and the language model is trained based on the corpus resource, there is no need to obtain labeled dialogue text. Moreover, a sufficient amount of corpus resources can be obtained, and the training effect and recognition accuracy of the language model can be guaranteed, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.

100‧‧‧電子裝置 100‧‧‧ electronic device

10‧‧‧語音識別系統 10‧‧‧Speech recognition system

11‧‧‧儲存器 11‧‧‧Memory

12‧‧‧處理器 12‧‧‧ processor

13‧‧‧顯示器 13‧‧‧Display

01‧‧‧獲取模組 01‧‧‧Get Module

02‧‧‧分詞模組 02‧‧‧Word segmentation module

03‧‧‧訓練識別模組 03‧‧‧Training recognition module

S10‧‧‧步驟 S10‧‧‧step

S20‧‧‧步驟 S20‧‧‧step

S30‧‧‧步驟 S30‧‧‧step

S40‧‧‧步驟 S40‧‧‧step

S50‧‧‧步驟 S50‧‧‧step

S60‧‧‧步驟 S60‧‧‧step

圖1為本發明語音識別方法第一實施例的流程示意圖。 FIG. 1 is a schematic flowchart of a first embodiment of a speech recognition method according to the present invention.

圖2為本發明語音識別方法第二實施例的流程示意圖。 FIG. 2 is a schematic flowchart of a second embodiment of a speech recognition method according to the present invention.

圖3為本發明語音識別系統10較佳實施例的運行環境示意圖。 FIG. 3 is a schematic diagram of an operating environment of a voice recognition system 10 according to a preferred embodiment of the present invention.

圖4為本發明語音識別系統一實施例的功能模組示意圖。 FIG. 4 is a functional module diagram of a speech recognition system according to an embodiment of the present invention.

為了使本發明所要解決的技術問題、技術方案及有益效果更加清楚、明白，以下結合附圖和實施例，對本發明進行進一步詳細說明。應當理解，此處所描述的具體實施例僅僅用以解釋本發明，並不用於限定本發明。 In order to make the technical problems, technical solutions and beneficial effects to be more clearly understood by the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

本發明提供一種語音識別方法。 The invention provides a speech recognition method.

參照圖1，圖1為本發明語音識別方法第一實施例的流程示意圖。 Referring to FIG. 1, FIG. 1 is a schematic flowchart of a first embodiment of a speech recognition method according to the present invention.

在第一實施例中，該語音識別方法包括：步驟S10，從預先確定的資料源獲取特定類型的資訊文本。 In the first embodiment, the speech recognition method includes: Step S10, obtaining a specific type of information text from a predetermined data source.

本實施例中，在訓練語言模型之前，實時或者定時從預先確定的多個資料源(例如，新浪微博、百度百科、維基百科、新浪新聞等網站)獲取特定類型的資訊文本(例如，詞條及其解釋、新聞標題、新聞摘要、微博內容等等)。例如，可通過網路爬蟲等工具實時或者定時從預先確定的資料源(例如，各大新聞網站、論壇等)獲取特定類型的資訊(例如，新聞標題資訊、索引資訊、簡介資訊等)。 In this embodiment, before training a language model, a specific type of information text (for example, a word) is obtained in real time or from a predetermined number of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.). Articles and their explanations, news titles, news summaries, Weibo content, etc.). For example, a specific type of information (for example, news headline information, index information, profile information, etc.) can be obtained from a predetermined data source (for example, major news websites, forums, etc.) in real time or periodically through a tool such as a web crawler.

步驟S20，對獲取的各個資訊文本進行語句切分得到若干語句，對各個語句進行分詞處理得到對應的分詞，由各個語句與對應的分詞構成第一映射語料。 Step S20: Sentence segmentation of each acquired information text to obtain several languages Sentences, perform segmentation processing on each sentence to obtain corresponding segmentation, and each sentence and corresponding segmentation constitute a first mapping corpus.

從預先確定的多個資料源中獲取到特定類型的各個資訊文本後，可對獲取的各個資訊文本進行語句切分，例如可根據標點符號將各個資訊文本切分成一條條完整的語句。然後，對各個切分的語句進行分詞處理，例如，可利用字符串匹配的分詞方法對各個切分的語句進行分詞處理，如正向最大匹配法，把一個切分的語句中的字符串從左至右來分詞；或者，反向最大匹配法，把一個切分的語句中的字符串從右至左來分詞；或者，最短路徑分詞法，一個切分的語句中的字符串裡面要求切出的詞數是最少的；或者，雙向最大匹配法，正反向同時進行分詞匹配。還可利用詞義分詞法對各個切分的語句進行分詞處理，詞義分詞法是一種機器語音判斷的分詞方法，利用句法資訊和語義資訊來處理歧義現象來分詞。還可利用統計分詞法對各個切分的語句進行分詞處理，從當前用戶的歷史搜索記錄或大眾用戶的歷史搜索記錄中，根據詞組的統計，會統計有些兩個相鄰的字出現的頻率較多，則可將這兩個相鄰的字作為詞組來進行分詞。 After obtaining a specific type of each information text from a plurality of predetermined data sources, sentence segmentation can be performed on each of the obtained information texts. For example, each information text can be divided into complete sentences according to punctuation marks. Then, perform segmentation processing on each segmented sentence. For example, segmentation processing can be performed on each segmented sentence by using a string matching segmentation method, such as forward maximum matching, and the character string in a segmented sentence is removed from Word segmentation from left to right; or reverse maximum matching, which splits strings in a segmented sentence from right to left; or shortest path word segmentation, where strings in a segmented sentence require segmentation The number of words is the smallest; or, the two-way maximum matching method, which performs word segmentation matching in both forward and reverse directions. Word segmentation can also be used to perform segmentation processing on each segmented sentence. Word segmentation is a method of machine speech judgment and uses syntactic and semantic information to process ambiguity to segment words. You can also use statistical word segmentation to perform word segmentation on each segmented sentence. From the historical search history of the current user or the historical search history of the general user, according to the statistics of the phrase, it will be counted that some two adjacent words appear more frequently. If there are many, you can use these two adjacent words as phrases to perform segmentation.

對獲取的各個切分的語句完成分詞處理後，即可得到各個切分的語句與對應的分詞所組成的第一映射語料。通過從預先確定的多個資料源中獲取資訊文本，並對資訊文本切分生成大量的語句來進行分詞處理，可從多個資料源中獲取到語料類型豐富、範圍較廣以及數量較多的語料資源。 After completing the word segmentation processing on the obtained segmented sentences, a first mapping corpus composed of each segmented sentence and the corresponding segmentation can be obtained. By obtaining informational text from multiple predetermined data sources and segmenting the informational text to generate a large number of sentences for word segmentation processing, multiple corpus types can be obtained from a variety of sources, with a wide range and a large number Corpus resources.

步驟S30，根據得到的各個第一映射語料，訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。 In step S30, a first language model of a preset type is trained according to the obtained first mapping corpora, and speech recognition is performed based on the trained first language model.

基於所述第一映射語料，訓練預設類型的第一語言模型，該第一語言模型可以是生成性模型、分析性模型、辨識性模型等。由於第一映射語料是從多個資料源中獲取到的，其語料資源的語料類型豐富、範圍較廣且數量較多，因此，利用該第一映射語料來訓練第一語言模型的訓練效果較好，進而使得基於訓練的所述第一語言模型進行語音識別的識別精度較高。 Based on the first mapping corpus, a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, and the like. Because the first mapping corpus is obtained from multiple sources, the corpus resources have rich corpus types, a wide range, and a large number. Therefore, the first mapping corpus is used to train the first language model. The training effect is better, and the recognition accuracy of speech recognition based on the trained first language model is higher.

本實施例通過對從預先確定的資料源獲取的特定類型的資訊文本進行語句切分，並對各個切分的語句進行分詞處理，得到各個切分的語句與對應的分詞的第一映射語料，根據該第一映射語料訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。由於可通過對從預先確定的多個資料源中獲取的資訊文本進行語句切分及相應的分詞處理來得到語料資源，並基於該語料資源訓練語言模型，無需獲取標注過的對話文本，且能獲取到足夠數量的語料資源，能保證語言模型的訓練效果和識別精度，從而有效提高語音識別的精度且有效降低語音識別的成本。 In this embodiment, a specific type of information obtained from a predetermined data source is used. Sentence segmentation of message text and segmentation processing of each segmented sentence to obtain a first mapping corpus of each segmented sentence and corresponding segmentation, and training a first type of first language according to the first mapping corpus Model, and perform speech recognition based on the trained first language model. Since the corpus resource can be obtained by sentence segmentation and corresponding word segmentation processing of the information text obtained from a plurality of predetermined data sources, and the language model is trained based on the corpus resource, there is no need to obtain labeled dialogue text. Moreover, a sufficient amount of corpus resources can be obtained, and the training effect and recognition accuracy of the language model can be guaranteed, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.

進一步地，在其他實施例中，上述步驟S20可以包括：對獲取的各個資訊文本進行清洗去噪。例如，針對微博內容，所述清洗去噪的步驟包括：從微博內容中刪除用戶名、id等資訊，只保留微博的實際內容；刪除掉轉發的微博內容，一般獲取的微博內容中有大量轉發的微博內容，重複的轉發微博內容會影響到詞語的頻次，因此須將轉發的微博內容過濾掉，過濾方法為刪除掉所有包含“轉發”或包含“http”的微博內容；過濾掉微博內容中的特殊符號，將微博內容中預設類型的符號全部過濾掉；繁體轉簡體，微博內容中有大量的繁體字符，利用預先確定的簡繁對應表將所有繁體字符轉變為簡體字符，等等。 Further, in other embodiments, the above-mentioned step S20 may include: cleaning and denoising each acquired information text. For example, for microblog content, the step of cleaning and denoising includes: deleting user name, id and other information from the microblog content, leaving only the actual content of the microblog; deleting the reposted microblog content, and generally acquired microblogs There is a large number of reposted Weibo content in the content. Repeated reposting of Weibo content will affect the frequency of words. Therefore, the reposted Weibo content must be filtered out. The filtering method is to delete all content that contains "forward" or "http" Weibo content; filtering out special symbols in Weibo content, filtering out all types of symbols in Weibo content; traditional to simplified, there are a large number of traditional characters in Weibo content, using a predetermined Simplified and Traditional correspondence table Convert all traditional characters to simplified characters, and so on.

對清洗去噪後的各個資訊文本進行語句切分，例如，將兩個預設類型的斷句符“例如，逗號、句號、感嘆號等”之間的語句作為一個待切分的語句，並對各個切分的語句進行分詞處理，以得到各個切分的語句與對應的分詞(包括詞組和單字)的映射語料。 Sentence segmentation of each information text after cleaning and denoising, for example, use the sentence between two preset types of hyphens “for example, comma, period, exclamation mark, etc.” as one sentence to be segmented, and The segmented sentence is segmented to obtain a mapping corpus of each segmented sentence and a corresponding segmentation (including a phrase and a single word).

如圖2所示，本發明第二實施例提出一種語音識別方法，在上述實施例的基礎上，上述步驟S30替換為：步驟S40，根據得到的各個第一映射語料，訓練預設類型的第一語言模型。 As shown in FIG. 2, a second embodiment of the present invention proposes a speech recognition method. Based on the above embodiment, the above step S30 is replaced with: step S40, according to each first mapping corpus obtained, a preset type of training is trained. First language model.

步驟S50，根據各個預先確定的樣本語句與對應的分詞的第二映射語料，訓練預設類型的第二語言模型。例如，可預先確定若干樣本語句，如可從預先確定的資料源中找出若干出現頻率最高或最常用的樣本語句，並確定每一樣本語句對應的正確的分詞(包括詞組和單字)，以根據各個預先確定的樣本語句與對應的分詞的第二映射語料，訓練預設類型的第二語言模型。 In step S50, a second language model of a preset type is trained according to the second mapping corpus of each predetermined sample sentence and the corresponding segmentation. For example, a number of sample sentences can be determined in advance. For example, a number of sample sentences with the highest frequency or most frequent occurrence can be found from a predetermined data source, and the correct word segmentation (including phrases and words) corresponding to each sample sentence can be determined. according to A second mapping corpus of each predetermined sample sentence and the corresponding word segmentation is used to train a second language model of a preset type.

步驟S60，根據預先確定的模型混合公式，將訓練的所述第一語言模型及第二語言模型進行混合，以獲得混合語言模型，並基於獲得的所述混合語言模型進行語音識別。所述預先確定的模型混合公式可以為：M=a*M1+b*M2其中，M為混合語言模型，M1代表預設類型的第一語言模型，a代表預設的模型M1的權重係數，M2代表預設類型的第二語言模型，b代表預設的模型M2的權重係數。 Step S60: Mix the trained first language model and second language model according to a predetermined model mixing formula to obtain a mixed language model, and perform speech recognition based on the obtained mixed language model. The predetermined model mixing formula may be: M = a * M1 + b * M2, where M is a mixed language model, M1 represents a first language model of a preset type, and a represents a weight coefficient of the preset model M1, M2 represents a second language model of a preset type, and b represents a weight coefficient of the preset model M2.

本實施例中，在根據從多個資料源中獲取到的第一映射語料訓練得到第一語言模型的基礎上，還根據各個預先確定的樣本語句與對應的分詞的第二映射語料，訓練得到第二語言模型，例如該預先確定的樣本語句可以為預設的最常用且正確無誤的若干語句，因此，訓練得到的該第二語言模型能正確識別常用的語音。將訓練的所述第一語言模型及第二語言模型按預設的不同權重比例進行混合得到混合語言模型，並基於獲得的所述混合語言模型進行語音識別，既能保證語音識別的類型豐富、範圍較廣，又能保證正確識別常用的語音，進一步地提高語音識別的精度。 In this embodiment, on the basis of training the first language model based on the first mapping corpus obtained from multiple data sources, and also according to each predetermined sample sentence and the second mapping corpus of the corresponding word segmentation, The second language model is obtained through training. For example, the predetermined sample sentence may be a preset most commonly used and correct sentence. Therefore, the second language model obtained through training can correctly recognize commonly used speech. The trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model, which can ensure that the types of speech recognition are rich, The range is wide, and it can ensure the correct recognition of commonly used speech, which further improves the accuracy of speech recognition.

進一步地，在其他實施例中，所述預設類型的第一語言模型或第二語言模型的訓練過程如下：A、將各個第一映射語料或者各個第二映射語料分為第一比例(例如，70%)的訓練集和第二比例(例如，30%)的驗證集；B、利用所述訓練集訓練所述第一語言模型或者第二語言模型；C、利用所述驗證集驗證訓練的第一語言模型或者第二語言模型的準確率，若準確率大於或者等於預設準確率，則訓練結束，或者，若準確率小於預設準確率，則增加第一映射語料或者第二映射語料的數量並重新執行步驟A、B、C，直至訓練的所述第一語言模型或者第二語言模型的準確率大於或者等於預設準確率。 Further, in other embodiments, the training process of the preset type of the first language model or the second language model is as follows: A. Divide each first mapping corpus or each second mapping corpus into a first proportion (For example, 70%) training set and a second proportion (for example, 30%) validation set; B. using the training set to train the first language model or the second language model; C. using the validation set Verify the accuracy of the trained first language model or second language model. If the accuracy is greater than or equal to the preset accuracy, then the training ends, or if the accuracy is less than the preset accuracy, add the first mapping corpus or Steps A, B, and C are performed again for the number of second mapping corpora until the accuracy rate of the trained first language model or second language model is greater than or equal to a preset accuracy rate.

進一步地，在其他實施例中，所述預設類型的第一語言模型及/或第二語言模型為n-gram語言模型。n-gram語言模型是大詞匯連續語音識別中常用的一種語言模型，對中文而言，稱之為漢語語言模型(CLM,Chinese Language Model)。漢語語言模型利用上下文中相鄰詞間的搭配資訊，在需要把連續無空格的拼音、筆劃，或代表字母或筆劃的數字，轉換成漢字串(即句子)時，可以計算出具有最大機率的句子，從而實現到漢字的自動轉換，避開了許多漢字對應一個相同的拼音(或筆劃串、數字串)的重碼問題。n-gram是一種統計語言模型，用來根據前(n-1)個item來預測第n個item。在應用層面，這些item可以是音素(語音識別應用)、字符(輸入法應用)、詞(分詞應用)或堿基對(基因資訊)，可以從大規模文本或音頻語料庫生成n-gram模型。 Further, in other embodiments, the first language model of the preset type And / or the second language model is an n-gram language model. The n-gram language model is a language model commonly used in large vocabulary continuous speech recognition. For Chinese, it is called the Chinese Language Model (CLM, Chinese Language Model). The Chinese language model uses the collocation information between adjacent words in the context. When it is necessary to convert consecutive non-spaced pinyin, strokes, or numbers representing letters or strokes into Chinese character strings (that is, sentences), it can calculate the Sentences, thus realizing automatic conversion to Chinese characters, avoiding the problem of recoding of many Chinese characters corresponding to the same pinyin (or stroke string, number string). n-gram is a statistical language model used to predict the nth item based on the first (n-1) items. At the application level, these items can be phonemes (speech recognition applications), characters (input method applications), words (word segmentation applications), or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.

n-gram語言模型基於這樣一種假設，第n個詞的出現只與前面n-1個詞相關，而與其它任何詞都不相關，整句的機率就是各個詞出現的機率的乘積，這些機率可以通過直接從映射語料中統計n個詞同時出現的次數得到。對於一個句子T，假設T是由詞序列W1,W2,…,Wn組成的，那麼句子T出現的機率P(T)=P(W1W2…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)。本實施例中，為了解決出現機率為0的n-gram，在所述第一語言模型及/或第二語言模型的訓練中，本實施例採用了最大似然估計方法，即：P(Wn|W1W2...Wn-1)=C(W1W2...Wn)/C(W1W2...Wn-1)也就是說，在語言模型訓練過程中，通過統計序列W1W2…Wn出現的次數和W1W2…Wn-1出現的次數，即可算出第n個詞的出現機率，以判斷出所對應字的機率，實現語音識別。 The n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words and not related to any other words. The probability of the entire sentence is the product of the probability of each word appearing. It can be obtained by directly counting the number of simultaneous occurrences of n words from the mapping corpus. For a sentence T, assuming T is composed of the word sequences W1, W2, ..., Wn, then the probability of occurrence of sentence T is P (T) = P (W1W2 ... Wn) = P (W1) P (W2 | W1) P (W3 | W1W2) ... P (Wn | W1W2 ... Wn-1). In this embodiment, in order to solve the n-gram with a probability of occurrence 0, in the training of the first language model and / or the second language model, this embodiment adopts a maximum likelihood estimation method, that is, P (Wn | W1W2 ... Wn-1) = C (W1W2 ... Wn) / C (W1W2 ... Wn-1) That is, during the training of the language model, the number of occurrences of the sequence W1W2 ... Wn and The number of times W1W2 ... Wn-1 appears can calculate the probability of the nth word to determine the probability of the corresponding word and realize speech recognition.

進一步地，在其他實施例中，上述步驟S20中對各個切分的語句進行分詞處理的步驟可以包括：根據正向最大匹配法將每一切分的語句中待處理的字符串與預先確定的字詞典庫(例如，該字詞典庫可以是通用字詞典庫，也可以是可擴充的學習型字詞典庫)進行匹配，得到第一匹配結果；根據逆向最大匹配法將每一切分的語句中待處理的字符串與預先確定的字詞典庫(例如，該字詞典庫可以是通用字詞典庫，也可以是可擴充的學習型字詞典庫)進行匹配，得到第二匹配結果。其中，所述第一匹配結果中包含有第一數量的第一詞組，所述第二匹配結果中包含有第二數量的第二詞組；所述第一匹配結果中包含有第三數量的單字，所述第二匹配結果中包含有第四數量的單字。 Further, in other embodiments, the step of segmenting each segmented sentence in the above step S20 may include: according to a forward maximum matching method, a character string to be processed in each segmented sentence and a predetermined word A dictionary library (for example, the word dictionary library can be a general word dictionary library or an expandable learning word dictionary library) to match to obtain the first matching result; according to the reverse maximum matching method, each sentence that is divided into minutes is waited for. Processed strings and predetermined word dictionary libraries (for example, the word dictionary library can be a general-purpose word dictionary library, or Is an extensible learning dictionary dictionary) to obtain a second matching result. The first matching result includes a first number of first phrases, the second matching result includes a second number of second phrases, and the first matching result includes a third number of words. , The second matching result includes a fourth number of words.

若所述第一數量與所述第二數量相等，且所述第三數量小於或者等於所述第四數量，則輸出該切分的語句對應的所述第一匹配結果(包括詞組和單字)；若所述第一數量與所述第二數量相等，且所述第三數量大於所述第四數量，則輸出該切分的語句對應的所述第二匹配結果(包括詞組和單字)；若所述第一數量與所述第二數量不相等，且所述第一數量大於所述第二數量，則輸出該切分的語句對應的所述第二匹配結果(包括詞組和單字)；若所述第一數量與所述第二數量不相等，且所述第一數量小於所述第二數量，則輸出該切分的語句對應的所述第一匹配結果(包括詞組和單字)。 If the first number is equal to the second number, and the third number is less than or equal to the fourth number, output the first matching result (including a phrase and a word) corresponding to the segmented sentence ; If the first number is equal to the second number, and the third number is greater than the fourth number, output the second matching result (including a phrase and a word) corresponding to the segmented sentence; If the first number is not equal to the second number, and the first number is greater than the second number, outputting the second matching result (including a phrase and a single word) corresponding to the segmented sentence; If the first number is not equal to the second number, and the first number is smaller than the second number, the first matching result (including a phrase and a single word) corresponding to the segmented sentence is output.

本實施例中採用雙向匹配法來對獲取的各個切分的語句進行分詞處理，通過正反向同時進行分詞匹配來分析各個切分的語句待處理的字符串中前後組合內容的粘性，由於通常情況下詞組能代表核心觀點資訊的機率更大，即通過詞組更能表達出核心觀點資訊。因此，通過正反向同時進行分詞匹配找出單字數量更少，詞組數量更多的分詞匹配結果，以作為切分的語句的分詞結果，從而提高分詞的準確性，進而保證語言模型的訓練效果和識別精度。 In this embodiment, a two-way matching method is used to perform word segmentation on each obtained segmented sentence, and forward and reverse simultaneous segmentation matching is performed to analyze the stickiness of the combined content in the strings to be processed in each segmented sentence. Under the circumstances, the phrase is more likely to represent the core point of view information, that is, the core point of view information can be more expressed through the phrase. Therefore, the word segmentation matching with forward and reverse to find the word segmentation result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation, and thus ensuring the training effect of the language model. And recognition accuracy.

本發明進一步提供一種語音識別系統。請參閱圖3，是本發明語音識別系統10較佳實施例的運行環境示意圖。 The invention further provides a speech recognition system. Please refer to FIG. 3, which is a schematic diagram of an operating environment of a voice recognition system 10 according to a preferred embodiment of the present invention.

在本實施例中，所述的語音識別系統10安裝並運行於電子裝置100中。該電子裝置100可包括，但不僅限於，儲存器11、處理器12及顯示器13。圖3僅示出了具有組件11-13的電子裝置100，但是應理解的是，並不要求實施所有示出的組件，可以替代的實施更多或者更少的組件。 In this embodiment, the voice recognition system 10 is installed and runs in the electronic device 100. The electronic device 100 may include, but is not limited to, a memory 11, a processor 12, and a display 13. FIG. 3 only shows the electronic device 100 having components 11-13, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

所述儲存器11在一些實施例中可以是所述電子裝置100的內部存儲單元，例如該電子裝置100的硬碟或記憶體。所述儲存器11在另一些實施例中也可以是所述電子裝置100的外部存儲設備，例如所述電子裝置100上配備的插接式硬碟，智能存儲卡(Smart Media Card,SMC)，安全數位(Secure Digital,SD)卡，快閃記憶體卡(Flash Card)等。進一步地，所述儲存器11還可以既包括所述電子裝置100的內部存儲單元也包括外部存儲設備。所述儲存器11用於存儲安裝於所述電子裝置100的應用軟體及各類資料，例如所述語音識別系統10的程式碼等。所述儲存器11還可以用於暫時地存儲已經輸出或者將要輸出的資料。 The storage 11 may be an internal storage unit of the electronic device 100 in some embodiments, such as a hard disk or a memory of the electronic device 100. The storage 11 may also be an external storage device of the electronic device 100 in other embodiments, such as a plug-in hard disk and a Smart Media Card (SMC) provided on the electronic device 100. Secure Digital (SD) card, Flash Card, etc. Further, the storage 11 may further include both an internal storage unit of the electronic device 100 and an external storage device. The storage 11 is configured to store application software installed on the electronic device 100 and various types of data, such as codes of the speech recognition system 10. The storage 11 may also be used to temporarily store data that has been output or is to be output.

所述處理器12在一些實施例中可以是一中央處理器(Central Processing Unit,CPU)，微處理器或其他資料處理晶片，用於運行所述儲存器11中存儲的程式碼或處理資料，例如執行所述語音識別系統10等。 In some embodiments, the processor 12 may be a central processing unit (CPU), a microprocessor, or other data processing chip, and is configured to run the program code or process data stored in the storage 11. For example, the speech recognition system 10 is executed.

所述顯示器13在一些實施例中可以是LED顯示器、液晶顯示器、觸控式液晶顯示器以及OLED(Organic Light-Emitting Diode，有機發光二極體)觸控器等。所述顯示器13用於顯示在所述電子裝置100中處理的資訊以及用於顯示可視化的用戶界面，例如語音識別的選單界面、語音識別的結果等。所述電子裝置100的部件11-13通過系統匯流排相互通信。 In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-type liquid crystal display, an OLED (Organic Light-Emitting Diode) organic touch device, or the like. The display 13 is used to display information processed in the electronic device 100 and to display a visual user interface, such as a menu interface for voice recognition, a result of voice recognition, and the like. The components 11-13 of the electronic device 100 communicate with each other through a system bus.

請參閱圖4，是本發明語音識別系統10較佳實施例的功能模組圖。在本實施例中，所述的語音識別系統10可以被分割成一個或多個模組，所述一個或者多個模組被存儲於所述儲存器11中，並由一個或多個處理器(本實施例為所述處理器12)所執行，以完成本發明。例如，在圖4中，所述的語音識別系統10可以被分割成獲取模組01、分詞模組02及訓練識別模組03。本發明所稱的模組是指能夠完成特定功能的一系列電腦程式指令段，比程式更適合於描述所述語音識別系統10在所述電子裝置100中的執行過程。以下描述將具體介紹所述獲取模組01、分詞模組02及訓練識別模組03的功能。 Please refer to FIG. 4, which is a functional module diagram of a speech recognition system 10 according to a preferred embodiment of the present invention. In this embodiment, the speech recognition system 10 may be divided into one or more modules, and the one or more modules are stored in the storage 11 and are processed by one or more processors. (This embodiment is executed by the processor 12) to complete the present invention. For example, in FIG. 4, the speech recognition system 10 can be divided into an acquisition module 01, a tokenizer module 02, and a training recognition module 03. The module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the speech recognition system 10 in the electronic device 100 than programs. The following description will specifically introduce the functions of the acquisition module 01, word segmentation module 02, and training recognition module 03.

獲取模組01，用於從預先確定的資料源獲取特定類型的資訊文本。 Acquisition module 01, which is used to obtain a specific type of information from a predetermined data source Message text.

分詞模組02，用於對獲取的各個資訊文本進行語句切分得到若干語句，對各個語句進行分詞處理得到對應的分詞，由各個語句與對應的分詞構成第一映射語料。 The word segmentation module 02 is used for sentence segmentation of each obtained information text to obtain several sentences, and word segmentation processing for each sentence to obtain corresponding word segmentation, and each sentence and the corresponding word segmentation form a first mapping corpus.

訓練識別模組03，用於根據得到的各個第一映射語料，訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。 The training recognition module 03 is used for training according to each first mapping corpus obtained. Practice a first language model of a preset type and perform speech recognition based on the trained first language model.

本實施例通過對從預先確定的資料源獲取的特定類型的資訊文本進行語句切分，並對各個切分的語句進行分詞處理，得到各個切分的語句與對應的分詞的第一映射語料，根據該第一映射語料訓練預設類型的第一語言模型，並基於訓練的所述第一語言模型進行語音識別。由於可通過對從預先確定的多個資料源中獲取的資訊文本進行語句切分及相應的分詞處理來得到語料資源，並基於該語料資源訓練語言模型，無需獲取標注過的對話文本，且能獲取到足夠數量的語料資源，能保證語言模型的訓練效果和識別精度，從而有效提高語音識別的精度且有效降低語音識別的成本。 In this embodiment, sentence segmentation is performed on a specific type of information text obtained from a predetermined data source, and segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation. Training a first language model of a preset type according to the first mapping corpus, and performing speech recognition based on the trained first language model. Since the corpus resource can be obtained by sentence segmentation and corresponding word segmentation processing of the information text obtained from a plurality of predetermined data sources, and the language model is trained based on the corpus resource, there is no need to obtain labeled dialogue text. Moreover, a sufficient amount of corpus resources can be obtained, and the training effect and recognition accuracy of the language model can be guaranteed, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.

進一步地，在其他實施例中，上述分詞模組02還用於：對獲取的各個資訊文本進行清洗去噪。例如，針對微博內容，所述清洗去噪的步驟包括：從微博內容中刪除用戶名、id等資訊，只保留微博的實際內容；刪除掉轉發的微博內容，一般獲取的微博內容中有大量轉發的微博內容，重複的轉發微博內容會影響到詞語的頻次，因此須將轉發的微博內容過濾掉，過濾方法為刪除掉所有包含“轉發”或包含“http”的微博內容；過濾掉微博內容中的特殊符號，將微博內容中預設類型的符號全部過濾掉；繁體轉簡體，微博內容中有大量的繁體字符，利用預先確定的簡繁對應表將所有繁體字符轉變為簡體字符，等等。 Further, in other embodiments, the word segmentation module 02 is further configured to clean and denoise each acquired information text. For example, for microblog content, the step of cleaning and denoising includes: deleting user name, id and other information from the microblog content, leaving only the actual content of the microblog; deleting the reposted microblog content, and generally acquired microblogs There is a large number of reposted Weibo content in the content. Repeated reposting of Weibo content will affect the frequency of words. Therefore, the reposted Weibo content must be filtered out. The filtering method is to delete all content that contains "forward" or "http" Weibo content; filtering out special symbols in Weibo content, filtering out all types of symbols in Weibo content; traditional to simplified, there are a large number of traditional characters in Weibo content, using a predetermined Simplified and Traditional correspondence table Convert all traditional characters to simplified characters, and so on.

對清洗去噪後的各個資訊文本進行語句切分，例如，將兩個預設類型的斷句符“例如，逗號、句號、感嘆號等”之間的語句作為一個待切分的語句，並對各個切分的語句進行分詞處理，以得到各個切分的語句與對應的分詞(包括詞組和單字)的映射語料。 Sentence segmentation of each information text after cleaning and denoising, for example, use the sentence between two preset types of hyphens “for example, comma, period, exclamation mark, etc.” as one sentence to be segmented, and The segmented sentence is segmented to obtain each segmented language. A mapping corpus of sentences and corresponding participles (including phrases and words).

進一步地，在其他實施例中，上述訓練識別模組03還用於：根據得到的各個第一映射語料，訓練預設類型的第一語言模型。 Further, in other embodiments, the training recognition module 03 is further configured to train a first language model of a preset type according to each of the obtained first mapping corpora.

根據各個預先確定的樣本語句與對應的分詞的第二映射語料，訓練預設類型的第二語言模型。例如，可預先確定若干樣本語句，如可從預先確定的資料源中找出若干出現頻率最高或最常用的樣本語句，並確定每一樣本語句對應的正確的分詞(包括詞組和單字)，以根據各個預先確定的樣本語句與對應的分詞的第二映射語料，訓練預設類型的第二語言模型。 According to the second mapping corpus of each predetermined sample sentence and the corresponding word segmentation, a second language model of a preset type is trained. For example, a number of sample sentences can be determined in advance. For example, a number of sample sentences with the highest frequency or most frequent occurrence can be found from a predetermined data source, and the correct word segmentation (including phrases and words) corresponding to each sample sentence can be determined. According to the second mapping corpus of each predetermined sample sentence and the corresponding word segmentation, a second language model of a preset type is trained.

根據預先確定的模型混合公式，將訓練的所述第一語言模型及第二語言模型進行混合，以獲得混合語言模型，並基於獲得的所述混合語言模型進行語音識別。所述預先確定的模型混合公式可以為：M=a*M1+b*M2其中，M為混合語言模型，M1代表預設類型的第一語言模型，a代表預設的模型M1的權重係數，M2代表預設類型的第二語言模型，b代表預設的模型M2的權重係數。 The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model. The predetermined model mixing formula may be: M = a * M1 + b * M2, where M is a mixed language model, M1 represents a first language model of a preset type, and a represents a weight coefficient of the preset model M1, M2 represents a second language model of a preset type, and b represents a weight coefficient of the preset model M2.

進一步地，在其他實施例中，所述預設類型的第一語言模型或第二語言模型的訓練過程如下：A、將各個第一映射語料或者各個第二映射語料分為第一比例(例如，70%)的訓練集和第二比例(例如，30%)的驗證集； B、利用所述訓練集訓練所述第一語言模型或者第二語言模型；C、利用所述驗證集驗證訓練的第一語言模型或者第二語言模型的準確率，若準確率大於或者等於預設準確率，則訓練結束，或者，若準確率小於預設準確率，則增加第一映射語料或者第二映射語料的數量並重新執行步驟A、B、C，直至訓練的所述第一語言模型或者第二語言模型的準確率大於或者等於預設準確率。 Further, in other embodiments, the training process of the preset type of the first language model or the second language model is as follows: A. Divide each first mapping corpus or each second mapping corpus into a first proportion (E.g., 70%) training set and a second proportion (e.g., 30%) validation set; B. Use the training set to train the first language model or the second language model; C. Use the validation set to verify the accuracy of the trained first language model or the second language model, if the accuracy rate is greater than or equal to the If the accuracy rate is set, the training is ended, or if the accuracy rate is less than the preset accuracy rate, the number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are performed again until the first The accuracy rate of a language model or a second language model is greater than or equal to a preset accuracy rate.

進一步地，在其他實施例中，所述預設類型的第一語言模型及/或第二語言模型為n-gram語言模型。n-gram語言模型是大詞匯連續語音識別中常用的一種語言模型，對中文而言，稱之為漢語語言模型(CLM,Chinese Language Model)。漢語語言模型利用上下文中相鄰詞間的搭配資訊，在需要把連續無空格的拼音、筆劃，或代表字母或筆劃的數字，轉換成漢字串(即句子)時，可以計算出具有最大機率的句子，從而實現到漢字的自動轉換，避開了許多漢字對應一個相同的拼音(或筆劃串、數字串)的重碼問題。n-gram是一種統計語言模型，用來根據前(n-1)個item來預測第n個item。在應用層面，這些item可以是音素(語音識別應用)、字符(輸入法應用)、詞(分詞應用)或堿基對(基因資訊)，可以從大規模文本或音頻語料庫生成n-gram模型。 Further, in other embodiments, the first language model and / or the second language model of the preset type is an n-gram language model. The n-gram language model is a language model commonly used in large vocabulary continuous speech recognition. For Chinese, it is called the Chinese Language Model (CLM, Chinese Language Model). The Chinese language model uses the collocation information between adjacent words in the context. When it is necessary to convert consecutive non-spaced pinyin, strokes, or numbers representing letters or strokes into Chinese character strings (that is, sentences), it can calculate the Sentences, thus realizing automatic conversion to Chinese characters, avoiding the problem of recoding of many Chinese characters corresponding to the same pinyin (or stroke string, number string). n-gram is a statistical language model used to predict the nth item based on the first (n-1) items. At the application level, these items can be phonemes (speech recognition applications), characters (input method applications), words (word segmentation applications), or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.

n-gram語言模型基於這樣一種假設，第n個詞的出現只與前面n-1個詞相關，而與其它任何詞都不相關，整句的機率就是各個詞出現的機率的乘積，這些機率可以通過直接從映射語料中統計n個詞同時出現的次數得到。對於一個句子T，假設T是由詞序列W1,W2,…,Wn組成的，那麼句子T出現的機率P(T)=P(W1W2…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)。本實施例中，為了解決出現機率為0的n-gram，在所述第一語言模型及/或第二語言模型的訓練中，本實施例採用了最大似然估計方法，即：P(Wn|W1W2...Wn-1)=C(W1W2...Wn)/C(W1W2...Wn-1)也就是說，在語言模型訓練過程中，通過統計序列W1W2…Wn出現的次數和W1W2…Wn-1出現的次數，即可算出第n個詞的出現機率，以判斷出所對應字的機率，實現語音識別。 The n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words and not related to any other words. The probability of the entire sentence is the product of the probability of each word appearing. It can be obtained by directly counting the number of simultaneous occurrences of n words from the mapping corpus. For a sentence T, assuming T is composed of the word sequences W1, W2, ..., Wn, then the probability of occurrence of sentence T is P (T) = P (W1W2 ... Wn) = P (W1) P (W2 | W1) P (W3 | W1W2) ... P (Wn | W1W2 ... Wn-1). In this embodiment, in order to solve the n-gram with a probability of occurrence 0, in the training of the first language model and / or the second language model, this embodiment adopts a maximum likelihood estimation method, that is, P (Wn | W1W2 ... Wn-1) = C (W1W2 ... Wn) / C (W1W2 ... Wn-1) That is, during the training of the language model, the number of occurrences of the sequence W1W2 ... Wn and The number of occurrences of W1W2 ... Wn-1, you can calculate the occurrence probability of the nth word to determine the Corresponds to the probability of words to achieve speech recognition.

進一步地，在其他實施例中，上述分詞模組02還用於：根據正向最大匹配法將每一切分的語句中待處理的字符串與預先確定的字詞典庫(例如，該字詞典庫可以是通用字詞典庫，也可以是可擴充的學習型字詞典庫)進行匹配，得到第一匹配結果；根據逆向最大匹配法將每一切分的語句中待處理的字符串與預先確定的字詞典庫(例如，該字詞典庫可以是通用字詞典庫，也可以是可擴充的學習型字詞典庫)進行匹配，得到第二匹配結果。其中，所述第一匹配結果中包含有第一數量的第一詞組，所述第二匹配結果中包含有第二數量的第二詞組；所述第一匹配結果中包含有第三數量的單字，所述第二匹配結果中包含有第四數量的單字。 Further, in other embodiments, the above-mentioned word segmentation module 02 is further configured to: according to the forward maximum matching method, the character string to be processed in each segmented sentence and a predetermined word dictionary library (for example, the word dictionary library) (Can be a general word dictionary library or an expandable learning word dictionary library) to perform matching to obtain a first matching result; According to the reverse maximum matching method, the character string to be processed in each sentence and the predetermined word dictionary library (for example, the word dictionary library can be a general word dictionary library or an expandable learning word dictionary library) Perform matching to get the second matching result. The first matching result includes a first number of first phrases, the second matching result includes a second number of second phrases, and the first matching result includes a third number of words. , The second matching result includes a fourth number of words.

本實施例中採用雙向匹配法來對獲取的各個切分的語句進行分詞處理，通過正反向同時進行分詞匹配來分析各個切分的語句待處理的字符串中前後組合內容的粘性，由於通常情況下詞組能代表核心觀點資訊的機率更大，即通過詞組更能表達出核心觀點資訊。因此，通過正反向同時進行分詞匹配找出單字數量更少，詞組數量更多的分詞匹配結果，以作為切分的語句的分詞結果，從而提高分詞的準確性，進而保證語言模型的訓練效果和識別精度。 In this embodiment, a two-way matching method is used to perform word segmentation on each obtained segmented sentence, and forward and reverse simultaneous segmentation matching is performed to analyze the stickiness of the combined content in the strings to be processed in each segmented sentence. Under the circumstances, the phrase is more likely to represent the core point of view information, that is, the core point of view information can be more expressed through the phrase. Therefore, the word segmentation matching with forward and reverse to find the word segmentation results with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and thus ensuring the language model. Training effect and recognition accuracy.

需要說明的是，在本文中，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、物品或者裝置不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、物品或者裝置所固有的要素。在沒有更多限制的情況下，由語句“包括一個……”限定的要素，並不排除在包括該要素的過程、方法、物品或者裝置中還存在另外的相同要素。 It should be noted that, in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, It also includes other elements not explicitly listed, or elements inherent to such a process, method, article, or device. Without more restrictions, an element limited by the sentence "including a ..." does not exclude that there are other identical elements in the process, method, article, or device that includes the element.

通過以上的實施方式的描述，本領域的技術人員可以清楚地瞭解到上述實施例方法可借助軟體加必需的通用硬體平臺的方式來實現，當然也可以通過硬體來實現，但很多情況下前者是更佳的實施方式。基於這樣的理解，本發明的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，該電腦軟體產品存儲在一個儲存媒體(如ROM/RAM、磁碟、光碟)中，包括若干指令用以使得一台終端設備(可以是手機，電腦，伺服器，空調器，或者網路設備等)執行本發明各個實施例所述的方法。 Through the description of the foregoing implementation manners, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, can also be implemented by hardware, but in many cases The former is a better implementation. Based on such an understanding, the technical solution of the present invention, in essence, or a part that contributes to the existing technology, can be embodied in the form of a software product. The computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk). ) Includes several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the embodiments of the present invention.

以上參照附圖說明了本發明的較佳實施例，並非因此局限本發明的專利範圍。上述本發明實施例序號僅僅為了描述，不代表實施例的優劣。另外，雖然在流程圖中示出了邏輯順序，但是在某些情況下，可以以不同於此處的順序執行所示出或描述的步驟。 The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and thus do not limit the patent scope of the present invention. The sequence numbers of the foregoing embodiments of the present invention are only for description, and do not represent the superiority or inferiority of the embodiments. In addition, although the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than here.

本領域技術人員不脫離本發明的範圍和實質，可以有多種變型方案實現本發明，比如作為一個實施例的特徵可用於另一實施例而得到又一實施例。凡在運用本發明的技術構思之內所作的任何修改、等同替換和改進，均應在本發明的專利範圍之內。 Those skilled in the art can implement the present invention in various modifications without departing from the scope and essence of the present invention. For example, the features of one embodiment can be used in another embodiment to obtain another embodiment. Any modification, equivalent replacement and improvement made within the technical concept of the present invention shall fall within the patent scope of the present invention.

Claims

A speech recognition method, which includes the following steps: A. Obtain a specific type of information text from a predetermined data source; the specific type includes an entry and its explanation, a news title, a news summary, and / or Weibo content; B, Sentence segmentation of each obtained information text according to punctuation marks to obtain several sentences, and segmentation processing of each sentence to obtain corresponding segmentation, and each sentence and corresponding segmentation constitute a first mapping corpus; and C. A mapping corpus, a first language model of a preset type is trained, and speech recognition is performed based on the trained first language model.

The speech recognition method according to claim 1, wherein the step C is replaced by: training a first language model of a preset type according to each obtained first mapping corpus; and according to each predetermined sample sentence and corresponding Training the second language corpus of the word segmentation of a preset type; and training the first language model and the second language model according to a predetermined model mixing formula to obtain a mixed language model, Speech recognition is performed based on the obtained mixed language model.

The speech recognition method according to item 2 of the claim, wherein the predetermined model mixing formula is: M = a * M1 + b * M2, where M is a mixed language model, and M1 represents the first of the preset types. Language model, a represents the weight coefficient of the preset model M1, M2 represents the second language model of the preset type, and b represents the weight coefficient of the preset model M2.

The speech recognition method according to claim 2 or 3, wherein the first language model of the preset type and / or the second language model is an n-gram language model, and the first language model of the preset type Or the second language model training process is as follows: S1, each first mapping corpus or each second mapping corpus is divided into a first proportion of training set and a second proportion of validation set; S2, training using the training set The first language model or the second language model; S3, using the verification set to verify the accuracy of the trained first language model or the second language model, and if the accuracy rate is greater than or equal to a preset accuracy rate, the training ends, Alternatively, if the accuracy rate is less than the preset accuracy rate, increase the number of the first mapping corpus or the second mapping corpus and perform steps S1, S2, and S3 again.

The speech recognition method according to claim 1, wherein the step of performing segmentation processing on each segmented sentence includes: when a segmented sentence is selected for segmentation processing, the segmentation is performed according to the forward maximum matching method. The segmented sentence is matched with a predetermined word dictionary library to obtain a first matching result. The first matching result includes a first number of first phrases and a third number of words. The segmented sentence is matched with a predetermined word dictionary library to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words; if the first number matches If the second number is equal, and the third number is less than or equal to the fourth number, the first matching result is used as a segmentation result of the segmented sentence; if the first number is equal to the first number If the two numbers are equal, and the third number is greater than the fourth number, the second matching result is used as the word segmentation result of the segmented sentence; if the first number is different from the second number And the first number is greater than the second number, the second matching result is used as the word segmentation result of the segmented sentence; if the first number is not equal to the second number, and If the first number is smaller than the second number, the first matching result is used as a segmentation result of the segmented sentence.

A speech recognition system includes: an acquisition module for acquiring a specific type of information text from a predetermined data source; the specific type includes an entry and its explanation, a news title, a news summary, and / or Weibo content; word segmentation A module for segmenting each obtained information text according to punctuation marks to obtain several sentences, and performing segmentation processing on each sentence to obtain corresponding segmentations, and each sentence and the corresponding segmentation constitute a first mapping corpus; training a recognition module A group for training a first language model of a preset type according to each obtained first mapping corpus, and performing speech recognition based on the trained first language model.

The speech recognition system according to claim 6, wherein the training recognition module is further configured to: train a first language model of a preset type according to each obtained first mapping corpus; and according to each predetermined sample The second mapping corpus of the sentence and the corresponding word segmentation is used to train a second language model of a preset type; according to a predetermined model mixing formula, the first language model and the second language model trained are mixed to obtain a mixture A language model, and perform speech recognition based on the obtained mixed language model.

The speech recognition system according to claim 7, wherein the predetermined model mixing formula is: M = a * M1 + b * M2, where M is a mixed language model, and M1 represents the first of the preset types. Language model, a represents the weight coefficient of the preset model M1, M2 represents the second language model of the preset type, and b represents the weight coefficient of the preset model M2.

The speech recognition system according to claim 7 or 8, wherein the first language model and / or the second language model of the preset type is an n-gram language model, and the first language model of the preset type Or the second language model training process is as follows: S1, each first mapping corpus or each second mapping corpus is divided into a first proportion of training set and a second proportion of validation set; S2, training using the training set The first language model or the second language model; S3, using the verification set to verify the accuracy of the trained first language model or the second language model, and if the accuracy rate is greater than or equal to a preset accuracy rate, the training ends, Alternatively, if the accuracy rate is less than the preset accuracy rate, increase the number of the first mapping corpus or the second mapping corpus and perform steps S1, S2, and S3 again.

The speech recognition system according to item 6, wherein the segmentation module is further configured to: when a segmented sentence is selected for segmentation processing, the segmented sentence is compared with The determined word dictionary library is matched to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of single words; the segmented sentence and the The determined word dictionary library is matched to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of single words; if the first number is equal to the second number , And the third number is less than or equal to the fourth number, the first matching result is used as the segmentation result of the segmented sentence; if the first number is equal to the second number, and If the third quantity is greater than the fourth quantity, the second matching result is used as the segmentation result of the segmented sentence; if the first quantity is not equal to the second quantity, and the first quantity Greater than If the second quantity is described, the second matching result is used as the segmentation result of the segmented sentence; if the first quantity is not equal to the second quantity, and the first quantity is smaller than the second quantity , Then use the first matching result as the segmentation result of the segmented sentence.