TW201935460A

TW201935460A - Speech recognition device and speech recognition method

Info

Publication number: TW201935460A
Application number: TW107109348A
Authority: TW
Inventors: 朱逸斐; 張國峰
Original assignee: 威盛電子股份有限公司
Priority date: 2018-02-01
Filing date: 2018-03-19
Publication date: 2019-09-01
Also published as: CN108346426B; TWI683305B; CN108346426A

Abstract

A speech recognition device including a speech recognition module and a probability comparison module. The speech recognition module receives a speech feature. The speech recognition module includes an acoustic model, language model, and a plurality of acoustic dictionaries. The speech recognition module analyzes the speech feature by at least one of the plurality of acoustic dictionaries, the acoustic model and the language model to generate at least one string probability and at least one string data. The plurality of acoustic dictionaries correspond to a plurality of different accent type. The probability comparison module judges a highest probability among the at least one string probability to output one of the at least one string data corresponding to the highest probability. In addition, a speech recognition method is also provided.

Description

Voice recognition device and voice recognition method

本發明是有關於一種識別技術，且特別是有關於一種語音識別裝置以及語音識別方法。The present invention relates to a recognition technology, and more particularly, to a voice recognition device and a voice recognition method.

隨著語音識別(speech recognition)技術的發展，有越來越多的電子裝置具備有語音識別的功能。語音識別通常是對輸入的語音信號取出語音特徵的參數，再與資料庫的樣本相比對，以找出與輸入相異度低的樣本。但是，若使用者提供的語音具有特殊口音，則可能發生無法有效識別語音信號的問題。因此，一般的語音識別方式是針對不同口音類型而建立多組聲學模型、多組語言模型以及多個聲學辭典，以分別產生多個字串概率以及多個字串資料。With the development of speech recognition technology, more and more electronic devices have the function of speech recognition. Speech recognition usually takes the parameters of speech features from the input speech signal and compares them with samples in the database to find samples with a low degree of dissimilarity from the input. However, if the voice provided by the user has a special accent, a problem that a voice signal cannot be effectively recognized may occur. Therefore, the general speech recognition method is to establish multiple sets of acoustic models, multiple sets of language models, and multiple acoustic dictionaries for different types of accents to generate multiple string probabilities and multiple string data, respectively.

但是，由於不同的聲學模型具有不同的音素以及概率基礎，並且不同的語言模型同樣具有不同的概率基礎，因此一般的語音識別方式是所產生的多個字串概率實際上不具有可比性，並且準確率低。此外，一般的語音識別方式需要經由大量的分析運算，還具有語音識別的效率過低的問題。對此，如何提出一種語音識別技術可有效適應不同口音類型的使用者，以有效識別不同使用者所提供的不同口音類型的語音信號，並且可提供有效率的語音識別結果，是本領域目前重要的課題之一。However, because different acoustic models have different phonemes and probability bases, and different language models also have different probability bases, the general way of speech recognition is that the probabilities of multiple strings produced are not actually comparable, and Low accuracy. In addition, the general speech recognition method requires a large amount of analysis and calculation, and has the problem that the efficiency of speech recognition is too low. In this regard, how to propose a speech recognition technology that can effectively adapt to users of different accent types, to effectively recognize the speech signals of different accent types provided by different users, and to provide efficient speech recognition results is currently important in the field One of the topics.

本發明提供一種語音識別裝置以及語音識別方法，可有效識別不同口音的語音信號，以依據不同的口音類型來對語音信號的語音特徵進行有效的解碼分析。The invention provides a speech recognition device and a speech recognition method, which can effectively recognize the speech signals of different accents, so as to effectively decode and analyze the speech characteristics of the speech signals according to different types of accents.

本發明的語音識別裝置包括語音識別模組以及概率比較模組。所述語音識別模組用以接收語音特徵。所述語音識別模組包括聲學模型、語言模型以及多個聲學辭典。所述語音識別模組藉由所述多個聲學辭典的至少其中之一個、所述聲學模型以及所述語言模型來分析所述語音特徵以產生至少一個字串概率以及至少一個字串資料。所述多個聲學辭典對應於多個不同口音類型。概率比較模組耦接所述語音識別模組。所述概率比較模組判斷所述至少一個字串概率當中的最高概率，以輸出對應於所述最高概率的所述至少一個字串資料的其中之一個。The speech recognition device of the present invention includes a speech recognition module and a probability comparison module. The voice recognition module is used to receive voice features. The speech recognition module includes an acoustic model, a language model, and a plurality of acoustic dictionaries. The speech recognition module analyzes the speech features by using at least one of the plurality of acoustic dictionaries, the acoustic model, and the language model to generate at least one string probability and at least one string data. The plurality of acoustic dictionaries correspond to a plurality of different accent types. The probability comparison module is coupled to the speech recognition module. The probability comparison module determines a highest probability among the at least one string probability to output one of the at least one string data corresponding to the highest probability.

在本發明的一實施例中，上述的語音識別模組藉由所述多個聲學辭典產生對應於不同口音類型的多個字串概率以及多個字串資料。所述語音特徵與所述多個聲學辭典的其中之一個為相同口音類型，以使經由所述多個聲學辭典的其中之一個產生的所述字串概率為所述最高概率。In an embodiment of the present invention, the speech recognition module generates multiple string probabilities and multiple string data corresponding to different accent types by using the multiple acoustic dictionaries. The speech feature is of the same accent type as one of the plurality of acoustic dictionaries, so that the string probability generated via one of the plurality of acoustic dictionaries is the highest probability.

在本發明的一實施例中，上述的語音識別裝置更包括特徵擷取模組。所述特徵擷取模組耦接所述語音識別模組。所述特徵擷取模組用以接收語音信號。所述特徵擷取模組分析所述語音信號以提供所述語音特徵至語音識別模組。In an embodiment of the present invention, the voice recognition device further includes a feature extraction module. The feature extraction module is coupled to the voice recognition module. The feature extraction module is used to receive a voice signal. The feature extraction module analyzes the voice signal to provide the voice feature to a voice recognition module.

在本發明的一實施例中，上述的語音識別裝置更包括口音識別模組。所述口音識別模組耦接所述特徵擷取模組以及所述語音識別模組。所述口音識別模組用以分析所述語音信號，以判斷所述語音信號的口音類型，並且選擇所述多個聲學辭典的其中之一個來分析所述語音特徵。In an embodiment of the present invention, the voice recognition device further includes an accent recognition module. The accent recognition module is coupled to the feature extraction module and the voice recognition module. The accent recognition module is configured to analyze the voice signal to determine the accent type of the voice signal, and select one of the plurality of acoustic dictionaries to analyze the voice characteristics.

在本發明的一實施例中，上述的所述口音識別模組依據所述語音信號的所述口音類型來選擇性地輸出所述語音特徵至對應於所述口音類型的所述多個聲學辭典的其中之一個，以藉由對應於所述口音類型的所述多個聲學辭典的其中之一個、所述聲學模型以及所述語言模型來分析所述語音特徵，並且輸出一個所述字串資料。In an embodiment of the present invention, the above-mentioned accent recognition module selectively outputs the speech feature to the plurality of acoustic dictionaries corresponding to the accent type according to the accent type of the voice signal. One of the plurality of acoustic dictionaries corresponding to the accent type, the acoustic model, and the language model to analyze the speech feature, and output one string of data .

本發明的語音識別方法包括以下步驟：接收語音特徵，並且藉由多個聲學辭典的至少其中之一個、聲學模型以及語言模型來分析所述語音特徵以產生至少一個字串概率以及至少一個字串資料，其中所述多個聲學辭典對應於多個不同口音類型；以及判斷所述至少一個字串概率當中的最高概率，以輸出對應於所述最高概率的所述至少一個字串資料的其中之一個。The speech recognition method of the present invention includes the following steps: receiving a speech feature, and analyzing the speech feature to generate at least one string probability and at least one string by using at least one of a plurality of acoustic dictionaries, an acoustic model, and a language model. Data, wherein the plurality of acoustic dictionaries correspond to a plurality of different accent types; and determining the highest probability among the at least one string probability to output one of the at least one string information corresponding to the highest probability One.

在本發明的一實施例中，上述的多個聲學辭典產生對應於不同口音類型的多個字串概率以及多個字串資料，並且所述語音特徵與所述多個聲學辭典的其中之一個為相同口音類型，以使經由所述多個聲學辭典的其中之一個產生的所述字串概率為所述最高概率。In an embodiment of the present invention, the above-mentioned multiple acoustic dictionaries generate multiple string probabilities and multiple string data corresponding to different accent types, and one of the voice characteristics and the multiple acoustic dictionaries is Are the same accent type such that the string probability generated via one of the plurality of acoustic dictionaries is the highest probability.

在本發明的一實施例中，上述的語音識別方法更包括以下步驟：接收語音信號，並且分析所述語音信號以取得所述語音特徵。In an embodiment of the present invention, the above-mentioned voice recognition method further includes the following steps: receiving a voice signal, and analyzing the voice signal to obtain the voice feature.

在本發明的一實施例中，上述的語音識別方法更包括以下步驟：分析所述語音信號，以判斷所述語音信號的口音類型，並且選擇所述多個聲學辭典的其中之一個來分析所述語音特徵。In an embodiment of the present invention, the voice recognition method further includes the steps of analyzing the voice signal to determine an accent type of the voice signal, and selecting one of the plurality of acoustic dictionaries to analyze the voice dictionary. The speech features are described.

在本發明的一實施例中，上述的語音識別方法更包括以下步驟：依據所述語音信號的所述口音類型來選擇性地輸出所述語音特徵至對應於所述口音類型的所述多個聲學辭典的其中之一個，以藉由對應於所述口音類型的所述多個聲學辭典的其中之一個、所述聲學模型以及所述語言模型來分析所述語音特徵；以及輸出一個所述字串資料。In an embodiment of the present invention, the above-mentioned voice recognition method further includes the following steps: selectively outputting the voice feature to the plurality of corresponding to the accent type according to the accent type of the voice signal One of the acoustic dictionaries to analyze the speech feature by one of the plurality of acoustic dictionaries corresponding to the accent type, the acoustic model, and the language model; and outputting one of the words String data.

基於上述，本發明的語音識別裝置以及語音識別方法，可藉由對應於多個不同口音類型的多個聲學辭典來分析語音特徵，以使語音識別模組輸出的字串概率具有可比性。並且，本發明的語音識別裝置以及語音識別方法還可藉由口音識別模組來判斷語音信息的口音類型，以選擇相同口音類型的聲學辭典來產生語音識別結果。Based on the above, the speech recognition device and speech recognition method of the present invention can analyze speech characteristics through multiple acoustic dictionaries corresponding to multiple different accent types, so that the string probabilities output by the speech recognition module are comparable. In addition, the speech recognition device and speech recognition method of the present invention can also determine the accent type of speech information by using an accent recognition module, and select an acoustic dictionary of the same accent type to generate a speech recognition result.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。In order to make the above features and advantages of the present invention more comprehensible, embodiments are hereinafter described in detail with reference to the accompanying drawings.

為了使本發明之內容可以被更容易明瞭，以下特舉實施例做為本發明確實能夠據以實施的範例。另外，凡可能之處，在圖式及實施方式中使用相同標號的元件/構件/步驟，係代表相同或類似部件。In order to make the content of the present invention easier to understand, the following specific embodiments are examples based on which the present invention can be implemented. In addition, wherever possible, the same reference numbers are used in the drawings and embodiments to refer to the same or similar components.

圖1繪示本發明一實施例的一種語音識別裝置的示意圖。參考圖1，語音識別裝置100包括處理裝置110、輸入裝置120、儲存裝置130以及輸出裝置140。處理裝置110耦接輸入裝置120、儲存裝置130以及輸出裝置140。語音識別裝置100例如為手機、智慧型手機、個人數位助理(Personal Digital Assistant，PDA)、平板電腦、筆記型電腦、桌上型電腦、車用電腦等具有運算功能的裝置。FIG. 1 is a schematic diagram of a speech recognition device according to an embodiment of the present invention. Referring to FIG. 1, the speech recognition device 100 includes a processing device 110, an input device 120, a storage device 130, and an output device 140. The processing device 110 is coupled to the input device 120, the storage device 130 and the output device 140. The voice recognition device 100 is, for example, a device having a computing function such as a mobile phone, a smart phone, a Personal Digital Assistant (PDA), a tablet computer, a notebook computer, a desktop computer, or a car computer.

在本實施例中，處理裝置110例如是中央處理單元(Central Processing Unit, CPU)，或是其他可程式化之一般用途或特殊用途的微處理器(microprocessor)、數位訊號處理器(Digital Signal Processor, DSP)、可程式化控制器、特殊應用積體電路(Application Specific Integrated Circuits, ASIC)、可程式化邏輯裝置(Programmable Logic Device, PLD)、其他類似處理電路或這些裝置的組合。In this embodiment, the processing device 110 is, for example, a Central Processing Unit (CPU), or other programmable general-purpose or special-purpose microprocessor (microprocessor), digital signal processor (Digital Signal Processor) (DSP), programmable controller, Application Specific Integrated Circuits (ASIC), Programmable Logic Device (PLD), other similar processing circuits, or a combination of these devices.

在本實施例中，輸入裝置120用以接收語音信號。輸入裝置120可例如是麥克風。輸入裝置120用以接收使用者所發出的類比語音信號，並將類比語音信號轉換為數位語音信號後，傳送至處理裝置110。In this embodiment, the input device 120 is used to receive a voice signal. The input device 120 may be, for example, a microphone. The input device 120 is configured to receive an analog voice signal sent by a user, convert the analog voice signal into a digital voice signal, and transmit the analog voice signal to the processing device 110.

在本實施例中，儲存裝置130可例如是電子抹除式可複寫唯讀記憶體(Electrically-Erasable Programmable Read-Only Memory, EEPROM)、內嵌式多媒體記憶卡(Embedded Multi Media Card, eMMC)、動態隨機存取記憶體(Dynamic Random Access Memory, DRAM)、快閃記憶體(Flash memory)或非揮發性隨機存取記憶體(Non-Volatile Random Access Memory, NVRAM)等。In this embodiment, the storage device 130 may be, for example, an electronically erasable rewritable read-only memory (Electrically-Erasable Programmable Read-Only Memory, EEPROM), an embedded Multi Media Card (eMMC), Dynamic Random Access Memory (DRAM), Flash Memory or Non-Volatile Random Access Memory (NVRAM), etc.

在本實施例中，輸出裝置140例如為陰極射線管(Cathode Ray Tube，CRT)顯示器、液晶顯示器(Liquid Crystal Display，LCD)、電漿顯示器(Plasma Display)、觸控顯示器(Touch Display)等顯示設備。輸出裝置140可用以顯示所產生的字串概率當中的最大概率所對應的字串資料。在一實施例中，輸出裝置140亦可以是揚聲器，並且用以播放所產生的字串概率當中的最大概率所對應的字串資料。或者，在另一實施例中，輸出裝置140亦可將產生的字串概率當中的最大概率所對應的字串資料提供至特定的應用程式中，以使特定的應用程式可對應執行特定功能或操作。In this embodiment, the output device 140 is a display such as a cathode ray tube (CRT) display, a liquid crystal display (LCD), a plasma display, a touch display, and the like. device. The output device 140 can display string data corresponding to a maximum probability among the generated string probabilities. In an embodiment, the output device 140 may also be a speaker, and is used to play string data corresponding to the largest probability among the generated string probabilities. Alternatively, in another embodiment, the output device 140 may also provide string data corresponding to a maximum probability among the generated string probabilities to a specific application program, so that the specific application program may correspondingly execute a specific function or operating.

在本實施例中，儲存裝置130可用以儲存多個模組供處理裝置110讀取並執行之，以實現本發明各實施例所述的語音識別操作。具體而言，儲存裝置130儲存的這些模組可例如包括特徵擷取模組、口音識別模組、語音識別模組或概率比較模組等，諸如此類的分析以及運算模組。在本實施例中，語音識別裝置100可藉由輸入裝置120取得語音信息，並且藉由儲存裝置130的這些模組來分析語音信息以產生對應的分析結果。也就是說，本實施例的語音識別裝置100可具有語音識別功能。In this embodiment, the storage device 130 may be used to store a plurality of modules for the processing device 110 to read and execute, so as to implement the voice recognition operation according to the embodiments of the present invention. Specifically, the modules stored in the storage device 130 may include, for example, a feature extraction module, an accent recognition module, a voice recognition module, a probability comparison module, and the like, such as an analysis and calculation module. In this embodiment, the voice recognition device 100 can obtain the voice information through the input device 120 and analyze the voice information through these modules of the storage device 130 to generate corresponding analysis results. That is, the voice recognition device 100 of this embodiment may have a voice recognition function.

圖2繪示本發明一實施例的一種語音識別方法的流程圖。圖3繪示本發明一實施例的一種語音識別裝置的示意圖。同時參考圖1至圖3，在本實施例中，處理裝置110可執行儲存在儲存裝置130當中的語音識別模組330以及概率比較模組340。語音識別模組330包括一個聲學模型331、多個聲學辭典332A、332B、332C、一個語言模型333以及多個解碼器334A、334B、334C。FIG. 2 is a flowchart of a speech recognition method according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a speech recognition device according to an embodiment of the invention. Referring to FIG. 1 to FIG. 3 at the same time, in this embodiment, the processing device 110 may execute the voice recognition module 330 and the probability comparison module 340 stored in the storage device 130. The speech recognition module 330 includes an acoustic model 331, a plurality of acoustic dictionaries 332A, 332B, and 332C, a language model 333, and a plurality of decoders 334A, 334B, and 334C.

搭配圖2的流程圖來說明一種可行的實施方式，在步驟S210中，語音識別模組330接收語音特徵VC，並且藉由這些聲學辭典332A、332B、332C的至少中之一個、聲學模型331以及語言模型333以產生至少一個字串概率以及至少一個字串資料。在步驟S220中，概率比較模組340判斷至少一個字串概率當中的最高概率，以輸出具有最高概率的字串資料SD。A possible implementation is described with reference to the flowchart of FIG. 2. In step S210, the voice recognition module 330 receives the voice feature VC, and uses at least one of these acoustic dictionaries 332A, 332B, and 332C, an acoustic model 331, and The language model 333 generates at least one string probability and at least one string data. In step S220, the probability comparison module 340 determines the highest probability among the at least one string probability to output the string data SD having the highest probability.

具體而言，在本實施例中，這些聲學辭典332A、332B、332C對應於多個不同的口音類型，其中這些口音類型例如是北京地區口音、上海地區口音、廣州地區口音或福建地區口音等。在本實施例中，語音識別模組330用以接收語音特徵VC，並且對應產生多個字串概率以及多個字串資料至概率比較模組340。概率比較模組340比較這些字串概率，以判斷這些字串概率當中的最高概率，並且輸出對應於這些字串概率當中的最高概率的字串資料SD至輸出裝置140。Specifically, in this embodiment, these acoustic dictionaries 332A, 332B, and 332C correspond to a plurality of different accent types, and these accent types are, for example, a Beijing area accent, a Shanghai area accent, a Guangzhou area accent, or a Fujian area accent. In this embodiment, the voice recognition module 330 is configured to receive the voice feature VC and correspondingly generate multiple string probabilities and multiple string data to the probability comparison module 340. The probability comparison module 340 compares the string probabilities to determine the highest probability among the string probabilities, and outputs the string data SD corresponding to the highest probability among the string probabilities to the output device 140.

在本實施例中，這些解碼器334A、334B、334C用以基於聲學模型331、聲學辭典332A、332B、332C以及語言模型333來分別產生較為適當或具有最大概率的字串資料與字串概率。值得注意的是，為了使語音識別模組330產生的字串概率具有可比性，本實施例的語音識別模組330僅透過一個聲學模型331以及一個語言模型333來分析語音特徵VC。在本實施例中，聲學模型331是經由語音資料庫訓練而得，例如是採用隱藏式馬可夫模型(Hidden Markov Model，HMM)進行建模。語言模型333經由語料庫(text corpus)訓練而得，例如利用機率統計的方法來揭示語言單位內在的統計規律。並且，本實施例的語音識別模組330針對多個不同口音類型來建立這些聲學辭典332A、332B、332C，其中這些聲學辭典332A、332B、332C亦是經由語音資料庫訓練而得，但分別對應於不同口音類型。這些聲學辭典332A、332B、332C分別用以處理不同口音類型的詞彙以及發音。In this embodiment, these decoders 334A, 334B, and 334C are used to generate string data and string probabilities that are more appropriate or have the highest probability, respectively, based on the acoustic model 331, the acoustic dictionary 332A, 332B, 332C, and the language model 333. It is worth noting that, in order to make the string probabilities generated by the speech recognition module 330 comparable, the speech recognition module 330 of this embodiment analyzes the speech feature VC only through one acoustic model 331 and one language model 333. In this embodiment, the acoustic model 331 is obtained by training through a voice database. For example, the acoustic model 331 is modeled by using a hidden Markov model (HMM). The language model 333 is obtained through training of a text corpus. For example, a probability statistics method is used to reveal the statistical rules inherent in the language unit. In addition, the speech recognition module 330 of this embodiment establishes these acoustic dictionaries 332A, 332B, and 332C for a plurality of different accent types. Among them, these acoustic dictionaries 332A, 332B, and 332C are also obtained by training through a voice database, but corresponding For different accent types. These acoustic dictionaries 332A, 332B, and 332C are used to process vocabulary and pronunciation of different accent types, respectively.

也就是說，本實施例的處理裝置110可藉由執行一個聲學模型331分析語音特徵VC，以取得對應的音素(phone)或音節(syllable)，並且再經由這些聲學辭典332A、332B、332C的至少中之一來獲得對應的字或詞，最後經由一個語言模型333來判斷一連串的字成為句子的概率。值得注意的是，本實施例的聲學模型331與語言模型333都是屬於概率模型，而由於本實施例的語音識別模組330只建立單一聲學模型331與單一語言模型333，因此由對應於不同口音類型的這些聲學辭典332A、332B、332C所提供的字或詞而對應產生的多個字串概率具有可比性。That is, the processing device 110 of this embodiment may analyze a speech feature VC by executing an acoustic model 331 to obtain a corresponding phone or syllable, and then pass these acoustic dictionaries 332A, 332B, and 332C. At least one of them is used to obtain the corresponding word or word, and finally a language model 333 is used to determine the probability of a series of words becoming a sentence. It is worth noting that both the acoustic model 331 and the language model 333 of this embodiment belong to a probability model, and since the speech recognition module 330 of this embodiment only builds a single acoustic model 331 and a single language model 333, it corresponds to different The probabilities of the multiple strings generated by these accent-type acoustic dictionaries 332A, 332B, and 332C are comparable.

舉例而言，聲學辭典332A例如針對廣州地區而建立，因此可例如記錄有“bei jin(北京)”、“ci fan(吃飯)”以及“re qi(熱氣)”的聲學資料。聲學辭典332B例如針對北京地區而建立，因此可例如記錄有“bei jing(北京)”、“chi fan(吃飯)”以及“re qi(熱氣)”的聲學資料。聲學辭典332C例如針對福建地區而建立，因此可例如記錄有“bei jin(北京)”、“ci fan(吃飯)”以及“le qi(熱氣)”的聲學資料。也就是說，由於這些聲學辭典332A、332B、332C分別針對不同的口音類型而建立，因此只要語音特徵VC與這些聲學辭典的其中之一個為相同口音類型，則經由聲學模型331、語言模型333以及對應相同口音類型的這些聲學辭典332A、332B、332C的其中之一個所產生的字串概率將為最高概率。For example, the acoustic dictionary 332A is established, for example, for the Guangzhou area, and thus, acoustic data such as “bei jin (Beijing)”, “ci fan (meal)”, and “re qi (热气)” may be recorded. The acoustic dictionary 332B is, for example, created for the Beijing area, and therefore, acoustic data such as "bei jing (Beijing)", "chi fan (eating)", and "re qi (热气)" may be recorded. The acoustic dictionary 332C is established, for example, for the Fujian region, and therefore, acoustic data such as “bei jin (Beijing)”, “ci fan (dining)”, and “le qi (热气)” may be recorded. That is, since these acoustic dictionaries 332A, 332B, and 332C are respectively established for different accent types, as long as the speech feature VC and one of these acoustic dictionaries are the same accent type, the acoustic model 331, the language model 333, and The string probability generated by one of these acoustic dictionaries 332A, 332B, 332C corresponding to the same accent type will be the highest probability.

然而，須注意的是，本發明的聲學辭典的數量以及類型不限於圖3所示，圖3的示意圖僅用於說明一種可行的語音識別模組的範例實施例。However, it should be noted that the number and types of the acoustic dictionaries of the present invention are not limited to those shown in FIG. 3, and the schematic diagram of FIG. 3 is only used to illustrate an exemplary embodiment of a feasible speech recognition module.

圖4繪示本發明一實施例的另一種語音識別裝置的示意圖。圖5繪示本發明一實施例的另一種語音識別方法的流程圖。同時參考圖1、圖4以及圖5，在本實施例中，處理裝置110可執行儲存在儲存裝置130當中的特徵擷取模組410、口音識別模組420、語音識別模組430以及概率比較模組440。相較於圖2以及圖3實施例，本實施例的處理裝置110可更進一步包括執行特徵擷取模組410以及口音識別模組420。在本實施例中，特徵擷取模組410用以接收輸入裝置120提供的語音信號，並且分析語音信號以取得語音特徵VC’。口音識別模組420用以分析語音信號以判斷語音信號是屬於何種口音類型，並且依據判斷結果來將語音特徵VC’選擇由對應的聲學辭典來進行運算以及處理。FIG. 4 is a schematic diagram of another speech recognition device according to an embodiment of the invention. FIG. 5 is a flowchart of another speech recognition method according to an embodiment of the present invention. Referring to FIG. 1, FIG. 4, and FIG. 5 at the same time, in this embodiment, the processing device 110 may execute the feature extraction module 410, the accent recognition module 420, the voice recognition module 430, and the probability comparison stored in the storage device 130. Module 440. Compared with the embodiments of FIG. 2 and FIG. 3, the processing device 110 of this embodiment may further include an execution feature extraction module 410 and an accent recognition module 420. In this embodiment, the feature extraction module 410 is configured to receive a voice signal provided by the input device 120 and analyze the voice signal to obtain a voice feature VC '. The accent recognition module 420 is used to analyze a voice signal to determine what type of accent the voice signal belongs to, and to select the voice feature VC 'according to the judgment result for calculation and processing by a corresponding acoustic dictionary.

在本實施例中，類似於圖2以及圖3實施例，語音識別模組430可包括一個聲學模型431、多個聲學辭典432A、432B、432C、一個語言模型433以及多個解碼器434A、434B、434C。在本實施例中，這些聲學辭典432A、432B、432C對應於多個不同的口音類型。在本實施例中，這些解碼器434A、434B、434C用以基於聲學模型431、聲學辭典432A、432B、432C以及語言模型433來產生較為適當或具有最大概率的字串資料與字串概率。In this embodiment, similar to the embodiments of FIG. 2 and FIG. 3, the speech recognition module 430 may include an acoustic model 431, multiple acoustic dictionaries 432A, 432B, 432C, a language model 433, and multiple decoders 434A, 434B. , 434C. In this embodiment, these acoustic dictionaries 432A, 432B, 432C correspond to a plurality of different accent types. In this embodiment, these decoders 434A, 434B, and 434C are used to generate string data and string probabilities that are more appropriate or have the greatest probability based on the acoustic model 431, the acoustic dictionary 432A, 432B, 432C, and the language model 433.

在本實施例中，口音識別模組420例如是透過濾波器組(Filter Bank)或梅爾倒頻譜係數(Mel-Frequency Cipstal Coefficients, MFCC)，以經由高斯混合模型(Gaussian Mixture Model, GMM)或深度神經網路(Deep Neural Networks, DNN)來進行分析以及運算而識別使用者提供的語音信號所屬的口音類型。因此，本實施例的語音識別模組430可依據口音識別模組420的識別結果來對應選擇這些聲學辭典432A、432B、432C的其中一個來進行分析以及運算。In this embodiment, the accent recognition module 420 is, for example, through a filter bank (Mel-Frequency Cipstal Coefficients, MFCC) to pass a Gaussian Mixture Model (GMM) or Deep Neural Networks (DNN) perform analysis and calculation to identify the type of accent to which the user-supplied speech signal belongs. Therefore, the voice recognition module 430 of this embodiment may select one of these acoustic dictionaries 432A, 432B, 432C for analysis and calculation according to the recognition result of the accent recognition module 420.

也就是說，在本實施例中，若口音識別模組420可識別出語音信號的口音類型可對應於這些聲學辭典432A、432B、432C的其中之一個，則語音識別模組430接收語音特徵VC’後，可依據口音識別模組420的識別結果，來經由這些聲學辭典432A、432B、432C的其中一個對應於語音信號的口音類型來進行分析以及運算，以取得一個字串資料SD’。概率比較模組440可直接輸出此字串資料SD’至輸出裝置140。That is, in this embodiment, if the accent recognition module 420 can recognize that the accent type of the voice signal can correspond to one of these acoustic dictionaries 432A, 432B, 432C, the voice recognition module 430 receives the voice feature VC 'After that, according to the recognition result of the accent recognition module 420, one of these acoustic dictionaries 432A, 432B, and 432C may be analyzed and calculated to obtain a string data SD through one of the accent types corresponding to the speech signal. The probability comparison module 440 can directly output the string data SD 'to the output device 140.

然而，在一實施例中，若口音識別模組420無法識別出語音信號的口音類型可對應於這些聲學辭典432A、432B、432C的其中之一個，則表示可能這些聲學辭典432A、432B、432C無對應相同口音類型。因此，在一實施例中，語音識別模組430可執行如同上述圖2、圖3實施例的語音識別操作，以產生多組字串資料以及多組字串概率。概率比較模組440用以比較這些字串概率，以選擇最接近的口音類型且具有最高概率的字串資料至輸出裝置140。However, in an embodiment, if the accent type that the accent recognition module 420 cannot recognize the voice signal may correspond to one of these acoustic dictionaries 432A, 432B, 432C, it means that these acoustic dictionaries 432A, 432B, 432C may not have Corresponds to the same accent type. Therefore, in one embodiment, the speech recognition module 430 can perform the speech recognition operation as described in the embodiments of FIG. 2 and FIG. 3 to generate multiple sets of string data and multiple sets of string probabilities. The probability comparison module 440 compares the string probabilities to select the string data with the closest accent type and the highest probability to the output device 140.

舉例而言，若使用者透過輸入裝置120輸入為北京地區口音的語音信號“bei jing(北京)”，則經由口音識別模組420後，語音識別模組430選擇聲學辭典432B來處理語音特徵VC’。在此例中，語音識別模組430透過聲學模型431、聲學辭典432B以及語言模型433來分析語音特徵VC’，以使解碼器334B產生一個字串資料至概率比較模組440。概率比較模組440無須經由概率比較，而直接輸出由語音識別模組430提供的一個字串資料。也就是說，本實施例的語音識別模組430無須經由全部的聲學辭典432A、432B、432C來執行分析以及運算，而可有效率的提供語音識別結果。For example, if the user inputs the voice signal "bei jing (Beijing)" as a Beijing accent through the input device 120, after the accent recognition module 420, the voice recognition module 430 selects the acoustic dictionary 432B to process the voice feature VC '. In this example, the speech recognition module 430 analyzes the speech features VC 'through the acoustic model 431, the acoustic dictionary 432B, and the language model 433, so that the decoder 334B generates a string of data to the probability comparison module 440. The probability comparison module 440 directly outputs a string of data provided by the speech recognition module 430 without performing a probability comparison. That is, the speech recognition module 430 of this embodiment does not need to perform analysis and calculation through all the acoustic dictionaries 432A, 432B, and 432C, and can efficiently provide speech recognition results.

搭配圖5的流程圖來說明一種可行的實施方式，在步驟S510中，特徵擷取模組410接收語音信號，並且分析語音信號以取得語音特徵VC’。在步驟S520中，口音識別模組420分析語音信號，以判斷語音信號的口音類型，並且選擇這些聲學辭典432A、432B、432C的其中之一個來分析語音特徵VC’。在步驟S530中，語音識別模組430依據語音信號的口音類型來選擇性地輸出語音特徵VC’至對應於口音類型的這些聲學辭典432A、432B、432C的其中之一個，以藉由對應於口音類型的這些聲學辭典432A、432B、432C的其中之一個、聲學模型431以及語言模型433來分析語音特徵。在步驟S540中，概率比較模組440輸出由語音識別模組430的分析結果所提供的一個字串資料SD’。A possible implementation manner is described with reference to the flowchart of FIG. 5. In step S510, the feature extraction module 410 receives a voice signal and analyzes the voice signal to obtain a voice feature VC ′. In step S520, the accent recognition module 420 analyzes the speech signal to determine the accent type of the speech signal, and selects one of these acoustic dictionaries 432A, 432B, and 432C to analyze the speech feature VC '. In step S530, the voice recognition module 430 selectively outputs a voice feature VC 'to one of the acoustic dictionaries 432A, 432B, and 432C corresponding to the type of the accent according to the type of the accent of the speech signal, so as to correspond to the accent One of these acoustic dictionaries 432A, 432B, 432C, an acoustic model 431, and a language model 433 to analyze speech features. In step S540, the probability comparison module 440 outputs a string data SD 'provided by the analysis result of the speech recognition module 430.

另外，關於本實施例的語音識別模組430當中的各個模型的詳細實施細節以及技術方案，可參考上述圖2以及圖3實施例而獲致足夠的教示、建議以及實施說明，因此不再贅述。In addition, for detailed implementation details and technical solutions of each model in the speech recognition module 430 of this embodiment, reference may be made to the embodiments of FIG. 2 and FIG. 3 to obtain sufficient teaching, suggestions, and implementation descriptions, and therefore will not be repeated.

綜上所述，本發明的語音識別裝置以及語音識別方法，可適用於多種不同口音類型的語音信號分析，其中語音識別裝置可透過執行在語音識別模組當中建立一個聲學模型、一個語言模型以及多個對應於不同口音類型的多個聲學辭典來分析語音特徵，因此本發明的語音識別模組產生的多個字串概率以及字串資料可具有可比性。並且，本發明的語音識別裝置以及語音識別方法還可進一步利用口音識別模組來判斷使用者提供的語音信號為何種口音類型，以直接透過對應口音類型的聲學辭典來分析語音特徵，因此本發明的語音識別模組可更有效率的提供語音識別結果。In summary, the speech recognition device and speech recognition method of the present invention can be applied to the analysis of speech signals of different types of accents. The speech recognition device can build an acoustic model, a language model, and Multiple acoustic dictionaries corresponding to different accent types are used to analyze speech features, so multiple string probabilities and string data generated by the speech recognition module of the present invention may be comparable. In addition, the speech recognition device and speech recognition method of the present invention can further use an accent recognition module to determine what type of accent the speech signal provided by the user, so as to analyze the speech characteristics directly through the acoustic dictionary corresponding to the type of accent. The speech recognition module can provide speech recognition results more efficiently.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed as above with the examples, it is not intended to limit the present invention. Any person with ordinary knowledge in the technical field can make some modifications and retouching without departing from the spirit and scope of the present invention. The protection scope of the present invention shall be determined by the scope of the attached patent application.

100‧‧‧語音識別裝置100‧‧‧speech recognition device

110‧‧‧處理裝置110‧‧‧Processing device

120‧‧‧輸入裝置120‧‧‧ input device

130‧‧‧儲存裝置130‧‧‧Storage device

140‧‧‧輸出裝置140‧‧‧output device

330、430‧‧‧語音識別模組330, 430‧‧‧speech recognition module

331、431‧‧‧聲學模型331, 431‧‧‧ acoustic models

332A、332B、332C、432A、432B、432C‧‧‧聲學辭典332A, 332B, 332C, 432A, 432B, 432C‧‧‧ Acoustic Dictionary

333、433‧‧‧語言模型333, 433‧‧‧ language models

334A、334B、334C、434A、434B、434C‧‧‧解碼器334A, 334B, 334C, 434A, 434B, 434C‧‧‧ decoders

340‧‧‧概率比較模組340‧‧‧Probability comparison module

410‧‧‧特徵擷取模組410‧‧‧Feature Extraction Module

420‧‧‧口音識別模組420‧‧‧Accent recognition module

440‧‧‧概率比較模組440‧‧‧ Probability Comparison Module

S210、S220、S510、S520、S530、S540‧‧‧步驟S210, S220, S510, S520, S530, S540‧‧‧ steps

VC、VC’‧‧‧語音特徵VC, VC’‧‧‧ Voice Features

SD、SD’‧‧‧字串資料SD, SD’‧‧‧ string data

圖1繪示本發明一實施例的一種語音識別裝置的示意圖。圖2繪示本發明一實施例的一種語音識別方法的流程圖。圖3繪示本發明一實施例的一種語音識別裝置的示意圖。圖4繪示本發明一實施例的另一種語音識別裝置的示意圖。圖5繪示本發明一實施例的另一種語音識別方法的流程圖。FIG. 1 is a schematic diagram of a speech recognition device according to an embodiment of the present invention. FIG. 2 is a flowchart of a speech recognition method according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a speech recognition device according to an embodiment of the invention. FIG. 4 is a schematic diagram of another speech recognition device according to an embodiment of the invention. FIG. 5 is a flowchart of another speech recognition method according to an embodiment of the present invention.

Claims

A voice recognition device includes: a voice recognition module for receiving voice features, and the voice recognition module includes an acoustic model, a language model, and a plurality of acoustic dictionaries, wherein the voice recognition module uses the plurality of At least one of an acoustic dictionary, the acoustic model, and the language model to analyze the speech features to generate at least one string probability and at least one string data, wherein the plurality of acoustic dictionaries correspond to a plurality of different accents Type; and a probability comparison module coupled to the speech recognition module, and the probability comparison module judges a highest probability among the at least one string probability to output the at least one corresponding to the highest probability One of the string data.

The speech recognition device according to item 1 of the scope of patent application, wherein the speech recognition module generates multiple string probabilities and multiple string data corresponding to different accent types by using the multiple acoustic dictionaries, where The phonetic features are of the same accent type as one of the plurality of acoustic dictionaries, so that the string probability generated via one of the plurality of acoustic dictionaries is the highest probability.

The speech recognition device according to item 1 of the scope of patent application, further comprising: a feature extraction module coupled to the voice recognition module to receive a voice signal, and the feature extraction module analyzes the voice Signal to provide the voice feature to the voice recognition module.

The speech recognition device according to item 3 of the scope of patent application, further comprising: an accent recognition module, coupled to the feature extraction module and the speech recognition module, and the accent recognition module is used to analyze all Speak the speech signal to determine the accent type of the speech signal, and select one of the plurality of acoustic dictionaries to analyze the speech feature.

The speech recognition device according to item 4 of the scope of patent application, wherein the accent recognition module selectively outputs the speech features to the corresponding to the accent type according to the accent type of the speech signal. One of the plurality of acoustic dictionaries to analyze the speech feature by one of the plurality of acoustic dictionaries corresponding to the accent type, the acoustic model, and the language model, and outputting one Description of string data.

A speech recognition method includes: receiving a speech feature and analyzing the speech feature by at least one of a plurality of acoustic dictionaries, an acoustic model, and a language model to generate at least one string probability and at least one string data, wherein The plurality of acoustic dictionaries correspond to a plurality of different accent types; and determining a highest probability among the at least one string probability to output one of the at least one string data corresponding to the highest probability.

The speech recognition method according to item 6 of the scope of patent application, wherein the plurality of acoustic dictionaries generate a plurality of string probabilities and a plurality of string data corresponding to different accent types, and the speech feature and the plurality of One of the acoustic dictionaries is of the same accent type such that the string probability generated via one of the plurality of acoustic dictionaries is the highest probability.

The speech recognition method according to item 6 of the patent application scope further comprises: receiving a speech signal, and analyzing the speech signal to obtain the speech characteristics.

The speech recognition method according to item 8 of the scope of patent application, further comprising: analyzing the speech signal to determine the accent type of the speech signal, and selecting one of the plurality of acoustic dictionaries to analyze the speech feature.

The speech recognition method according to item 9 of the scope of patent application, further comprising: selectively outputting the speech feature to the plurality of acoustic dictionaries corresponding to the accent type according to the accent type of the speech signal. One of the plurality of acoustic dictionaries corresponding to the type of the accent, the acoustic model, and the language model to analyze the speech feature; and outputting one of the string data .