TW201935460A - Speech recognition device and speech recognition method - Google Patents
Speech recognition device and speech recognition method Download PDFInfo
- Publication number
- TW201935460A TW201935460A TW107109348A TW107109348A TW201935460A TW 201935460 A TW201935460 A TW 201935460A TW 107109348 A TW107109348 A TW 107109348A TW 107109348 A TW107109348 A TW 107109348A TW 201935460 A TW201935460 A TW 201935460A
- Authority
- TW
- Taiwan
- Prior art keywords
- speech
- acoustic
- accent
- speech recognition
- probability
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000000605 extraction Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 description 13
- 238000004458 analytical method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 235000012054 meals Nutrition 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
Description
本發明是有關於一種識別技術,且特別是有關於一種語音識別裝置以及語音識別方法。The present invention relates to a recognition technology, and more particularly, to a voice recognition device and a voice recognition method.
隨著語音識別(speech recognition)技術的發展,有越來越多的電子裝置具備有語音識別的功能。語音識別通常是對輸入的語音信號取出語音特徵的參數,再與資料庫的樣本相比對,以找出與輸入相異度低的樣本。但是,若使用者提供的語音具有特殊口音,則可能發生無法有效識別語音信號的問題。因此,一般的語音識別方式是針對不同口音類型而建立多組聲學模型、多組語言模型以及多個聲學辭典,以分別產生多個字串概率以及多個字串資料。With the development of speech recognition technology, more and more electronic devices have the function of speech recognition. Speech recognition usually takes the parameters of speech features from the input speech signal and compares them with samples in the database to find samples with a low degree of dissimilarity from the input. However, if the voice provided by the user has a special accent, a problem that a voice signal cannot be effectively recognized may occur. Therefore, the general speech recognition method is to establish multiple sets of acoustic models, multiple sets of language models, and multiple acoustic dictionaries for different types of accents to generate multiple string probabilities and multiple string data, respectively.
但是,由於不同的聲學模型具有不同的音素以及概率基礎,並且不同的語言模型同樣具有不同的概率基礎,因此一般的語音識別方式是所產生的多個字串概率實際上不具有可比性,並且準確率低。此外,一般的語音識別方式需要經由大量的分析運算,還具有語音識別的效率過低的問題。對此,如何提出一種語音識別技術可有效適應不同口音類型的使用者,以有效識別不同使用者所提供的不同口音類型的語音信號,並且可提供有效率的語音識別結果,是本領域目前重要的課題之一。However, because different acoustic models have different phonemes and probability bases, and different language models also have different probability bases, the general way of speech recognition is that the probabilities of multiple strings produced are not actually comparable, and Low accuracy. In addition, the general speech recognition method requires a large amount of analysis and calculation, and has the problem that the efficiency of speech recognition is too low. In this regard, how to propose a speech recognition technology that can effectively adapt to users of different accent types, to effectively recognize the speech signals of different accent types provided by different users, and to provide efficient speech recognition results is currently important in the field One of the topics.
本發明提供一種語音識別裝置以及語音識別方法,可有效識別不同口音的語音信號,以依據不同的口音類型來對語音信號的語音特徵進行有效的解碼分析。The invention provides a speech recognition device and a speech recognition method, which can effectively recognize the speech signals of different accents, so as to effectively decode and analyze the speech characteristics of the speech signals according to different types of accents.
本發明的語音識別裝置包括語音識別模組以及概率比較模組。所述語音識別模組用以接收語音特徵。所述語音識別模組包括聲學模型、語言模型以及多個聲學辭典。所述語音識別模組藉由所述多個聲學辭典的至少其中之一個、所述聲學模型以及所述語言模型來分析所述語音特徵以產生至少一個字串概率以及至少一個字串資料。所述多個聲學辭典對應於多個不同口音類型。概率比較模組耦接所述語音識別模組。所述概率比較模組判斷所述至少一個字串概率當中的最高概率,以輸出對應於所述最高概率的所述至少一個字串資料的其中之一個。The speech recognition device of the present invention includes a speech recognition module and a probability comparison module. The voice recognition module is used to receive voice features. The speech recognition module includes an acoustic model, a language model, and a plurality of acoustic dictionaries. The speech recognition module analyzes the speech features by using at least one of the plurality of acoustic dictionaries, the acoustic model, and the language model to generate at least one string probability and at least one string data. The plurality of acoustic dictionaries correspond to a plurality of different accent types. The probability comparison module is coupled to the speech recognition module. The probability comparison module determines a highest probability among the at least one string probability to output one of the at least one string data corresponding to the highest probability.
在本發明的一實施例中,上述的語音識別模組藉由所述多個聲學辭典產生對應於不同口音類型的多個字串概率以及多個字串資料。所述語音特徵與所述多個聲學辭典的其中之一個為相同口音類型,以使經由所述多個聲學辭典的其中之一個產生的所述字串概率為所述最高概率。In an embodiment of the present invention, the speech recognition module generates multiple string probabilities and multiple string data corresponding to different accent types by using the multiple acoustic dictionaries. The speech feature is of the same accent type as one of the plurality of acoustic dictionaries, so that the string probability generated via one of the plurality of acoustic dictionaries is the highest probability.
在本發明的一實施例中,上述的語音識別裝置更包括特徵擷取模組。所述特徵擷取模組耦接所述語音識別模組。所述特徵擷取模組用以接收語音信號。所述特徵擷取模組分析所述語音信號以提供所述語音特徵至語音識別模組。In an embodiment of the present invention, the voice recognition device further includes a feature extraction module. The feature extraction module is coupled to the voice recognition module. The feature extraction module is used to receive a voice signal. The feature extraction module analyzes the voice signal to provide the voice feature to a voice recognition module.
在本發明的一實施例中,上述的語音識別裝置更包括口音識別模組。所述口音識別模組耦接所述特徵擷取模組以及所述語音識別模組。所述口音識別模組用以分析所述語音信號,以判斷所述語音信號的口音類型,並且選擇所述多個聲學辭典的其中之一個來分析所述語音特徵。In an embodiment of the present invention, the voice recognition device further includes an accent recognition module. The accent recognition module is coupled to the feature extraction module and the voice recognition module. The accent recognition module is configured to analyze the voice signal to determine the accent type of the voice signal, and select one of the plurality of acoustic dictionaries to analyze the voice characteristics.
在本發明的一實施例中,上述的所述口音識別模組依據所述語音信號的所述口音類型來選擇性地輸出所述語音特徵至對應於所述口音類型的所述多個聲學辭典的其中之一個,以藉由對應於所述口音類型的所述多個聲學辭典的其中之一個、所述聲學模型以及所述語言模型來分析所述語音特徵,並且輸出一個所述字串資料。In an embodiment of the present invention, the above-mentioned accent recognition module selectively outputs the speech feature to the plurality of acoustic dictionaries corresponding to the accent type according to the accent type of the voice signal. One of the plurality of acoustic dictionaries corresponding to the accent type, the acoustic model, and the language model to analyze the speech feature, and output one string of data .
本發明的語音識別方法包括以下步驟:接收語音特徵,並且藉由多個聲學辭典的至少其中之一個、聲學模型以及語言模型來分析所述語音特徵以產生至少一個字串概率以及至少一個字串資料,其中所述多個聲學辭典對應於多個不同口音類型;以及判斷所述至少一個字串概率當中的最高概率,以輸出對應於所述最高概率的所述至少一個字串資料的其中之一個。The speech recognition method of the present invention includes the following steps: receiving a speech feature, and analyzing the speech feature to generate at least one string probability and at least one string by using at least one of a plurality of acoustic dictionaries, an acoustic model, and a language model. Data, wherein the plurality of acoustic dictionaries correspond to a plurality of different accent types; and determining the highest probability among the at least one string probability to output one of the at least one string information corresponding to the highest probability One.
在本發明的一實施例中,上述的多個聲學辭典產生對應於不同口音類型的多個字串概率以及多個字串資料,並且所述語音特徵與所述多個聲學辭典的其中之一個為相同口音類型,以使經由所述多個聲學辭典的其中之一個產生的所述字串概率為所述最高概率。In an embodiment of the present invention, the above-mentioned multiple acoustic dictionaries generate multiple string probabilities and multiple string data corresponding to different accent types, and one of the voice characteristics and the multiple acoustic dictionaries is Are the same accent type such that the string probability generated via one of the plurality of acoustic dictionaries is the highest probability.
在本發明的一實施例中,上述的語音識別方法更包括以下步驟:接收語音信號,並且分析所述語音信號以取得所述語音特徵。In an embodiment of the present invention, the above-mentioned voice recognition method further includes the following steps: receiving a voice signal, and analyzing the voice signal to obtain the voice feature.
在本發明的一實施例中,上述的語音識別方法更包括以下步驟:分析所述語音信號,以判斷所述語音信號的口音類型,並且選擇所述多個聲學辭典的其中之一個來分析所述語音特徵。In an embodiment of the present invention, the voice recognition method further includes the steps of analyzing the voice signal to determine an accent type of the voice signal, and selecting one of the plurality of acoustic dictionaries to analyze the voice dictionary. The speech features are described.
在本發明的一實施例中,上述的語音識別方法更包括以下步驟:依據所述語音信號的所述口音類型來選擇性地輸出所述語音特徵至對應於所述口音類型的所述多個聲學辭典的其中之一個,以藉由對應於所述口音類型的所述多個聲學辭典的其中之一個、所述聲學模型以及所述語言模型來分析所述語音特徵;以及輸出一個所述字串資料。In an embodiment of the present invention, the above-mentioned voice recognition method further includes the following steps: selectively outputting the voice feature to the plurality of corresponding to the accent type according to the accent type of the voice signal One of the acoustic dictionaries to analyze the speech feature by one of the plurality of acoustic dictionaries corresponding to the accent type, the acoustic model, and the language model; and outputting one of the words String data.
基於上述,本發明的語音識別裝置以及語音識別方法,可藉由對應於多個不同口音類型的多個聲學辭典來分析語音特徵,以使語音識別模組輸出的字串概率具有可比性。並且,本發明的語音識別裝置以及語音識別方法還可藉由口音識別模組來判斷語音信息的口音類型,以選擇相同口音類型的聲學辭典來產生語音識別結果。Based on the above, the speech recognition device and speech recognition method of the present invention can analyze speech characteristics through multiple acoustic dictionaries corresponding to multiple different accent types, so that the string probabilities output by the speech recognition module are comparable. In addition, the speech recognition device and speech recognition method of the present invention can also determine the accent type of speech information by using an accent recognition module, and select an acoustic dictionary of the same accent type to generate a speech recognition result.
為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。In order to make the above features and advantages of the present invention more comprehensible, embodiments are hereinafter described in detail with reference to the accompanying drawings.
為了使本發明之內容可以被更容易明瞭,以下特舉實施例做為本發明確實能夠據以實施的範例。另外,凡可能之處,在圖式及實施方式中使用相同標號的元件/構件/步驟,係代表相同或類似部件。In order to make the content of the present invention easier to understand, the following specific embodiments are examples based on which the present invention can be implemented. In addition, wherever possible, the same reference numbers are used in the drawings and embodiments to refer to the same or similar components.
圖1繪示本發明一實施例的一種語音識別裝置的示意圖。參考圖1,語音識別裝置100包括處理裝置110、輸入裝置120、儲存裝置130以及輸出裝置140。處理裝置110耦接輸入裝置120、儲存裝置130以及輸出裝置140。語音識別裝置100例如為手機、智慧型手機、個人數位助理(Personal Digital Assistant,PDA)、平板電腦、筆記型電腦、桌上型電腦、車用電腦等具有運算功能的裝置。FIG. 1 is a schematic diagram of a speech recognition device according to an embodiment of the present invention. Referring to FIG. 1, the speech recognition device 100 includes a processing device 110, an input device 120, a storage device 130, and an output device 140. The processing device 110 is coupled to the input device 120, the storage device 130 and the output device 140. The voice recognition device 100 is, for example, a device having a computing function such as a mobile phone, a smart phone, a Personal Digital Assistant (PDA), a tablet computer, a notebook computer, a desktop computer, or a car computer.
在本實施例中,處理裝置110例如是中央處理單元(Central Processing Unit, CPU),或是其他可程式化之一般用途或特殊用途的微處理器(microprocessor)、數位訊號處理器(Digital Signal Processor, DSP)、可程式化控制器、特殊應用積體電路(Application Specific Integrated Circuits, ASIC)、可程式化邏輯裝置(Programmable Logic Device, PLD)、其他類似處理電路或這些裝置的組合。In this embodiment, the processing device 110 is, for example, a Central Processing Unit (CPU), or other programmable general-purpose or special-purpose microprocessor (microprocessor), digital signal processor (Digital Signal Processor) (DSP), programmable controller, Application Specific Integrated Circuits (ASIC), Programmable Logic Device (PLD), other similar processing circuits, or a combination of these devices.
在本實施例中,輸入裝置120用以接收語音信號。輸入裝置120可例如是麥克風。輸入裝置120用以接收使用者所發出的類比語音信號,並將類比語音信號轉換為數位語音信號後,傳送至處理裝置110。In this embodiment, the input device 120 is used to receive a voice signal. The input device 120 may be, for example, a microphone. The input device 120 is configured to receive an analog voice signal sent by a user, convert the analog voice signal into a digital voice signal, and transmit the analog voice signal to the processing device 110.
在本實施例中,儲存裝置130可例如是電子抹除式可複寫唯讀記憶體(Electrically-Erasable Programmable Read-Only Memory, EEPROM)、內嵌式多媒體記憶卡(Embedded Multi Media Card, eMMC)、動態隨機存取記憶體(Dynamic Random Access Memory, DRAM)、快閃記憶體(Flash memory)或非揮發性隨機存取記憶體(Non-Volatile Random Access Memory, NVRAM)等。In this embodiment, the storage device 130 may be, for example, an electronically erasable rewritable read-only memory (Electrically-Erasable Programmable Read-Only Memory, EEPROM), an embedded Multi Media Card (eMMC), Dynamic Random Access Memory (DRAM), Flash Memory or Non-Volatile Random Access Memory (NVRAM), etc.
在本實施例中,輸出裝置140例如為陰極射線管(Cathode Ray Tube,CRT)顯示器、液晶顯示器(Liquid Crystal Display,LCD)、電漿顯示器(Plasma Display)、觸控顯示器(Touch Display)等顯示設備。輸出裝置140可用以顯示所產生的字串概率當中的最大概率所對應的字串資料。在一實施例中,輸出裝置140亦可以是揚聲器,並且用以播放所產生的字串概率當中的最大概率所對應的字串資料。或者,在另一實施例中,輸出裝置140亦可將產生的字串概率當中的最大概率所對應的字串資料提供至特定的應用程式中,以使特定的應用程式可對應執行特定功能或操作。In this embodiment, the output device 140 is a display such as a cathode ray tube (CRT) display, a liquid crystal display (LCD), a plasma display, a touch display, and the like. device. The output device 140 can display string data corresponding to a maximum probability among the generated string probabilities. In an embodiment, the output device 140 may also be a speaker, and is used to play string data corresponding to the largest probability among the generated string probabilities. Alternatively, in another embodiment, the output device 140 may also provide string data corresponding to a maximum probability among the generated string probabilities to a specific application program, so that the specific application program may correspondingly execute a specific function or operating.
在本實施例中,儲存裝置130可用以儲存多個模組供處理裝置110讀取並執行之,以實現本發明各實施例所述的語音識別操作。具體而言,儲存裝置130儲存的這些模組可例如包括特徵擷取模組、口音識別模組、語音識別模組或概率比較模組等,諸如此類的分析以及運算模組。在本實施例中,語音識別裝置100可藉由輸入裝置120取得語音信息,並且藉由儲存裝置130的這些模組來分析語音信息以產生對應的分析結果。也就是說,本實施例的語音識別裝置100可具有語音識別功能。In this embodiment, the storage device 130 may be used to store a plurality of modules for the processing device 110 to read and execute, so as to implement the voice recognition operation according to the embodiments of the present invention. Specifically, the modules stored in the storage device 130 may include, for example, a feature extraction module, an accent recognition module, a voice recognition module, a probability comparison module, and the like, such as an analysis and calculation module. In this embodiment, the voice recognition device 100 can obtain the voice information through the input device 120 and analyze the voice information through these modules of the storage device 130 to generate corresponding analysis results. That is, the voice recognition device 100 of this embodiment may have a voice recognition function.
圖2繪示本發明一實施例的一種語音識別方法的流程圖。圖3繪示本發明一實施例的一種語音識別裝置的示意圖。同時參考圖1至圖3,在本實施例中,處理裝置110可執行儲存在儲存裝置130當中的語音識別模組330以及概率比較模組340。語音識別模組330包括一個聲學模型331、多個聲學辭典332A、332B、332C、一個語言模型333以及多個解碼器334A、334B、334C。FIG. 2 is a flowchart of a speech recognition method according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a speech recognition device according to an embodiment of the invention. Referring to FIG. 1 to FIG. 3 at the same time, in this embodiment, the processing device 110 may execute the voice recognition module 330 and the probability comparison module 340 stored in the storage device 130. The speech recognition module 330 includes an acoustic model 331, a plurality of acoustic dictionaries 332A, 332B, and 332C, a language model 333, and a plurality of decoders 334A, 334B, and 334C.
搭配圖2的流程圖來說明一種可行的實施方式,在步驟S210中,語音識別模組330接收語音特徵VC,並且藉由這些聲學辭典332A、332B、332C的至少中之一個、聲學模型331以及語言模型333以產生至少一個字串概率以及至少一個字串資料。在步驟S220中,概率比較模組340判斷至少一個字串概率當中的最高概率,以輸出具有最高概率的字串資料SD。A possible implementation is described with reference to the flowchart of FIG. 2. In step S210, the voice recognition module 330 receives the voice feature VC, and uses at least one of these acoustic dictionaries 332A, 332B, and 332C, an acoustic model 331, and The language model 333 generates at least one string probability and at least one string data. In step S220, the probability comparison module 340 determines the highest probability among the at least one string probability to output the string data SD having the highest probability.
具體而言,在本實施例中,這些聲學辭典332A、332B、332C對應於多個不同的口音類型,其中這些口音類型例如是北京地區口音、上海地區口音、廣州地區口音或福建地區口音等。在本實施例中,語音識別模組330用以接收語音特徵VC,並且對應產生多個字串概率以及多個字串資料至概率比較模組340。概率比較模組340比較這些字串概率,以判斷這些字串概率當中的最高概率,並且輸出對應於這些字串概率當中的最高概率的字串資料SD至輸出裝置140。Specifically, in this embodiment, these acoustic dictionaries 332A, 332B, and 332C correspond to a plurality of different accent types, and these accent types are, for example, a Beijing area accent, a Shanghai area accent, a Guangzhou area accent, or a Fujian area accent. In this embodiment, the voice recognition module 330 is configured to receive the voice feature VC and correspondingly generate multiple string probabilities and multiple string data to the probability comparison module 340. The probability comparison module 340 compares the string probabilities to determine the highest probability among the string probabilities, and outputs the string data SD corresponding to the highest probability among the string probabilities to the output device 140.
在本實施例中,這些解碼器334A、334B、334C用以基於聲學模型331、聲學辭典332A、332B、332C以及語言模型333來分別產生較為適當或具有最大概率的字串資料與字串概率。值得注意的是,為了使語音識別模組330產生的字串概率具有可比性,本實施例的語音識別模組330僅透過一個聲學模型331以及一個語言模型333來分析語音特徵VC。在本實施例中,聲學模型331是經由語音資料庫訓練而得,例如是採用隱藏式馬可夫模型(Hidden Markov Model,HMM)進行建模。語言模型333經由語料庫(text corpus)訓練而得,例如利用機率統計的方法來揭示語言單位內在的統計規律。並且,本實施例的語音識別模組330針對多個不同口音類型來建立這些聲學辭典332A、332B、332C,其中這些聲學辭典332A、332B、332C亦是經由語音資料庫訓練而得,但分別對應於不同口音類型。這些聲學辭典332A、332B、332C分別用以處理不同口音類型的詞彙以及發音。In this embodiment, these decoders 334A, 334B, and 334C are used to generate string data and string probabilities that are more appropriate or have the highest probability, respectively, based on the acoustic model 331, the acoustic dictionary 332A, 332B, 332C, and the language model 333. It is worth noting that, in order to make the string probabilities generated by the speech recognition module 330 comparable, the speech recognition module 330 of this embodiment analyzes the speech feature VC only through one acoustic model 331 and one language model 333. In this embodiment, the acoustic model 331 is obtained by training through a voice database. For example, the acoustic model 331 is modeled by using a hidden Markov model (HMM). The language model 333 is obtained through training of a text corpus. For example, a probability statistics method is used to reveal the statistical rules inherent in the language unit. In addition, the speech recognition module 330 of this embodiment establishes these acoustic dictionaries 332A, 332B, and 332C for a plurality of different accent types. Among them, these acoustic dictionaries 332A, 332B, and 332C are also obtained by training through a voice database, but corresponding For different accent types. These acoustic dictionaries 332A, 332B, and 332C are used to process vocabulary and pronunciation of different accent types, respectively.
也就是說,本實施例的處理裝置110可藉由執行一個聲學模型331分析語音特徵VC,以取得對應的音素(phone)或音節(syllable),並且再經由這些聲學辭典332A、332B、332C的至少中之一來獲得對應的字或詞,最後經由一個語言模型333來判斷一連串的字成為句子的概率。值得注意的是,本實施例的聲學模型331與語言模型333都是屬於概率模型,而由於本實施例的語音識別模組330只建立單一聲學模型331與單一語言模型333,因此由對應於不同口音類型的這些聲學辭典332A、332B、332C所提供的字或詞而對應產生的多個字串概率具有可比性。That is, the processing device 110 of this embodiment may analyze a speech feature VC by executing an acoustic model 331 to obtain a corresponding phone or syllable, and then pass these acoustic dictionaries 332A, 332B, and 332C. At least one of them is used to obtain the corresponding word or word, and finally a language model 333 is used to determine the probability of a series of words becoming a sentence. It is worth noting that both the acoustic model 331 and the language model 333 of this embodiment belong to a probability model, and since the speech recognition module 330 of this embodiment only builds a single acoustic model 331 and a single language model 333, it corresponds to different The probabilities of the multiple strings generated by these accent-type acoustic dictionaries 332A, 332B, and 332C are comparable.
舉例而言,聲學辭典332A例如針對廣州地區而建立,因此可例如記錄有“bei jin(北京)”、“ci fan(吃飯)”以及“re qi(熱氣)”的聲學資料。聲學辭典332B例如針對北京地區而建立,因此可例如記錄有“bei jing(北京)”、“chi fan(吃飯)”以及“re qi(熱氣)”的聲學資料。聲學辭典332C例如針對福建地區而建立,因此可例如記錄有“bei jin(北京)”、“ci fan(吃飯)”以及“le qi(熱氣)”的聲學資料。也就是說,由於這些聲學辭典332A、332B、332C分別針對不同的口音類型而建立,因此只要語音特徵VC與這些聲學辭典的其中之一個為相同口音類型,則經由聲學模型331、語言模型333以及對應相同口音類型的這些聲學辭典332A、332B、332C的其中之一個所產生的字串概率將為最高概率。For example, the acoustic dictionary 332A is established, for example, for the Guangzhou area, and thus, acoustic data such as “bei jin (Beijing)”, “ci fan (meal)”, and “re qi (热气)” may be recorded. The acoustic dictionary 332B is, for example, created for the Beijing area, and therefore, acoustic data such as "bei jing (Beijing)", "chi fan (eating)", and "re qi (热气)" may be recorded. The acoustic dictionary 332C is established, for example, for the Fujian region, and therefore, acoustic data such as “bei jin (Beijing)”, “ci fan (dining)”, and “le qi (热气)” may be recorded. That is, since these acoustic dictionaries 332A, 332B, and 332C are respectively established for different accent types, as long as the speech feature VC and one of these acoustic dictionaries are the same accent type, the acoustic model 331, the language model 333, and The string probability generated by one of these acoustic dictionaries 332A, 332B, 332C corresponding to the same accent type will be the highest probability.
然而,須注意的是,本發明的聲學辭典的數量以及類型不限於圖3所示,圖3的示意圖僅用於說明一種可行的語音識別模組的範例實施例。However, it should be noted that the number and types of the acoustic dictionaries of the present invention are not limited to those shown in FIG. 3, and the schematic diagram of FIG. 3 is only used to illustrate an exemplary embodiment of a feasible speech recognition module.
圖4繪示本發明一實施例的另一種語音識別裝置的示意圖。圖5繪示本發明一實施例的另一種語音識別方法的流程圖。同時參考圖1、圖4以及圖5,在本實施例中,處理裝置110可執行儲存在儲存裝置130當中的特徵擷取模組410、口音識別模組420、語音識別模組430以及概率比較模組440。相較於圖2以及圖3實施例,本實施例的處理裝置110可更進一步包括執行特徵擷取模組410以及口音識別模組420。在本實施例中,特徵擷取模組410用以接收輸入裝置120提供的語音信號,並且分析語音信號以取得語音特徵VC’。口音識別模組420用以分析語音信號以判斷語音信號是屬於何種口音類型,並且依據判斷結果來將語音特徵VC’選擇由對應的聲學辭典來進行運算以及處理。FIG. 4 is a schematic diagram of another speech recognition device according to an embodiment of the invention. FIG. 5 is a flowchart of another speech recognition method according to an embodiment of the present invention. Referring to FIG. 1, FIG. 4, and FIG. 5 at the same time, in this embodiment, the processing device 110 may execute the feature extraction module 410, the accent recognition module 420, the voice recognition module 430, and the probability comparison stored in the storage device 130. Module 440. Compared with the embodiments of FIG. 2 and FIG. 3, the processing device 110 of this embodiment may further include an execution feature extraction module 410 and an accent recognition module 420. In this embodiment, the feature extraction module 410 is configured to receive a voice signal provided by the input device 120 and analyze the voice signal to obtain a voice feature VC '. The accent recognition module 420 is used to analyze a voice signal to determine what type of accent the voice signal belongs to, and to select the voice feature VC 'according to the judgment result for calculation and processing by a corresponding acoustic dictionary.
在本實施例中,類似於圖2以及圖3實施例,語音識別模組430可包括一個聲學模型431、多個聲學辭典432A、432B、432C、一個語言模型433以及多個解碼器434A、434B、434C。在本實施例中,這些聲學辭典432A、432B、432C對應於多個不同的口音類型。在本實施例中,這些解碼器434A、434B、434C用以基於聲學模型431、聲學辭典432A、432B、432C以及語言模型433來產生較為適當或具有最大概率的字串資料與字串概率。In this embodiment, similar to the embodiments of FIG. 2 and FIG. 3, the speech recognition module 430 may include an acoustic model 431, multiple acoustic dictionaries 432A, 432B, 432C, a language model 433, and multiple decoders 434A, 434B. , 434C. In this embodiment, these acoustic dictionaries 432A, 432B, 432C correspond to a plurality of different accent types. In this embodiment, these decoders 434A, 434B, and 434C are used to generate string data and string probabilities that are more appropriate or have the greatest probability based on the acoustic model 431, the acoustic dictionary 432A, 432B, 432C, and the language model 433.
在本實施例中,口音識別模組420例如是透過濾波器組(Filter Bank)或梅爾倒頻譜係數(Mel-Frequency Cipstal Coefficients, MFCC),以經由高斯混合模型(Gaussian Mixture Model, GMM)或深度神經網路(Deep Neural Networks, DNN)來進行分析以及運算而識別使用者提供的語音信號所屬的口音類型。因此,本實施例的語音識別模組430可依據口音識別模組420的識別結果來對應選擇這些聲學辭典432A、432B、432C的其中一個來進行分析以及運算。In this embodiment, the accent recognition module 420 is, for example, through a filter bank (Mel-Frequency Cipstal Coefficients, MFCC) to pass a Gaussian Mixture Model (GMM) or Deep Neural Networks (DNN) perform analysis and calculation to identify the type of accent to which the user-supplied speech signal belongs. Therefore, the voice recognition module 430 of this embodiment may select one of these acoustic dictionaries 432A, 432B, 432C for analysis and calculation according to the recognition result of the accent recognition module 420.
也就是說,在本實施例中,若口音識別模組420可識別出語音信號的口音類型可對應於這些聲學辭典432A、432B、432C的其中之一個,則語音識別模組430接收語音特徵VC’後,可依據口音識別模組420的識別結果,來經由這些聲學辭典432A、432B、432C的其中一個對應於語音信號的口音類型來進行分析以及運算,以取得一個字串資料SD’。概率比較模組440可直接輸出此字串資料SD’至輸出裝置140。That is, in this embodiment, if the accent recognition module 420 can recognize that the accent type of the voice signal can correspond to one of these acoustic dictionaries 432A, 432B, 432C, the voice recognition module 430 receives the voice feature VC 'After that, according to the recognition result of the accent recognition module 420, one of these acoustic dictionaries 432A, 432B, and 432C may be analyzed and calculated to obtain a string data SD through one of the accent types corresponding to the speech signal. The probability comparison module 440 can directly output the string data SD 'to the output device 140.
然而,在一實施例中,若口音識別模組420無法識別出語音信號的口音類型可對應於這些聲學辭典432A、432B、432C的其中之一個,則表示可能這些聲學辭典432A、432B、432C無對應相同口音類型。因此,在一實施例中,語音識別模組430可執行如同上述圖2、圖3實施例的語音識別操作,以產生多組字串資料以及多組字串概率。概率比較模組440用以比較這些字串概率,以選擇最接近的口音類型且具有最高概率的字串資料至輸出裝置140。However, in an embodiment, if the accent type that the accent recognition module 420 cannot recognize the voice signal may correspond to one of these acoustic dictionaries 432A, 432B, 432C, it means that these acoustic dictionaries 432A, 432B, 432C may not have Corresponds to the same accent type. Therefore, in one embodiment, the speech recognition module 430 can perform the speech recognition operation as described in the embodiments of FIG. 2 and FIG. 3 to generate multiple sets of string data and multiple sets of string probabilities. The probability comparison module 440 compares the string probabilities to select the string data with the closest accent type and the highest probability to the output device 140.
舉例而言,若使用者透過輸入裝置120輸入為北京地區口音的語音信號“bei jing(北京)”,則經由口音識別模組420後,語音識別模組430選擇聲學辭典432B來處理語音特徵VC’。在此例中,語音識別模組430透過聲學模型431、聲學辭典432B以及語言模型433來分析語音特徵VC’,以使解碼器334B產生一個字串資料至概率比較模組440。概率比較模組440無須經由概率比較,而直接輸出由語音識別模組430提供的一個字串資料。也就是說,本實施例的語音識別模組430無須經由全部的聲學辭典432A、432B、432C來執行分析以及運算,而可有效率的提供語音識別結果。For example, if the user inputs the voice signal "bei jing (Beijing)" as a Beijing accent through the input device 120, after the accent recognition module 420, the voice recognition module 430 selects the acoustic dictionary 432B to process the voice feature VC '. In this example, the speech recognition module 430 analyzes the speech features VC 'through the acoustic model 431, the acoustic dictionary 432B, and the language model 433, so that the decoder 334B generates a string of data to the probability comparison module 440. The probability comparison module 440 directly outputs a string of data provided by the speech recognition module 430 without performing a probability comparison. That is, the speech recognition module 430 of this embodiment does not need to perform analysis and calculation through all the acoustic dictionaries 432A, 432B, and 432C, and can efficiently provide speech recognition results.
搭配圖5的流程圖來說明一種可行的實施方式,在步驟S510中,特徵擷取模組410接收語音信號,並且分析語音信號以取得語音特徵VC’。在步驟S520中,口音識別模組420分析語音信號,以判斷語音信號的口音類型,並且選擇這些聲學辭典432A、432B、432C的其中之一個來分析語音特徵VC’。在步驟S530中,語音識別模組430依據語音信號的口音類型來選擇性地輸出語音特徵VC’至對應於口音類型的這些聲學辭典432A、432B、432C的其中之一個,以藉由對應於口音類型的這些聲學辭典432A、432B、432C的其中之一個、聲學模型431以及語言模型433來分析語音特徵。在步驟S540中,概率比較模組440輸出由語音識別模組430的分析結果所提供的一個字串資料SD’。A possible implementation manner is described with reference to the flowchart of FIG. 5. In step S510, the feature extraction module 410 receives a voice signal and analyzes the voice signal to obtain a voice feature VC ′. In step S520, the accent recognition module 420 analyzes the speech signal to determine the accent type of the speech signal, and selects one of these acoustic dictionaries 432A, 432B, and 432C to analyze the speech feature VC '. In step S530, the voice recognition module 430 selectively outputs a voice feature VC 'to one of the acoustic dictionaries 432A, 432B, and 432C corresponding to the type of the accent according to the type of the accent of the speech signal, so as to correspond to the accent One of these acoustic dictionaries 432A, 432B, 432C, an acoustic model 431, and a language model 433 to analyze speech features. In step S540, the probability comparison module 440 outputs a string data SD 'provided by the analysis result of the speech recognition module 430.
另外,關於本實施例的語音識別模組430當中的各個模型的詳細實施細節以及技術方案,可參考上述圖2以及圖3實施例而獲致足夠的教示、建議以及實施說明,因此不再贅述。In addition, for detailed implementation details and technical solutions of each model in the speech recognition module 430 of this embodiment, reference may be made to the embodiments of FIG. 2 and FIG. 3 to obtain sufficient teaching, suggestions, and implementation descriptions, and therefore will not be repeated.
綜上所述,本發明的語音識別裝置以及語音識別方法,可適用於多種不同口音類型的語音信號分析,其中語音識別裝置可透過執行在語音識別模組當中建立一個聲學模型、一個語言模型以及多個對應於不同口音類型的多個聲學辭典來分析語音特徵,因此本發明的語音識別模組產生的多個字串概率以及字串資料可具有可比性。並且,本發明的語音識別裝置以及語音識別方法還可進一步利用口音識別模組來判斷使用者提供的語音信號為何種口音類型,以直接透過對應口音類型的聲學辭典來分析語音特徵,因此本發明的語音識別模組可更有效率的提供語音識別結果。In summary, the speech recognition device and speech recognition method of the present invention can be applied to the analysis of speech signals of different types of accents. The speech recognition device can build an acoustic model, a language model, and Multiple acoustic dictionaries corresponding to different accent types are used to analyze speech features, so multiple string probabilities and string data generated by the speech recognition module of the present invention may be comparable. In addition, the speech recognition device and speech recognition method of the present invention can further use an accent recognition module to determine what type of accent the speech signal provided by the user, so as to analyze the speech characteristics directly through the acoustic dictionary corresponding to the type of accent. The speech recognition module can provide speech recognition results more efficiently.
雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed as above with the examples, it is not intended to limit the present invention. Any person with ordinary knowledge in the technical field can make some modifications and retouching without departing from the spirit and scope of the present invention. The protection scope of the present invention shall be determined by the scope of the attached patent application.
100‧‧‧語音識別裝置100‧‧‧speech recognition device
110‧‧‧處理裝置110‧‧‧Processing device
120‧‧‧輸入裝置120‧‧‧ input device
130‧‧‧儲存裝置130‧‧‧Storage device
140‧‧‧輸出裝置140‧‧‧output device
330、430‧‧‧語音識別模組330, 430‧‧‧speech recognition module
331、431‧‧‧聲學模型331, 431‧‧‧ acoustic models
332A、332B、332C、432A、432B、432C‧‧‧聲學辭典332A, 332B, 332C, 432A, 432B, 432C‧‧‧ Acoustic Dictionary
333、433‧‧‧語言模型333, 433‧‧‧ language models
334A、334B、334C、434A、434B、434C‧‧‧解碼器334A, 334B, 334C, 434A, 434B, 434C‧‧‧ decoders
340‧‧‧概率比較模組340‧‧‧Probability comparison module
410‧‧‧特徵擷取模組410‧‧‧Feature Extraction Module
420‧‧‧口音識別模組420‧‧‧Accent recognition module
440‧‧‧概率比較模組440‧‧‧ Probability Comparison Module
S210、S220、S510、S520、S530、S540‧‧‧步驟S210, S220, S510, S520, S530, S540‧‧‧ steps
VC、VC’‧‧‧語音特徵VC, VC’‧‧‧ Voice Features
SD、SD’‧‧‧字串資料SD, SD’‧‧‧ string data
圖1繪示本發明一實施例的一種語音識別裝置的示意圖。 圖2繪示本發明一實施例的一種語音識別方法的流程圖。 圖3繪示本發明一實施例的一種語音識別裝置的示意圖。 圖4繪示本發明一實施例的另一種語音識別裝置的示意圖。 圖5繪示本發明一實施例的另一種語音識別方法的流程圖。FIG. 1 is a schematic diagram of a speech recognition device according to an embodiment of the present invention. FIG. 2 is a flowchart of a speech recognition method according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a speech recognition device according to an embodiment of the invention. FIG. 4 is a schematic diagram of another speech recognition device according to an embodiment of the invention. FIG. 5 is a flowchart of another speech recognition method according to an embodiment of the present invention.
Claims (10)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810101318.7A CN108346426B (en) | 2018-02-01 | 2018-02-01 | Speech recognition device and speech recognition method |
??201810101318.7 | 2018-02-01 | ||
CN201810101318.7 | 2018-02-01 |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201935460A true TW201935460A (en) | 2019-09-01 |
TWI683305B TWI683305B (en) | 2020-01-21 |
Family
ID=62959815
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW107109348A TWI683305B (en) | 2018-02-01 | 2018-03-19 | Speech recognition device and speech recognition method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108346426B (en) |
TW (1) | TWI683305B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI723634B (en) * | 2019-10-01 | 2021-04-01 | 創鑫智慧股份有限公司 | Data processing system and data processing method thereof |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108877146A (en) * | 2018-09-03 | 2018-11-23 | 深圳市尼欧科技有限公司 | It is a kind of that safety automatic-alarming devices and methods therefor is driven based on multiplying for intelligent sound identification |
CN109493857A (en) * | 2018-09-28 | 2019-03-19 | 广州智伴人工智能科技有限公司 | A kind of auto sleep wake-up robot system |
WO2021000068A1 (en) * | 2019-06-29 | 2021-01-07 | 播闪机械人有限公司 | Speech recognition method and apparatus used by non-native speaker |
CN112102816A (en) * | 2020-08-17 | 2020-12-18 | 北京百度网讯科技有限公司 | Speech recognition method, apparatus, system, electronic device and storage medium |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102016837B (en) * | 2007-11-26 | 2014-08-20 | 沃伦·丹尼尔·蔡尔德 | System and method for classification and retrieval of Chinese-type characters and character components |
CN103035241A (en) * | 2012-12-07 | 2013-04-10 | 中国科学院自动化研究所 | Model complementary Chinese rhythm interruption recognition system and method |
CN103578467B (en) * | 2013-10-18 | 2017-01-18 | 威盛电子股份有限公司 | Acoustic model building method, voice recognition method and electronic device |
CN103578471B (en) * | 2013-10-18 | 2017-03-01 | 威盛电子股份有限公司 | Speech identifying method and its electronic installation |
CN103578465B (en) * | 2013-10-18 | 2016-08-17 | 威盛电子股份有限公司 | Speech identifying method and electronic installation |
CN103578464B (en) * | 2013-10-18 | 2017-01-11 | 威盛电子股份有限公司 | Language model establishing method, speech recognition method and electronic device |
CN110797019B (en) * | 2014-05-30 | 2023-08-29 | 苹果公司 | Multi-command single speech input method |
CN105632501B (en) * | 2015-12-30 | 2019-09-03 | 中国科学院自动化研究所 | A kind of automatic accent classification method and device based on depth learning technology |
CN107274885B (en) * | 2017-05-31 | 2020-05-26 | Oppo广东移动通信有限公司 | Speech recognition method and related product |
-
2018
- 2018-02-01 CN CN201810101318.7A patent/CN108346426B/en active Active
- 2018-03-19 TW TW107109348A patent/TWI683305B/en active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI723634B (en) * | 2019-10-01 | 2021-04-01 | 創鑫智慧股份有限公司 | Data processing system and data processing method thereof |
Also Published As
Publication number | Publication date |
---|---|
TWI683305B (en) | 2020-01-21 |
CN108346426B (en) | 2020-12-08 |
CN108346426A (en) | 2018-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI683305B (en) | Speech recognition device and speech recognition method | |
US9711139B2 (en) | Method for building language model, speech recognition method and electronic apparatus | |
US11450313B2 (en) | Determining phonetic relationships | |
US10586533B2 (en) | Method and device for recognizing speech based on Chinese-English mixed dictionary | |
US10614802B2 (en) | Method and device for recognizing speech based on Chinese-English mixed dictionary | |
US9613621B2 (en) | Speech recognition method and electronic apparatus | |
US11881210B2 (en) | Speech synthesis prosody using a BERT model | |
US20150112685A1 (en) | Speech recognition method and electronic apparatus using the method | |
TWI391915B (en) | Method and apparatus for builiding phonetic variation models and speech recognition | |
US20150112674A1 (en) | Method for building acoustic model, speech recognition method and electronic apparatus | |
US20220262352A1 (en) | Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation | |
US20080027725A1 (en) | Automatic Accent Detection With Limited Manually Labeled Data | |
US11232780B1 (en) | Predicting parametric vocoder parameters from prosodic features | |
TW202020854A (en) | Speech recognition system and method thereof, and computer program product | |
JP7544989B2 (en) | Lookup Table Recurrent Language Models | |
CN113393830A (en) | Hybrid acoustic model training and lyric timestamp generation method, device and medium | |
WO2023113784A1 (en) | Lattice speech corrections | |
US9928832B2 (en) | Method and apparatus for classifying lexical stress | |
US20140372118A1 (en) | Method and apparatus for exemplary chip architecture | |
KR20160062254A (en) | Method for reasoning of semantic robust on speech recognition error | |
Abudubiyaz et al. | The acoustical and language modeling issues on Uyghur speech recognition | |
US12008986B1 (en) | Universal semi-word model for vocabulary contraction in automatic speech recognition | |
CN114267341A (en) | Voice recognition processing method and device based on ATM service logic | |
Sunitha et al. | Dynamic construction of Telugu speech corpus for voice enabled text editor |