TWI752437B - At least two phoneme-based voice input operation method and computer program product - Google Patents
At least two phoneme-based voice input operation method and computer program product Download PDFInfo
- Publication number
- TWI752437B TWI752437B TW109108337A TW109108337A TWI752437B TW I752437 B TWI752437 B TW I752437B TW 109108337 A TW109108337 A TW 109108337A TW 109108337 A TW109108337 A TW 109108337A TW I752437 B TWI752437 B TW I752437B
- Authority
- TW
- Taiwan
- Prior art keywords
- phoneme
- target
- phonemes
- voice
- computer system
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000004590 computer program Methods 0.000 title claims description 7
- 238000005516 engineering process Methods 0.000 claims abstract description 18
- 230000003213 activating effect Effects 0.000 claims abstract description 3
- 230000004044 response Effects 0.000 claims description 4
- 230000009977 dual effect Effects 0.000 claims 1
- 230000009471 action Effects 0.000 abstract description 8
- 238000012545 processing Methods 0.000 description 29
- 238000004891 communication Methods 0.000 description 11
- 206010013887 Dysarthria Diseases 0.000 description 8
- 238000010586 diagram Methods 0.000 description 4
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 3
- 238000012790 confirmation Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 206010037660 Pyrexia Diseases 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
一種基於至少雙音素的語音輸入操作方法包含,編碼多個參考音素,以定義出多個分別相關聯於多個操作選項的音素標籤,其每一者至少包含選自該等參考音素的第一及第二參考音素;當根據已儲存且與該等參考音素相關聯的語音辨識資料或用戶的個人音素資料且利用語音或聲紋辨識技術確認出收集自用戶的語音信號所含的第一及第二音素分別相似於第一及第二目標參考音素後,自該等音素標籤決定出目標音素標籤,其所含的該第一及第二參考音素分別相同於該第一及第二目標參考音素;及將該等操作選項其中一個與該目標音素標籤相關聯的目標操作選項激活。A method for operating a speech input based on at least two phonemes includes, encoding a plurality of reference phonemes to define a plurality of phoneme labels respectively associated with a plurality of operation options, each of which comprises at least a first selected from the reference phonemes and the second reference phoneme; when the first and second reference phonemes contained in the voice signal collected from the user are confirmed according to the stored voice recognition data associated with these reference phonemes or the user's personal phoneme data and using voice or voiceprint recognition technology After the second phoneme is similar to the first and second target reference phonemes, respectively, a target phoneme label is determined from the phoneme labels, and the first and second reference phonemes contained in the first and second reference phonemes are the same as the first and second target reference phonemes, respectively. phoneme; and activating a target action option associated with the target phoneme tag of one of the action options.
Description
本發明是有關於語音輸入,特別是指一種基於至少雙音素的語音輸入操作方法及電腦程式產品。The present invention relates to speech input, in particular to a speech input operation method and computer program product based on at least two phonemes.
語音輸入功能已廣泛用來取代繁冗的手動輸入。在應用上,電腦裝置通常可將收集到用戶所發出的語音經由利用語言模型及聲學模型的語音辨識引擎成功辨識出與該語音有關的字詞、操作指令或應用程式後,該電腦裝置便可將有關的字詞顯示出來,或者執行有關的指令或應用程式。Voice input has been widely used to replace tedious manual input. In application, the computer device can usually recognize the words, operation instructions or applications related to the speech from the collected speech through the speech recognition engine using the language model and the acoustic model, and then the computer device can Display the relevant word, or execute the relevant command or application.
然而,對於構音障礙患者而言,其往往無法發出特定的語音,而所發出語音常有含混不清、沙啞、單調、斷續、發聲音量過大或其他異常的特徵。在此情況下,由於現有的語音辨識技術無法成功辨識構音障礙患者所發出的語音,致使構音障礙患者無法使用現有語音輸入操作方式,例如作為語音溝通設備的平板電腦所提供的語音輸入功能,與外界溝通。However, for patients with dysarthria, they are often unable to produce specific speech sounds, and the speech sounds are often slurred, hoarse, monotonous, intermittent, too loud or otherwise abnormal. In this case, because the existing speech recognition technology cannot successfully recognize the speech uttered by the patients with dysarthria, the patients with dysarthria cannot use the existing voice input operation methods, such as the voice input function provided by the tablet computer as a voice communication device, and external communication.
因此,為了使構音障礙患者能夠利用語音輸入來操作電子裝置或與外界溝通,如何發展出適用於構音障礙患者的語音輸入操作技術遂成為目前重要的議題。Therefore, in order to enable patients with dysarthria to use voice input to operate electronic devices or communicate with the outside world, how to develop a voice input operation technology suitable for patients with dysarthria has become an important issue at present.
因此,本發明的目的,即在提供一種基於至少雙音素的語音輸入操作方法及電腦程式產品,其能克服現有技術的至少一缺點。Therefore, the purpose of the present invention is to provide a voice input operation method and computer program product based on at least two phonemes, which can overcome at least one disadvantage of the prior art.
於是,本發明所提供的一種基於至少雙音素的語音輸入操作方法,利用一具有語音及聲紋辨識技術的電腦系統來執行,並包含下步驟:(A)儲存與多個彼此不同的參考音素相關聯的語音辨識資料、及對應於一用戶的個人音素資料,該個人音素資料包含由該用戶所發出且分別對應於該等參考音素的多個語音的語音內容;(B)將該等參考音素編碼,以定義出多個彼此不同的音素標籤,其中每一個音素標籤至少包含一選自該等參考音素其中一者的第一參考音素、及一選自該等參考音素其中一者的第二參考音素;(C)將該等音素標籤分別與多個彼此不同的操作選項相關聯;(D)當收集到來自該用戶且至少包含連續的第一音素和第二音素的語音信號後,根據該語音辨識資料且利用語音辨識技術,或者根據該個人音素資料且利用聲紋辨識技術,確認該第一音素是否相似於該等參考音素其中之一者並確認該第二音素是否相似於該等參考音素其中之一者;(E)當確認出相似於該第一音素的第一目標參考音素、及相似於該第二音素的第二目標參考音素後,根據該第一目標參考音素和該第二目標參考音素,自該等音素標籤決定出一個目標音素標籤,該目標音素標籤所含的該第一參考音素與該第二參考音素分別相同於該第一目標參考音素與該第二目標參考音素;及(F)將該等操作選項其中一個與該目標音素標籤相關聯的目標操作選項激活。Therefore, the present invention provides a voice input operation method based on at least two phonemes, which utilizes a computer system with voice and voiceprint recognition technology to perform, and includes the following steps: (A) storing a plurality of reference phonemes that are different from each other. Associated speech recognition data and personal phoneme data corresponding to a user, the personal phoneme data comprising the voice content of a plurality of voices issued by the user and corresponding to the reference phonemes respectively; (B) the reference Phoneme encoding to define a plurality of phoneme labels that are different from each other, wherein each phoneme label at least includes a first reference phoneme selected from one of the reference phonemes and a first reference phoneme selected from one of the reference phonemes Two reference phonemes; (C) respectively associate these phoneme labels with a plurality of operation options different from each other; (D) after collecting the speech signal from the user and containing at least the continuous first phoneme and the second phoneme, Confirming whether the first phoneme is similar to one of the reference phonemes and confirming whether the second phoneme is similar to the (E) after confirming the first target reference phoneme similar to the first phoneme and the second target reference phoneme similar to the second phoneme, according to the first target reference phoneme and For the second target reference phoneme, a target phoneme label is determined from the phoneme labels, and the first reference phoneme and the second reference phoneme contained in the target phoneme label are respectively the same as the first target reference phoneme and the second reference phoneme. a target reference phoneme; and (F) activating a target action option associated with the target phoneme tag of one of the action options.
本發明的基於至少雙音素的語音輸入操作方法中,每一參考音素為一母音或一音節。In the voice input operation method based on at least two phonemes of the present invention, each reference phoneme is a vowel or a syllable.
本發明的基於至少雙音素的語音輸入操作方法中,每一操作選項為一符號、一字元、一文字內容、一操作指令、一應用程式及一檔案其中一者。當該目標操作選項為一符號時,該電腦系統透過一顯示該符號之方式激活該目標操作選項。當該目標操作選項為一字元時,該電腦系統透過一顯示該字元之方式激活該目標操作選項。當該目標操作選項為一文字內容時,該電腦系統至少透過一顯示該文字內容之方式激活該目標操作選項。當該目標操作選項為一操作指令時,該電腦系統透過一執行該操作指令之方式激活該目標操作選項。當該目標操作選項為一應用程式時,該電腦系統透過一執行該應用程式之方式激活該目標操作選項。當該目標操作選項為一檔案時,該電腦系統透過一開啟或播放該檔案之方式激活該目標操作選項。In the voice input operation method based on at least two phonemes of the present invention, each operation option is one of a symbol, a character, a text content, an operation command, an application program and a file. When the target operation option is a symbol, the computer system activates the target operation option by displaying the symbol. When the target operation option is a character, the computer system activates the target operation option by displaying the character. When the target operation option is a text content, the computer system activates the target operation option at least by displaying the text content. When the target operation option is an operation command, the computer system activates the target operation option by executing the operation command. When the target operation option is an application program, the computer system activates the target operation option by executing the application program. When the target operation option is a file, the computer system activates the target operation option by opening or playing the file.
本發明的基於至少雙音素的語音輸入操作方法中,在步驟(C)與步驟(D)之間,還包含以下步驟:(G)顯示多個分別代表該等操作選項的圖像,並在該等圖像附近顯示與該等操作選項相關聯的該等音素標籤。In the voice input operation method based on at least two phonemes of the present invention, between steps (C) and (D), the following steps are further included: (G) displaying a plurality of images representing the operation options respectively, and The phoneme labels associated with the operation options are displayed adjacent to the images.
本發明的基於至少雙音素的語音輸入操作方法中,當該目標操作選項為一文字內容時,該電腦系統不僅透過顯示該文字內容之方式,還透過一播放一對應於該文字內容的語音內容的方式激活該目標操作選項。In the voice input operation method based on at least two phonemes of the present invention, when the target operation option is a text content, the computer system not only displays the text content, but also plays a voice content corresponding to the text content by playing a way to activate the target action option.
本發明的基於至少雙音素的語音輸入操作方法中,該電腦系統包含一用於執行步驟(B)、(C)、(E)、(F)及(G)的使用終端、及一可與該使用終端通訊且用於執行步驟(A)及(D)的辨識伺服端,還包含以下步驟:在步驟(A)之前,(H)藉由該使用終端,將收集到的該個人音素資料傳送至該辨識伺服端;在步驟(C)與步驟(D)之間,(I)藉由該使用終端,收集該語音信號,並將一包含該語音信號且有關該用戶的辨識請求送至該辨識伺服端,以便該辨識伺服端回應於該辨識請求執行步驟(D);及在步驟(D)與步驟(E)之間,(J)藉由該辨識伺服端,當確認出該第一目標參考音素與該第二目標參考音素時,將一含有該第一目標參考音素及該第二目標參考音素的辨識回覆傳送至該使用終端,以使該使用終端回應於該辨識回覆執行步驟(E)。In the voice input operation method based on at least two phonemes of the present invention, the computer system includes a user terminal for executing steps (B), (C), (E), (F) and (G), and a user terminal that can be connected with The identification server that communicates with the user terminal and is used for performing steps (A) and (D) further includes the following steps: before step (A), (H) using the user terminal to collect the personal phoneme data. Send to the identification server; between step (C) and step (D), (1) collect the voice signal through the use terminal, and send an identification request including the voice signal and related to the user to the identification server, so that the identification server executes step (D) in response to the identification request; and between steps (D) and (E), (J) by the identification server, when it is confirmed that the first When a target reference phoneme and the second target reference phoneme are present, an identification reply containing the first target reference phoneme and the second target reference phoneme is sent to the user terminal, so that the user terminal performs the steps in response to the identification reply (E).
於是,本發明所提供的一種電腦程式產品儲存在一電腦可讀取媒體,包含多個程式指令,且當一電腦裝置執行該等程式指令時,可完成如以上所述之基於至少雙音素的語音輸入操作方法。Therefore, a computer program product provided by the present invention is stored in a computer-readable medium, and includes a plurality of program instructions, and when a computer device executes the program instructions, it can complete the above-mentioned at least diphone-based program. Voice input operation method.
本發明之功效在於:由於先儲存了用於語音辨識的語音辨識資料,以及用於聲紋辨識且對應於用戶的個人音素資料,因此不僅對於構音正常的用戶,而且對於構音障礙患者的用戶所發出的有限語音均能被精確地辨識出。此外,基於至少雙音素的編碼方式能定義出相對較多數量的音素標籤,此等音素標籤能被廣泛地應用來與相對較多數量的操作選項建立關聯性。於是,相較於現有利用相對複雜的語言模型及聲學模型的語音辨識技術,能相對容易且快速地辨識出用戶所發出且含有至少雙音素的語音輸入,以判定出所欲激活的目標操作選項。The effect of the present invention is: because the voice recognition data for voice recognition and the personal phoneme data corresponding to the user for voiceprint recognition are stored first, not only for users with normal articulation, but also for users with dysarthria The limited speech uttered can be accurately recognized. In addition, the encoding method based on at least two phones can define a relatively large number of phoneme labels, and these phoneme labels can be widely used to establish associations with a relatively large number of operation options. Therefore, compared with the existing speech recognition technology using a relatively complex language model and an acoustic model, it is relatively easy and fast to recognize the speech input issued by the user that contains at least two phonemes, so as to determine the target operation option to be activated.
在本發明被詳細描述之前,應當注意在以下的說明內容中,類似的元件是以相同的編號來表示。Before the present invention is described in detail, it should be noted that in the following description, similar elements are designated by the same reference numerals.
參閱圖1,繪示出的一電腦系統係實施成如智慧型手機或平板電腦的電腦裝置100,其係用來實施本發明第一實施例的基於至少雙音素(Double-phoneme)的語音輸入操作方法,並包含一用於收集來自外部語音的語音收集模組1(例如,麥克風模組)、一用作顯示器和使用者操作介面的觸控顯示模組2、一儲存模組3、一喇叭模組4,以及一電連接該語音收集模組1、該觸控顯示模組2、該儲存模組3和該喇叭模組4的處理單元5。在本實施例中,該處理單元5支援語音及聲紋辨識技術。Referring to FIG. 1 , a computer system is shown implemented as a
以下,將參閱圖1及圖2來示例地說明該電腦裝置100如何執行該第一實施例的語音輸入操作方法。大體而言,該語音輸入操作方法可包含以下步驟S20-S28。Hereinafter, referring to FIG. 1 and FIG. 2 , how the
首先,在步驟S20中,該電腦裝置100將與多個彼此不同的參考音素相關聯的語音辨識資料儲存於該儲存模組3。在本實施例中,每一參考音素可為一母音(Vowel)或音節(Syllable)。更明確地,該電腦裝置100根據收集自多個(構音正常)用戶發出該等參考音素的語音內容並利用例如聲學模型訓練而獲得用於辨識該等參考音素的該語音辨識資料。舉例來說,該等參考音素例如具有四個母音和四個音節,如以下表示1所示。
表1
特別要說明的是,為了紀錄具有可區別性的該等語音以建立符合個人聲紋特性的個人語音資料,該等參考音素的選擇可視該用戶的構音能力而定。換言之,用戶無須發出標準的每一參考音素(即,母音或音節),只要用戶能發出對應於每一參考音素的語音彼此是可區別的即可。It should be noted that, in order to record the distinguishable voices to create personal voice data conforming to the characteristics of personal voiceprints, the selection of the reference phonemes may depend on the user's articulation ability. In other words, the user does not need to pronounce each reference phoneme (ie, vowel or syllable) of the standard, as long as the user can pronounce the speech corresponding to each reference phoneme that is distinguishable from each other.
然後,在步驟S22中,該處理單元5將該等參考音素編碼,以定義出多個彼此不同的音素標籤。在本實施例中,每一個音素標籤僅包含一選自該等參考音素其中一者的第一參考音素、及一選自該等參考音素其中一者的第二參考音素。若依照上述表1的範例,該等音素標籤可被定義如以下表2:
表2
值得注意的是,在其他實施態樣中,若受限於用戶貧乏的構音能力而導致能發出可區別的參考音素之數量相對較少時,為了可定義出相當數量的音素標籤,該處理單元5定義出的每一音素標籤亦可包含三個或三個以上的參考音素。It is worth noting that, in other implementation aspects, if the number of distinguishable reference phonemes is relatively small due to the user's poor articulation ability, in order to define a considerable number of phoneme labels, the processing unit Each phoneme tag defined in 5 can also contain three or more reference phonemes.
接著,在步驟S23中,該處理單元5將定義出的該等音素標籤分別與多個彼此不同的操作選項相關聯,並可將該等音素標籤與該等操作選項的關聯關係儲存於該儲存模組3。在本實施例中,每一操作選項可以是一符號、一字元、一文字內容、一操作指令、一應用程式或一檔案,其中的符號、字元和操作指令可以是該觸控顯示模組2所顯示的一虛擬鍵盤所含的任一符號、任一字元和任一操作指令,而其中的應用程式和檔案可以是該觸控顯示模組2所顯示的任一視窗所含應用程式和任一文字、圖形、音頻或多媒體檔案。應注意的是,在實際使用時,該等音素標籤可被應用來與該觸控顯示模組2所提供的不同顯示視窗中的操作選項建立關聯性。Next, in step S23, the
之後,在使用該電腦裝置100期間,在步驟S24中,該處理單元5使該觸控顯示模組2當下提供的顯示視窗顯示多個分別代表該等音素標籤及該等操作選項的圖像。更明確地說,將視該觸控顯示模組2當下提供的顯示視窗,會有對應的圖像顯示內容。以下,將就不同使用情況來示例說明。Then, during use of the
參閱圖3,繪示的示例是一顯示於該觸控顯示模組2提供的一顯示視窗的虛擬鍵盤,此虛擬鍵盤顯示了多個(編輯候選)字元的圖像(如「我」、「好」…等)以及多個分別在該等字元的圖像附近且分別與該等字元相關聯的音素標籤(如「a a」、「a i」…等)、多個(編輯候選)注音符號的圖像(如「ㄅ」、「ㄆ」…等)以及多個分別在該等注音符號的圖像附近且分別與該等注音符號相關聯的音素標籤(如「i a」、「e o」…等)、一(編輯候選)數學符號的圖像(如「>」)以及在該數學符號的圖像附近且與該數學符號相關聯的音素標籤(如「a hu」)和多個(編輯候選)操作指令的圖像(如代表數字鍵切換指令的「123」、代表輸入空格指令的「空格」和代表傳送指令的「傳送」)以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「ha e」、「ha he」、「ha ha」)。Referring to FIG. 3, the illustrated example is a virtual keyboard displayed in a display window provided by the
參閱圖4,繪示的示例是該觸控顯示模組2提供的「桌面」顯示視窗,其中含有多個分別代表多個不同的應用程式(如「YouTube」、「EVA Facial Mouse」、「米家」…等)的圖像以及多個分別在該等應用程式的圖像附近且分別與該等應用程式相關聯的音素標籤(如「a a」、「a u」、「a e」…等)和多個分別代表多個操作指令(如「我的檔案」、「電話」、「聯絡人」…等)的圖像以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「o ha」、「ha a」、「ha u」…等)。Referring to FIG. 4, the illustrated example is the “desktop” display window provided by the
參閱圖5,繪示的示例是該電腦裝置100在執行「YouTube」應用程式後由該觸控顯示模組2提供的顯示視窗,其中含有多個分別代表多個不同多媒體檔案(如影片1~影片10)的圖像(如圖像1~圖像10)以及多個分別在該等多媒體檔案的圖像附近且分別與該等多媒體檔案相關聯的音素標籤(如「u e」、「e u」、「e e」…等)和多個分別代表多個操作指令(如「首頁」、「發燒影片」、「訂閱內容」…等)的圖像以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「ha a」、「ha u」、「ha e」…等)。Referring to FIG. 5, the illustrated example is the display window provided by the
參閱圖6,繪示的示例是該電腦裝置100在執行一社群溝通應用程式後由該觸控顯示模組2提供的一顯示視窗,其中含有多個分別代表多個文字內容的圖像(如「我需要幫忙」、「我要小便」…等)以及多個分別在該等文字內容的圖像附近且分別與該等文字內容相關聯的音素標籤(如「a a」、「a i」…等)和多個分別代表多個操作指令的圖像(如「清除」、「傳送和發音」、「儲存」…等)以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「u ha」、「u hi」、「u he」…等)。Referring to FIG. 6, the illustrated example is a display window provided by the
在本實施例中,在此情況下,該用戶可根據該觸控顯示模組2當前的顯示視窗所含的操作選項以及與其相關聯的音素標籤,發出與所欲操作選項相關聯的音素標籤有關的第一音素及第二音素。然而,在其他實施態樣中,若每一音素標籤含有三個或三個以上的參考音素時,用戶則必須發出與所欲操作選項相關聯的音素標籤有關的多個音素,其數量應與每一音素標籤所含的參考音素的數量一致。In this embodiment, in this case, the user can issue the phoneme label associated with the desired operation option according to the operation option contained in the current display window of the
於是,當該處理單元5接收到由該語音收集模組1收集到來自該用戶且包含連續的第一音素和第二音素的語音信號時,在步驟S25中,該處理單元5可以根據該儲存模組3儲存的該語音辨識資料且利用語音辨識技術,或者根據該儲存模組3儲存的該個人音素資料且利用聲紋辨識技術,確認該第一音素是否相似於該等參考音素其中之一者並確認該第二音素是否相似於該等參考音素其中之一者。若該處理單元5確認出相似於該第一音素的第一目標參考音素、及相似於該第二音素的第二目標參考音素時,則流程進行步驟S26。否則,流程進行步驟S28。特別要說明的是,在實際執行步驟S25時,該處理單元5例如可以先根據該語音辨識資料且利用語音辨識技術來執行該第一音素與該第二音素的確認操作,並在無法成功確認時,再利用該個人音素資料且利用聲紋辨識技術來執行該第一音素與該第二音素的確認操作,但不在此限。Therefore, when the
在步驟S26中,該處理單元5根據該第一目標參考音素及該第二目標參考音素以及該儲存模組3儲存的該關聯關係,自該等音素標籤(即,在步驟S24中當前顯示視窗所含的音素標籤)決定出一個目標音素標籤。該目標音素標籤所含的該第一參考音素與該第二參考音素分別相同於該第一目標參考音素與該第二目標參考音素。In step S26, the
接著,在步驟S27中,該處理單元5根據該儲存模組3儲存的該關聯關係將該等操作選項(即,在步驟S24中當前顯示視窗所含的操作選項)其中一個與該目標音素標籤相關聯的目標操作選項(亦即所欲操作選項)激活。Next, in step S27, the
當該處理單元5確認出該第一音素與該第二音素其中一者與該等參考音素其中每一者均不相似時,在步驟S28中,該處理單元5使該觸控顯示模組2顯示辨識失敗訊息。於是,該用戶可在重新發出含有與所欲操作選項相關聯的語音標籤有關的第一音素與第二音素後,該電腦裝置100重新執行步驟S25的操作直到能確認出相似於該第一音素與該第二音素的參考音素為止。When the
以下,將進一步就該目標操作選項的實際形式來示例地說明該處理單元5如何激活該目標操作選項的方式。Hereinafter, the manner of how the
若該目標操作選項為一符號(如圖3所示的虛擬鍵盤中的數學符號「>」)時,該處理單元5透過一使該觸控顯示模組2顯示該符號在一編輯區(圖未示)的方式來激活該目標操作選項。相似地,若該目標操作選項為一字元(如圖3所示的虛擬鍵盤中的字元「我」)時,該處理單元5透過一使該觸控顯示模組2顯示該字元的方式來激活該目標操作選項。若該目標操作選項為一文字內容(如圖6所示的顯示視窗中的「我需要幫忙」)時,該處理單元5不僅透過一使該觸控顯示面板2顯示該文字內容在一溝通紀錄區(如圖6所示的溝通紀錄區)方式,還透過一使該喇叭模組4播放一對應於該文字內容的語音內容的方式來激活該目標操作選項。若該目標操作選項為一操作指令(如圖4所示的顯示視窗中的操作指令「電話」)時,該處理單元5透過一執行該操作指令的方式(如使該觸控顯示模組2從原本的桌面顯示視窗切換至與「電話」有關的顯示視窗)來激活該目標操作選項。若該目標操作選項為一應用程式(如圖4所示的顯示視窗中的應用程式「YouTube」)時,該處理單元5透過一執行該應用程式的方式來激活該目標操作選項,並使該觸控顯示模組2從原本的顯示視窗(如圖4所示)切換至與該應用程式相關的顯示視窗(如圖5所示)。若該目標操作選項為一檔案時,該處理單元5透過一開啟或播放該檔案之方式來激活該目標操作選項。If the target operation option is a symbol (mathematical symbol ">" in the virtual keyboard as shown in FIG. 3 ), the
值得注意的是,上述第一實施例的語音輸入操作方法可被編程為一包含多個程式指令的電腦程式產品,並將該電腦程式產品儲存在一電腦可讀取媒體(例如,該儲存模組3)。當該電腦裝置100執行該等程式指令時,該電腦裝置100可完成如以上所述基於至少雙音素的語音輸入操作方法。It is worth noting that, the voice input operation method of the above-mentioned first embodiment can be programmed as a computer program product including a plurality of program instructions, and the computer program product is stored in a computer-readable medium (for example, the storage module). group 3). When the
參閱圖7,繪示出的另一電腦系統不僅包含作為使用終端的該電腦裝置100,還包含一辨識伺服端200。該電腦裝置100與該辨識伺服端200協同來實施本發明第二實施例的基於至少雙音素(Double-phoneme)的語音輸入操作方法。在本實施例中,該辨識伺服端200可經由一通訊網路300與該電腦裝置100通訊,並支援語音及聲紋辨識技術。Referring to FIG. 7 , another computer system shown not only includes the
以下,將參閱圖7及圖8來示例地說明該電腦系統如何執行該第二實施例的語音輸入操作方法。大體而言,本實施例的語音輸入操作方法為上述第一實施例的語音操作方法的變化實施例,並可包含以下步驟S80-S91。Hereinafter, referring to FIG. 7 and FIG. 8 , how the computer system executes the voice input operation method of the second embodiment will be exemplarily described. Generally speaking, the voice input operation method of this embodiment is a modified embodiment of the voice operation method of the above-mentioned first embodiment, and may include the following steps S80-S91.
首先,在步驟S80中,該辨識伺服端200預先儲存有該語音辨識資料。First, in step S80, the
在步驟S81中,在註冊階段,該電腦裝置100經由該通訊網路300,將該語音收集模組1收集到由一用戶發出對應於多個參考音素的多個語音的語音內容傳送至該辨識伺服端200。In step S81, in the registration stage, the
然後,在步驟S82中,該辨識伺服端200將來自該電腦裝置100的該語音內容作為對應於該用戶的個人音素資料並將其儲存。值得一提的是,在實際使用時,該辨識伺服端200亦可用作有關個人音素資料的雲端伺服器,並廣為收集且儲存大量其他用戶(如構音異常的用戶)的個人音素資料,並將此大量資料進一步利用人工智慧的分析或進行機器學習可獲得作為用於特殊語音(如構音異常用戶所發出的語音)辨識的語音資料庫(圖未示)。Then, in step S82, the
接著,該電腦裝置100依序執行步驟S83至步驟S85。由於該電腦裝置100在步驟S83至步驟S85的操作細節分別相同於上述步驟S22至步驟S24(圖2)的所有操作細節,故在此不再贅述。Next, the
之後,當該電腦裝置100的該處理單元5接收到由該語音收集模組1收集到來自該用戶且包含連續的第一音素和第二音素的語音信號時,該電腦裝置100經由該通訊網路300,將一包含該語音信號且有關該用戶的辨識請求傳送至該辨識伺服端200(步驟S86)。Afterwards, when the
然後,該辨識伺服端200在接收到來自該電腦裝置100的該辨識請求時,可以根據已儲存的該語音辨識資料且利用語音辨識技術,或者根據已儲存且對應於該用戶的該個人音素資料並利用聲紋辨識技術(又或者根據上述用於特殊語音辨識的語音資料庫且利用語音辨識技術),確認該第一音素是否相似於該等參考音素其中之一者並確認該第二音素是否相似於該等參考音素其中之一者(步驟S87)。若該辨識伺服端200確認出相似於該第一音素的第一目標參考音素、及相似於該第二音素的第二目標參考音素時(即,成功辨識),則流程進行步驟S88。否則,流程進行步驟S91。Then, when the
在步驟S88中,該辨識伺服端200經由該通訊網路300,將一含有該第一目標參考音素及該第二目標參考音素的辨識回覆傳送至該電腦裝置100。In step S88 , the
然後,在步驟S89中,該電腦裝置100的該處理單元5根據該辨識回覆所含的該第一目標參考音素及該第二目標參考音素以及該儲存模組3儲存的該關聯關係,自該等音素標籤決定出一個目標音素標籤。該目標音素標籤所含的該第一參考音素與該第二參考音素分別相同於該第一目標參考音素與該第二目標參考音素。Then, in step S89, the
接著,該電腦裝置100的該處理單元5,相似於上述步驟S27(圖2),根據該儲存模組3儲存的該關聯關係將該等操作選項其中一個與該目標音素標籤相關聯的目標操作選項(亦即所欲操作選項)激活(步驟S90)。Then, the
當該辨識伺服端200確認出該第一音素與該第二音素其中一者與該等參考音素其中每一者均不相似時(即辨識失敗),在步驟S91中,該辨識伺服端200經由該通訊網路300,將辨識失敗訊息傳送至該電腦裝置100。於是,該電腦裝置100的該處理單元5可將來自於該辨識伺服端200的該辨識失敗訊息顯示於該觸控顯示模組2,以供該用戶觀看。於是,該用戶可在重新發出含有與所欲操作選項相關聯的語音標籤有關的第一音素與第二音素後,該電腦系統重新執行步驟S86與步驟S87的操作直到能確認出相似於該第一音素與該第二音素的參考音素為止。When the
綜上所述,由於先儲存了用於語音辨識的語音辨識資料,以及用於聲紋辨識且對應於用戶的個人音素資料,因此不僅對於構音正常的用戶所發出的語音,而且對於構音障礙患者的用戶所發出的有限語音均能被精確地辨識出。此外,基於至少雙音素的編碼方式能定義出相對較多數量的音素標籤,此等音素標籤能被廣泛地應用來與相對較多數量的操作選項建立關聯性。於是,相較於現有利用相對複雜的語言模型及聲學模型的語音辨識技術,能相對容易且快速地語音辨識出用戶所發出且含至少雙音素的語音輸入,以判定出所欲激活的目標操作選項。故確實能達成本發明的目的。To sum up, since the speech recognition data for speech recognition and the personal phoneme data corresponding to the user for voiceprint recognition are stored first, not only the speech uttered by users with normal articulation, but also for patients with dysarthria. The limited speech uttered by the users can be accurately recognized. In addition, the encoding method based on at least two phones can define a relatively large number of phoneme labels, and these phoneme labels can be widely used to establish associations with a relatively large number of operation options. Therefore, compared with the existing speech recognition technology using a relatively complex language model and an acoustic model, it is relatively easy and fast to recognize the speech input issued by the user and containing at least two phonemes, so as to determine the target operation option to be activated. . Therefore, the object of the present invention can indeed be achieved.
惟以上所述者,僅為本發明之實施例而已,當不能以此限定本發明實施之範圍,凡是依本發明申請專利範圍及專利說明書內容所作之簡單的等效變化與修飾,皆仍屬本發明專利涵蓋之範圍內。However, the above are only examples of the present invention, and should not limit the scope of the present invention. Any simple equivalent changes and modifications made according to the scope of the application for patent of the present invention and the content of the patent specification are still within the scope of the present invention. within the scope of the invention patent.
100:電腦裝置 1:語音收集模組 2:觸控顯示模組 3:儲存模組 4:喇叭模組 5:處理單元 200:辨識伺服端 300:通訊網路 S20~S28:步驟 S80~S91:步驟100: Computer Devices 1: Voice collection module 2: touch display module 3: storage module 4: Speaker module 5: Processing unit 200: Identify the server 300: Communication Network S20~S28: Steps S80~S91: Steps
本發明之其他的特徵及功效,將於參照圖式的實施方式中清楚地呈現,其中: 圖1是一方塊圖,示例性地繪示用來實施本發明第一實施例的基於至少雙音素的語音輸入操作方法的電腦系統的一架構; 圖2是一流程圖,示例性地說明圖1的電腦系統如何實施本發明第一實施例; 圖3是一示意圖,示例性地繪示出由該電腦系統的一觸控顯示模組所顯示並含有音素標籤的虛擬鍵盤; 圖4至圖6是示意圖,示例性地繪示出該電腦系統在不同的使用情況下由該觸控顯示模組所提供並含有音素標籤的顯示視窗; 圖7是一方塊圖,示例性地繪示出用來實施本發明第二實施例的基於至少雙音素的語音輸入操作方法的電腦系統的另一架構;及 圖8是一流程圖,示例性地說明圖7的電腦系統如何實施本發明第二實施例。Other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, wherein: FIG. 1 is a block diagram exemplarily illustrating a structure of a computer system for implementing the at least diphone-based voice input operation method according to the first embodiment of the present invention; FIG. 2 is a flow chart illustrating how the computer system of FIG. 1 implements the first embodiment of the present invention; 3 is a schematic diagram exemplarily depicting a virtual keyboard displayed by a touch display module of the computer system and containing phoneme labels; 4 to 6 are schematic diagrams, exemplarily illustrating display windows provided by the touch display module and containing phoneme labels under different usage conditions of the computer system; 7 is a block diagram exemplarily illustrating another architecture of a computer system for implementing the at least diphone-based voice input operation method of the second embodiment of the present invention; and FIG. 8 is a flow chart illustrating how the computer system of FIG. 7 implements the second embodiment of the present invention.
S20-S28:步驟S20-S28: Steps
Claims (7)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW109108337A TWI752437B (en) | 2020-03-13 | 2020-03-13 | At least two phoneme-based voice input operation method and computer program product |
PCT/US2020/038446 WO2021183169A1 (en) | 2020-03-13 | 2020-06-18 | Method of voice input operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW109108337A TWI752437B (en) | 2020-03-13 | 2020-03-13 | At least two phoneme-based voice input operation method and computer program product |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202134855A TW202134855A (en) | 2021-09-16 |
TWI752437B true TWI752437B (en) | 2022-01-11 |
Family
ID=77672287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW109108337A TWI752437B (en) | 2020-03-13 | 2020-03-13 | At least two phoneme-based voice input operation method and computer program product |
Country Status (2)
Country | Link |
---|---|
TW (1) | TWI752437B (en) |
WO (1) | WO2021183169A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102272827A (en) * | 2005-06-01 | 2011-12-07 | 泰吉克通讯股份有限公司 | Method and apparatus utilizing voice input to resolve ambiguous manually entered text input |
CN106463113A (en) * | 2014-03-04 | 2017-02-22 | 亚马逊技术公司 | Predicting pronunciation in speech recognition |
US20180348970A1 (en) * | 2017-05-31 | 2018-12-06 | Snap Inc. | Methods and systems for voice driven dynamic menus |
US20190339936A1 (en) * | 2018-05-07 | 2019-11-07 | Imam Abdulrahman Bin Faisal University | Smart mirror |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7146319B2 (en) * | 2003-03-31 | 2006-12-05 | Novauris Technologies Ltd. | Phonetically based speech recognition system and method |
US20050154587A1 (en) * | 2003-09-11 | 2005-07-14 | Voice Signal Technologies, Inc. | Voice enabled phone book interface for speaker dependent name recognition and phone number categorization |
US7471775B2 (en) * | 2005-06-30 | 2008-12-30 | Motorola, Inc. | Method and apparatus for generating and updating a voice tag |
US8903847B2 (en) * | 2010-03-05 | 2014-12-02 | International Business Machines Corporation | Digital media voice tags in social networks |
WO2013192535A1 (en) * | 2012-06-22 | 2013-12-27 | Johnson Controls Technology Company | Multi-pass vehicle voice recognition systems and methods |
-
2020
- 2020-03-13 TW TW109108337A patent/TWI752437B/en active
- 2020-06-18 WO PCT/US2020/038446 patent/WO2021183169A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102272827A (en) * | 2005-06-01 | 2011-12-07 | 泰吉克通讯股份有限公司 | Method and apparatus utilizing voice input to resolve ambiguous manually entered text input |
CN106463113A (en) * | 2014-03-04 | 2017-02-22 | 亚马逊技术公司 | Predicting pronunciation in speech recognition |
US20180348970A1 (en) * | 2017-05-31 | 2018-12-06 | Snap Inc. | Methods and systems for voice driven dynamic menus |
US20190339936A1 (en) * | 2018-05-07 | 2019-11-07 | Imam Abdulrahman Bin Faisal University | Smart mirror |
Also Published As
Publication number | Publication date |
---|---|
TW202134855A (en) | 2021-09-16 |
WO2021183169A1 (en) | 2021-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6588637B2 (en) | Learning personalized entity pronunciation | |
JP7432556B2 (en) | Methods, devices, equipment and media for man-machine interaction | |
US6377925B1 (en) | Electronic translator for assisting communications | |
JP2021103328A (en) | Voice conversion method, device, and electronic apparatus | |
WO2022078146A1 (en) | Speech recognition method and apparatus, device, and storage medium | |
KR101819459B1 (en) | Voice recognition system and apparatus supporting voice recognition error correction | |
CN112819664A (en) | Apparatus for learning foreign language and method for providing foreign language learning service using the same | |
JP2002116796A (en) | Voice processor and method for voice processing and storage medium | |
US20070055520A1 (en) | Incorporation of speech engine training into interactive user tutorial | |
WO2020024620A1 (en) | Voice information processing method and device, apparatus, and storage medium | |
CN111711834B (en) | Recorded broadcast interactive course generation method and device, storage medium and terminal | |
CN111653265A (en) | Speech synthesis method, speech synthesis device, storage medium and electronic equipment | |
CN113515586A (en) | Data processing method and device | |
CN110647613A (en) | Courseware construction method, courseware construction device, courseware construction server and storage medium | |
KR100593589B1 (en) | Multilingual Interpretation / Learning System Using Speech Recognition | |
CN109272983A (en) | Bilingual switching device for child-parent education | |
JP6166831B1 (en) | Word learning support device, word learning support program, and word learning support method | |
JP2007018290A (en) | Handwritten character input display supporting device and method and program | |
TWI752437B (en) | At least two phoneme-based voice input operation method and computer program product | |
CN115019787B (en) | Interactive homonym disambiguation method, system, electronic equipment and storage medium | |
KR102684930B1 (en) | Video learning systems for enable learners to be identified through artificial intelligence and method thereof | |
JP2000112610A (en) | Contents display selecting system and contents recording medium | |
KR102272567B1 (en) | Speech recognition correction system | |
CN113393831B (en) | Speech input operation method based on at least diphones and computer readable medium | |
CN111489742A (en) | Acoustic model training method, voice recognition method, device and electronic equipment |