TWI752437B - At least two phoneme-based voice input operation method and computer program product - Google Patents

At least two phoneme-based voice input operation method and computer program product Download PDF

Info

Publication number
TWI752437B
TWI752437B TW109108337A TW109108337A TWI752437B TW I752437 B TWI752437 B TW I752437B TW 109108337 A TW109108337 A TW 109108337A TW 109108337 A TW109108337 A TW 109108337A TW I752437 B TWI752437 B TW I752437B
Authority
TW
Taiwan
Prior art keywords
phoneme
target
phonemes
voice
computer system
Prior art date
Application number
TW109108337A
Other languages
Chinese (zh)
Other versions
TW202134855A (en
Inventor
林海興
張嘉原
何冠旻
陳豫邦
翁恪誠
劉峻宇
林廷容
曾佳玉
Original Assignee
宇康生科股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 宇康生科股份有限公司 filed Critical 宇康生科股份有限公司
Priority to TW109108337A priority Critical patent/TWI752437B/en
Priority to PCT/US2020/038446 priority patent/WO2021183169A1/en
Publication of TW202134855A publication Critical patent/TW202134855A/en
Application granted granted Critical
Publication of TWI752437B publication Critical patent/TWI752437B/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一種基於至少雙音素的語音輸入操作方法包含,編碼多個參考音素,以定義出多個分別相關聯於多個操作選項的音素標籤,其每一者至少包含選自該等參考音素的第一及第二參考音素;當根據已儲存且與該等參考音素相關聯的語音辨識資料或用戶的個人音素資料且利用語音或聲紋辨識技術確認出收集自用戶的語音信號所含的第一及第二音素分別相似於第一及第二目標參考音素後,自該等音素標籤決定出目標音素標籤,其所含的該第一及第二參考音素分別相同於該第一及第二目標參考音素;及將該等操作選項其中一個與該目標音素標籤相關聯的目標操作選項激活。A method for operating a speech input based on at least two phonemes includes, encoding a plurality of reference phonemes to define a plurality of phoneme labels respectively associated with a plurality of operation options, each of which comprises at least a first selected from the reference phonemes and the second reference phoneme; when the first and second reference phonemes contained in the voice signal collected from the user are confirmed according to the stored voice recognition data associated with these reference phonemes or the user's personal phoneme data and using voice or voiceprint recognition technology After the second phoneme is similar to the first and second target reference phonemes, respectively, a target phoneme label is determined from the phoneme labels, and the first and second reference phonemes contained in the first and second reference phonemes are the same as the first and second target reference phonemes, respectively. phoneme; and activating a target action option associated with the target phoneme tag of one of the action options.

Description

基於至少雙音素的語音輸入操作方法及電腦程式產品At least two phoneme-based voice input operation method and computer program product

本發明是有關於語音輸入,特別是指一種基於至少雙音素的語音輸入操作方法及電腦程式產品。The present invention relates to speech input, in particular to a speech input operation method and computer program product based on at least two phonemes.

語音輸入功能已廣泛用來取代繁冗的手動輸入。在應用上,電腦裝置通常可將收集到用戶所發出的語音經由利用語言模型及聲學模型的語音辨識引擎成功辨識出與該語音有關的字詞、操作指令或應用程式後,該電腦裝置便可將有關的字詞顯示出來,或者執行有關的指令或應用程式。Voice input has been widely used to replace tedious manual input. In application, the computer device can usually recognize the words, operation instructions or applications related to the speech from the collected speech through the speech recognition engine using the language model and the acoustic model, and then the computer device can Display the relevant word, or execute the relevant command or application.

然而,對於構音障礙患者而言,其往往無法發出特定的語音,而所發出語音常有含混不清、沙啞、單調、斷續、發聲音量過大或其他異常的特徵。在此情況下,由於現有的語音辨識技術無法成功辨識構音障礙患者所發出的語音,致使構音障礙患者無法使用現有語音輸入操作方式,例如作為語音溝通設備的平板電腦所提供的語音輸入功能,與外界溝通。However, for patients with dysarthria, they are often unable to produce specific speech sounds, and the speech sounds are often slurred, hoarse, monotonous, intermittent, too loud or otherwise abnormal. In this case, because the existing speech recognition technology cannot successfully recognize the speech uttered by the patients with dysarthria, the patients with dysarthria cannot use the existing voice input operation methods, such as the voice input function provided by the tablet computer as a voice communication device, and external communication.

因此,為了使構音障礙患者能夠利用語音輸入來操作電子裝置或與外界溝通,如何發展出適用於構音障礙患者的語音輸入操作技術遂成為目前重要的議題。Therefore, in order to enable patients with dysarthria to use voice input to operate electronic devices or communicate with the outside world, how to develop a voice input operation technology suitable for patients with dysarthria has become an important issue at present.

因此,本發明的目的,即在提供一種基於至少雙音素的語音輸入操作方法及電腦程式產品,其能克服現有技術的至少一缺點。Therefore, the purpose of the present invention is to provide a voice input operation method and computer program product based on at least two phonemes, which can overcome at least one disadvantage of the prior art.

於是,本發明所提供的一種基於至少雙音素的語音輸入操作方法,利用一具有語音及聲紋辨識技術的電腦系統來執行,並包含下步驟:(A)儲存與多個彼此不同的參考音素相關聯的語音辨識資料、及對應於一用戶的個人音素資料,該個人音素資料包含由該用戶所發出且分別對應於該等參考音素的多個語音的語音內容;(B)將該等參考音素編碼,以定義出多個彼此不同的音素標籤,其中每一個音素標籤至少包含一選自該等參考音素其中一者的第一參考音素、及一選自該等參考音素其中一者的第二參考音素;(C)將該等音素標籤分別與多個彼此不同的操作選項相關聯;(D)當收集到來自該用戶且至少包含連續的第一音素和第二音素的語音信號後,根據該語音辨識資料且利用語音辨識技術,或者根據該個人音素資料且利用聲紋辨識技術,確認該第一音素是否相似於該等參考音素其中之一者並確認該第二音素是否相似於該等參考音素其中之一者;(E)當確認出相似於該第一音素的第一目標參考音素、及相似於該第二音素的第二目標參考音素後,根據該第一目標參考音素和該第二目標參考音素,自該等音素標籤決定出一個目標音素標籤,該目標音素標籤所含的該第一參考音素與該第二參考音素分別相同於該第一目標參考音素與該第二目標參考音素;及(F)將該等操作選項其中一個與該目標音素標籤相關聯的目標操作選項激活。Therefore, the present invention provides a voice input operation method based on at least two phonemes, which utilizes a computer system with voice and voiceprint recognition technology to perform, and includes the following steps: (A) storing a plurality of reference phonemes that are different from each other. Associated speech recognition data and personal phoneme data corresponding to a user, the personal phoneme data comprising the voice content of a plurality of voices issued by the user and corresponding to the reference phonemes respectively; (B) the reference Phoneme encoding to define a plurality of phoneme labels that are different from each other, wherein each phoneme label at least includes a first reference phoneme selected from one of the reference phonemes and a first reference phoneme selected from one of the reference phonemes Two reference phonemes; (C) respectively associate these phoneme labels with a plurality of operation options different from each other; (D) after collecting the speech signal from the user and containing at least the continuous first phoneme and the second phoneme, Confirming whether the first phoneme is similar to one of the reference phonemes and confirming whether the second phoneme is similar to the (E) after confirming the first target reference phoneme similar to the first phoneme and the second target reference phoneme similar to the second phoneme, according to the first target reference phoneme and For the second target reference phoneme, a target phoneme label is determined from the phoneme labels, and the first reference phoneme and the second reference phoneme contained in the target phoneme label are respectively the same as the first target reference phoneme and the second reference phoneme. a target reference phoneme; and (F) activating a target action option associated with the target phoneme tag of one of the action options.

本發明的基於至少雙音素的語音輸入操作方法中,每一參考音素為一母音或一音節。In the voice input operation method based on at least two phonemes of the present invention, each reference phoneme is a vowel or a syllable.

本發明的基於至少雙音素的語音輸入操作方法中,每一操作選項為一符號、一字元、一文字內容、一操作指令、一應用程式及一檔案其中一者。當該目標操作選項為一符號時,該電腦系統透過一顯示該符號之方式激活該目標操作選項。當該目標操作選項為一字元時,該電腦系統透過一顯示該字元之方式激活該目標操作選項。當該目標操作選項為一文字內容時,該電腦系統至少透過一顯示該文字內容之方式激活該目標操作選項。當該目標操作選項為一操作指令時,該電腦系統透過一執行該操作指令之方式激活該目標操作選項。當該目標操作選項為一應用程式時,該電腦系統透過一執行該應用程式之方式激活該目標操作選項。當該目標操作選項為一檔案時,該電腦系統透過一開啟或播放該檔案之方式激活該目標操作選項。In the voice input operation method based on at least two phonemes of the present invention, each operation option is one of a symbol, a character, a text content, an operation command, an application program and a file. When the target operation option is a symbol, the computer system activates the target operation option by displaying the symbol. When the target operation option is a character, the computer system activates the target operation option by displaying the character. When the target operation option is a text content, the computer system activates the target operation option at least by displaying the text content. When the target operation option is an operation command, the computer system activates the target operation option by executing the operation command. When the target operation option is an application program, the computer system activates the target operation option by executing the application program. When the target operation option is a file, the computer system activates the target operation option by opening or playing the file.

本發明的基於至少雙音素的語音輸入操作方法中,在步驟(C)與步驟(D)之間,還包含以下步驟:(G)顯示多個分別代表該等操作選項的圖像,並在該等圖像附近顯示與該等操作選項相關聯的該等音素標籤。In the voice input operation method based on at least two phonemes of the present invention, between steps (C) and (D), the following steps are further included: (G) displaying a plurality of images representing the operation options respectively, and The phoneme labels associated with the operation options are displayed adjacent to the images.

本發明的基於至少雙音素的語音輸入操作方法中,當該目標操作選項為一文字內容時,該電腦系統不僅透過顯示該文字內容之方式,還透過一播放一對應於該文字內容的語音內容的方式激活該目標操作選項。In the voice input operation method based on at least two phonemes of the present invention, when the target operation option is a text content, the computer system not only displays the text content, but also plays a voice content corresponding to the text content by playing a way to activate the target action option.

本發明的基於至少雙音素的語音輸入操作方法中,該電腦系統包含一用於執行步驟(B)、(C)、(E)、(F)及(G)的使用終端、及一可與該使用終端通訊且用於執行步驟(A)及(D)的辨識伺服端,還包含以下步驟:在步驟(A)之前,(H)藉由該使用終端,將收集到的該個人音素資料傳送至該辨識伺服端;在步驟(C)與步驟(D)之間,(I)藉由該使用終端,收集該語音信號,並將一包含該語音信號且有關該用戶的辨識請求送至該辨識伺服端,以便該辨識伺服端回應於該辨識請求執行步驟(D);及在步驟(D)與步驟(E)之間,(J)藉由該辨識伺服端,當確認出該第一目標參考音素與該第二目標參考音素時,將一含有該第一目標參考音素及該第二目標參考音素的辨識回覆傳送至該使用終端,以使該使用終端回應於該辨識回覆執行步驟(E)。In the voice input operation method based on at least two phonemes of the present invention, the computer system includes a user terminal for executing steps (B), (C), (E), (F) and (G), and a user terminal that can be connected with The identification server that communicates with the user terminal and is used for performing steps (A) and (D) further includes the following steps: before step (A), (H) using the user terminal to collect the personal phoneme data. Send to the identification server; between step (C) and step (D), (1) collect the voice signal through the use terminal, and send an identification request including the voice signal and related to the user to the identification server, so that the identification server executes step (D) in response to the identification request; and between steps (D) and (E), (J) by the identification server, when it is confirmed that the first When a target reference phoneme and the second target reference phoneme are present, an identification reply containing the first target reference phoneme and the second target reference phoneme is sent to the user terminal, so that the user terminal performs the steps in response to the identification reply (E).

於是,本發明所提供的一種電腦程式產品儲存在一電腦可讀取媒體,包含多個程式指令,且當一電腦裝置執行該等程式指令時,可完成如以上所述之基於至少雙音素的語音輸入操作方法。Therefore, a computer program product provided by the present invention is stored in a computer-readable medium, and includes a plurality of program instructions, and when a computer device executes the program instructions, it can complete the above-mentioned at least diphone-based program. Voice input operation method.

本發明之功效在於:由於先儲存了用於語音辨識的語音辨識資料,以及用於聲紋辨識且對應於用戶的個人音素資料,因此不僅對於構音正常的用戶,而且對於構音障礙患者的用戶所發出的有限語音均能被精確地辨識出。此外,基於至少雙音素的編碼方式能定義出相對較多數量的音素標籤,此等音素標籤能被廣泛地應用來與相對較多數量的操作選項建立關聯性。於是,相較於現有利用相對複雜的語言模型及聲學模型的語音辨識技術,能相對容易且快速地辨識出用戶所發出且含有至少雙音素的語音輸入,以判定出所欲激活的目標操作選項。The effect of the present invention is: because the voice recognition data for voice recognition and the personal phoneme data corresponding to the user for voiceprint recognition are stored first, not only for users with normal articulation, but also for users with dysarthria The limited speech uttered can be accurately recognized. In addition, the encoding method based on at least two phones can define a relatively large number of phoneme labels, and these phoneme labels can be widely used to establish associations with a relatively large number of operation options. Therefore, compared with the existing speech recognition technology using a relatively complex language model and an acoustic model, it is relatively easy and fast to recognize the speech input issued by the user that contains at least two phonemes, so as to determine the target operation option to be activated.

在本發明被詳細描述之前,應當注意在以下的說明內容中,類似的元件是以相同的編號來表示。Before the present invention is described in detail, it should be noted that in the following description, similar elements are designated by the same reference numerals.

參閱圖1,繪示出的一電腦系統係實施成如智慧型手機或平板電腦的電腦裝置100,其係用來實施本發明第一實施例的基於至少雙音素(Double-phoneme)的語音輸入操作方法,並包含一用於收集來自外部語音的語音收集模組1(例如,麥克風模組)、一用作顯示器和使用者操作介面的觸控顯示模組2、一儲存模組3、一喇叭模組4,以及一電連接該語音收集模組1、該觸控顯示模組2、該儲存模組3和該喇叭模組4的處理單元5。在本實施例中,該處理單元5支援語音及聲紋辨識技術。Referring to FIG. 1 , a computer system is shown implemented as a computer device 100 such as a smart phone or a tablet computer, which is used to implement at least double-phoneme-based voice input according to the first embodiment of the present invention The operation method includes a voice collection module 1 (eg, a microphone module) for collecting external voices, a touch display module 2 used as a display and a user operating interface, a storage module 3, a A speaker module 4 and a processing unit 5 electrically connected to the voice collection module 1 , the touch display module 2 , the storage module 3 and the speaker module 4 . In this embodiment, the processing unit 5 supports voice and voiceprint recognition technology.

以下,將參閱圖1及圖2來示例地說明該電腦裝置100如何執行該第一實施例的語音輸入操作方法。大體而言,該語音輸入操作方法可包含以下步驟S20-S28。Hereinafter, referring to FIG. 1 and FIG. 2 , how the computer device 100 executes the voice input operation method of the first embodiment will be exemplarily described. Generally speaking, the voice input operation method may include the following steps S20-S28.

首先,在步驟S20中,該電腦裝置100將與多個彼此不同的參考音素相關聯的語音辨識資料儲存於該儲存模組3。在本實施例中,每一參考音素可為一母音(Vowel)或音節(Syllable)。更明確地,該電腦裝置100根據收集自多個(構音正常)用戶發出該等參考音素的語音內容並利用例如聲學模型訓練而獲得用於辨識該等參考音素的該語音辨識資料。舉例來說,該等參考音素例如具有四個母音和四個音節,如以下表示1所示。 表1 參考音素 母音 音節 a u e o ha hu he ho 然後,在步驟S21中,通常在一註冊階段,該電腦裝置100儲存對應於一用戶的個人音素資料,該個人音素資料包含由該用戶所發出且分別對應於多個彼此不同的參考音素的多個語音的語音內容。更明確地,在開始語音紀錄之前,該處理單元5例如可使該觸控顯示模組2顯示該等參考音素,以供該用戶作為有關該等參考音素的發聲的指示,但不在此限。於是,在語音紀錄期間,該語音收集模組1將收集到由該用戶發出對應於該等參考音素的多個語音的語音內容傳送至該處理單元5,且該處理單元5將該語音內容作為對應於該用戶的該個人音素資料,並將該個人音素資料儲存於該儲存模組3。First, in step S20 , the computer device 100 stores the speech recognition data associated with a plurality of different reference phonemes in the storage module 3 . In this embodiment, each reference phoneme may be a vowel (Vowel) or a syllable (Syllable). More specifically, the computer device 100 obtains the speech recognition data for recognizing the reference phonemes according to the speech content collected from a plurality of (normally articulated) users uttering the reference phonemes and using, for example, acoustic model training. For example, the reference phonemes have, for example, four vowels and four syllables, as shown in Representation 1 below. Table 1 reference phoneme vowel syllable a u e o ha hu he ho Then, in step S21, usually in a registration stage, the computer device 100 stores personal phoneme data corresponding to a user, and the personal phoneme data includes a plurality of reference phonemes uttered by the user and corresponding to a plurality of different reference phonemes respectively. The voice content of a voice. More specifically, before starting the voice recording, the processing unit 5 may, for example, enable the touch display module 2 to display the reference phonemes for the user to use as an instruction about the vocalization of the reference phonemes, but not limited thereto. Therefore, during the voice recording, the voice collection module 1 transmits the voice content of a plurality of voices corresponding to the reference phonemes uttered by the user to the processing unit 5, and the processing unit 5 uses the voice content as Corresponding to the personal phoneme data of the user, and storing the personal phoneme data in the storage module 3 .

特別要說明的是,為了紀錄具有可區別性的該等語音以建立符合個人聲紋特性的個人語音資料,該等參考音素的選擇可視該用戶的構音能力而定。換言之,用戶無須發出標準的每一參考音素(即,母音或音節),只要用戶能發出對應於每一參考音素的語音彼此是可區別的即可。It should be noted that, in order to record the distinguishable voices to create personal voice data conforming to the characteristics of personal voiceprints, the selection of the reference phonemes may depend on the user's articulation ability. In other words, the user does not need to pronounce each reference phoneme (ie, vowel or syllable) of the standard, as long as the user can pronounce the speech corresponding to each reference phoneme that is distinguishable from each other.

然後,在步驟S22中,該處理單元5將該等參考音素編碼,以定義出多個彼此不同的音素標籤。在本實施例中,每一個音素標籤僅包含一選自該等參考音素其中一者的第一參考音素、及一選自該等參考音素其中一者的第二參考音素。若依照上述表1的範例,該等音素標籤可被定義如以下表2: 表2 音素標籤 a a u a e a o a ha a hu a he a ho a a u u u e u o u ha u hu u he u ho u a e u e e e o e ha e hu e he e ho e a o u o e o o o ha o hu o he o ho o a ha u ha e ha o ha ha ha hu ha he ha ho ha a he u he e he o he ha he hu he he he ho he a ho u ho e ho o ho ha ho hu ho he ho ho ho a hu u hu e hu o hu ha hu hu hu he hu ho hu Then, in step S22, the processing unit 5 encodes the reference phonemes to define a plurality of phoneme labels that are different from each other. In this embodiment, each phoneme tag only includes a first reference phoneme selected from one of the reference phonemes, and a second reference phoneme selected from one of the reference phonemes. According to the example of Table 1 above, the phoneme tags can be defined as the following Table 2: Table 2 phoneme label aa ua ea oa ha a hu a he a ho a au uu eu ou ha u hu u he u ho u ae ue ee oe ha e hu e he e ho e ao uo eo oo ha o hu o he o ho o a ha u ha e ha o ha ha ha hu ha he ha ho ha a he u he e he o he ha he hu he he he ho he a ho u ho e ho o ho ha ho hu ho he ho ho ho a hu u hu e hu o hu ha hu hu hu he hu ho hu

值得注意的是,在其他實施態樣中,若受限於用戶貧乏的構音能力而導致能發出可區別的參考音素之數量相對較少時,為了可定義出相當數量的音素標籤,該處理單元5定義出的每一音素標籤亦可包含三個或三個以上的參考音素。It is worth noting that, in other implementation aspects, if the number of distinguishable reference phonemes is relatively small due to the user's poor articulation ability, in order to define a considerable number of phoneme labels, the processing unit Each phoneme tag defined in 5 can also contain three or more reference phonemes.

接著,在步驟S23中,該處理單元5將定義出的該等音素標籤分別與多個彼此不同的操作選項相關聯,並可將該等音素標籤與該等操作選項的關聯關係儲存於該儲存模組3。在本實施例中,每一操作選項可以是一符號、一字元、一文字內容、一操作指令、一應用程式或一檔案,其中的符號、字元和操作指令可以是該觸控顯示模組2所顯示的一虛擬鍵盤所含的任一符號、任一字元和任一操作指令,而其中的應用程式和檔案可以是該觸控顯示模組2所顯示的任一視窗所含應用程式和任一文字、圖形、音頻或多媒體檔案。應注意的是,在實際使用時,該等音素標籤可被應用來與該觸控顯示模組2所提供的不同顯示視窗中的操作選項建立關聯性。Next, in step S23, the processing unit 5 associates the defined phoneme labels with a plurality of operation options that are different from each other, and can store the association relationship between the phoneme labels and the operation options in the storage Module 3. In this embodiment, each operation option can be a symbol, a character, a text content, an operation command, an application program or a file, and the symbol, character and operation command can be the touch display module 2 any symbol, any character and any operation command contained in a virtual keyboard displayed, and the application program and file therein may be the application program contained in any window displayed by the touch display module 2 and any text, graphics, audio or multimedia files. It should be noted that, in actual use, the phoneme tags can be applied to establish associations with the operation options in different display windows provided by the touch display module 2 .

之後,在使用該電腦裝置100期間,在步驟S24中,該處理單元5使該觸控顯示模組2當下提供的顯示視窗顯示多個分別代表該等音素標籤及該等操作選項的圖像。更明確地說,將視該觸控顯示模組2當下提供的顯示視窗,會有對應的圖像顯示內容。以下,將就不同使用情況來示例說明。Then, during use of the computer device 100, in step S24, the processing unit 5 causes the display window currently provided by the touch display module 2 to display a plurality of images representing the phoneme labels and the operation options respectively. More specifically, depending on the display window currently provided by the touch display module 2, there will be corresponding image display contents. In the following, examples will be given for different use cases.

參閱圖3,繪示的示例是一顯示於該觸控顯示模組2提供的一顯示視窗的虛擬鍵盤,此虛擬鍵盤顯示了多個(編輯候選)字元的圖像(如「我」、「好」…等)以及多個分別在該等字元的圖像附近且分別與該等字元相關聯的音素標籤(如「a a」、「a i」…等)、多個(編輯候選)注音符號的圖像(如「ㄅ」、「ㄆ」…等)以及多個分別在該等注音符號的圖像附近且分別與該等注音符號相關聯的音素標籤(如「i a」、「e o」…等)、一(編輯候選)數學符號的圖像(如「>」)以及在該數學符號的圖像附近且與該數學符號相關聯的音素標籤(如「a hu」)和多個(編輯候選)操作指令的圖像(如代表數字鍵切換指令的「123」、代表輸入空格指令的「空格」和代表傳送指令的「傳送」)以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「ha e」、「ha he」、「ha ha」)。Referring to FIG. 3, the illustrated example is a virtual keyboard displayed in a display window provided by the touch display module 2, and the virtual keyboard displays images of a plurality of (editing candidates) characters (such as "I", "good"...etc) and a plurality of phoneme tags (such as "aa", "ai"...etc.), a plurality of (editing candidates) near the image of the characters and associated with the characters respectively An image of a phonetic symbol (such as "ㄅ", "ㄆ"...etc.) and a plurality of phoneme tags (such as "ia", "eo") near the image of the phonetic symbol and associated with the phonetic symbol respectively "... etc.), an (editing candidate) image of a mathematical symbol (eg ">") and a phoneme label (eg "a hu") near and associated with the image of the mathematical symbol, and multiple (Edit candidate) The image of the operation instruction (such as "123" representing the numeric key switching instruction, "Space" representing the input space instruction, and "Transfer" representing the transmission instruction) and a plurality of images of these operation instructions respectively Phoneme tags (eg, "ha e", "ha he", "ha ha") that are nearby and are respectively associated with these operating instructions.

參閱圖4,繪示的示例是該觸控顯示模組2提供的「桌面」顯示視窗,其中含有多個分別代表多個不同的應用程式(如「YouTube」、「EVA Facial Mouse」、「米家」…等)的圖像以及多個分別在該等應用程式的圖像附近且分別與該等應用程式相關聯的音素標籤(如「a a」、「a u」、「a e」…等)和多個分別代表多個操作指令(如「我的檔案」、「電話」、「聯絡人」…等)的圖像以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「o ha」、「ha a」、「ha u」…等)。Referring to FIG. 4, the illustrated example is the “desktop” display window provided by the touch display module 2, which contains a plurality of different application programs (such as “YouTube”, “EVA Facial Mouse”, “Mi home”, etc.) and a number of phoneme tags (e.g., “aa”, “au”, “ae”, etc.) that are in the vicinity of and respectively associated with the images of those applications, and A plurality of images respectively representing a plurality of operation instructions (such as "My File", "Phone", "Contacts", etc.) and a plurality of images respectively adjacent to the images of the operation instructions and respectively associated with the operation instructions Associated phoneme tags (eg "o ha", "ha a", "ha u"...etc).

參閱圖5,繪示的示例是該電腦裝置100在執行「YouTube」應用程式後由該觸控顯示模組2提供的顯示視窗,其中含有多個分別代表多個不同多媒體檔案(如影片1~影片10)的圖像(如圖像1~圖像10)以及多個分別在該等多媒體檔案的圖像附近且分別與該等多媒體檔案相關聯的音素標籤(如「u e」、「e u」、「e e」…等)和多個分別代表多個操作指令(如「首頁」、「發燒影片」、「訂閱內容」…等)的圖像以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「ha a」、「ha u」、「ha e」…等)。Referring to FIG. 5, the illustrated example is the display window provided by the touch display module 2 after the computer device 100 executes the “YouTube” application program, which contains a plurality of windows respectively representing a plurality of different multimedia files (such as video 1~ Video 10) images (such as images 1 to 10) and a plurality of phoneme tags (such as "ue", "eu") that are respectively near the images of the multimedia files and are respectively associated with the multimedia files , "ee", etc.) and multiple images representing multiple operation instructions (such as "home page", "fever video", "subscription content"...etc.) and multiple images adjacent to the images respectively of these operation instructions And the phoneme tags (such as "ha a", "ha u", "ha e", etc.) associated with these operation instructions respectively.

參閱圖6,繪示的示例是該電腦裝置100在執行一社群溝通應用程式後由該觸控顯示模組2提供的一顯示視窗,其中含有多個分別代表多個文字內容的圖像(如「我需要幫忙」、「我要小便」…等)以及多個分別在該等文字內容的圖像附近且分別與該等文字內容相關聯的音素標籤(如「a a」、「a i」…等)和多個分別代表多個操作指令的圖像(如「清除」、「傳送和發音」、「儲存」…等)以及多個分別在該等操作指令的圖像附近且分別與該等操作指令相關聯的音素標籤(如「u ha」、「u hi」、「u he」…等)。Referring to FIG. 6, the illustrated example is a display window provided by the touch display module 2 after the computer device 100 executes a community communication application, which contains a plurality of images ( such as "I need help", "I want to urinate", etc.) and a number of phoneme tags (such as "aa", "ai"... etc.) and a plurality of images respectively representing a plurality of operation instructions (such as "clear", "transfer and pronunciation", "storage", etc.) and a plurality of images respectively adjacent to and respectively associated with the images of the operation instructions The phoneme label associated with the operation instruction (eg "u ha", "u hi", "u he"...etc).

在本實施例中,在此情況下,該用戶可根據該觸控顯示模組2當前的顯示視窗所含的操作選項以及與其相關聯的音素標籤,發出與所欲操作選項相關聯的音素標籤有關的第一音素及第二音素。然而,在其他實施態樣中,若每一音素標籤含有三個或三個以上的參考音素時,用戶則必須發出與所欲操作選項相關聯的音素標籤有關的多個音素,其數量應與每一音素標籤所含的參考音素的數量一致。In this embodiment, in this case, the user can issue the phoneme label associated with the desired operation option according to the operation option contained in the current display window of the touch display module 2 and the phoneme label associated therewith related first and second phonemes. However, in other embodiments, if each phoneme tag contains three or more reference phonemes, the user must utter a plurality of phonemes related to the phoneme tag associated with the desired operation option, the number of which should be equal to The number of reference phonemes contained in each phoneme label is the same.

於是,當該處理單元5接收到由該語音收集模組1收集到來自該用戶且包含連續的第一音素和第二音素的語音信號時,在步驟S25中,該處理單元5可以根據該儲存模組3儲存的該語音辨識資料且利用語音辨識技術,或者根據該儲存模組3儲存的該個人音素資料且利用聲紋辨識技術,確認該第一音素是否相似於該等參考音素其中之一者並確認該第二音素是否相似於該等參考音素其中之一者。若該處理單元5確認出相似於該第一音素的第一目標參考音素、及相似於該第二音素的第二目標參考音素時,則流程進行步驟S26。否則,流程進行步驟S28。特別要說明的是,在實際執行步驟S25時,該處理單元5例如可以先根據該語音辨識資料且利用語音辨識技術來執行該第一音素與該第二音素的確認操作,並在無法成功確認時,再利用該個人音素資料且利用聲紋辨識技術來執行該第一音素與該第二音素的確認操作,但不在此限。Therefore, when the processing unit 5 receives the voice signal collected by the voice collection module 1 from the user and containing the continuous first phoneme and the second phoneme, in step S25, the processing unit 5 can store the voice signal according to the The voice recognition data stored in the module 3 uses voice recognition technology, or according to the personal phoneme data stored in the storage module 3 and voiceprint recognition technology is used to confirm whether the first phoneme is similar to one of the reference phonemes and confirms whether the second phoneme is similar to one of the reference phonemes. If the processing unit 5 identifies a first target reference phoneme similar to the first phoneme and a second target reference phoneme similar to the second phoneme, the process proceeds to step S26. Otherwise, the flow proceeds to step S28. It should be noted that, when step S25 is actually executed, the processing unit 5 may, for example, first perform the confirmation operation of the first phoneme and the second phoneme according to the speech recognition data and using the speech recognition technology, and if the confirmation fails to be successful At the time, the personal phoneme data and the voiceprint recognition technology are used to perform the confirmation operation of the first phoneme and the second phoneme, but not limited to this.

在步驟S26中,該處理單元5根據該第一目標參考音素及該第二目標參考音素以及該儲存模組3儲存的該關聯關係,自該等音素標籤(即,在步驟S24中當前顯示視窗所含的音素標籤)決定出一個目標音素標籤。該目標音素標籤所含的該第一參考音素與該第二參考音素分別相同於該第一目標參考音素與該第二目標參考音素。In step S26, the processing unit 5, according to the first target reference phoneme and the second target reference phoneme and the association stored in the storage module 3, selects the phoneme tags (that is, the currently displayed window in step S24) from the phoneme tags contains the phoneme label) to determine a target phoneme label. The first reference phoneme and the second reference phoneme contained in the target phoneme tag are respectively the same as the first target reference phoneme and the second target reference phoneme.

接著,在步驟S27中,該處理單元5根據該儲存模組3儲存的該關聯關係將該等操作選項(即,在步驟S24中當前顯示視窗所含的操作選項)其中一個與該目標音素標籤相關聯的目標操作選項(亦即所欲操作選項)激活。Next, in step S27, the processing unit 5 associates one of the operation options (that is, the operation options contained in the currently displayed window in step S24) with the target phoneme label according to the association relationship stored in the storage module 3 The associated target action option (ie, the desired action option) is activated.

當該處理單元5確認出該第一音素與該第二音素其中一者與該等參考音素其中每一者均不相似時,在步驟S28中,該處理單元5使該觸控顯示模組2顯示辨識失敗訊息。於是,該用戶可在重新發出含有與所欲操作選項相關聯的語音標籤有關的第一音素與第二音素後,該電腦裝置100重新執行步驟S25的操作直到能確認出相似於該第一音素與該第二音素的參考音素為止。When the processing unit 5 confirms that one of the first phoneme and the second phoneme is not similar to each of the reference phonemes, in step S28 , the processing unit 5 makes the touch display module 2 A recognition failure message is displayed. Therefore, after the user re-utters the first phoneme and the second phoneme containing the voice tag associated with the desired operation option, the computer device 100 re-executes the operation of step S25 until it can confirm that the first phoneme is similar to the first phoneme until the reference phoneme of the second phoneme.

以下,將進一步就該目標操作選項的實際形式來示例地說明該處理單元5如何激活該目標操作選項的方式。Hereinafter, the manner of how the processing unit 5 activates the target operation option will be further exemplified in terms of the actual form of the target operation option.

若該目標操作選項為一符號(如圖3所示的虛擬鍵盤中的數學符號「>」)時,該處理單元5透過一使該觸控顯示模組2顯示該符號在一編輯區(圖未示)的方式來激活該目標操作選項。相似地,若該目標操作選項為一字元(如圖3所示的虛擬鍵盤中的字元「我」)時,該處理單元5透過一使該觸控顯示模組2顯示該字元的方式來激活該目標操作選項。若該目標操作選項為一文字內容(如圖6所示的顯示視窗中的「我需要幫忙」)時,該處理單元5不僅透過一使該觸控顯示面板2顯示該文字內容在一溝通紀錄區(如圖6所示的溝通紀錄區)方式,還透過一使該喇叭模組4播放一對應於該文字內容的語音內容的方式來激活該目標操作選項。若該目標操作選項為一操作指令(如圖4所示的顯示視窗中的操作指令「電話」)時,該處理單元5透過一執行該操作指令的方式(如使該觸控顯示模組2從原本的桌面顯示視窗切換至與「電話」有關的顯示視窗)來激活該目標操作選項。若該目標操作選項為一應用程式(如圖4所示的顯示視窗中的應用程式「YouTube」)時,該處理單元5透過一執行該應用程式的方式來激活該目標操作選項,並使該觸控顯示模組2從原本的顯示視窗(如圖4所示)切換至與該應用程式相關的顯示視窗(如圖5所示)。若該目標操作選項為一檔案時,該處理單元5透過一開啟或播放該檔案之方式來激活該目標操作選項。If the target operation option is a symbol (mathematical symbol ">" in the virtual keyboard as shown in FIG. 3 ), the processing unit 5 displays the symbol in an editing area (Fig. not shown) to activate the target operation option. Similarly, if the target operation option is a character (the character “I” in the virtual keyboard shown in FIG. 3 ), the processing unit 5 displays the character through a way to activate the target action option. If the target operation option is a text content (“I need help” in the display window as shown in FIG. 6 ), the processing unit 5 not only displays the text content in a communication record area through a touch display panel 2 (the communication record area shown in FIG. 6 ), the target operation option is also activated by making the speaker module 4 play a voice content corresponding to the text content. If the target operation option is an operation command (such as the operation command "telephone" in the display window as shown in FIG. 4 ), the processing unit 5 executes the operation command in a manner (for example, the touch display module 2 Switch from the original desktop display window to the display window related to the "phone") to activate the target operation option. If the target operation option is an application (the application “YouTube” in the display window as shown in FIG. 4 ), the processing unit 5 activates the target operation option by executing the application, and makes the The touch display module 2 switches from the original display window (as shown in FIG. 4 ) to the display window related to the application (as shown in FIG. 5 ). If the target operation option is a file, the processing unit 5 activates the target operation option by opening or playing the file.

值得注意的是,上述第一實施例的語音輸入操作方法可被編程為一包含多個程式指令的電腦程式產品,並將該電腦程式產品儲存在一電腦可讀取媒體(例如,該儲存模組3)。當該電腦裝置100執行該等程式指令時,該電腦裝置100可完成如以上所述基於至少雙音素的語音輸入操作方法。It is worth noting that, the voice input operation method of the above-mentioned first embodiment can be programmed as a computer program product including a plurality of program instructions, and the computer program product is stored in a computer-readable medium (for example, the storage module). group 3). When the computer device 100 executes the program instructions, the computer device 100 can implement the voice input operation method based on at least two phonemes as described above.

參閱圖7,繪示出的另一電腦系統不僅包含作為使用終端的該電腦裝置100,還包含一辨識伺服端200。該電腦裝置100與該辨識伺服端200協同來實施本發明第二實施例的基於至少雙音素(Double-phoneme)的語音輸入操作方法。在本實施例中,該辨識伺服端200可經由一通訊網路300與該電腦裝置100通訊,並支援語音及聲紋辨識技術。Referring to FIG. 7 , another computer system shown not only includes the computer device 100 serving as a user terminal, but also includes an identification server 200 . The computer device 100 cooperates with the recognition server 200 to implement the voice input operation method based on at least double-phoneme according to the second embodiment of the present invention. In this embodiment, the identification server 200 can communicate with the computer device 100 via a communication network 300 and supports voice and voiceprint identification technology.

以下,將參閱圖7及圖8來示例地說明該電腦系統如何執行該第二實施例的語音輸入操作方法。大體而言,本實施例的語音輸入操作方法為上述第一實施例的語音操作方法的變化實施例,並可包含以下步驟S80-S91。Hereinafter, referring to FIG. 7 and FIG. 8 , how the computer system executes the voice input operation method of the second embodiment will be exemplarily described. Generally speaking, the voice input operation method of this embodiment is a modified embodiment of the voice operation method of the above-mentioned first embodiment, and may include the following steps S80-S91.

首先,在步驟S80中,該辨識伺服端200預先儲存有該語音辨識資料。First, in step S80, the recognition server 200 pre-stores the voice recognition data.

在步驟S81中,在註冊階段,該電腦裝置100經由該通訊網路300,將該語音收集模組1收集到由一用戶發出對應於多個參考音素的多個語音的語音內容傳送至該辨識伺服端200。In step S81, in the registration stage, the computer device 100 transmits the voice content of the voices corresponding to the reference phonemes uttered by the voice collection module 1 to the recognition server via the communication network 300. end 200.

然後,在步驟S82中,該辨識伺服端200將來自該電腦裝置100的該語音內容作為對應於該用戶的個人音素資料並將其儲存。值得一提的是,在實際使用時,該辨識伺服端200亦可用作有關個人音素資料的雲端伺服器,並廣為收集且儲存大量其他用戶(如構音異常的用戶)的個人音素資料,並將此大量資料進一步利用人工智慧的分析或進行機器學習可獲得作為用於特殊語音(如構音異常用戶所發出的語音)辨識的語音資料庫(圖未示)。Then, in step S82, the recognition server 200 stores the voice content from the computer device 100 as the personal phoneme data corresponding to the user. It is worth mentioning that, in actual use, the identification server 200 can also be used as a cloud server for personal phoneme data, and widely collects and stores personal phoneme data of a large number of other users (such as users with abnormal pronunciation). This large amount of data is further analyzed by artificial intelligence or machine learning can be used as a speech database (not shown) for recognition of special speech (such as speech made by users with dysarthria).

接著,該電腦裝置100依序執行步驟S83至步驟S85。由於該電腦裝置100在步驟S83至步驟S85的操作細節分別相同於上述步驟S22至步驟S24(圖2)的所有操作細節,故在此不再贅述。Next, the computer device 100 executes steps S83 to S85 in sequence. Since the operation details of the computer device 100 in steps S83 to S85 are respectively the same as all the operation details of the above-mentioned steps S22 to S24 ( FIG. 2 ), the details are not repeated here.

之後,當該電腦裝置100的該處理單元5接收到由該語音收集模組1收集到來自該用戶且包含連續的第一音素和第二音素的語音信號時,該電腦裝置100經由該通訊網路300,將一包含該語音信號且有關該用戶的辨識請求傳送至該辨識伺服端200(步驟S86)。Afterwards, when the processing unit 5 of the computer device 100 receives the voice signal from the user collected by the voice collection module 1 and including the continuous first phoneme and the second phoneme, the computer device 100 passes through the communication network. 300. Send an identification request including the voice signal and related to the user to the identification server 200 (step S86).

然後,該辨識伺服端200在接收到來自該電腦裝置100的該辨識請求時,可以根據已儲存的該語音辨識資料且利用語音辨識技術,或者根據已儲存且對應於該用戶的該個人音素資料並利用聲紋辨識技術(又或者根據上述用於特殊語音辨識的語音資料庫且利用語音辨識技術),確認該第一音素是否相似於該等參考音素其中之一者並確認該第二音素是否相似於該等參考音素其中之一者(步驟S87)。若該辨識伺服端200確認出相似於該第一音素的第一目標參考音素、及相似於該第二音素的第二目標參考音素時(即,成功辨識),則流程進行步驟S88。否則,流程進行步驟S91。Then, when the recognition server 200 receives the recognition request from the computer device 100, it can use the voice recognition technology according to the stored voice recognition data, or use the stored personal phoneme data corresponding to the user. And use voiceprint recognition technology (or according to the above-mentioned speech database for special speech recognition and use speech recognition technology), confirm whether the first phoneme is similar to one of these reference phonemes and confirm whether the second phoneme is Similar to one of the reference phonemes (step S87). If the identification server 200 identifies a first target reference phoneme similar to the first phoneme and a second target reference phoneme similar to the second phoneme (ie, the identification is successful), the process proceeds to step S88. Otherwise, the flow proceeds to step S91.

在步驟S88中,該辨識伺服端200經由該通訊網路300,將一含有該第一目標參考音素及該第二目標參考音素的辨識回覆傳送至該電腦裝置100。In step S88 , the identification server 200 transmits an identification reply including the first target reference phoneme and the second target reference phoneme to the computer device 100 via the communication network 300 .

然後,在步驟S89中,該電腦裝置100的該處理單元5根據該辨識回覆所含的該第一目標參考音素及該第二目標參考音素以及該儲存模組3儲存的該關聯關係,自該等音素標籤決定出一個目標音素標籤。該目標音素標籤所含的該第一參考音素與該第二參考音素分別相同於該第一目標參考音素與該第二目標參考音素。Then, in step S89, the processing unit 5 of the computer device 100, according to the first target reference phoneme and the second target reference phoneme included in the identification reply, and the association stored in the storage module 3, from the The equal phoneme label determines a target phoneme label. The first reference phoneme and the second reference phoneme contained in the target phoneme tag are respectively the same as the first target reference phoneme and the second target reference phoneme.

接著,該電腦裝置100的該處理單元5,相似於上述步驟S27(圖2),根據該儲存模組3儲存的該關聯關係將該等操作選項其中一個與該目標音素標籤相關聯的目標操作選項(亦即所欲操作選項)激活(步驟S90)。Then, the processing unit 5 of the computer device 100, similar to the above-mentioned step S27 (FIG. 2), according to the association stored in the storage module 3, one of the operation options is associated with the target operation of the target phoneme tag The option (ie, the desired operation option) is activated (step S90).

當該辨識伺服端200確認出該第一音素與該第二音素其中一者與該等參考音素其中每一者均不相似時(即辨識失敗),在步驟S91中,該辨識伺服端200經由該通訊網路300,將辨識失敗訊息傳送至該電腦裝置100。於是,該電腦裝置100的該處理單元5可將來自於該辨識伺服端200的該辨識失敗訊息顯示於該觸控顯示模組2,以供該用戶觀看。於是,該用戶可在重新發出含有與所欲操作選項相關聯的語音標籤有關的第一音素與第二音素後,該電腦系統重新執行步驟S86與步驟S87的操作直到能確認出相似於該第一音素與該第二音素的參考音素為止。When the identification server 200 confirms that one of the first phoneme and the second phoneme is not similar to each of the reference phonemes (ie, the identification fails), in step S91, the identification server 200 passes the The communication network 300 transmits the identification failure message to the computer device 100 . Therefore, the processing unit 5 of the computer device 100 can display the identification failure message from the identification server 200 on the touch display module 2 for the user to view. Therefore, after the user re-utters the first phoneme and the second phoneme related to the voice tag associated with the desired operation option, the computer system re-executes the operations of step S86 and step S87 until it can confirm that the first phoneme and the second phoneme are similar to the first phoneme. A phoneme and the reference phoneme of the second phoneme.

綜上所述,由於先儲存了用於語音辨識的語音辨識資料,以及用於聲紋辨識且對應於用戶的個人音素資料,因此不僅對於構音正常的用戶所發出的語音,而且對於構音障礙患者的用戶所發出的有限語音均能被精確地辨識出。此外,基於至少雙音素的編碼方式能定義出相對較多數量的音素標籤,此等音素標籤能被廣泛地應用來與相對較多數量的操作選項建立關聯性。於是,相較於現有利用相對複雜的語言模型及聲學模型的語音辨識技術,能相對容易且快速地語音辨識出用戶所發出且含至少雙音素的語音輸入,以判定出所欲激活的目標操作選項。故確實能達成本發明的目的。To sum up, since the speech recognition data for speech recognition and the personal phoneme data corresponding to the user for voiceprint recognition are stored first, not only the speech uttered by users with normal articulation, but also for patients with dysarthria. The limited speech uttered by the users can be accurately recognized. In addition, the encoding method based on at least two phones can define a relatively large number of phoneme labels, and these phoneme labels can be widely used to establish associations with a relatively large number of operation options. Therefore, compared with the existing speech recognition technology using a relatively complex language model and an acoustic model, it is relatively easy and fast to recognize the speech input issued by the user and containing at least two phonemes, so as to determine the target operation option to be activated. . Therefore, the object of the present invention can indeed be achieved.

惟以上所述者,僅為本發明之實施例而已,當不能以此限定本發明實施之範圍,凡是依本發明申請專利範圍及專利說明書內容所作之簡單的等效變化與修飾,皆仍屬本發明專利涵蓋之範圍內。However, the above are only examples of the present invention, and should not limit the scope of the present invention. Any simple equivalent changes and modifications made according to the scope of the application for patent of the present invention and the content of the patent specification are still within the scope of the present invention. within the scope of the invention patent.

100:電腦裝置 1:語音收集模組 2:觸控顯示模組 3:儲存模組 4:喇叭模組 5:處理單元 200:辨識伺服端 300:通訊網路 S20~S28:步驟 S80~S91:步驟100: Computer Devices 1: Voice collection module 2: touch display module 3: storage module 4: Speaker module 5: Processing unit 200: Identify the server 300: Communication Network S20~S28: Steps S80~S91: Steps

本發明之其他的特徵及功效,將於參照圖式的實施方式中清楚地呈現,其中: 圖1是一方塊圖,示例性地繪示用來實施本發明第一實施例的基於至少雙音素的語音輸入操作方法的電腦系統的一架構; 圖2是一流程圖,示例性地說明圖1的電腦系統如何實施本發明第一實施例; 圖3是一示意圖,示例性地繪示出由該電腦系統的一觸控顯示模組所顯示並含有音素標籤的虛擬鍵盤; 圖4至圖6是示意圖,示例性地繪示出該電腦系統在不同的使用情況下由該觸控顯示模組所提供並含有音素標籤的顯示視窗; 圖7是一方塊圖,示例性地繪示出用來實施本發明第二實施例的基於至少雙音素的語音輸入操作方法的電腦系統的另一架構;及 圖8是一流程圖,示例性地說明圖7的電腦系統如何實施本發明第二實施例。Other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, wherein: FIG. 1 is a block diagram exemplarily illustrating a structure of a computer system for implementing the at least diphone-based voice input operation method according to the first embodiment of the present invention; FIG. 2 is a flow chart illustrating how the computer system of FIG. 1 implements the first embodiment of the present invention; 3 is a schematic diagram exemplarily depicting a virtual keyboard displayed by a touch display module of the computer system and containing phoneme labels; 4 to 6 are schematic diagrams, exemplarily illustrating display windows provided by the touch display module and containing phoneme labels under different usage conditions of the computer system; 7 is a block diagram exemplarily illustrating another architecture of a computer system for implementing the at least diphone-based voice input operation method of the second embodiment of the present invention; and FIG. 8 is a flow chart illustrating how the computer system of FIG. 7 implements the second embodiment of the present invention.

S20-S28:步驟S20-S28: Steps

Claims (7)

一種基於至少雙音素的語音輸入操作方法,利用一具有語音與聲紋辨識技術的電腦系統來執行,並包含以下步驟: (A)儲存與多個彼此不同的參考音素相關聯的語音辨識資料、及對應於一用戶的個人音素資料,該個人音素資料包含由該用戶所發出且分別對應於該等參考音素的多個語音的語音內容; (B)將該等參考音素編碼,以定義出多個彼此不同的音素標籤,其中每一個音素標籤至少包含一選自該等參考音素其中一者的第一參考音素、及一選自該等參考音素其中一者的第二參考音素; (C)將該等音素標籤分別與多個彼此不同的操作選項相關聯; (D)當收集到來自該用戶且至少包含連續的第一音素和第二音素的語音信號後,根據該語音辨識資料且利用語音辨識技術,或者根據該個人音素資料且利用聲紋辨識技術,確認該第一音素是否相似於該等參考音素其中之一者並確認該第二音素是否相似於該等參考音素其中之一者; (E)當確認出相似於該第一音素的第一目標參考音素、及相似於該第二音素的第二目標參考音素後,根據該第一目標參考音素和該第二目標參考音素,自該等音素標籤決定出一個目標音素標籤,該目標音素標籤所含的該第一參考音素與該第二參考音素分別相同於該第一目標參考音素與該第二目標參考音素;及 (F)將該等操作選項其中一個與該目標音素標籤相關聯的目標操作選項激活。A voice input operation method based on at least two phonemes is implemented by a computer system with voice and voiceprint recognition technology, and includes the following steps: (A) storing speech recognition data associated with a plurality of reference phonemes different from each other, and personal phoneme data corresponding to a user, the personal phoneme data including a plurality of reference phonemes uttered by the user and corresponding to the reference phonemes respectively the voice content of the voice; (B) encoding the reference phonemes to define a plurality of mutually different phoneme labels, wherein each phoneme label at least includes a first reference phoneme selected from one of the reference phonemes, and a first reference phoneme selected from the reference phonemes a second reference phoneme of one of the reference phonemes; (C) respectively associating the phoneme labels with a plurality of operation options that are different from each other; (D) After collecting the voice signal from the user and including at least the first phoneme and the second phoneme in succession, according to the voice recognition data and using the voice recognition technology, or according to the personal phoneme data and using the voiceprint recognition technology, confirming whether the first phoneme is similar to one of the reference phonemes and confirming whether the second phoneme is similar to one of the reference phonemes; (E) After confirming the first target reference phoneme similar to the first phoneme and the second target reference phoneme similar to the second phoneme, according to the first target reference phoneme and the second target reference phoneme, automatically The phoneme labels determine a target phoneme label, and the first reference phoneme and the second reference phoneme contained in the target phoneme label are the same as the first target reference phoneme and the second target reference phoneme, respectively; and (F) Activating a target operation option associated with the target phoneme tag of one of the operation options. 如請求項1所述的基於至少雙音素的語音輸入操作方法,其中,在步驟(A)中,每一參考音素為一母音或一音節。The voice input operation method based on at least two phonemes according to claim 1, wherein, in step (A), each reference phoneme is a vowel or a syllable. 如請求項1所述的基於至少雙音素的語音輸入操作方法,其中: 在步驟(C)中,每一操作選項為一符號、一字元、一文字內容、一操作指令、一應用程式及一檔案其中一者;及 在步驟(F)中,當該目標操作選項為一符號時,該電腦系統透過一顯示該符號之方式激活該目標操作選項, 當該目標操作選項為一字元時,該電腦系統透過一顯示該字元之方式激活該目標操作選項, 當該目標操作選項為一文字內容時,該電腦系統至少透過一顯示該文字內容之方式激活該目標操作選項, 當該目標操作選項為一操作指令時,該電腦系統透過一執行該操作指令之方式激活該目標操作選項, 當該目標操作選項為一應用程式時,該電腦系統透過一執行該應用程式之方式激活該目標操作選項,及 當該目標操作選項為一檔案時,該電腦系統透過一開啟或播放該檔案之方式激活該目標操作選項。The voice input operation method based on at least two phonemes as described in claim 1, wherein: In step (C), each operation option is one of a symbol, a character, a text content, an operation command, an application program and a file; and In step (F), when the target operation option is a symbol, the computer system activates the target operation option by displaying the symbol, When the target operation option is a character, the computer system activates the target operation option by displaying the character, When the target operation option is a text content, the computer system activates the target operation option at least by displaying the text content, When the target operation option is an operation command, the computer system activates the target operation option by executing the operation command, when the target operation option is an application, the computer system activates the target operation option by executing the application, and When the target operation option is a file, the computer system activates the target operation option by opening or playing the file. 如請求項3所述的基於至少雙音素的語音輸入操作方法,在步驟(C)與步驟(D)之間,還包含以下步驟: (G)顯示多個分別代表該等操作選項的圖像,並在該等圖像附近顯示與該等操作選項相關聯的該等音素標籤。The voice input operation method based on at least two phonemes as described in claim 3, between step (C) and step (D), further comprising the following steps: (G) displaying a plurality of images respectively representing the operation options, and displaying the phoneme labels associated with the operation options near the images. 如請求項4所述的基於至少雙音素的語音輸入操作方法,在步驟(F)中,當該目標操作選項為一文字內容時,該電腦系統不僅透過顯示該文字內容之方式,還透過一播放一對應於該文字內容的語音內容的方式激活該目標操作選項。According to the voice input operation method based on at least two phonemes according to claim 4, in step (F), when the target operation option is a text content, the computer system not only displays the text content, but also plays a The target operation option is activated in a manner corresponding to the speech content of the text content. 如請求項5所述的基於至少雙音素的語音輸入操作方法,該電腦系統包含一用於執行步驟(B)、(C)、(E)、(F)及(G)的使用終端、及一可與該使用終端通訊且用於執行步驟(A)及(D)的辨識伺服端,還包含以下步驟: 在步驟(A)之前,(H)藉由該使用終端,將收集到的該個人音素資料傳送至該辨識伺服端; 在步驟(C)與步驟(D)之間,(I)藉由該使用終端,收集該語音信號,並將一包含該語音信號且有關該用戶的辨識請求傳送至該辨識伺服端,以便該辨識伺服端回應於該辨識請求執行步驟(D);及 在步驟(D)與步驟(E)之間,(J)藉由該辨識伺服端,當確認出該第一目標參考音素與該第二目標參考音素時,將一含有該第一目標參考音素及該第二目標參考音素的辨識回覆傳送至該使用終端,以使該使用終端回應於該辨識回覆執行步驟(E)。The voice input operation method based on at least two phonemes according to claim 5, the computer system comprising a user terminal for executing steps (B), (C), (E), (F) and (G), and An identification server that can communicate with the user terminal and is used for performing steps (A) and (D), further comprising the following steps: Before step (A), (H) send the collected personal phoneme data to the identification server through the user terminal; Between step (C) and step (D), (1) collect the voice signal by the user terminal, and transmit an identification request including the voice signal and related to the user to the identification server, so that the The identification server performs step (D) in response to the identification request; and Between the step (D) and the step (E), (J) by the identification server, when the first target reference phoneme and the second target reference phoneme are confirmed, a file containing the first target reference phoneme and the identification reply of the second target reference phoneme is sent to the user terminal, so that the user terminal performs step (E) in response to the identification reply. 一種電腦程式產品,儲存在一電腦可讀取媒體,包含多個程式指令,且當一電腦系統執行該等程式指令時,可完成如請求項1至5其中任一項所述之基於至少雙音素的語音輸入操作方法。A computer program product, stored on a computer-readable medium, comprising a plurality of program instructions, and when a computer system executes the program instructions, can complete the at least dual-based method as described in any one of claims 1 to 5. Phoneme voice input operation method.
TW109108337A 2020-03-13 2020-03-13 At least two phoneme-based voice input operation method and computer program product TWI752437B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
TW109108337A TWI752437B (en) 2020-03-13 2020-03-13 At least two phoneme-based voice input operation method and computer program product
PCT/US2020/038446 WO2021183169A1 (en) 2020-03-13 2020-06-18 Method of voice input operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW109108337A TWI752437B (en) 2020-03-13 2020-03-13 At least two phoneme-based voice input operation method and computer program product

Publications (2)

Publication Number Publication Date
TW202134855A TW202134855A (en) 2021-09-16
TWI752437B true TWI752437B (en) 2022-01-11

Family

ID=77672287

Family Applications (1)

Application Number Title Priority Date Filing Date
TW109108337A TWI752437B (en) 2020-03-13 2020-03-13 At least two phoneme-based voice input operation method and computer program product

Country Status (2)

Country Link
TW (1) TWI752437B (en)
WO (1) WO2021183169A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102272827A (en) * 2005-06-01 2011-12-07 泰吉克通讯股份有限公司 Method and apparatus utilizing voice input to resolve ambiguous manually entered text input
CN106463113A (en) * 2014-03-04 2017-02-22 亚马逊技术公司 Predicting pronunciation in speech recognition
US20180348970A1 (en) * 2017-05-31 2018-12-06 Snap Inc. Methods and systems for voice driven dynamic menus
US20190339936A1 (en) * 2018-05-07 2019-11-07 Imam Abdulrahman Bin Faisal University Smart mirror

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7146319B2 (en) * 2003-03-31 2006-12-05 Novauris Technologies Ltd. Phonetically based speech recognition system and method
US20050154587A1 (en) * 2003-09-11 2005-07-14 Voice Signal Technologies, Inc. Voice enabled phone book interface for speaker dependent name recognition and phone number categorization
US7471775B2 (en) * 2005-06-30 2008-12-30 Motorola, Inc. Method and apparatus for generating and updating a voice tag
US8903847B2 (en) * 2010-03-05 2014-12-02 International Business Machines Corporation Digital media voice tags in social networks
WO2013192535A1 (en) * 2012-06-22 2013-12-27 Johnson Controls Technology Company Multi-pass vehicle voice recognition systems and methods

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102272827A (en) * 2005-06-01 2011-12-07 泰吉克通讯股份有限公司 Method and apparatus utilizing voice input to resolve ambiguous manually entered text input
CN106463113A (en) * 2014-03-04 2017-02-22 亚马逊技术公司 Predicting pronunciation in speech recognition
US20180348970A1 (en) * 2017-05-31 2018-12-06 Snap Inc. Methods and systems for voice driven dynamic menus
US20190339936A1 (en) * 2018-05-07 2019-11-07 Imam Abdulrahman Bin Faisal University Smart mirror

Also Published As

Publication number Publication date
TW202134855A (en) 2021-09-16
WO2021183169A1 (en) 2021-09-16

Similar Documents

Publication Publication Date Title
JP6588637B2 (en) Learning personalized entity pronunciation
JP7432556B2 (en) Methods, devices, equipment and media for man-machine interaction
US6377925B1 (en) Electronic translator for assisting communications
JP2021103328A (en) Voice conversion method, device, and electronic apparatus
WO2022078146A1 (en) Speech recognition method and apparatus, device, and storage medium
KR101819459B1 (en) Voice recognition system and apparatus supporting voice recognition error correction
CN112819664A (en) Apparatus for learning foreign language and method for providing foreign language learning service using the same
JP2002116796A (en) Voice processor and method for voice processing and storage medium
US20070055520A1 (en) Incorporation of speech engine training into interactive user tutorial
WO2020024620A1 (en) Voice information processing method and device, apparatus, and storage medium
CN111711834B (en) Recorded broadcast interactive course generation method and device, storage medium and terminal
CN111653265A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113515586A (en) Data processing method and device
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
KR100593589B1 (en) Multilingual Interpretation / Learning System Using Speech Recognition
CN109272983A (en) Bilingual switching device for child-parent education
JP6166831B1 (en) Word learning support device, word learning support program, and word learning support method
JP2007018290A (en) Handwritten character input display supporting device and method and program
TWI752437B (en) At least two phoneme-based voice input operation method and computer program product
CN115019787B (en) Interactive homonym disambiguation method, system, electronic equipment and storage medium
KR102684930B1 (en) Video learning systems for enable learners to be identified through artificial intelligence and method thereof
JP2000112610A (en) Contents display selecting system and contents recording medium
KR102272567B1 (en) Speech recognition correction system
CN113393831B (en) Speech input operation method based on at least diphones and computer readable medium
CN111489742A (en) Acoustic model training method, voice recognition method, device and electronic equipment