TW201510770A

TW201510770A - Method for activating voice recognition of display device

Info

Publication number: TW201510770A
Application number: TW102131661A
Authority: TW
Inventors: Hung-Wang Hsu; Shih-Chieh Hsiao; Yu-Tsung Hsu
Original assignee: Top Victory Invest Ltd
Priority date: 2013-09-03
Filing date: 2013-09-03
Publication date: 2015-03-16

Abstract

A method for activating voice recognition of a display device including a camera, a speaker and a microphone is provided. The method includes: when activating a voice recognition function, turning on the camera to capture images, and performing image recognition; when detecting a specific gesture, reducing or muting volume of sound from the speaker, turning on the microphone to record voice, and performing the voice recognition; when detecting a voice command, performing a corresponding control action according to the detected voice command; when detecting the voice command of exiting the voice recognition, or detecting no voice command for a predetermined period of time, terminating the voice recognition, and restoring the volume of the sound from the speaker. The invention may reduce interference due to the sound from the display device itself, improve time spent for inputting the command and complexity of use, and meet user's habit.

Description

Display method for display speech recognition

本發明是有關於一種語音辨識的啟動方法，且特別是有關於一種顯示器語音辨識的啟動方法。The invention relates to a method for starting speech recognition, and in particular to a method for starting speech recognition of a display.

　　對於目前的顯示器，如電腦螢幕（monitor）或電視而言，語音辨識功能已經越來越普及。而常見的顯示器語音辨識的啟動方法有三種，分別是「Free Talk（直接輸入語音指令）」、「Voice Trigger to Talk（先語音啟動輸入再輸入語音指令）」和「Push to Talk（先按鍵啟動輸入再輸入語音指令）」。在「Free Talk」方法中，顯示器隨時進行收錄語音和語音辨識。在「Voice Trigger to Talk」方法中，顯示器隨時進行收錄語音並採用兩階段語音辨識，第一階段只辨識幾個預設的語音指令，當辨識到有預設的語音指令出現時，先降低顯示器聲音音量或靜音，再進入第二階段等候並辨識完整的語音指令。在「Push to Talk」方法中，顯示器在辨識到有遙控器特定按鍵被按壓後，先降低顯示器聲音音量或靜音，再進行收錄語音和語音辨識。
　　在這三種顯示器語音辨識的啟動方法中，「Free Talk」是使用者最容易接受的方法，但因顯示器本身正在播放的影音內容所發出的聲音干擾，往往會造成辨識失敗而誤動作或無動作。「Voice Trigger to Talk」採用兩階段語音辨識，可以減少「Free Talk」方法中顯示器本身播放聲音的干擾，但因為等於要辨識兩次語音指令，使得整個指令輸入時間長且使用複雜度高。「Push to Talk」既可以減少「Free Talk」方法中顯示器本身播放聲音的干擾，也改善了「Voice Trigger to Talk」方法中指令輸入時間長和使用複雜度高的缺點，但這樣的設計違反使用者的使用習慣，因為當使用者手持遙控器時，直接使用遙控器來輸入指令，又快又直覺，此時使用語音辨識來輸入語音指令，相較之下反而反應慢且有時會辨識錯誤。

For current displays, such as computer monitors or televisions, voice recognition has become increasingly popular. There are three common ways to start the display voice recognition, namely "Free Talk", "Voice Trigger to Talk" and "Push to Talk". Enter and then enter the voice command). In the "Free Talk" method, the display is ready to record voice and speech. In the "Voice Trigger to Talk" method, the display is recorded at any time and uses two-stage speech recognition. In the first stage, only a few preset voice commands are recognized. When a preset voice command is recognized, the display is lowered first. The volume of the sound is muted or muted, and then enters the second stage to wait and recognize the complete voice command. In the "Push to Talk" method, after the display recognizes that a specific button of the remote controller is pressed, the display reduces the volume or mute of the display sound, and then performs voice and voice recognition.
Among the three methods for starting speech recognition, "Free Talk" is the most acceptable method for users. However, due to the sound interference caused by the audio and video content being played by the display itself, the recognition failure may be caused by malfunction or no action. "Voice Trigger to Talk" uses two-stage speech recognition, which can reduce the interference of the sound played by the display itself in the "Free Talk" method, but because it is equal to the recognition of two voice commands, the entire instruction input time is long and the use complexity is high. "Push to Talk" can reduce the interference of the playback sound of the display itself in the "Free Talk" method, and also improve the shortcomings of the instruction input time and the high complexity of the "Voice Trigger to Talk" method, but such design is in violation of the design. The user's habit, because when the user holds the remote control, directly use the remote control to input commands, fast and intuitive, then use voice recognition to input voice commands, which in turn is slow and sometimes recognizes errors. .

    本發明的目的在提出一種顯示器語音辨識的啟動方法，可減少顯示器本身播放聲音的干擾，改善指令輸入時間長和使用複雜度高的缺點，且設計符合使用者的使用習慣。
    為達到上述目的，本發明提出一種顯示器語音辨識的啟動方法，該顯示器包括一攝像頭、一揚聲器及一麥克風，該顯示器語音辨識的啟動方法包括：
    當開啟語音辨識功能時，開啟該攝像頭擷取影像，並進行影像辨識；
    當辨識到特定的手勢時，控制該揚聲器降低聲音音量或靜音後，開啟該麥克風收錄語音，並進行語音辨識；
    當辨識到語音指令時，根據辨識到的語音指令進行相應的控制動作；及
    當辨識到離開語音辨識的語音指令時，或者當一段預定時間內沒有辨識到語音指令時，結束語音辨識，並控制該揚聲器恢復聲音音量。
    在本發明一實施例中，該顯示器語音辨識的啟動方法還包括：當沒有辨識到特定的手勢時，控制該攝像頭繼續擷取影像，並進行影像辨識。
    在本發明一實施例中，該顯示器語音辨識的啟動方法還包括：當還沒有結束語音辨識時，控制該麥克風繼續收錄語音，並進行語音辨識。
    在本發明一實施例中，該顯示器語音辨識的啟動方法還包括：當結束語音辨識時，還控制該麥克風停止收錄語音。
    在本發明一實施例中，特定的手勢包括揮手或握拳頭。
    在本發明一實施例中，該顯示器包括電腦螢幕或電視。
    本發明因採用在顯示器辨識到特定的手勢時，降低顯示器聲音音量或靜音後，再進行收錄語音和語音辨識，可減少顯示器本身播放聲音的干擾以提高辨識正確率，改善指令輸入時間長和使用複雜度高的缺點，且利用手勢辨識來啟動語音辨識的設計更符合使用者的使用習慣。
    為讓本發明之上述和其他目的、特徵和優點能更明顯易懂，下文特舉較佳實施例，並配合所附圖式，作詳細說明如下。The object of the present invention is to provide a method for starting speech recognition of a display, which can reduce the interference of the sound played by the display itself, improve the shortcomings of long instruction input time and high complexity of use, and is designed to conform to the user's usage habits.
In order to achieve the above object, the present invention provides a method for starting voice recognition of a display, the display comprising a camera, a speaker and a microphone, and the method for starting voice recognition of the display comprises:
When the voice recognition function is turned on, the camera is turned on to capture an image, and image recognition is performed;
When the specific gesture is recognized, after controlling the speaker to reduce the sound volume or mute, the microphone is turned on to record the voice, and the voice recognition is performed;
When the voice command is recognized, the corresponding control action is performed according to the recognized voice command; and when the voice command leaving the voice recognition is recognized, or when the voice command is not recognized within a predetermined time, the voice recognition is ended, and the control is ended. This speaker restores the sound volume.
In an embodiment of the present invention, the method for starting speech recognition of the display further comprises: when no specific gesture is recognized, controlling the camera to continue capturing images and performing image recognition.
In an embodiment of the present invention, the method for starting speech recognition of the display further comprises: controlling the microphone to continue to record speech and perform speech recognition when the speech recognition has not been completed.
In an embodiment of the invention, the method for starting the voice recognition of the display further comprises: when ending the voice recognition, further controlling the microphone to stop recording the voice.
In an embodiment of the invention, the particular gesture includes waving or clenching a fist.
In an embodiment of the invention, the display comprises a computer screen or a television.
The invention adopts the method of reducing the sound volume or mute of the display sound after the display recognizes a specific gesture, and then performing the recording voice and voice recognition, thereby reducing the interference of the sound played by the display itself to improve the recognition accuracy rate, improving the command input time and using The disadvantage of high complexity, and the use of gesture recognition to initiate speech recognition is more in line with the user's usage habits.
The above and other objects, features and advantages of the present invention will become more <RTIgt;

S1‧‧‧攝像頭擷取影像
S2‧‧‧是否有特定的手勢
S3‧‧‧控制揚聲器降低聲音音量或靜音
S4‧‧‧麥克風收錄語音，並進行語音辨識
S5‧‧‧是否結束語音辨識
S6‧‧‧控制揚聲器恢復聲音音量
S7‧‧‧根據語音指令進行相應的控制動作S1‧‧‧ camera capture image
S2‧‧ Have specific gestures?
S3‧‧‧Control speakers to reduce sound volume or mute
S4‧‧‧ microphones for voice recording and speech recognition
S5‧‧‧End of speech recognition
S6‧‧‧Control speaker to restore sound volume
S7‧‧‧ corresponding control actions according to voice commands

圖1為根據本發明較佳實施例所繪示的顯示器語音辨識的啟動方法的流程圖。

1 is a flow chart of a method for starting speech recognition of a display according to a preferred embodiment of the present invention.

    請參見圖1，其為根據本發明較佳實施例所繪示的顯示器語音辨識的啟動方法的流程圖。顯示器可以是電腦螢幕、電視或其它功能類型的顯示器，但其必須內建或外接有攝像頭、揚聲器及麥克風。在步驟S1，當顯示器開啟語音辨識功能時，顯示器開啟攝像頭擷取影像，並進行影像辨識。在步驟S2，顯示器判斷是否辨識到特定的手勢，特定的手勢可以設定是揮手、握拳頭或其它類型的手勢。
    當顯示器在步驟S2判斷沒有辨識到特定的手勢時，顯示器返回步驟S1控制攝像頭繼續擷取影像，並進行影像辨識。當顯示器在步驟S2判斷辨識到特定的手勢時，執行步驟S3，顯示器先儲存本身正在播放的影音內容使揚聲器發出的聲音的音量值，再控制揚聲器降低聲音音量或靜音；接著，執行步驟S4，顯示器開啟麥克風收錄語音，並進行語音辨識。
    當顯示器在步驟S4沒有辨識到語音指令時，執行步驟S5，顯示器判斷是否一段預定時間內沒有辨識到語音指令。當顯示器在步驟S5判斷沒有辨識到語音指令的時長還沒有達到該預定時間時，表示還沒有結束語音辨識，故返回步驟S4，顯示器控制麥克風繼續收錄語音，並進行語音辨識。當顯示器在步驟S5判斷已一段預定時間內沒有辨識到語音指令時，表示要結束語音辨識，故接著執行步驟S6，顯示器根據在步驟S3所儲存的音量值控制揚聲器恢復聲音音量，然後返回步驟S1繼續本流程。
    當顯示器在步驟S4辨識到語音指令時，執行步驟S7，顯示器根據辨識到的語音指令進行相應的控制動作；接著，執行步驟S5，顯示器判斷是否結束語音辨識。顯示器辨識到的語音指令例如可以是調整亮度、調整音量、切換頻道、離開語音辨識或其它類型的語音指令。在步驟S7，當顯示器辨識到的語音指令是離開語音辨識的語音指令時，顯示器在接著執行的步驟S5中會因判斷有離開語音辨識的語音指令而結束語音辨識，並執行步驟S6。在步驟S7，當顯示器辨識到的語音指令是除了離開語音辨識以外的語音指令時，顯示器在接著執行的步驟S5中會因判斷有語音指令輸入而將用於計時該預定時間的計時器重置以重新計時，並返回步驟S4繼續進行收錄語音和語音辨識。
    換句話說，在步驟S5，當顯示器判斷辨識到離開語音辨識的語音指令時，或者當一段預定時間內沒有辨識到語音指令時，顯示器結束語音辨識，並執行步驟S6。在步驟S5，當顯示器判斷有除了離開語音辨識以外的語音指令時，顯示器返回步驟S4繼續進行收錄語音和語音辨識。
    綜上所述，本發明因採用在顯示器辨識到特定的手勢時，降低顯示器聲音音量或靜音後，再進行收錄語音和語音辨識，可減少顯示器本身播放聲音的干擾以提高辨識正確率，改善指令輸入時間長和使用複雜度高的缺點，且利用手勢辨識來啟動語音辨識的設計更符合使用者的使用習慣。
    雖然本發明已以較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。Please refer to FIG. 1 , which is a flowchart of a method for starting speech recognition of a display according to a preferred embodiment of the present invention. The display can be a computer screen, TV or other functional type of display, but it must have a built-in or external camera, speaker and microphone. In step S1, when the display turns on the voice recognition function, the display turns on the camera to capture the image, and performs image recognition. In step S2, the display determines whether a particular gesture is recognized, and the particular gesture can be set to wave, clench or other type of gesture.
When the display determines in step S2 that a specific gesture is not recognized, the display returns to step S1 to control the camera to continue capturing images and perform image recognition. When the display determines that the specific gesture is recognized in step S2, step S3 is performed, the display first stores the audio content of the audio and video content that is being played by the speaker, and then controls the speaker to lower the sound volume or mute; then, step S4 is performed. The display turns on the microphone to record the voice and perform voice recognition.
When the display does not recognize the voice command in step S4, step S5 is performed, and the display determines whether the voice command is not recognized within a predetermined time. When the display determines in step S5 that the duration of the voice command has not been recognized has not reached the predetermined time, indicating that the voice recognition has not been completed, the process returns to step S4, and the display controls the microphone to continue to record the voice and perform voice recognition. When the display determines in step S5 that the voice command has not been recognized for a predetermined period of time, indicating that the voice recognition is to be ended, then step S6 is performed, the display controls the speaker to restore the sound volume according to the volume value stored in step S3, and then returns to step S1. Continue this process.
When the display recognizes the voice command in step S4, step S7 is performed, and the display performs a corresponding control action according to the recognized voice command; then, in step S5, the display determines whether to end the voice recognition. The voice commands recognized by the display may be, for example, adjusting brightness, adjusting volume, switching channels, leaving speech recognition or other types of voice commands. In step S7, when the voice command recognized by the display is a voice command leaving the voice recognition, the display ends the voice recognition by determining the voice command leaving the voice recognition in the step S5 to be executed next, and executes step S6. In step S7, when the voice command recognized by the display is a voice command other than leaving the voice recognition, the display resets the timer for timing the predetermined time by determining the voice command input in the step S5 to be executed next. To re-time, and return to step S4 to continue recording speech and speech recognition.
In other words, in step S5, when the display judges that the voice command leaving the voice recognition is recognized, or when the voice command is not recognized for a predetermined period of time, the display ends the voice recognition, and step S6 is performed. In step S5, when the display determines that there is a voice command other than leaving the voice recognition, the display returns to step S4 to continue the recording voice and voice recognition.
In summary, the present invention reduces the sound volume or mute of the display after the display recognizes a specific gesture, and then performs recording and voice recognition, thereby reducing interference of the sound played by the display itself to improve the recognition accuracy and improve the command. The shortcomings of long input time and high complexity of use, and the use of gesture recognition to initiate speech recognition design is more in line with the user's usage habits.
While the present invention has been described in its preferred embodiments, the present invention is not intended to limit the invention, and the present invention may be modified and modified without departing from the spirit and scope of the invention. The scope of protection is subject to the definition of the scope of the patent application.

無no

S1‧‧‧攝像頭擷取影像 S1‧‧‧ camera capture image

S2‧‧‧是否有特定的手勢 S2‧‧ Have specific gestures?

S3‧‧‧控制揚聲器降低聲音音量或靜音 S3‧‧‧Control speakers to reduce sound volume or mute

S4‧‧‧麥克風收錄語音，並進行語音辨識 S4‧‧‧ microphones for voice recording and speech recognition

S5‧‧‧是否結束語音辨識 S5‧‧‧End of speech recognition

S6‧‧‧控制揚聲器恢復聲音音量 S6‧‧‧Control speaker to restore sound volume

S7‧‧‧根據語音指令進行相應的控制動作 S7‧‧‧ corresponding control actions according to voice commands

Claims

A method for starting a voice recognition of a display, the display comprises a camera, a speaker and a microphone, and the method for starting voice recognition of the display comprises:

When the voice recognition function is turned on, the camera is turned on to capture an image, and image recognition is performed;

When the specific gesture is recognized, after controlling the speaker to reduce the sound volume or mute, the microphone is turned on to record the voice, and the voice recognition is performed;

When the voice command is recognized, the corresponding control action is performed according to the recognized voice command; and

When the voice command leaving the voice recognition is recognized, or when the voice command is not recognized for a predetermined period of time, the voice recognition is ended, and the speaker is controlled to restore the sound volume.

2. The method for starting the voice recognition of the display according to the first aspect of the invention, wherein the method for starting the voice recognition of the display further comprises:

When no specific gesture is recognized, the camera is controlled to continue capturing images and performing image recognition.

3. The method for starting the voice recognition of the display according to the first aspect of the invention, wherein the method for starting the voice recognition of the display further comprises:

When the speech recognition has not ended, the microphone is controlled to continue to record speech and perform speech recognition.

4. The method for starting the voice recognition of the display according to the first aspect of the invention, wherein the method for starting the voice recognition of the display further comprises:

When the speech recognition is ended, the microphone is also controlled to stop recording speech.

5. The method for starting speech recognition of a display according to claim 1, wherein the specific gesture comprises waving or clenching a fist.

6. The method for starting a speech recognition of a display according to claim 1, wherein the display comprises a computer screen or a television.