TWI687917B - Voice system and voice detection method - Google Patents
Voice system and voice detection method Download PDFInfo
- Publication number
- TWI687917B TWI687917B TW107107771A TW107107771A TWI687917B TW I687917 B TWI687917 B TW I687917B TW 107107771 A TW107107771 A TW 107107771A TW 107107771 A TW107107771 A TW 107107771A TW I687917 B TWI687917 B TW I687917B
- Authority
- TW
- Taiwan
- Prior art keywords
- audio
- speaker
- voice
- mouth
- processor
- Prior art date
Links
Images
Landscapes
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Studio Devices (AREA)
Abstract
Description
本案是有關於一種語音系統及聲音偵測方法,且特別是有關於一種應用攝像裝置之語音系統及聲音偵測方法。 This case relates to a voice system and sound detection method, and particularly to a voice system and sound detection method using a camera device.
現今的智慧語音助理裝置需要靠揚聲器以將關鍵字轉化成系統理解指令,才能進行收音麥克風、語音處理、語音辨識引擎以及雲端上各種應用服務。其中,收音麥克風的設計,是智慧揚聲器能否精準辨識使用者指令的第一道關卡,例如,在多人的會議中,收音麥克風往往容易收到環境雜音或是收到主講者之外的其他人之講話聲,又例如,一台放置於電視旁邊的語音助理裝置,可能意外地被電視所播放的廣告或新聞發出的聲音觸發,而執行了非使用者所指示的應用服務。 Today's smart voice assistant devices rely on speakers to translate keywords into system understanding commands before they can perform radio microphones, voice processing, voice recognition engines, and various application services on the cloud. Among them, the design of the radio microphone is the first level of whether the smart speaker can accurately recognize the user's command. For example, in a multi-person meeting, the radio microphone is often prone to receive ambient noise or other than the presenter. Human speech, for example, a voice assistant device placed next to a TV, may be accidentally triggered by the sound of advertisements or news broadcast on the TV, and execute application services not directed by the user.
因此,如何避免環境噪音對識別有干擾,如何避免混合說話人的情形下人的聲紋特徵不易 提取,又如何避免語音助理意外被其他非使用者所指示的指令觸發,已成為須解決的問題之一。 Therefore, how to avoid the interference of environmental noise on the recognition and how to avoid the human voiceprint characteristics in the case of mixed speakers Extraction, and how to prevent the voice assistant from being accidentally triggered by other non-user-instructed instructions, has become one of the problems to be solved.
本案提供一種語音系統,包含:一揚聲器、一攝像裝置以及一處理器。揚聲器包含複數個揚聲器渠道。該揚聲器依據一初始音訊之一音量、一音源位置或一頻率開啟揚聲器渠道中的一當前音訊渠道,並閉合揚聲器渠道中的其他揚聲器渠道。攝像裝置用以拍攝該揚聲器之前的一畫面。一處理器用以偵測畫面中呈現一開合狀態的一嘴型,辨識對應嘴型的一嘴型位置,並依據嘴型位置及當前音訊渠道,以輸出一主要音訊。 This case provides a voice system, including: a speaker, a camera device and a processor. The speaker contains a plurality of speaker channels. The speaker opens a current audio channel in the speaker channel according to a volume of an initial audio, a sound source position, or a frequency, and closes other speaker channels in the speaker channel. The camera device is used to take a picture before the speaker. A processor is used to detect a mouth shape presenting an open and closed state on the screen, identify a mouth shape position corresponding to the mouth shape, and output a main audio according to the mouth shape position and the current audio channel.
根據本案之一方面,一種聲音偵測方法,包含:接收一初始音訊,其中一揚聲器中依據該初始音訊之一音量、一音源位置或一頻率開啟一當前音訊渠道,並閉合其他揚聲器渠道;拍攝該揚聲器之前的一畫面;偵測畫面中呈現一開合狀態的一嘴型;辨識對應嘴型的一嘴型位置;以及依據嘴型位置及當前音訊渠道,以輸出一主要音訊。 According to one aspect of the present case, a sound detection method includes: receiving an initial audio, wherein a speaker opens a current audio channel according to a volume, an audio source position, or a frequency of the initial audio, and closes other speaker channels; A picture before the speaker; a mouth shape showing an open and closed state in the detection screen; identifying a mouth shape position corresponding to the mouth shape; and outputting a main audio according to the mouth shape position and the current audio channel.
綜上,本案透過語音系統及聲音偵測方法,藉由偵測畫面中的嘴型開合、嘴型位置並判斷聲音來源,可達到濾除雜訊使聲音偵測更為精準的 效果,且可避免環境噪音對主要音訊的干擾,亦可在會議或演講場合中,在多名說話者的情形下,仍能分析出主講人的聲音,此外,本發明亦可應用於語音助理系統中,由於需要辨識到嘴型的開合才會確認語音助理的操作者,藉此可避免語音助理系統意外被其他非使用者所指示(例如電視廣告聲音)的指令觸發。 To sum up, in this case, through the voice system and the sound detection method, by detecting the opening and closing of the mouth and the position of the mouth in the picture and judging the sound source, it is possible to filter out the noise and make the sound detection more accurate Effect, and can avoid the interference of environmental noise on the main audio, and can also analyze the voice of the presenter in the case of multiple speakers in conferences or lectures. In addition, the present invention can also be applied to voice assistants In the system, it is necessary to recognize the opening and closing of the mouth shape to confirm the operator of the voice assistant, thereby preventing the voice assistant system from being accidentally triggered by other non-user-instructed commands (such as TV advertisement sound).
100:語音系統 100: voice system
10:攝像裝置 10: Camera device
20:處理器 20: processor
30:揚聲器 30: Speaker
SC1~SC5:揚聲器渠道 SC1~SC5: Speaker channels
USR1~USR3:使用者 USR1~USR3: user
200:聲音偵測方法 200: sound detection method
210~250、410~470:步驟 210~250, 410~470: steps
為讓本揭示內容之上述和其他目的、特徵、優點與實施例能更明顯易懂,所附圖示之說明如下:第1圖為根據本案一實施例繪示的一種語音系統;第2圖為根據本案一實施例繪示的一種聲音偵測方法的流程圖;第3A~3B圖為根據本案一實施例繪示的一種語音系統之使用情境的示意圖;以及第4圖為根據本案一實施例繪示的一種聲音偵測方法的流程圖。 In order to make the above and other objects, features, advantages and embodiments of the present disclosure more obvious and understandable, the attached drawings are explained as follows: FIG. 1 is a speech system according to an embodiment of the present case; FIG. 2 Is a flow chart of a sound detection method according to an embodiment of the present case; FIGS. 3A~3B are schematic diagrams of usage scenarios of a speech system according to an embodiment of the present case; and FIG. 4 is an implementation according to the present case An example flowchart of a sound detection method is shown.
請參閱第1~2圖,第1圖為根據本案一實施例繪示的一種語音系統100。第2圖為根據本案一實
施例繪示的一種聲音偵測方法200的流程圖。於一實施例中,語音系統100包含一揚聲器30、一攝像裝置10及一處理器20。其中,揚聲器30包含多個揚聲器渠道(speaker channel)SC1~SC5,揚聲器渠道SC1~SC5不限於五個,此僅作為一例,只要是複數個即可。
Please refer to FIGS. 1-2. FIG. 1 is a
於一實施例中,具有語音系統100可以是一會議電話裝置、一智慧語音助理裝置、一筆電、一桌機、一手機、一平板或其他具有顯示功能的裝置。
In an embodiment, the
於一實施例中,揚聲器渠道SC1~SC5是接收聲音的結構設計,可以因為聲音的大小,方向或是頻率作開合,可辨別使用者位置以及過濾雜訊。 In one embodiment, the speaker channels SC1 to SC5 are structural designs for receiving sound, which can be opened and closed because of the size, direction, or frequency of the sound to identify the user's position and filter noise.
於一實施例中,攝像裝置10由至少一電荷耦合元件(Charge Coupled Device;CCD)或一互補式金氧半導體(Complementary Metal-Oxide Semiconductor;CMOS)感測器所組成。
In one embodiment, the
於步驟210中,揚聲器渠道SC1~SC5中的一當前音訊渠道SC5被開啟以接收一初始音訊。
In
請參閱第3A~3B圖,第3A~3B圖為根據本案一實施例繪示的一種語音系統100之使用情境的示意圖。為方便說明,第3A~3B圖僅繪示第1圖語音系統100中的揚聲器渠道SC1~SC5及攝像裝置
10,然本領域具通常知識者應可理解,第3A~3B圖中的揚聲器渠道SC1~SC5係位於語音系統100中,且第3A~3B圖的使用情境可應用第1圖的語音系統100以實現。
Please refer to FIGS. 3A~3B. FIGS. 3A~3B are schematic diagrams illustrating a usage situation of a
舉例而言,於第3A圖中,使用者USR1~USR3正在開會,其中使用者USR3開始說話,此時由於揚聲器渠道SC1~SC5中的揚聲器渠道SC5與使用者USR3距離最近,接收到的音量及/或震動相對其他揚聲器渠道SC1~SC4也較大,因此揚聲器30將揚聲器渠道SC5開啟,以接收使用者USR3所發出的初始音訊(說話聲音),而其他的揚聲器渠道SC1~SC4則維持閉合;為方便說明,於本發明中,將被開啟以接收初始音訊的揚聲器渠道SC5定義為當前音訊渠道SC5。
For example, in Figure 3A, users USR1~USR3 are in a meeting, and user USR3 begins to speak. At this time, since speaker channel SC5 in speaker channels SC1~SC5 is closest to user USR3, the received volume and /Or vibration is also larger than other speaker channels SC1~SC4, so the
於一實施例中,揚聲器30依據初始音訊的一音量、一音源位置或一頻率以決定當前音訊渠道SC5。
In one embodiment, the
於步驟220中,攝像裝置10用以拍攝一畫面。例如,於第3A圖中,攝像裝置10用以拍攝會議現場的畫面,此畫面中可拍攝到使用者USR1~USR3。於實際實施中,步驟210可以與步驟220對調順序。
In
於步驟230中,處理器20用以偵測畫面中呈現一開合狀態的一嘴型。例如,於第3A圖中,處
理器20依據攝像裝置10拍攝到的一或多張畫面,以判斷畫面中使用者USR3正在呈現開合狀態的嘴型,代表使用者USR3正在講話。關於人臉嘴型的偵測可以採用已知的人臉偵測相關的演算法實現,故此處不贅述之。
In
於步驟240中,處理器20用以辨識對應嘴型的一嘴型位置。舉例而言,當處理器20偵測到畫面中使用者USR3正在呈現開合狀態的嘴型,處理器可依據使用者USR3的嘴部位於此畫面中的座標,用以描述嘴型位置。
In
於步驟250中,處理器20依據嘴型位置及當前音訊渠道SC5,以輸出一主要音訊。例如,處理器20由當前音訊渠道SC5接收到初始音訊後,進一步依據正在開合的嘴型位置,以準確地判斷畫面中的多名使用者USR1~3中,是使用者USR3正在講話,因此,處理器20將當前音訊渠道SC5所接收到的初始音訊(使用者USR3)所講話的聲音,作為主要音訊。
In
於一實施例中,處理器20依據當前音訊渠道SC5所接收到的初始音訊的一音量,調整用以加權初始音訊的一權重,藉此將初始音訊調整較為大聲後,作為主要音訊,並將此主要音訊輸出至儲存裝置中。於一實施例中,主要音訊可以被傳
送到另一台語音系統,並由另一台語音系統播放出來。
In one embodiment, the
於一實施例中,語音系統100更包含一儲存裝置,耦接於攝像裝置10、處理器20及/或揚聲器30。儲存裝置可被實作為快閃記憶體、軟碟、硬碟、光碟、隨身碟、磁帶或熟悉此技藝者可輕易思及具有相同功能之儲存媒體。儲存裝置用以儲存主要音訊,以作為會議紀錄的錄音檔。
In one embodiment, the
於一實施例中,處理器20更用以辨識主要音訊的內容,例如使用已知的語音辨識演算法(採用現有的任何可用以辨識語音的技術即可,故此處不贅述之),以將主要音訊的內容以至少一文字顯示於一顯示器上,例如處理器20判斷主要音訊的內容為「大家好」,則將「大家好」的文字顯示於顯示器(例如為一螢幕)上。藉此可即時地讓會議中的所有使用者USR1~USR3同時聽到及/或觀看到主講使用者USR3所要表達的內容,藉此可同時將主要音訊錄製為會議紀錄的錄音檔,將文字作為會議紀錄的文件檔。
In one embodiment, the
請參閱第3B及第4圖,第4圖為根據本案一實施例繪示的一種聲音偵測方法400的流程圖。第4圖中的步驟410~440、470分別與第2圖中的步驟210~250相似,故此處不贅述之,以下詳述第4圖中的步驟450~460。
Please refer to FIGS. 3B and 4. FIG. 4 is a flowchart of a sound detection method 400 according to an embodiment of the present invention.
於步驟450中,處理器20用以判斷是否初始音訊中包含雜訊。例如,處理器20可依據初始音訊的頻率、波形、已知的音訊分析方法(例如為訊號雜訊比(Signal-to-noise ratio,SNR)或其他已知的雜訊判斷方式,以分辨出是否初始音訊中包含雜訊,若初始音訊中包含雜訊,則進入步驟460,若初始音訊中不包含雜訊,則進入步驟470。
In
於步驟460中,處理器20用以過濾由嘴型位置以外之另一位置所發出的另一音訊。舉例而言,如第3B圖所示,在會議室中,除了主講人(使用者USR3)正在發言(在畫面中的嘴型位置)之外,使用者USR1與使用者USR2在以較小的聲音討論(在畫面中的另一位置),此時,當前音訊渠道SC5所接收到的初始音訊中會包含較大聲的使用者USR3的發言聲及使用者USR1與使用者USR2之間較小聲的討論聲,當前音訊渠道SC5可將初始音訊傳送至處理器20,處理器20可依據初始音訊的頻率、波形、已知的音訊分析方法(例如為訊號雜訊比),以分辨出使用者USR3的發言聲為主要音訊,而其他討論聲為雜訊,進而去除雜訊只保留主要音訊。
In
於一實施例中,由於通常主講人(使用者USR3)會較大聲的發言,其嘴型的變化較大,而低聲討論的使用者USR1與使用者USR2發出的聲
音較小,其嘴型的變化較小。因此,處理器20可輔以利用攝像裝置10所拍攝到的畫面中各位置的嘴型開合大小,以進一步作為處理器20判斷何者聲音為主要音訊的參考依據,使得處理器20對於主要音訊的分析更為正確。
In one embodiment, since the presenter (user USR3) usually speaks louder, his mouth shape changes greatly, and the sounds of users USR1 and USR2 who are discussing in a low voice
The sound is small, and its mouth shape changes little. Therefore, the
於一實施例中,處理器20可透過將主要音訊(例如為主講者)的音量權重調大,而將雜訊(例如為低聲討論者或環境噪音)的音量權重調小,以突顯主要音訊。
In one embodiment, the
由前述可知,處理器20將每個發言者(如使用者USR3)講述的東西記錄於儲存裝置中,由於過濾掉了初始音訊中不必要的雜訊,僅儲存了發言者的音訊,藉此可在後續的音訊處理過程中減輕處理器20的負載力。此外,於一些實施例中,可將對應畫面中嘴型開合情形較明顯的位置,將此位置的音量權重調大,以算出一調整結果,並將此調整結果視為主要音訊。於一些實施例中,由於講話會有音量或距離的限制,如第3B圖所示,當前音訊渠道SC5接收距離使用者USR3(主講者)較近且主講者的聲音通常比較大聲,故當前音訊渠道SC5所接收到來自使用者USR3的聲音較大,故處理器20可確定是使用者USR3在講話,並過濾掉使用者USR1及使用者USR2(私下在講話的人)的聲音。
As can be seen from the foregoing, the
綜上,本案透過語音系統及聲音偵測方法,藉由偵測畫面中的嘴型開合、嘴型位置並判斷聲音來源,可達到濾除雜訊並使聲音偵測更為精準的效果,且可避免環境噪音對主要音訊的干擾,亦可在會議或演講場合中,在多名說話者的情形下,仍能分析出主講人的聲音。此外,本發明亦可應用於語音助理系統中,由於需要辨識到嘴型的開合才會確認語音助理的操作者,藉此可避免語音助理系統意外被其他非使用者所指示(例如電視廣告聲音)的指令觸發。 To sum up, in this case, through the voice system and the sound detection method, by detecting the opening and closing of the mouth and the position of the mouth in the picture and judging the sound source, the effect of filtering noise and making sound detection more accurate can be achieved. And it can avoid the interference of the environmental noise on the main audio, and it can also analyze the voice of the presenter in the case of conferences or lectures in the case of multiple speakers. In addition, the present invention can also be applied to a voice assistant system, because the opening and closing of the mouth shape is required to recognize the operator of the voice assistant, thereby preventing the voice assistant system from being accidentally instructed by other non-users (such as TV commercials Sound) triggers.
雖然本案已以實施例揭露如上,然其並非用以限定本案,任何熟習此技藝者,在不脫離本案之精神和範圍內,當可作各種之更動與潤飾,因此本案之保護範圍當視後附之申請專利範圍所界定者為準。 Although this case has been disclosed above with examples, it is not intended to limit this case. Anyone who is familiar with this skill can make various changes and retouching without departing from the spirit and scope of this case, so the scope of protection of this case should be considered The scope of the attached patent application shall prevail.
200:聲音偵測方法 200: sound detection method
210~250:步驟 210~250: steps
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW107107771A TWI687917B (en) | 2018-03-07 | 2018-03-07 | Voice system and voice detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW107107771A TWI687917B (en) | 2018-03-07 | 2018-03-07 | Voice system and voice detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201939483A TW201939483A (en) | 2019-10-01 |
TWI687917B true TWI687917B (en) | 2020-03-11 |
Family
ID=69023298
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW107107771A TWI687917B (en) | 2018-03-07 | 2018-03-07 | Voice system and voice detection method |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI687917B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230283740A1 (en) * | 2022-03-03 | 2023-09-07 | International Business Machines Corporation | Front-end clipping using visual cues |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100007665A1 (en) * | 2002-08-14 | 2010-01-14 | Shawn Smith | Do-It-Yourself Photo Realistic Talking Head Creation System and Method |
US20110131041A1 (en) * | 2009-11-27 | 2011-06-02 | Samsung Electronica Da Amazonia Ltda. | Systems And Methods For Synthesis Of Motion For Animation Of Virtual Heads/Characters Via Voice Processing In Portable Devices |
US20150228278A1 (en) * | 2013-11-22 | 2015-08-13 | Jonathan J. Huang | Apparatus and method for voice based user enrollment with video assistance |
TW201621758A (en) * | 2014-12-11 | 2016-06-16 | 由田新技股份有限公司 | Method and apparatus for detecting person to use handheld device |
-
2018
- 2018-03-07 TW TW107107771A patent/TWI687917B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100007665A1 (en) * | 2002-08-14 | 2010-01-14 | Shawn Smith | Do-It-Yourself Photo Realistic Talking Head Creation System and Method |
US20110131041A1 (en) * | 2009-11-27 | 2011-06-02 | Samsung Electronica Da Amazonia Ltda. | Systems And Methods For Synthesis Of Motion For Animation Of Virtual Heads/Characters Via Voice Processing In Portable Devices |
US20150228278A1 (en) * | 2013-11-22 | 2015-08-13 | Jonathan J. Huang | Apparatus and method for voice based user enrollment with video assistance |
TW201621758A (en) * | 2014-12-11 | 2016-06-16 | 由田新技股份有限公司 | Method and apparatus for detecting person to use handheld device |
Also Published As
Publication number | Publication date |
---|---|
TW201939483A (en) | 2019-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10516788B2 (en) | Method and apparatus for adjusting volume of user terminal, and terminal | |
WO2021143599A1 (en) | Scene recognition-based speech processing method and apparatus, medium and system | |
US10848889B2 (en) | Intelligent audio rendering for video recording | |
US8589167B2 (en) | Speaker liveness detection | |
US10242695B1 (en) | Acoustic echo cancellation using visual cues | |
US20210217433A1 (en) | Voice processing method and apparatus, and device | |
WO2019000721A1 (en) | Video file recording method, audio file recording method, and mobile terminal | |
US8855295B1 (en) | Acoustic echo cancellation using blind source separation | |
US10461712B1 (en) | Automatic volume leveling | |
TWI678696B (en) | Method and system for receiving voice message and electronic device using the method | |
CN104991754A (en) | Recording method and apparatus | |
CN110853664A (en) | Method and device for evaluating performance of speech enhancement algorithm and electronic equipment | |
US11405584B1 (en) | Smart audio muting in a videoconferencing system | |
KR101508092B1 (en) | Method and system for supporting video conference | |
US9319513B2 (en) | Automatic un-muting of a telephone call | |
CN115831155A (en) | Audio signal processing method and device, electronic equipment and storage medium | |
CN106326804B (en) | Recording control method and device | |
TWI687917B (en) | Voice system and voice detection method | |
JP3838159B2 (en) | Speech recognition dialogue apparatus and program | |
US20230290335A1 (en) | Detection of live speech | |
TW201626364A (en) | System and method for recovering missed voice automatically | |
CN111182416B (en) | Processing method and device and electronic equipment | |
WO2008075305A1 (en) | Method and apparatus to address source of lombard speech | |
CN113170022A (en) | Microphone control based on voice direction | |
WO2023230782A1 (en) | Sound effect control method and apparatus, and storage medium |