TWI687917B

TWI687917B - Voice system and voice detection method

Info

Publication number: TWI687917B
Application number: TW107107771A
Authority: TW
Inventors: 張嘉凌; 李建緯
Original assignee: 宏碁股份有限公司
Priority date: 2018-03-07
Filing date: 2018-03-07
Publication date: 2020-03-11
Also published as: TW201939483A

Abstract

A voice detection method comprises: receiving an initial voice; a current voice channel comprised in a speaker is opened for receiving the initial voice, and other voice channels comprised in the speaker is closed; capturing a frame; detecting a mouth type with opening and closing state in the frame; recognizing a mouth position corresponding to the mouth type; and outputting a main voice according to the mouth type and the current voice channel.

Description

Voice system and sound detection method

本案是有關於一種語音系統及聲音偵測方法，且特別是有關於一種應用攝像裝置之語音系統及聲音偵測方法。 This case relates to a voice system and sound detection method, and particularly to a voice system and sound detection method using a camera device.

現今的智慧語音助理裝置需要靠揚聲器以將關鍵字轉化成系統理解指令，才能進行收音麥克風、語音處理、語音辨識引擎以及雲端上各種應用服務。其中，收音麥克風的設計，是智慧揚聲器能否精準辨識使用者指令的第一道關卡，例如，在多人的會議中，收音麥克風往往容易收到環境雜音或是收到主講者之外的其他人之講話聲，又例如，一台放置於電視旁邊的語音助理裝置，可能意外地被電視所播放的廣告或新聞發出的聲音觸發，而執行了非使用者所指示的應用服務。 Today's smart voice assistant devices rely on speakers to translate keywords into system understanding commands before they can perform radio microphones, voice processing, voice recognition engines, and various application services on the cloud. Among them, the design of the radio microphone is the first level of whether the smart speaker can accurately recognize the user's command. For example, in a multi-person meeting, the radio microphone is often prone to receive ambient noise or other than the presenter. Human speech, for example, a voice assistant device placed next to a TV, may be accidentally triggered by the sound of advertisements or news broadcast on the TV, and execute application services not directed by the user.

因此，如何避免環境噪音對識別有干擾，如何避免混合說話人的情形下人的聲紋特徵不易提取，又如何避免語音助理意外被其他非使用者所指示的指令觸發，已成為須解決的問題之一。 Therefore, how to avoid the interference of environmental noise on the recognition and how to avoid the human voiceprint characteristics in the case of mixed speakers Extraction, and how to prevent the voice assistant from being accidentally triggered by other non-user-instructed instructions, has become one of the problems to be solved.

本案提供一種語音系統，包含：一揚聲器、一攝像裝置以及一處理器。揚聲器包含複數個揚聲器渠道。該揚聲器依據一初始音訊之一音量、一音源位置或一頻率開啟揚聲器渠道中的一當前音訊渠道，並閉合揚聲器渠道中的其他揚聲器渠道。攝像裝置用以拍攝該揚聲器之前的一畫面。一處理器用以偵測畫面中呈現一開合狀態的一嘴型，辨識對應嘴型的一嘴型位置，並依據嘴型位置及當前音訊渠道，以輸出一主要音訊。 This case provides a voice system, including: a speaker, a camera device and a processor. The speaker contains a plurality of speaker channels. The speaker opens a current audio channel in the speaker channel according to a volume of an initial audio, a sound source position, or a frequency, and closes other speaker channels in the speaker channel. The camera device is used to take a picture before the speaker. A processor is used to detect a mouth shape presenting an open and closed state on the screen, identify a mouth shape position corresponding to the mouth shape, and output a main audio according to the mouth shape position and the current audio channel.

根據本案之一方面，一種聲音偵測方法，包含：接收一初始音訊，其中一揚聲器中依據該初始音訊之一音量、一音源位置或一頻率開啟一當前音訊渠道，並閉合其他揚聲器渠道；拍攝該揚聲器之前的一畫面；偵測畫面中呈現一開合狀態的一嘴型；辨識對應嘴型的一嘴型位置；以及依據嘴型位置及當前音訊渠道，以輸出一主要音訊。 According to one aspect of the present case, a sound detection method includes: receiving an initial audio, wherein a speaker opens a current audio channel according to a volume, an audio source position, or a frequency of the initial audio, and closes other speaker channels; A picture before the speaker; a mouth shape showing an open and closed state in the detection screen; identifying a mouth shape position corresponding to the mouth shape; and outputting a main audio according to the mouth shape position and the current audio channel.

綜上，本案透過語音系統及聲音偵測方法，藉由偵測畫面中的嘴型開合、嘴型位置並判斷聲音來源，可達到濾除雜訊使聲音偵測更為精準的效果，且可避免環境噪音對主要音訊的干擾，亦可在會議或演講場合中，在多名說話者的情形下，仍能分析出主講人的聲音，此外，本發明亦可應用於語音助理系統中，由於需要辨識到嘴型的開合才會確認語音助理的操作者，藉此可避免語音助理系統意外被其他非使用者所指示(例如電視廣告聲音)的指令觸發。 To sum up, in this case, through the voice system and the sound detection method, by detecting the opening and closing of the mouth and the position of the mouth in the picture and judging the sound source, it is possible to filter out the noise and make the sound detection more accurate Effect, and can avoid the interference of environmental noise on the main audio, and can also analyze the voice of the presenter in the case of multiple speakers in conferences or lectures. In addition, the present invention can also be applied to voice assistants In the system, it is necessary to recognize the opening and closing of the mouth shape to confirm the operator of the voice assistant, thereby preventing the voice assistant system from being accidentally triggered by other non-user-instructed commands (such as TV advertisement sound).

100:語音系統 100: voice system

10:攝像裝置 10: Camera device

20:處理器 20: processor

30:揚聲器 30: Speaker

SC1~SC5:揚聲器渠道 SC1~SC5: Speaker channels

USR1~USR3:使用者 USR1~USR3: user

200:聲音偵測方法 200: sound detection method

210~250、410~470:步驟 210~250, 410~470: steps

為讓本揭示內容之上述和其他目的、特徵、優點與實施例能更明顯易懂，所附圖示之說明如下：第1圖為根據本案一實施例繪示的一種語音系統；第2圖為根據本案一實施例繪示的一種聲音偵測方法的流程圖；第3A~3B圖為根據本案一實施例繪示的一種語音系統之使用情境的示意圖；以及第4圖為根據本案一實施例繪示的一種聲音偵測方法的流程圖。 In order to make the above and other objects, features, advantages and embodiments of the present disclosure more obvious and understandable, the attached drawings are explained as follows: FIG. 1 is a speech system according to an embodiment of the present case; FIG. 2 Is a flow chart of a sound detection method according to an embodiment of the present case; FIGS. 3A~3B are schematic diagrams of usage scenarios of a speech system according to an embodiment of the present case; and FIG. 4 is an implementation according to the present case An example flowchart of a sound detection method is shown.

請參閱第1~2圖，第1圖為根據本案一實施例繪示的一種語音系統100。第2圖為根據本案一實施例繪示的一種聲音偵測方法200的流程圖。於一實施例中，語音系統100包含一揚聲器30、一攝像裝置10及一處理器20。其中，揚聲器30包含多個揚聲器渠道(speaker channel)SC1~SC5，揚聲器渠道SC1~SC5不限於五個，此僅作為一例，只要是複數個即可。 Please refer to FIGS. 1-2. FIG. 1 is a speech system 100 according to an embodiment of the present invention. Figure 2 is based on the actual case A flowchart of a method 200 for sound detection shown in the embodiment. In one embodiment, the voice system 100 includes a speaker 30, a camera 10, and a processor 20. The speaker 30 includes a plurality of speaker channels SC1~SC5. The speaker channels SC1~SC5 are not limited to five. This is only an example, as long as there are a plurality of them.

於一實施例中，具有語音系統100可以是一會議電話裝置、一智慧語音助理裝置、一筆電、一桌機、一手機、一平板或其他具有顯示功能的裝置。 In an embodiment, the voice system 100 may be a conference phone device, a smart voice assistant device, a power bank, a desktop computer, a mobile phone, a tablet, or other device with a display function.

於一實施例中，揚聲器渠道SC1~SC5是接收聲音的結構設計，可以因為聲音的大小，方向或是頻率作開合，可辨別使用者位置以及過濾雜訊。 In one embodiment, the speaker channels SC1 to SC5 are structural designs for receiving sound, which can be opened and closed because of the size, direction, or frequency of the sound to identify the user's position and filter noise.

於一實施例中，攝像裝置10由至少一電荷耦合元件(Charge Coupled Device；CCD)或一互補式金氧半導體(Complementary Metal-Oxide Semiconductor；CMOS)感測器所組成。 In one embodiment, the camera device 10 is composed of at least one Charge Coupled Device (CCD) or a Complementary Metal-Oxide Semiconductor (CMOS) sensor.

於步驟210中，揚聲器渠道SC1~SC5中的一當前音訊渠道SC5被開啟以接收一初始音訊。 In step 210, a current audio channel SC5 among the speaker channels SC1~SC5 is opened to receive an initial audio.

請參閱第3A~3B圖，第3A~3B圖為根據本案一實施例繪示的一種語音系統100之使用情境的示意圖。為方便說明，第3A~3B圖僅繪示第1圖語音系統100中的揚聲器渠道SC1~SC5及攝像裝置 10，然本領域具通常知識者應可理解，第3A~3B圖中的揚聲器渠道SC1~SC5係位於語音系統100中，且第3A~3B圖的使用情境可應用第1圖的語音系統100以實現。 Please refer to FIGS. 3A~3B. FIGS. 3A~3B are schematic diagrams illustrating a usage situation of a speech system 100 according to an embodiment of the present case. For convenience of description, FIGS. 3A~3B only show the speaker channels SC1~SC5 and the camera device in the voice system 100 of FIG. 1 10. However, those with ordinary knowledge in the art should understand that the speaker channels SC1~SC5 in Figures 3A~3B are located in the voice system 100, and the usage scenarios of Figures 3A~3B can be applied to the voice system 100 in Figure 1. To achieve.

舉例而言，於第3A圖中，使用者USR1~USR3正在開會，其中使用者USR3開始說話，此時由於揚聲器渠道SC1~SC5中的揚聲器渠道SC5與使用者USR3距離最近，接收到的音量及/或震動相對其他揚聲器渠道SC1~SC4也較大，因此揚聲器30將揚聲器渠道SC5開啟，以接收使用者USR3所發出的初始音訊(說話聲音)，而其他的揚聲器渠道SC1~SC4則維持閉合；為方便說明，於本發明中，將被開啟以接收初始音訊的揚聲器渠道SC5定義為當前音訊渠道SC5。 For example, in Figure 3A, users USR1~USR3 are in a meeting, and user USR3 begins to speak. At this time, since speaker channel SC5 in speaker channels SC1~SC5 is closest to user USR3, the received volume and /Or vibration is also larger than other speaker channels SC1~SC4, so the speaker 30 turns on the speaker channel SC5 to receive the initial audio (speaking sound) from the user USR3, while the other speaker channels SC1~SC4 remain closed; For convenience of description, in the present invention, the speaker channel SC5 that is turned on to receive the initial audio is defined as the current audio channel SC5.

於一實施例中，揚聲器30依據初始音訊的一音量、一音源位置或一頻率以決定當前音訊渠道SC5。 In one embodiment, the speaker 30 determines the current audio channel SC5 according to a volume, a source position, or a frequency of the initial audio.

於步驟220中，攝像裝置10用以拍攝一畫面。例如，於第3A圖中，攝像裝置10用以拍攝會議現場的畫面，此畫面中可拍攝到使用者USR1~USR3。於實際實施中，步驟210可以與步驟220對調順序。 In step 220, the camera device 10 is used to take a picture. For example, in FIG. 3A, the camera device 10 is used to capture a screen of a conference site, and users USR1~USR3 can be captured in this screen. In actual implementation, step 210 may be reversed with step 220.

於步驟230中，處理器20用以偵測畫面中呈現一開合狀態的一嘴型。例如，於第3A圖中，處理器20依據攝像裝置10拍攝到的一或多張畫面，以判斷畫面中使用者USR3正在呈現開合狀態的嘴型，代表使用者USR3正在講話。關於人臉嘴型的偵測可以採用已知的人臉偵測相關的演算法實現，故此處不贅述之。 In step 230, the processor 20 is used to detect a mouth shape in an open and closed state. For example, in Figure 3A, at The processor 20 determines the mouth shape of the user USR3 in the open and closed state on the screen according to one or more frames captured by the camera device 10, which represents that the user USR3 is speaking. The detection of the face shape of the face can be implemented using known algorithms related to face detection, so it will not be repeated here.

於步驟240中，處理器20用以辨識對應嘴型的一嘴型位置。舉例而言，當處理器20偵測到畫面中使用者USR3正在呈現開合狀態的嘴型，處理器可依據使用者USR3的嘴部位於此畫面中的座標，用以描述嘴型位置。 In step 240, the processor 20 is used to identify a mouth shape position corresponding to the mouth shape. For example, when the processor 20 detects the mouth shape of the user USR3 in the opening and closing state on the screen, the processor can describe the position of the mouth shape according to the coordinates of the mouth of the user USR3 located on the screen.

於步驟250中，處理器20依據嘴型位置及當前音訊渠道SC5，以輸出一主要音訊。例如，處理器20由當前音訊渠道SC5接收到初始音訊後，進一步依據正在開合的嘴型位置，以準確地判斷畫面中的多名使用者USR1~3中，是使用者USR3正在講話，因此，處理器20將當前音訊渠道SC5所接收到的初始音訊(使用者USR3)所講話的聲音，作為主要音訊。 In step 250, the processor 20 outputs a main audio according to the mouth position and the current audio channel SC5. For example, after the processor 20 receives the initial audio from the current audio channel SC5, it further determines the multiple users USR1~3 in the screen according to the position of the mouth that is opening and closing, because the user USR3 is speaking. The processor 20 uses the voice of the initial audio (user USR3) received by the current audio channel SC5 as the main audio.

於一實施例中，處理器20依據當前音訊渠道SC5所接收到的初始音訊的一音量，調整用以加權初始音訊的一權重，藉此將初始音訊調整較為大聲後，作為主要音訊，並將此主要音訊輸出至儲存裝置中。於一實施例中，主要音訊可以被傳送到另一台語音系統，並由另一台語音系統播放出來。 In one embodiment, the processor 20 adjusts a weight used to weight the initial audio according to the volume of the initial audio received by the current audio channel SC5, thereby adjusting the initial audio to be louder and used as the main audio, and Output this main audio to the storage device. In one embodiment, the main audio can be transmitted Send to another voice system, and broadcast by another voice system.

於一實施例中，語音系統100更包含一儲存裝置，耦接於攝像裝置10、處理器20及/或揚聲器30。儲存裝置可被實作為快閃記憶體、軟碟、硬碟、光碟、隨身碟、磁帶或熟悉此技藝者可輕易思及具有相同功能之儲存媒體。儲存裝置用以儲存主要音訊，以作為會議紀錄的錄音檔。 In one embodiment, the voice system 100 further includes a storage device coupled to the camera device 10, the processor 20, and/or the speaker 30. The storage device can be implemented as a flash memory, a floppy disk, a hard disk, an optical disk, a flash disk, a magnetic tape, or a storage medium with the same function that can be easily thought of by those skilled in the art. The storage device is used to store the main audio as a recording file of the meeting record.

於一實施例中，處理器20更用以辨識主要音訊的內容，例如使用已知的語音辨識演算法(採用現有的任何可用以辨識語音的技術即可，故此處不贅述之)，以將主要音訊的內容以至少一文字顯示於一顯示器上，例如處理器20判斷主要音訊的內容為「大家好」，則將「大家好」的文字顯示於顯示器(例如為一螢幕)上。藉此可即時地讓會議中的所有使用者USR1~USR3同時聽到及/或觀看到主講使用者USR3所要表達的內容，藉此可同時將主要音訊錄製為會議紀錄的錄音檔，將文字作為會議紀錄的文件檔。 In one embodiment, the processor 20 is further used to identify the main audio content, for example, using a known speech recognition algorithm (using any existing technology that can be used to recognize speech, so it will not be repeated here), to The content of the main audio is displayed on at least one text on a display. For example, the processor 20 determines that the content of the main audio is "Hello everyone", and then displays the text of "Hello everyone" on the display (for example, a screen). This allows all users USR1~USR3 in the meeting to hear and/or watch what the presenter user USR3 wants to express at the same time, so that the main audio can be recorded as a recording file of the meeting record, and the text can be used as the meeting Document file for record.

請參閱第3B及第4圖，第4圖為根據本案一實施例繪示的一種聲音偵測方法400的流程圖。第4圖中的步驟410~440、470分別與第2圖中的步驟210~250相似，故此處不贅述之，以下詳述第4圖中的步驟450~460。 Please refer to FIGS. 3B and 4. FIG. 4 is a flowchart of a sound detection method 400 according to an embodiment of the present invention. Steps 410 to 440 and 470 in FIG. 4 are similar to steps 210 to 250 in FIG. 2 respectively, so details are not described here. Steps 450 to 460 in FIG. 4 are described in detail below.

於步驟450中，處理器20用以判斷是否初始音訊中包含雜訊。例如，處理器20可依據初始音訊的頻率、波形、已知的音訊分析方法(例如為訊號雜訊比(Signal-to-noise ratio，SNR)或其他已知的雜訊判斷方式，以分辨出是否初始音訊中包含雜訊，若初始音訊中包含雜訊，則進入步驟460，若初始音訊中不包含雜訊，則進入步驟470。 In step 450, the processor 20 is used to determine whether the initial audio contains noise. For example, the processor 20 can distinguish the initial audio frequency, waveform, known audio analysis method (such as signal-to-noise ratio (SNR) or other known noise determination methods Whether the initial audio contains noise, if the initial audio contains noise, then go to step 460, if the initial audio does not contain noise, then go to step 470.

於步驟460中，處理器20用以過濾由嘴型位置以外之另一位置所發出的另一音訊。舉例而言，如第3B圖所示，在會議室中，除了主講人(使用者USR3)正在發言(在畫面中的嘴型位置)之外，使用者USR1與使用者USR2在以較小的聲音討論(在畫面中的另一位置)，此時，當前音訊渠道SC5所接收到的初始音訊中會包含較大聲的使用者USR3的發言聲及使用者USR1與使用者USR2之間較小聲的討論聲，當前音訊渠道SC5可將初始音訊傳送至處理器20，處理器20可依據初始音訊的頻率、波形、已知的音訊分析方法(例如為訊號雜訊比)，以分辨出使用者USR3的發言聲為主要音訊，而其他討論聲為雜訊，進而去除雜訊只保留主要音訊。 In step 460, the processor 20 is used to filter another audio sent from a position other than the mouth position. For example, as shown in Figure 3B, in the conference room, except that the presenter (user USR3) is speaking (the mouth position in the screen), user USR1 and user USR2 are Voice discussion (at another location in the screen), at this time, the initial audio received by the current audio channel SC5 will contain the louder voice of user USR3 and the smaller between user USR1 and user USR2 The sound of the discussion, the current audio channel SC5 can send the initial audio to the processor 20, the processor 20 can distinguish the use of the initial audio frequency, waveform, known audio analysis methods (such as signal to noise ratio) The USR3's speech is the main audio, and the other discussion sounds are noise, and then the noise is removed and only the main audio is retained.

於一實施例中，由於通常主講人(使用者USR3)會較大聲的發言，其嘴型的變化較大，而低聲討論的使用者USR1與使用者USR2發出的聲音較小，其嘴型的變化較小。因此，處理器20可輔以利用攝像裝置10所拍攝到的畫面中各位置的嘴型開合大小，以進一步作為處理器20判斷何者聲音為主要音訊的參考依據，使得處理器20對於主要音訊的分析更為正確。 In one embodiment, since the presenter (user USR3) usually speaks louder, his mouth shape changes greatly, and the sounds of users USR1 and USR2 who are discussing in a low voice The sound is small, and its mouth shape changes little. Therefore, the processor 20 can be supplemented by the size of the mouth opening and closing at various positions in the frame captured by the camera 10 to further serve as a reference basis for the processor 20 to determine which sound is the main audio, so that the processor 20 can Analysis is more correct.

於一實施例中，處理器20可透過將主要音訊(例如為主講者)的音量權重調大，而將雜訊(例如為低聲討論者或環境噪音)的音量權重調小，以突顯主要音訊。 In one embodiment, the processor 20 may increase the volume weight of the main audio (for example, the presenter) and reduce the volume weight of the noise (for example, the low-talker or ambient noise) to highlight the main Audio.

由前述可知，處理器20將每個發言者(如使用者USR3)講述的東西記錄於儲存裝置中，由於過濾掉了初始音訊中不必要的雜訊，僅儲存了發言者的音訊，藉此可在後續的音訊處理過程中減輕處理器20的負載力。此外，於一些實施例中，可將對應畫面中嘴型開合情形較明顯的位置，將此位置的音量權重調大，以算出一調整結果，並將此調整結果視為主要音訊。於一些實施例中，由於講話會有音量或距離的限制，如第3B圖所示，當前音訊渠道SC5接收距離使用者USR3(主講者)較近且主講者的聲音通常比較大聲，故當前音訊渠道SC5所接收到來自使用者USR3的聲音較大，故處理器20可確定是使用者USR3在講話，並過濾掉使用者USR1及使用者USR2(私下在講話的人)的聲音。 As can be seen from the foregoing, the processor 20 records what each speaker (such as user USR3) tells in the storage device. Since unnecessary noise in the initial audio is filtered out, only the audio of the speaker is stored, thereby The load of the processor 20 can be reduced during the subsequent audio processing. In addition, in some embodiments, the position where the mouth opening and closing situation in the corresponding picture is more obvious may be increased by adjusting the volume weight of the position to calculate an adjustment result, and the adjustment result is regarded as the main audio. In some embodiments, due to the limitation of the volume or distance of speech, as shown in FIG. 3B, the current audio channel SC5 receives the USR3 (presenter) closer to the user and the speaker’s voice is usually louder, so the current The audio channel SC5 receives a louder voice from the user USR3, so the processor 20 can determine that the user USR3 is speaking, and filter out the voices of the user USR1 and the user USR2 (person who is talking in private).

綜上，本案透過語音系統及聲音偵測方法，藉由偵測畫面中的嘴型開合、嘴型位置並判斷聲音來源，可達到濾除雜訊並使聲音偵測更為精準的效果，且可避免環境噪音對主要音訊的干擾，亦可在會議或演講場合中，在多名說話者的情形下，仍能分析出主講人的聲音。此外，本發明亦可應用於語音助理系統中，由於需要辨識到嘴型的開合才會確認語音助理的操作者，藉此可避免語音助理系統意外被其他非使用者所指示(例如電視廣告聲音)的指令觸發。 To sum up, in this case, through the voice system and the sound detection method, by detecting the opening and closing of the mouth and the position of the mouth in the picture and judging the sound source, the effect of filtering noise and making sound detection more accurate can be achieved. And it can avoid the interference of the environmental noise on the main audio, and it can also analyze the voice of the presenter in the case of conferences or lectures in the case of multiple speakers. In addition, the present invention can also be applied to a voice assistant system, because the opening and closing of the mouth shape is required to recognize the operator of the voice assistant, thereby preventing the voice assistant system from being accidentally instructed by other non-users (such as TV commercials Sound) triggers.

雖然本案已以實施例揭露如上，然其並非用以限定本案，任何熟習此技藝者，在不脫離本案之精神和範圍內，當可作各種之更動與潤飾，因此本案之保護範圍當視後附之申請專利範圍所界定者為準。 Although this case has been disclosed above with examples, it is not intended to limit this case. Anyone who is familiar with this skill can make various changes and retouching without departing from the spirit and scope of this case, so the scope of protection of this case should be considered The scope of the attached patent application shall prevail.

200:聲音偵測方法 200: sound detection method

210~250:步驟 210~250: steps

Claims

A voice system includes: a speaker, including: a plurality of speaker channels, the speaker opens a current audio channel closest to the sound source position among the speaker channels according to a source position of an initial audio, and closes the speaker channels Other speaker channels in the; a camera device to shoot a picture before the speaker; and a processor to detect a mouth shape showing an open and closed state in the picture and identify a mouth corresponding to the mouth shape Type position, and according to the mouth position and the current audio channel, to output a main audio.

The voice system according to claim 1, further comprising: a storage device for storing the main audio; wherein the processor is further used for recognizing the content of the main audio to display the content of the main audio in at least one text on On a monitor.

The speech system according to claim 1, wherein the processor is further configured to filter another audio signal sent from a location other than the mouth position.

A sound detection method, including: Receive an initial audio, one of the speakers according to the initial audio source position to open a current audio channel closest to the location of the audio source, and close the other speaker channels; take a picture before the speaker; detect the presence of the picture A mouth shape in an open and closed state; identifying a mouth shape position corresponding to the mouth shape; and outputting a main audio according to the mouth shape position and the current audio channel.

The sound detection method according to claim 4, further comprising: storing the main audio; identifying the content of the main audio; and displaying the content of the main audio on at least one text on a display.

The sound detection method according to claim 4, further comprising: filtering another audio signal sent from a position other than the mouth position.