TW201901527A

TW201901527A - Video conference and video conference management method

Info

Publication number: TW201901527A
Application number: TW106117551A
Authority: TW
Inventors: 曾羽鴻; 陳柏森
Original assignee: 和碩聯合科技股份有限公司
Priority date: 2017-05-26
Filing date: 2017-05-26
Publication date: 2019-01-01
Also published as: CN108933915A; CN108933915B

Abstract

A video conference device and corresponding video conference management method are provided. The method includes receiving a sound occurred in a conference space; determining a first position of the sound according to the received sound; capturing a panoramic image of the conference space; identifying face images of a plurality of participants in the panoramic image, and identifying second positions of the face images in the panoramic image; determining a speaker among the participants according to the first position, the second positions and the face images; setting the panoramic image to display in a first region of a video conference image, enlarging the determined speaker's image of the panoramic image, and setting the enlarged speaker's image to display in a second region of the video conference image.

Description

Video conference device and video conference management method

本發明是有關於一種視訊裝置，且特別是有關於一種適用於視訊會議的視訊會議裝置與視訊會議管理方法。The invention relates to a video conference device, and more particularly to a video conference device and a video conference management method suitable for a video conference.

傳統視訊會議系統利用3個以上的攝影機來拍攝參與會議的人，同時使用麥克風陣列來進行發言者的定位，並且將所定位之發言者放大於視訊會議影像中。然而，傳統作法僅執行聲音定位來判斷音源位置，並且認為該音源位置即是發言者的位置，進而將該位置的影像放大於視訊會議影像中。因此，上述傳統方法會因為環境噪音的影像而導致準確度不足，無法精準地判斷發言者的位置。The traditional video conference system uses more than three cameras to shoot people participating in the conference, and simultaneously uses a microphone array to locate the speakers, and enlarges the positioned speakers in the video conference image. However, the traditional method only performs sound localization to determine the position of the sound source, and considers the position of the sound source to be the position of the speaker, and then enlarges the image of the position in the video conference image. Therefore, the above-mentioned traditional method may have insufficient accuracy due to the image of environmental noise, and cannot accurately determine the position of the speaker.

本發明提供一種視訊會議裝置與視訊會議管理方法，可藉由聲音定位與影像辨識來準確且自動地偵測發言者，以將發言者的影像放大且顯示於視訊會議影像中。The invention provides a video conference device and a video conference management method, which can accurately and automatically detect a speaker through sound localization and image recognition, so as to enlarge and display the speaker's image in the video conference image.

本發明的一實施例提供一種視訊會議裝置。所述裝置包括麥克風陣列、聲音定位單元、影像擷取裝置、影像辨識單元、與視訊會議管理單元。所述麥克風陣列包括多個麥克風，並且用以接收所述視訊會議裝置周遭的會議空間內所發出的聲音。所述聲音定位單元耦接至所述麥克風陣列，並且用以根據所接收到的所述聲音來判斷所述聲音的第一位置。所述影像擷取裝置用以擷取所述會議空間的全景影像。所述影像辨識單元耦接所述影像擷取裝置，用以辨識所述全景影像中的至少一與會者的臉部影像，並且辨識所述至少一臉部影像於所述全景影像中的第二位置。所述視訊會議管理單元耦接所述聲音定位單元、所述影像辨識單元與所述儲存單元，並且用以根據所述第一位置、所述至少一第二位置與所述至少一臉部影像來判定所述至少一與會者中的發言者。此外，所述視訊會議管理單元設定所述全景影像顯示於視訊會議影像的第一區域，放大所述全景影像中的所判定之所述發言者的影像，並且設定所放大之所述發言者的所述影像顯示於所述視訊會議影像的第二區域。An embodiment of the present invention provides a video conference device. The device includes a microphone array, a sound positioning unit, an image capturing device, an image recognition unit, and a video conference management unit. The microphone array includes a plurality of microphones, and is configured to receive sounds emitted from a conference space around the video conference device. The sound positioning unit is coupled to the microphone array, and is configured to determine a first position of the sound according to the received sound. The image capturing device is used for capturing a panoramic image of the conference space. The image recognition unit is coupled to the image capture device to identify a facial image of at least one participant in the panoramic image, and identify a second of the at least one facial image in the panoramic image. position. The video conference management unit is coupled to the sound positioning unit, the image recognition unit, and the storage unit, and is configured to be based on the first position, the at least one second position, and the at least one face image. To determine a speaker of the at least one participant. In addition, the video conference management unit sets the panoramic image to be displayed in a first area of the video conference image, enlarges the image of the speaker determined in the panoramic image, and sets the enlarged speaker's image. The image is displayed in a second area of the video conference image.

在上述的實施例中，所述視訊會議管理單元對所述發言者所發出的所述聲音進行語音轉文字操作，以將所述發言者的所述聲音轉換為對應所述發言者的文字訊息，其中所述視訊會議管理單元儲存對應該發言者的所述文字訊息至會議記錄資料庫。In the above embodiment, the video conference management unit performs a voice-to-text operation on the sound issued by the speaker to convert the sound of the speaker into a text message corresponding to the speaker , Wherein the video conference management unit stores the text message corresponding to the speaker to a conference record database.

本發明的一實施例提供一種視訊會議管理方法，適用於在會議空間所進行之視訊會議，其中所述會議空間具有至少一與會者。所述方法包括接收所述會議空間內所發出的聲音；根據所接收到的所述聲音來判斷所述聲音的第一位置；擷取所述會議空間的全景影像；辨識所述全景影像中的所述至少一與會者的臉部影像，並且辨識所述至少一臉部影像於所述全景影像中的第二位置；根據所述第一位置、所述至少一第二位置與所述至少一臉部影像來判定所述至少一與會者中的發言者；以及設定所述全景影像顯示於視訊會議影像的第一區域，放大所述全景影像中的所判定之所述發言者的影像，並且設定所放大之該發言者的所述影像顯示於所述視訊會議影像的第二區域。An embodiment of the present invention provides a video conference management method suitable for a video conference in a conference space, wherein the conference space has at least one participant. The method includes receiving a sound emitted from the conference space; determining a first position of the sound based on the received sound; capturing a panoramic image of the conference space; identifying a panoramic image in the panoramic image A face image of the at least one participant, and identifying a second position of the at least one face image in the panoramic image; according to the first position, the at least one second position, and the at least one A face image to determine a speaker in the at least one participant; and setting the panoramic image to be displayed in a first area of a video conference image, enlarging the image of the determined speaker in the panoramic image, and The enlarged image of the speaker is set to be displayed in a second area of the video conference image.

基於上述，本發明所提供的視訊會議裝置與視訊會議管理方法，能夠利用聲音定位與影像辨識，來精確地判別舉開視訊會議的會議空間中的發言者，將發言者的影像放大且顯示於具有所述會議空間中的所有與會者的全景影像的視訊會議影像中。此外，更能夠對發言者的言論進行語音轉文字操作，儲存對應所述發言者的識別名稱與所述言論的文字訊息，以建立所述視訊會議的會議記錄。如此一來，本發明所提供的視訊會議裝置與視訊會議管理方法能夠讓所有與會者直覺地且專注於發言者上，以更有效率地進行視訊會議，並且所述裝置與方法還能夠即時地建立會議記錄，進而增進了視訊會議的整體工作效率。Based on the above, the video conference device and the video conference management method provided by the present invention can accurately identify the speakers in the conference space where the video conference is held by using sound localization and image recognition, and enlarge and display the images of the speakers on the In a video conference image having panoramic images of all participants in the conference space. In addition, a speech-to-text operation can be performed on a speaker's speech, and a text message corresponding to the speaker's identification name and the speech is stored to establish a meeting record of the video conference. In this way, the video conference device and video conference management method provided by the present invention can allow all participants to intuitively and focus on the speaker to conduct the video conference more efficiently, and the device and method can also instantly Establish meeting records, which in turn improves the overall efficiency of video conferences.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。In order to make the above features and advantages of the present invention more comprehensible, embodiments are hereinafter described in detail with reference to the accompanying drawings.

圖1A是依照本發明的一實施例所繪示的視訊會議的示意圖。圖1B是依照本發明的一實施例所繪示的對應圖1A中的視訊會議的全景影像的示意圖。請同時參照圖1A、1B，假設在會議空間1（如，會議室或演講廳等空間）中，有四名與會者2、3、4、5正舉行視訊會議（或其他會利用到視訊的會議、視訊教學等類型的活動），並且會議空間1中配置有視訊會議裝置10。本實施例所提供的視訊會議裝置10會擷取周遭的影像，來獲得全景影像11（如，視訊會議裝置10會經由360度的全景攝影來獲得全景影像）。如圖1B所繪示，所述全景影像11會包含所有的與會者2、3、4、5，並且所述全景影像11也會包含所述與會者2、3、4、5周圍的會議空間的影像。在本實施例中，視訊會議裝置10會判斷在所有與會者中，誰是當前說話的發言者，並且根據此判斷結果來產生（輸出）視訊會議影像。舉例來說，當與會者2說話時（即，與會者2為發言者），視訊會議裝置10會接收與會者2所發出的聲音21（亦稱，言論21），根據所接收到的聲音21來進行聲音定位，並且根據對應與會者2的臉部影像的變化以及所獲得的聲音定位結果來判定與會者2為發出上述聲音（言論）的發言者。以下會先藉由圖2來詳細說明本發明所提供的視訊會議裝置。FIG. 1A is a schematic diagram of a video conference according to an embodiment of the invention. FIG. 1B is a schematic diagram illustrating a panoramic image corresponding to the video conference in FIG. 1A according to an embodiment of the present invention. Please refer to FIGS. 1A and 1B at the same time. Assume that in conference space 1 (such as a conference room or lecture hall), four participants 2, 3, 4, and 5 are holding a video conference (or other video conference Conference, video teaching, etc.), and the video conference device 10 is arranged in the conference space 1. The video conference device 10 provided in this embodiment captures surrounding images to obtain a panoramic image 11 (for example, the video conference device 10 obtains a panoramic image through 360-degree panoramic photography). As shown in FIG. 1B, the panoramic image 11 will include all participants 2, 3, 4, and 5, and the panoramic image 11 will also include conference spaces around the participants 2, 3, 4, and 5. Image. In this embodiment, the video conference device 10 judges who is currently speaking among all the participants, and generates (outputs) a video conference image based on the judgment result. For example, when the participant 2 speaks (ie, the participant 2 is a speaker), the video conference device 10 receives the sound 21 (also referred to as a speech 21) from the participant 2, and according to the received sound 21 Perform voice localization, and determine that the participant 2 is the speaker who made the above-mentioned voice (speech) according to the change in the facial image of the participant 2 and the obtained voice localization result. Hereinafter, the video conference device provided by the present invention will be described in detail with reference to FIG. 2.

圖2是依照本發明的一實施例所繪示的視訊會議裝置的方塊圖。請參照圖2，在本實施例中，視訊會議裝置10包括視訊會議管理單元110、麥克風陣列120、聲音定位單元121、影像擷取裝置130、影像辨識單元131、儲存單元140以及連接介面單元150。聲音定位單元121耦接至麥克風陣列120。影像辨識單元131耦接至影像擷取裝置130。所述視訊會議管理單元110耦接至聲音定位單元121、影像辨識單元131、儲存單元140與連接介面單元150。FIG. 2 is a block diagram of a video conference device according to an embodiment of the present invention. Please refer to FIG. 2. In this embodiment, the video conference device 10 includes a video conference management unit 110, a microphone array 120, a sound positioning unit 121, an image capture device 130, an image recognition unit 131, a storage unit 140, and a connection interface unit 150. . The sound positioning unit 121 is coupled to the microphone array 120. The image recognition unit 131 is coupled to the image capture device 130. The video conference management unit 110 is coupled to a sound positioning unit 121, an image recognition unit 131, a storage unit 140, and a connection interface unit 150.

在本實施例中，所述視訊會議管理單元110為具備運算能力的硬體(例如晶片組、處理器等)，用以控制視訊會議裝置10的其他元件的功能以及管理視訊會議裝置10的整體運作。在本實施例中，視訊會議管理單元110例如是一核心或多核心的中央處理單元(Central Processing Unit，CPU)、圖像處理單元(Graphic Processing Unit，GPU)、微處理器(micro-processor)、或是其他可程式化之處理單元(Microprocessor)、數位訊號處理器(Digital Signal Processor，DSP)、可程式化控制器、特殊應用積體電路(Application Specific Integrated Circuits，ASIC)、可程式化邏輯裝置(Programmable Logic Device，PLD)或其他類似裝置。In this embodiment, the video conference management unit 110 is hardware (for example, a chipset, a processor, and the like) with computing capabilities, and is used to control functions of other components of the video conference device 10 and manage the entirety of the video conference device 10 Operation. In this embodiment, the video conference management unit 110 is, for example, a central processing unit (CPU), a central processing unit (CPU), a graphic processing unit (GPU), or a micro-processor. Or other programmable processing unit (Microprocessor), digital signal processor (DSP), programmable controller, Application Specific Integrated Circuits (ASIC), programmable logic Programmable Logic Device (PLD) or other similar devices.

所述儲存單元140可經由視訊會議管理單元110的指示來暫存資料，所述資料包括用以管理視訊會議裝置10的資料、從其他電子裝置所接收的資料、用以傳送至其他電子裝置的資料或是其他類型的資料，本發明不限於此。除此之外，在本實施例中，儲存單元140還可以經由視訊會議管理單元110的指示來記錄一些需要長時間儲存的資料，例如，儲存多個資料庫。所述多個資料庫包括人臉資料庫141、會議記錄資料庫142。在另一實施例中，所述多個資料庫還包括語音資料庫143。應注意的是，所述多個資料庫亦可儲存至遠端的伺服器中，並且經由與視訊會議裝置之間的通訊連接（或網路連接）來被存取。所述人臉資料庫141記錄分別對應不同人的多個臉部影像。所述人臉資料庫141亦可記錄分別對應不同人的多個臉部影像的臉部影像特徵資料組。所述臉部影像特徵資料組記錄所對應的臉部影像的多個影像特徵值。此外，所述人臉資料庫141亦可記錄所述多個臉部影像所對應的人的識別名稱（如，名字、代號或識別碼）。所述會議記錄資料庫142會記錄每一次會議的內容，特別是，所述內容可包括每一次會議中對應所有發言者的言論的文字訊息。所述語音資料庫143可記錄不同人的多筆語音訊息。此外，語音資料庫143也可記錄分別對應所述不同人的語音的多個語音特徵資料組。所述語音特徵資料組記錄所對應的人的語音的多個語音特徵值。The storage unit 140 may temporarily store data according to an instruction of the video conference management unit 110. The data includes data for managing the video conference device 10, data received from other electronic devices, and data for transmitting to other electronic devices. Data or other types of data, the invention is not limited to this. In addition, in this embodiment, the storage unit 140 may also record some data that needs to be stored for a long time, such as storing multiple databases, through instructions of the video conference management unit 110. The multiple databases include a face database 141 and a meeting record database 142. In another embodiment, the plurality of databases further includes a voice database 143. It should be noted that the multiple databases may also be stored in a remote server and accessed via a communication connection (or network connection) with the video conference device. The face database 141 records a plurality of facial images corresponding to different people. The face database 141 may also record facial image feature data sets corresponding to multiple facial images of different people. The facial image feature data set records a plurality of image feature values of a corresponding facial image. In addition, the face database 141 may also record identification names (eg, names, codes, or identification codes) of persons corresponding to the plurality of facial images. The conference record database 142 records the content of each meeting. In particular, the content may include text messages corresponding to the speeches of all speakers in each meeting. The voice database 143 can record multiple voice messages of different people. In addition, the voice database 143 may also record a plurality of voice feature data groups respectively corresponding to the voices of the different persons. The voice feature data set records a plurality of voice feature values of the corresponding human voice.

所述麥克風陣列120包括配置在視訊會議裝置10上的多個麥克風。所述多個麥克風接收聲音的空間範圍涵蓋視訊會議裝置10四周的空間。麥克風陣列120將每個麥克風可將所接收到的聲音轉換為聲音訊號傳送至聲音定位單元121。由於每個與會者2、3、4、5與麥克風陣列120的多個麥克風的相對位置不同。因此，假設目前麥克風陣列120的多個麥克風接收到與會者2說話的聲音，麥克風陣列120的多個麥克風各自接收到的聲音強弱會不同，進而使得所轉換的聲音訊號的強度也會不同。The microphone array 120 includes a plurality of microphones configured on the video conference apparatus 10. The spatial range of the plurality of microphones receiving sound covers the space around the video conference device 10. The microphone array 120 converts the received sound into a sound signal for each microphone and transmits the sound signal to the sound positioning unit 121. Because the relative positions of each participant 2, 3, 4, 5 and the multiple microphones of the microphone array 120 are different. Therefore, it is assumed that the microphones of the microphone array 120 currently receive the voice spoken by the participant 2, and the sounds received by the microphones of the microphone array 120 will be different, so that the strength of the converted sound signals will also be different.

所述聲音定位單元121為可根據分別從麥克風陣列120的多個麥克風所接收到的多個聲音訊號來計算所接收的聲音的位置的電路單元/晶片。在上述的例子中，聲音定位單元121可根據麥克風陣列120的多個麥克風所接收到來自同一個音源（如、上述例子的與會者2）的不同聲音強度的聲音，來計算出所述音源的位置（亦稱，第一位置）。所述第一位置的座標可以藉由對應全景影像11的座標系（如，直角座標系或角座標系）來表示。The sound positioning unit 121 is a circuit unit / chip that can calculate a position of the received sound according to a plurality of sound signals received from a plurality of microphones of the microphone array 120, respectively. In the above example, the sound localization unit 121 may calculate the sound source of the sound source according to the sounds of different sound intensity received from the same sound source (for example, the participant 2 of the above example) received by multiple microphones of the microphone array 120. Location (also known as the first location). The coordinates of the first position may be represented by a coordinate system (eg, a rectangular coordinate system or an angular coordinate system) corresponding to the panoramic image 11.

所述影像擷取裝置130例如是一或兩個可擷取（拍攝）全景影像的攝影機/相機/鏡頭。所述影像擷取裝置130可以調整高度。所述全景影像會涵蓋視訊會議裝置10周圍的會議空間的影像。所述影像辨識單元131會將所擷取的全景影像傳送給影像辨識單元131。The image capturing device 130 is, for example, one or two cameras / cameras / lenses that can capture (capture) panoramic images. The image capturing device 130 can adjust the height. The panoramic image may include an image of a conference space around the video conference apparatus 10. The image recognition unit 131 sends the captured panoramic image to the image recognition unit 131.

所述影像辨識單元131為可對所接收的影像進行影像辨識操作（如，人臉偵測操作與人臉識別操作）的電路單元。所述影像辨識單元131亦可耦接至儲存單元140，並且藉由人臉資料庫141中的多個臉部影像來進行機器學習，以增強影像辨識單元131所進行之人臉偵測操作或人臉識別操作的速度與準確率。影像辨識單元131可將所辨識出的與會者的臉部影像（或對應的臉部影像特徵資料組）記錄至人臉資料庫141。The image recognition unit 131 is a circuit unit capable of performing an image recognition operation (such as a face detection operation and a face recognition operation) on the received image. The image recognition unit 131 may also be coupled to the storage unit 140 and perform machine learning by using multiple face images in the face database 141 to enhance the face detection operation performed by the image recognition unit 131 or Speed and accuracy of face recognition operations. The image recognition unit 131 may record the facial images (or corresponding facial image feature data sets) of the identified participants to the face database 141.

所述連接介面單元150例如是可符合序列先進附件(Serial Advanced Technology Attachment, SATA)標準、並列先進附件(Parallel Advanced Technology Attachment, PATA)標準、電氣和電子工程師協會(Institute of Electrical and Electronic Engineers, IEEE) 1394標準、高速周邊零件連接介面(Peripheral Component Interconnect Express, PCI Express)標準、通用序列匯流排(Universal Serial Bus, USB)標準、超高速一代(Ultra High Speed-I，UHS-I)介面標準、超高速二代(Ultra High Speed-II, UHS-II)介面標準、安全數位(Secure Digital, SD)介面標準、記憶棒(Memory Stick, MS)介面標準、多媒體儲存卡(Multi Media Card, MMC)介面標準、小型快閃(Compact Flash, CF)介面標準、整合式驅動電子介面(Integrated Device Electronics, IDE)標準、終端微通道互連架構部件(Personal Computer Memory Card International Association，PCMCIA)標準、視訊圖形陣列（Video Graphics Array，VGA）標準、數位視訊介面(Digital Visual Interface，DVI)標準、高畫質晰度多媒體介面（High Definition Multimedia Interface，HDMI）標準或其他適合的標準的電路單元。在本實施例中，視訊會議管理單元110可藉由連接介面單元150連接至聲音播放裝置151（如，喇叭）、顯示裝置152（如，螢幕、投影機）或其他類型的輸出裝置，以輸出資料（如，視訊會議裝置10所產生之視訊會議影像）。此外，視訊會議管理單元110可藉由連接介面單元150連接至輸入裝置153，以接收來自輸入裝置153的輸入訊號，或是接收使用者（如，與會者）的操控。The connection interface unit 150 may be, for example, a Serial Advanced Technology Attachment (SATA) standard, a Parallel Advanced Technology Attachment (PATA) standard, or the Institute of Electrical and Electronic Engineers (IEEE). ) 1394 standard, Peripheral Component Interconnect Express (PCI Express) standard, Universal Serial Bus (USB) standard, Ultra High Speed-I (UHS-I) interface standard, Ultra High Speed-II (UHS-II) interface standard, Secure Digital (SD) interface standard, Memory Stick (MS) interface standard, Multi Media Card (MMC) Interface standards, Compact Flash (CF) interface standards, Integrated Device Electronics (IDE) standards, Personal Computer Memory Card International Association (PCMCIA) standards, video graphics Array (Video Graphics Array, VGA) standard, digital video Interface (Digital Visual Interface, DVI) standard, a high-definition multimedia interface sharpness (High Definition Multimedia Interface, HDMI) standard or other suitable standard circuit unit. In this embodiment, the video conference management unit 110 may be connected to a sound playback device 151 (eg, a speaker), a display device 152 (eg, a screen, a projector) or other types of output devices through the connection interface unit 150 to output Data (eg, video conference images generated by video conference device 10). In addition, the video conference management unit 110 may be connected to the input device 153 through the connection interface unit 150 to receive an input signal from the input device 153 or receive a control from a user (eg, a participant).

應注意的是，輸出裝置與輸入裝置亦可整合至同一電子裝置中（如，觸控螢幕）。特別是，連接介面單元150亦可連接至其他儲存單元（如，記憶卡、外接式硬碟等），以讓視訊會議管理單元110可存取所述經由連接介面單元150所外接之儲存單元中的資料。此外，在另一實施例中，上述經由連接介面單元150所連接的不同的輸入/輸出裝置亦可被整合至視訊會議裝置10中。It should be noted that the output device and the input device can also be integrated into the same electronic device (eg, a touch screen). In particular, the connection interface unit 150 can also be connected to other storage units (such as a memory card, an external hard disk, etc.), so that the video conference management unit 110 can access the storage unit externally connected through the connection interface unit 150. data of. In addition, in another embodiment, different input / output devices connected through the connection interface unit 150 described above can also be integrated into the video conference device 10.

在一實施例中，視訊會議裝置10亦可經由連接介面單元150與其他電子裝置（如，桌上型電腦、筆記型電腦、平板電腦、伺服器、智慧型手機等等）連接，以讓其他電子裝置藉由視訊會議裝置10與執行於其他電子裝置中的應用程式（如，Skype、QQ、Line、FB Messenger、Google Handout等通訊軟體）來進行視訊會議。In an embodiment, the video conference device 10 may also be connected to other electronic devices (such as a desktop computer, a notebook computer, a tablet computer, a server, a smart phone, etc.) via the connection interface unit 150 to allow other The electronic device performs the video conference through the video conference device 10 and applications (such as communication software such as Skype, QQ, Line, FB Messenger, Google Handout, etc.) running in other electronic devices.

在另一實施例中，所述視訊會議裝置10還包括耦接至視訊會議管理單元110之通訊單元160。所述通訊單元160用以透過無線通訊的方式來傳輸或是接收資料。在本實施例中，通訊單元160可具有一無線通訊模組，並支援全球行動通信(Global System for Mobile Communication，GSM)系統、個人手持式電話系統(Personal Handy-phone System，PHS)、碼多重擷取(Code Division Multiple Access，CDMA)系統、無線相容認證(Wireless Fidelity，WiFi)系統、全球互通微波存取(Worldwide Interoperability for Microwave Access，WiMAX)系統、第三代無線通信技術(3G)、第四代無線通信技術(4G)、長期演進技術(Long Term Evolution, LTE)、紅外線(Infrared)傳輸、藍芽(Bluetooth，BT)通訊技術的其中之一或其組合，且不限於此。此外，通訊單元160亦可具有網路介面卡(Network Interface Card，NIC)，以建立網路連線，進而讓視訊會議裝置10可連接至區域網路或是網際網路。In another embodiment, the video conference device 10 further includes a communication unit 160 coupled to the video conference management unit 110. The communication unit 160 is configured to transmit or receive data through wireless communication. In this embodiment, the communication unit 160 may have a wireless communication module, and support a Global System for Mobile Communication (GSM) system, a Personal Handy-phone System (PHS), and multiple codes. Code Division Multiple Access (CDMA) system, Wireless Fidelity (WiFi) system, Worldwide Interoperability for Microwave Access (WiMAX) system, 3rd generation wireless communication technology (3G), One or a combination of fourth-generation wireless communication technology (4G), Long Term Evolution (LTE), infrared (Infrared) transmission, and Bluetooth (BT) communication technology, and is not limited thereto. In addition, the communication unit 160 may also have a Network Interface Card (NIC) to establish a network connection, so that the video conference device 10 can be connected to a local area network or the Internet.

在又另一實施例中，所述視訊會議裝置10還包括耦接至視訊會議管理單元110、麥克風陣列120與儲存單元140之語音辨識單元122。所述語音辨識單元122為可對麥克風陣列120所接收之聲音進行語音辨識操作的電路單元，其可用以分辨聲音是否為人聲（語音，Voice）。此外，在所進行的語音辨識操作中，語音辨識單元122亦可根據所辨識的語音來對照語音資料庫143中的多個語音訊息或是多個語音特徵資料組來辨識發出所述語音的人的識別名稱。除此之外，語音辨識單元122亦可執行語音轉文字操作，以將所辨識的語音（語音訊息）傳換為文字訊息。應注意的是，所述語音辨識單元122可藉由語音資料庫143中的多個語音訊息或是多個語音特徵資料組來進行機器學習，以增進語音辨識單元122所執行的語音辨識操作的能力。以下會配合圖3、圖4來詳細說明本實施例所提供的視訊會議裝置的運作以及其所使用之視訊會議管理方法。In yet another embodiment, the video conference apparatus 10 further includes a voice recognition unit 122 coupled to the video conference management unit 110, the microphone array 120, and the storage unit 140. The voice recognition unit 122 is a circuit unit that can perform voice recognition operations on the sound received by the microphone array 120, and can be used to distinguish whether the sound is a human voice (voice). In addition, in the voice recognition operation performed, the voice recognition unit 122 can also identify the person who issued the voice against multiple voice messages or multiple voice feature data sets in the voice database 143 according to the recognized voice. The distinguished name of the. In addition, the voice recognition unit 122 can also perform a voice-to-text operation to convert the recognized voice (voice message) into a text message. It should be noted that the voice recognition unit 122 may perform machine learning by using multiple voice messages or multiple voice feature data sets in the voice database 143 to improve the performance of the voice recognition operations performed by the voice recognition unit 122. ability. The operation of the video conference device provided by this embodiment and the video conference management method used in this embodiment will be described in detail below with reference to FIGS. 3 and 4.

圖3是依照本發明的一實施例所繪示的視訊會議方法的流程圖。請參照圖3，假設當前（如圖1所繪示）與會者2、3、4、5中的與會者2發出了聲音21。在步驟S301中，麥克風陣列120接收會議空間內所發出的聲音。例如，麥克風陣列120的多個麥克風接收到所述聲音21。接著，在步驟S303中，聲音定位單元121根據所接收到的所述聲音來判斷所述聲音的第一位置。即，聲音定位單元121會根據麥克風陣列120接收聲音21而產生之多個聲音訊號來進行聲音定位操作，以計算發出聲音21的音源的位置。FIG. 3 is a flowchart of a video conference method according to an embodiment of the present invention. Please refer to FIG. 3, it is assumed that the participant 2 of the participants 2, 3, 4, and 5 (as shown in FIG. 1) currently makes a sound 21. In step S301, the microphone array 120 receives sounds emitted from the conference space. For example, a plurality of microphones of the microphone array 120 receive the sound 21. Next, in step S303, the sound positioning unit 121 determines a first position of the sound according to the received sound. That is, the sound positioning unit 121 performs a sound positioning operation according to a plurality of sound signals generated by the microphone array 120 receiving the sound 21 to calculate the position of the sound source that emits the sound 21.

應注意的是，在一實施例中，若語音辨識單元122判定所接收的聲音並不是人聲，則不會接續進行步驟S303來處理所接收的聲音。如此一來，可避免掉非人聲的環境噪音的干擾。此外，如上所述，若語音辨識單元122判定所接收的聲音是人聲，語音辨識單元122或視訊會議管理單元110除了可對應地進行語音轉文字操作外，還可根據所接收的人聲的聲音特徵以及利用經由語音資料庫所訓練的語音辨識模型來輔助校正所轉換的文字訊息。接著，再將所辨識到的發言者的識別名稱（如，可利用影像辨識或是語音辨識的方式，藉由人臉資料庫或語音資料庫來找尋發言者的識別名稱）與文字訊息記錄作為本次會議記錄的所述發言者的言論儲存至會議記錄資料庫中。It should be noted that, in an embodiment, if the voice recognition unit 122 determines that the received sound is not a human voice, step S303 will not be continued to process the received sound. In this way, the interference of non-human ambient noise can be avoided. In addition, as described above, if the voice recognition unit 122 determines that the received voice is a human voice, the voice recognition unit 122 or the video conference management unit 110 may perform a voice-to-text operation corresponding to the voice characteristics of the human voice received. And using a speech recognition model trained through a speech database to assist in correcting the converted text message. Then, use the recognized name of the speaker (for example, you can use the image recognition or speech recognition method to find the speaker's recognition name through the face database or voice database) and the text message record as the The speeches of the speakers of this meeting record are stored in the meeting record database.

在步驟S305中，影像擷取裝置130會擷取所述會議空間的全景影像。如在圖1A、1B的例子，影像擷取裝置130會擷取且產生會議空間1的全景影像11，並且將對應全景影像11影像資料傳送給影像辨識單元131。In step S305, the image capturing device 130 captures a panoramic image of the conference space. As in the example of FIGS. 1A and 1B, the image capturing device 130 captures and generates a panoramic image 11 of the conference space 1, and transmits the image data corresponding to the panoramic image 11 to the image recognition unit 131.

在步驟S307中，辨識所述全景影像中的至少一與會者的臉部影像，並且辨識出所述至少一臉部影像的第二位置。In step S307, a facial image of at least one participant in the panoramic image is identified, and a second position of the at least one facial image is identified.

具體來說，影像辨識單元131會對所接收到的全景影像11設定一個座標系。此外，影像辨識單元131會持續地對所接收到的全景影像11來偵測全景影像中是否有臉部影像（經由人臉偵測操作）。若偵測到至少一與會者（一或多個與會者）的臉部影像，影像辨識單元131會根據偵測到的臉部影像於全景影像中的位置，來對所述被偵測到的臉部影像設定一個座標值，此座標值即表示被偵測到的臉部影像在全景影像中的位置。例如，此座標值可用來表示所述臉部影像的中心點，臉部影像的嘴部區域的中心點（如，圖5A中的嘴部區域的中心點503），或對應用以涵蓋臉部影像的特定區域的一個點的座標。本發明並不限於對應臉部影像的座標值的設定方式。Specifically, the image recognition unit 131 sets a coordinate system for the received panoramic image 11. In addition, the image recognition unit 131 continuously detects whether there is a face image in the panorama image (via a face detection operation) on the received panorama image 11. If the face image of at least one participant (one or more participants) is detected, the image recognition unit 131 will detect the detected face image according to the position of the detected face image in the panoramic image. The facial image sets a coordinate value, and the coordinate value indicates the position of the detected facial image in the panoramic image. For example, this coordinate value can be used to represent the center point of the facial image, the center point of the mouth region of the facial image (eg, the center point 503 of the mouth region in FIG. 5A), or the corresponding point to cover the face The coordinates of a point in a specific area of the image. The present invention is not limited to the manner of setting coordinate values corresponding to facial images.

此外，在一實施例中，影像辨識單元131會嘗試去對所偵測到的臉部影像進行人臉識別（經由人臉辨識操作）。在所述人臉識別操作中，影像辨識單元131會比對人臉資料庫141與所偵測到的臉部影像，若有匹配的臉部影像，則可對應地找出被匹配的臉部影像所屬的與會者的識別名稱。在一實施例中，若沒有匹配的臉部影像，則影像辨識單元131可將所偵測到的臉部影像新增至人臉資料庫141中（對應的識別名稱可利用接收使用者的輸入操作來獲得，經由語音辨識的方式來獲得，或是利用存取包含所有與會者的識別名稱的會議資訊來獲得）。In addition, in an embodiment, the image recognition unit 131 attempts to perform face recognition on the detected facial image (via a face recognition operation). In the face recognition operation, the image recognition unit 131 compares the face database 141 with the detected face image. If there is a matching face image, it can find the matched face accordingly. The distinguished name of the participant to which the image belongs. In an embodiment, if there is no matching face image, the image recognition unit 131 may add the detected face image to the face database 141 (the corresponding identification name may be used to receive input from the user (Obtained by operation, obtained by voice recognition, or obtained by accessing conference information containing the identification names of all participants).

值得一提的是，視訊會議管理單元110會平行地（同步地）進行步驟S301至S303的運作與步驟S305至S307的運作。換句話說，視訊會議管理單元110可同時且持續地辨識當前所接收聲音的音源的位置，持續地拍攝全景影像且辨識於全景影像中的臉部影像以及對應所偵測到的臉部影像的位置。It is worth mentioning that the video conference management unit 110 performs the operations of steps S301 to S303 and the operations of steps S305 to S307 in parallel (synchronously). In other words, the video conference management unit 110 can simultaneously and continuously recognize the position of the sound source of the currently received sound, continuously capture the panoramic image and recognize the facial image in the panoramic image and the image corresponding to the detected facial image. position.

接著，在步驟S309中，視訊會議管理單元110根據所述第一位置、所述至少一第二位置與所述至少一臉部影像來判定所述至少一與會者中的發言者。Next, in step S309, the video conference management unit 110 determines a speaker among the at least one participant according to the first position, the at least one second position, and the at least one facial image.

圖4是依照本發明的一實施例所繪示的視訊會議方法的步驟S309的流程圖。圖5A是依照本發明的一實施例所繪示的全景影像的示意圖。圖5B是依照本發明的一實施例所繪示的特徵辨識區域的示意圖。請參照圖5A，在圖5A的全景影像500中有四個與會者。如上所述，影像辨識單元131會辨識出每個與會者的臉部影像，以及對應的多個第二位置。FIG. 4 is a flowchart of step S309 of the video conference method according to an embodiment of the present invention. FIG. 5A is a schematic diagram of a panoramic image according to an embodiment of the present invention. FIG. 5B is a schematic diagram of a feature recognition area according to an embodiment of the invention. Referring to FIG. 5A, there are four participants in the panoramic image 500 of FIG. 5A. As described above, the image recognition unit 131 recognizes the facial images of each participant and the corresponding second positions.

請同時參照圖4與圖5A，在步驟S3091中，視訊會議管理單元110根據該第一位置，設定該全景影像中的一目標區域，並且根據該目標區域與所述至少一第二位置辨識在該目標區域中的至少一目標臉部影像。舉例來說，假設目前視訊會議管理單元110根據所接收的聲音判定對應該聲音的第一位置502。視訊會議管理單元110會以第一位置502為中心，設定一個目標區域501，並且根據所設定的目標區域與已辨識出的多個第二位置判定在目標區域中的目標臉部影像的位置503以及對應的目標臉部影像。更詳細來說，視訊會議管理單元110會根據目標區域的涵蓋範圍（與對應的區域邊界的座標值）來判斷是否有至少一第二位置被涵蓋於目標區域中。若有，則推測目標區域中會具有發言者的臉部影像。Please refer to FIG. 4 and FIG. 5A at the same time. In step S3091, the video conference management unit 110 sets a target area in the panoramic image according to the first position, and identifies a target area based on the target area and the at least one second position. At least one target facial image in the target area. For example, it is assumed that the current video conference management unit 110 determines the first position 502 corresponding to the sound according to the received sound. The video conference management unit 110 sets a target area 501 with the first position 502 as the center, and determines the position 503 of the target facial image in the target area based on the set target area and the identified second positions. And the corresponding target facial image. In more detail, the video conference management unit 110 determines whether at least one second position is covered in the target area according to the coverage of the target area (coordinate values corresponding to the area boundaries). If so, it is estimated that there will be a face image of the speaker in the target area.

接著，在步驟S3093中，視訊會議管理單元110根據所述至少一目標臉部影像的影像變化判定所述至少一目標臉部影像所屬之至少一目標與會者中的該發言者。Next, in step S3093, the video conference management unit 110 determines the speaker among the at least one target participant to which the at least one target facial image belongs according to the image change of the at least one target facial image.

舉例來說，請參照圖5B，視訊會議管理單元110或影像辨識單元131會設定目標臉部影像511的四個角的參考座標值，其中目標臉部影像511的長度為“H”，寬度為“W”。在本實施例中，目標臉部影像511的嘴部區域可預先被設定為目標臉部影像511內的一個區域。假設目標臉部影像511的左上角座標為O(0, 0)；右上角座標為W(W, 0)；左下角座標為H(0, H)；右下角座標為WH(W, H)。在此例子中，嘴部區域的範圍可預設為目標臉部影像的3/5H至4/5H以及1/3W至2/3W的範圍。即，嘴部區域的（相對於目標臉部影像的）左上角座標為O1(1/3W, 3/5H)；右上角座標為W1(2/3W, 3/5H)；左下角座標為H1(1/3W, 4/5H)；右下角座標為WH1(2/3W, 4/5H)。For example, referring to FIG. 5B, the video conference management unit 110 or the image recognition unit 131 sets the reference coordinate values of the four corners of the target facial image 511. The length of the target facial image 511 is "H" and the width is "W". In this embodiment, the mouth area of the target facial image 511 may be set in advance as an area within the target facial image 511. Assume that the upper left corner of the target facial image 511 is O (0, 0); the upper right corner is W (W, 0); the lower left corner is H (0, H); the lower right corner is WH (W, H) . In this example, the range of the mouth area can be preset to the range of 3 / 5H to 4 / 5H and 1 / 3W to 2 / 3W of the target facial image. That is, the upper left corner of the mouth area (relative to the target facial image) is O1 (1 / 3W, 3 / 5H); the upper right corner is W1 (2 / 3W, 3 / 5H); the lower left corner is H1 (1 / 3W, 4 / 5H); the lower right corner is WH1 (2 / 3W, 4 / 5H).

在本實施例中，視訊會議管理單元110會指示影像辨識單元131將嘴部區域作為特徵辨識區域，並且更進一步地對於該特徵辨識區域來計算影像變化。更詳細來說，視訊會議管理單元110指示影像辨識單元131根據所述多個第二位置中對應所述目標臉部影像511的目標位置502來設定所述目標臉部影像的特徵辨識區域520。設定完特徵辨識區域520後，影像辨識單元131計算一段時間內，所述目標臉部影像511的特徵辨識區域520中的像素變化值。In this embodiment, the video conference management unit 110 instructs the image recognition unit 131 to use the mouth area as a feature recognition area, and further calculates the image change for the feature recognition area. In more detail, the video conference management unit 110 instructs the image recognition unit 131 to set the feature recognition area 520 of the target facial image according to the target position 502 corresponding to the target facial image 511 of the plurality of second positions. After the feature recognition area 520 is set, the image recognition unit 131 calculates pixel change values in the feature recognition area 520 of the target facial image 511 over a period of time.

舉例來說，於每一個視訊框的時間點，影像辨識單元131計算當前視訊框（frame）的全景影像的特徵辨識區域520的平均的像素值（如，RGB值、灰階值、亮度值等其他類型的像素值）。接著，影像辨識單元131會計算此當前視訊框的平均像素值與每一前M個視訊框的全景影像的特徵辨識區域520的平均像素值的差值（取絕對值）。接著，影像辨識單元131會取所述多個差值中的最大者作為對應當前視訊框的特徵辨識區域520的像素變化值。For example, at the time point of each video frame, the image recognition unit 131 calculates an average pixel value (eg, RGB value, grayscale value, brightness value, etc.) of the feature recognition area 520 of the panoramic image of the current video frame. Other types of pixel values). Next, the image recognition unit 131 calculates a difference (taken as an absolute value) between the average pixel value of the current video frame and the average pixel value of the feature recognition area 520 of the panoramic image of each of the first M video frames. Next, the image recognition unit 131 takes the largest of the plurality of differences as the pixel change value of the feature recognition area 520 corresponding to the current video frame.

接著，影像辨識單元131將所計算出的對應目標臉部影像511的像素變化值作為所述目標臉部影像511的特徵影像變化值。若所述特徵影像變化值超過預定門檻值，視訊會議管理單元110判定所對應之與會者為發言者。Next, the image recognition unit 131 uses the calculated pixel change value corresponding to the target face image 511 as the characteristic image change value of the target face image 511. If the characteristic image change value exceeds a predetermined threshold, the video conference management unit 110 determines that the corresponding participant is a speaker.

應注意的是，上述的例子中，目標區域僅具有一個目標與會者。然而，若目標區域具有多個目標與會者，影像辨識單元131會設定對應所有與會者的臉部影像的特徵辨識區域以計算所有與會者的特徵影像變化值，並且找出其中的最大者（亦稱最大特徵影像變化值）。若所述最大特徵影像變化值超過所述預定門檻值，視訊會議管理單元110判定該最大特徵影像變化值所對應之與會者為所述多個目標與會者中的發言者。It should be noted that in the above example, the target area has only one target participant. However, if the target area has multiple target participants, the image recognition unit 131 sets a feature recognition area corresponding to the facial images of all participants to calculate the characteristic image change values of all participants, and finds the largest one (also Called the maximum characteristic image change value). If the maximum characteristic image change value exceeds the predetermined threshold, the video conference management unit 110 determines that the participant corresponding to the maximum characteristic image change value is a speaker among the plurality of target participants.

請再回到圖3，在判定出發言者後，於步驟S311中，視訊會議管理單元110設定所述全景影像顯示於視訊會議影像的第一區域，放大所述全景影像中的所判定之所述發言者的影像，並且設定所放大之該發言者的所述影像顯示於所述視訊會議影像的第二區域。Please return to FIG. 3 again. After the speaker is determined, in step S311, the video conference management unit 110 sets the panoramic image to be displayed in the first area of the video conference image, and enlarges the determined location in the panoramic image. The speaker's image is described, and the enlarged image of the speaker is set to be displayed in a second area of the video conference image.

圖5C是依照本發明的一實施例所繪示的視訊會議影像的示意圖。請參照圖5C，接續上方圖5A、5B的例子，判定全景影像500中最左方的與會者為發言者後，視訊會議管理單元110會根據目標臉部影像511的第二位置來設定發言者的影像（如，影像510），並且視訊會議管理單元110會根據所接收的全景影像500來產生視訊會議畫面530。舉例來說，所產生的視訊會議畫面530會具有第一區域以及第二區域。視訊會議管理單元110會設定全景影像500顯示於第一區域。視訊會議管理單元110會放大發言者的影像510，並且設定放大後的影像510顯示於第二區域。5C is a schematic diagram of a video conference image according to an embodiment of the present invention. Referring to FIG. 5C, following the example in FIGS. 5A and 5B above, after determining that the leftmost participant in the panoramic image 500 is the speaker, the video conference management unit 110 will set the speaker according to the second position of the target facial image 511 Image (eg, image 510), and the video conference management unit 110 generates a video conference picture 530 according to the received panoramic image 500. For example, the generated video conference picture 530 will have a first area and a second area. The video conference management unit 110 sets the panoramic image 500 to be displayed in the first area. The video conference management unit 110 enlarges the speaker's image 510, and sets the enlarged image 510 to be displayed in the second area.

在一實施例中，視訊會議管理單元110可利用機器學習來進行超解析度的操作，以使所放大之發言者的影像510保持清晰（不會放大影像的操作而變得模糊）。In one embodiment, the video conference management unit 110 may use machine learning to perform super-resolution operations, so that the enlarged image 510 of the speaker is kept clear (the operation of magnifying the image does not become blurred).

應注意的是，在上述的例子中，第一區域配置在第二區域的上方，但本發明不限於此。例如，在另一實施例中，第一區域可配置在第二區域的下方。It should be noted that, in the above-mentioned example, the first region is disposed above the second region, but the present invention is not limited thereto. For example, in another embodiment, the first area may be disposed below the second area.

請再回到圖3，在設定/產生完視訊會議影像後，於步驟S313中，視訊會議管理單元110輸出所述視訊會議影像。具體來說，在本實施例，會議管理單元110可透過通訊單元160所建立的連線，將所產生的視訊會議影像的轉換為對應的視訊訊號並且傳送至其他電子裝置，以讓其他電子裝置的螢幕或是顯示裝置可以顯示所產生的視訊會議影像。Please return to FIG. 3 again. After setting / generating the video conference image, in step S313, the video conference management unit 110 outputs the video conference image. Specifically, in this embodiment, the conference management unit 110 may convert the generated video conference image into a corresponding video signal through the connection established by the communication unit 160 and send it to other electronic devices, so that other electronic devices The screen or display device can display the resulting video conference image.

在一實施例中，視訊會議裝置10可藉由連接介面單元150或通訊單元160連接至會議空間的電子裝置，以作為所連接電子裝置的相機，進而讓電子裝置所執行的一般可進行視訊會議的應用程式（如，Skype、Line等即時通訊軟體）可透過視訊會議裝置10所產生的視訊會議影像來進行視訊會議。如此一來，可讓使用者利用當前市面上的一般即時通訊軟體來進行智慧的視訊會議。即，提供本地端的整體會議影像（利用視訊會議影像的第一區域的全景影像）以及當前發言者的影像（利用視訊會議影像的第二區域的影像）給遠端的即時通訊軟體的使用者。In an embodiment, the video conference device 10 may be connected to the electronic device in the conference space through the connection interface unit 150 or the communication unit 160 as a camera of the connected electronic device, so that the electronic device generally performs a video conference. Applications (such as instant messaging software such as Skype, Line, etc.) can conduct video conferences through video conference images generated by the video conference device 10. In this way, users can use common instant messaging software on the market to conduct smart video conferences. That is, the overall conference image (using the panoramic image of the first region of the video conference image) and the current speaker image (using the image of the second region of the video conference image) of the local end are provided to the user of the remote instant messaging software.

值得一提的是，在另一實施例中，視訊會議管理單元110亦可將經由語音轉文字操作所獲得的文字訊息以另一圖層的方式附加至所產生的訊會議影像上，以讓該文字訊息可作為對應該發言者的言論的字幕。在又另一實施例中，視訊會議管理單元110可更將經由語音轉文字操作所獲得的所述文字訊息輸入至翻譯單元，以獲得翻譯後文字訊息（如，將發言者所說的中文語音轉換且翻譯成英文文字），並且將翻譯後文字訊息作為對應的字幕附加至視訊會議影像上。It is worth mentioning that, in another embodiment, the video conference management unit 110 may also attach the text message obtained through the voice-to-text operation to the generated video conference image in another layer so that the Text messages can be used as subtitles for speakers. In still another embodiment, the video conference management unit 110 may further input the text message obtained through a voice-to-text operation into a translation unit to obtain a translated text message (for example, a Chinese voice spoken by a speaker). Converted and translated into English text), and the translated text message is attached to the video conference video as corresponding subtitles.

綜上所述，本發明所提供的視訊會議裝置與視訊會議管理方法，能夠利用聲音定位與影像辨識，來精確地判別舉開視訊會議的會議空間中的發言者，將發言者的影像放大且顯示於具有所述會議空間中的所有與會者的全景影像的視訊會議影像中。此外，更能夠對發言者的言論進行語音轉文字操作，儲存對應所述發言者的識別名稱與所述言論的文字訊息，以建立所述視訊會議的會議記錄。如此一來，本發明所提供的視訊會議裝置與視訊會議管理方法能夠讓所有與會者直覺地且專注於發言者上，以更有效率地進行視訊會議，並且所述裝置與方法還能夠即時地建立會議記錄，進而增進了視訊會議的整體工作效率。In summary, the video conference device and the video conference management method provided by the present invention can accurately determine the speaker in the conference space where the video conference is held by using sound positioning and image recognition, and enlarge the image of the speaker and Displayed in a video conference image having panoramic images of all participants in the conference space. In addition, a speech-to-text operation can be performed on a speaker's speech, and a text message corresponding to the speaker's identification name and the speech is stored to establish a meeting record of the video conference. In this way, the video conference device and video conference management method provided by the present invention can allow all participants to intuitively and focus on the speaker to conduct the video conference more efficiently, and the device and method can also instantly Establish meeting records, which in turn improves the overall efficiency of video conferences.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed as above with the examples, it is not intended to limit the present invention. Any person with ordinary knowledge in the technical field can make some modifications and retouching without departing from the spirit and scope of the present invention. The protection scope of the present invention shall be determined by the scope of the attached patent application.

1‧‧‧會議空間1‧‧‧ meeting space

2、3、4、5‧‧‧與會者2, 3, 4, 5‧‧‧ attendees

10‧‧‧視訊會議裝置10‧‧‧ Video Conference Device

11、500‧‧‧全景影像11, 500‧‧‧ panoramic image

21‧‧‧聲音/人聲/言論21‧‧‧Voice / Voice / Speech

110‧‧‧視訊會議管理單元110‧‧‧Video Conference Management Unit

120‧‧‧麥克風陣列120‧‧‧Microphone Array

121‧‧‧聲音定位單元121‧‧‧ sound localization unit

122‧‧‧語音辨識單元122‧‧‧Speech recognition unit

130‧‧‧影像擷取裝置130‧‧‧Image capture device

131‧‧‧影像辨識單元131‧‧‧Image recognition unit

140‧‧‧儲存單元140‧‧‧Storage unit

141、142、143‧‧‧資料庫141, 142, 143‧‧‧ database

150‧‧‧連接介面單元150‧‧‧ connect interface unit

151‧‧‧聲音播放裝置151‧‧‧ sound playback device

152‧‧‧顯示裝置152‧‧‧Display device

153‧‧‧輸入裝置153‧‧‧input device

160‧‧‧通訊單元160‧‧‧communication unit

S301、S303、S305、S307、S309、S311、S313‧‧‧視訊會議管理方法的流程步驟Process steps of S301, S303, S305, S307, S309, S311, S313‧‧‧ video conference management method

S3091、S3093‧‧‧圖3的視訊會議管理方法的步驟S309的流程步驟S3091, S3093‧‧‧‧ Steps in the video conference management method of FIG. 3

501‧‧‧目標區域501‧‧‧target area

510‧‧‧發言者影像510‧‧‧Speaker image

511‧‧‧目標臉部影像511‧‧‧Target facial image

520‧‧‧特徵辨識區域520‧‧‧Feature recognition area

H‧‧‧目標臉部影像的長度H‧‧‧ Length of target facial image

W‧‧‧目標臉部影像的寬度W‧‧‧ The width of the target facial image

O(0, 0)、W(W, 0)、H(0, H)、WH(W, H)‧‧‧座標值O (0, 0), W (W, 0), H (0, H), WH (W, H)

530‧‧‧視訊會議影像530‧‧‧Video conference video

圖1A是依照本發明的一實施例所繪示的視訊會議的示意圖。圖1B是依照本發明的一實施例所繪示的對應圖1A中的視訊會議的全景影像的示意圖。圖2是依照本發明的一實施例所繪示的視訊會議裝置的方塊圖。圖3是依照本發明的一實施例所繪示的視訊會議方法的流程圖。圖4是依照本發明的一實施例所繪示的視訊會議方法的步驟S309的流程圖。圖5A是依照本發明的一實施例所繪示的全景影像的示意圖。圖5B是依照本發明的一實施例所繪示的特徵辨識區域的示意圖。圖5C是依照本發明的一實施例所繪示的視訊會議影像的示意圖。FIG. 1A is a schematic diagram of a video conference according to an embodiment of the invention. FIG. 1B is a schematic diagram illustrating a panoramic image corresponding to the video conference in FIG. 1A according to an embodiment of the present invention. FIG. 2 is a block diagram of a video conference device according to an embodiment of the present invention. FIG. 3 is a flowchart of a video conference method according to an embodiment of the present invention. FIG. 4 is a flowchart of step S309 of the video conference method according to an embodiment of the present invention. FIG. 5A is a schematic diagram of a panoramic image according to an embodiment of the present invention. FIG. 5B is a schematic diagram of a feature recognition area according to an embodiment of the invention. 5C is a schematic diagram of a video conference image according to an embodiment of the present invention.

Claims

A video conference device includes: a microphone array including a plurality of microphones for receiving a sound emitted from a conference space; a sound positioning unit coupled to the microphone array for receiving the sound according to the received sound To determine a first position of the sound; an image capture device to capture a panoramic image of the conference space; an image recognition unit coupled to the image capture device to identify at least one of the panoramic images A participant's face image, and determining a second position of the at least one face image; and a video conference management unit, coupled to the sound positioning unit and the image recognition unit, for using the first position, the At least a second position and the at least one face image to determine a speaker of the at least one participant, wherein the video conference management unit sets the panoramic image to be displayed in a first area of a video conference image and enlarges the panoramic image The image of the speaker determined in the image, and the enlarged image of the speaker is set to be displayed in a first portion of the video conference image Area.

The video conference device according to item 1 of the scope of patent application, wherein the video conference management unit performs a voice-to-text operation on the voice issued by the speaker to convert the voice of the speaker into a corresponding speaker A text message, wherein the video conference management unit stores an identification name corresponding to the speaker and the text message to a conference record database.

The video conference device according to item 1 of the scope of patent application, wherein the video conference management unit sets a target area in the panoramic image according to the first position, and instructs the image recognition unit according to the target area and the at least one The second position recognizes at least one target facial image in the target area, wherein the video conference management unit determines, among the at least one target participant to which the at least one target facial image belongs, according to an image change of the at least one target facial image. Of that speaker.

The video conference device according to item 3 of the scope of patent application, wherein the video conference management unit instructs the image recognition unit to calculate a characteristic image change value of the at least one target facial image, and if one of the at least one characteristic image change value is a maximum The characteristic image change value exceeds a predetermined threshold, and the video conference management unit determines that the participant corresponding to the maximum characteristic image change value is the speaker of the at least one target participant.

The video conference device according to item 4 of the scope of patent application, wherein the video conference management unit instructs the image recognition unit to set the at least one target position corresponding to at least one target position image of the at least one target face in the at least one second position. A feature recognition area of a target face image, wherein the image recognition unit calculates a pixel change value of the feature recognition area of the at least one target face image within a predetermined time, and matches the calculated corresponding to at least one target face The pixel change value of the image is used as the characteristic image change value of the at least one target facial image.

A video conference management method suitable for a video conference in a conference space, wherein the conference space has at least one participant, the method includes: receiving a sound emitted in the conference space; according to the received Using the sound to determine a first position of the sound; capturing a panoramic image of the conference space; identifying a facial image of at least one participant in the panoramic image, and determining a second position of the at least one facial image; Determine a speaker of the at least one participant according to the first position, the at least one second position, and the at least one facial image; and set the panoramic image to be displayed in a first area of a video conference image and zoom in The image of the speaker in the panoramic image is determined, and the enlarged image of the speaker is set to be displayed in a second area of the video conference image.

The video conference management method according to item 6 of the scope of patent application, further comprising: performing a voice-to-text operation on the voice issued by the speaker to convert the voice of the speaker into a text corresponding to the speaker Message, and store an identifying name corresponding to the speaker and the text message to a conference record database.

The video conference management method according to item 6 of the scope of patent application, wherein the step of determining the speaker among the at least one participant according to the first position, the at least one second position, and the at least one face image The method includes: setting a target area in the panoramic image according to the first position, and identifying at least one target facial image in the target area based on the target area and the at least one second position; and according to the at least one target The image change of the facial image determines the speaker among the at least one target participant to which the at least one target facial image belongs.

The video conference management method according to item 8 of the scope of patent application, wherein the speaker is determined among the at least one target participant to which the at least one target facial image belongs based on the image change of the at least one target facial image. The steps include: calculating a characteristic image change value of the at least one target facial image; and if a maximum characteristic image change value of the at least one characteristic image change value exceeds a predetermined threshold, determining a meeting corresponding to the maximum characteristic image change value The speaker is the speaker of the at least one target participant.

The video conference management method according to item 9 of the scope of patent application, wherein the step of calculating the characteristic image change value of the at least one target face image includes: according to the at least one second position corresponding to at least one target face Set the feature recognition area of the at least one target facial image with at least one target position of the partial image; and calculate a pixel change value of the feature recognition area of the at least one target facial image within a predetermined time, and calculate the calculated The pixel change value corresponding to the at least one target facial image is used as the feature image change value of the at least one target facial image.