TW201916669A

TW201916669A - Method and apparatus for gaze recognition and interaction

Info

Publication number: TW201916669A
Application number: TW106138396A
Authority: TW
Inventors: 蔣靜
Original assignee: 威盛電子股份有限公司
Priority date: 2017-09-27
Filing date: 2017-11-07
Publication date: 2019-04-16
Also published as: CN107622248A; TWI683575B; CN107622248B

Abstract

A method and an apparatus for gaze recognition and interaction are provided. The method is adapted for an electronic apparatus having a camera and a steering gear configured to change a direction of the camera. The method includes following steps: capturing multiple video frames; detecting at least one face in a current video frame of the captured video frames and a rotated video frame that is generated by rotating the current video frame relative to a direction axis; recognizing each of the detected faces to determine whether there is a face gazing the camera by using a previously trained classifier; and if it is confirmed that there is a face gazing the camera, controlling the steering gear to turn the camera toward the face being recognized as gazing the camera according to a position of the face in the current video frame or a position of the face mapped from the rotated video frame to the current video frame.

Description

Gaze recognition and interaction method and device

本發明是有關於一種互動方法與裝置，且特別是有關於一種注視識別及互動方法與裝置。The present invention relates to an interactive method and apparatus, and more particularly to a method and apparatus for gaze recognition and interaction.

現行的互動裝置（例如電子公仔、電子寵物或智慧機器人）可藉由肢體移動或聲光效果與使用者互動，藉以達到娛樂效果。例如，電子寵物可偵測使用者的聲音，而對應地變換表情或做出回應動作。藉由即時回應的動作，可達到與使用者互動的效果。Current interactive devices (such as electronic dolls, electronic pets, or smart robots) can interact with the user through limb movement or sound and light effects to achieve entertainment. For example, an electronic pet can detect the user's voice and correspondingly change the expression or respond. With the action of instant response, the effect of interacting with the user can be achieved.

然而，這些互動裝置的動作或回應都必須預先定義，且在與使用者互動的過程中，也只能針對特定的指示（例如按下按鍵或發出聲音）做出簡單的回應動作，並無法依據使用者的臉部表情或肢體語言作出適當的回應，未能體現出真實場景中人與人互動的效果。However, the actions or responses of these interactive devices must be pre-defined, and in the process of interacting with the user, only simple responses can be made for specific instructions (such as pressing a button or making a sound), and cannot be based on The user's facial expression or body language responds appropriately and fails to reflect the effect of human interaction in the real scene.

有鑑於此，本發明提供一種注視識別及互動方法及裝置，可模擬真實場景中人與人對話時的對視交流效果。In view of this, the present invention provides a gaze recognition and interaction method and apparatus, which can simulate the effect of the visual communication when a person interacts with a person in a real scene.

本發明的注視識別及互動方法適用於具有攝像頭及舵機的電子裝置，其中舵機是用以將攝像頭轉向。所述方法包括下列步驟：利用攝像頭擷取多個視訊框（video frame）；檢測這些視訊框的當前視訊框與所述當前視訊框對一方向軸旋轉後所產生的旋轉視訊框中的至少一個人臉；利用預先訓練的分類器識別所檢測的各個人臉是否對視攝像頭；以及若識別結果確認有人臉對視，依據此人臉於當前視訊框中的位置或是由旋轉視訊框映射回當前視訊框的位置，控制舵機將攝像頭轉向被識別為對視的人臉。The gaze recognition and interaction method of the present invention is applicable to an electronic device having a camera and a steering gear, wherein the steering gear is used to steer the camera. The method includes the following steps: capturing a plurality of video frames by using a camera; detecting at least one person in a rotating video frame generated by rotating a current video frame of the video frame and the current video frame in a direction axis Face; using a pre-trained classifier to identify whether each detected face is facing the camera; and if the recognition result confirms that the face is facing, according to the position of the face in the current video frame or mapped back to the current by the rotating video frame The position of the video frame controls the steering gear to turn the camera to the face that is recognized as the opposite.

本發明的注視識別及互動裝置包括攝像頭、舵機、儲存裝置及處理器。其中，攝像頭用以擷取多個視訊框。舵機用以將攝像頭轉向。儲存裝置用以儲存多個模組。處理器用以存取並執行儲存在儲存裝置的模組。這些模組包括視訊框旋轉模組、人臉檢測模組、對視識別模組及轉向模組。其中，視訊框旋轉模組將所述視訊框的當前視訊框對一方向軸旋轉為旋轉視訊框。人臉檢測模組檢測當前視訊框及旋轉視訊框中的至少一個人臉。對視識別模組利用預先訓練的分類器識別所檢測的各個人臉是否對視攝像頭。轉向模組在對視識別模組的識別結果確認有人臉對視時，依據此人臉於當前視訊框中的位置或是由旋轉視訊框映射回當前視訊框的位置，控制舵機將攝像頭轉向被識別為對視的人臉。The gaze recognition and interaction device of the present invention includes a camera, a steering gear, a storage device, and a processor. The camera is used to capture multiple video frames. The steering gear is used to steer the camera. The storage device is used to store a plurality of modules. The processor is configured to access and execute a module stored in the storage device. These modules include a video frame rotation module, a face detection module, a visual recognition module, and a steering module. The video frame rotation module rotates the current video frame of the video frame to a rotating video frame. The face detection module detects at least one face of the current video frame and the rotated video frame. The visual recognition module uses a pre-trained classifier to identify whether each detected face is facing the camera. The steering module controls the steering gear to turn the camera according to the position of the face in the current video frame or the position of the current video frame by the rotating video frame when the face recognition is confirmed by the recognition module of the visual recognition module. The face that is recognized as the opposite.

基於上述，本發明的注視識別及互動方法與裝置藉由對攝像頭擷取的視訊框進行人臉檢測，並將該視訊框依不同軸向旋轉後再進行人臉檢測，可檢測出各種姿態下的人臉。而藉由預先訓練的分類器對所檢測的人臉進行對視識別，可確認所檢測的人臉是否對視著攝像頭，進而控制攝像頭轉向該人臉。藉此，可模擬出真實情景中人與人對話時的對視交流效果。Based on the above, the gaze recognition and interaction method and apparatus of the present invention can detect face detection by performing face detection on the video frame captured by the camera, and rotating the video frame according to different axial directions, and then performing face detection. Face. The face of the detected face is recognized by the pre-trained classifier, and it can be confirmed whether the detected face is facing the camera, and then the camera is controlled to turn to the face. In this way, the effect of the visual communication in the dialogue between people in a real situation can be simulated.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。The above described features and advantages of the invention will be apparent from the following description.

本發明將聲音辨識、人臉檢測及對視識別等技術整合至智慧機器人或其他可與人互動的智慧裝置。當接收到使用者的聲音時，機器人即會轉向發聲方向，使得配置在機器人身上的攝像頭可擷取到使用者的視訊框（video frame）。而當使用者注視機器人時，機器人可從視訊框中檢測出人臉，並利用預先訓練的分類器識別所檢測的人臉是否對視著機器人，進而將機器人的頭轉向人臉中心（代表使用者的眼睛），藉此可模擬出真實情景中人與人對話時的對視交流效果。The invention integrates technologies such as voice recognition, face detection and visual recognition into smart robots or other smart devices that can interact with people. When the user's voice is received, the robot will turn to the sounding direction, so that the camera disposed on the robot can capture the user's video frame. When the user looks at the robot, the robot can detect the face from the video frame, and use the pre-trained classifier to identify whether the detected face is facing the robot, and then turn the head of the robot to the center of the face (representative use) The eyes of the person) can be used to simulate the effect of the visual communication when people talk to each other in a real situation.

圖1是依據本發明一實施例所繪示的注視識別及互動裝置的方塊圖。請參考圖1，本實施例的注視識別及互動裝置10例如是智慧機器人或其他可與人互動的電子裝置，其中包括攝像頭12、舵機14、儲存裝置16及處理器18，其功能分述如下：FIG. 1 is a block diagram of a gaze recognition and interaction apparatus according to an embodiment of the invention. Referring to FIG. 1 , the gaze recognition and interaction device 10 of the present embodiment is, for example, a smart robot or other electronic device that can interact with a person, including a camera 12 , a steering gear 14 , a storage device 16 , and a processor 18 . as follows:

攝像頭12例如是由鏡頭、光圈、快門、影像感測器等元件組成。其中，鏡頭包括多個光學透鏡，其例如是藉由步進馬達或音圈馬達（Voice Coil Motor，VCM）等致動器驅動，以改變透鏡之間的相對位置，從而改變焦距。光圈是由許多金屬葉片構成的圈狀開孔，此開孔會隨著光圈值的大小而開大或縮小，進而控制鏡頭的進光量。快門則是用以控制光進入鏡頭的時間長短，其與光圈的組合會影響影像感測器所擷取影像的曝光量。影像感測器例如是由電荷耦合元件（Charge Coupled Device，CCD）、互補性氧化金屬半導體（Complementary Metal-Oxide Semiconductor，CMOS）元件或其他種類的感光元件組成，其可感測進入鏡頭的光線強度以產生被攝物的視訊框。The camera 12 is composed of, for example, a lens, a diaphragm, a shutter, an image sensor, and the like. Wherein, the lens comprises a plurality of optical lenses, which are driven, for example, by an actuator such as a stepping motor or a voice coil motor (VCM) to change the relative position between the lenses, thereby changing the focal length. The aperture is a ring-shaped opening made up of a plurality of metal blades, and the opening is opened or reduced according to the aperture value, thereby controlling the amount of light entering the lens. The shutter is used to control the length of time that light enters the lens. The combination of the aperture and the aperture affects the exposure of the image captured by the image sensor. The image sensor is composed of, for example, a Charge Coupled Device (CCD), a Complementary Metal-Oxide Semiconductor (CMOS) component, or other kinds of photosensitive elements, which can sense the light intensity entering the lens. To create a frame of the subject.

舵機14例如是伺服馬達，其可配置於攝像頭12下方或周圍，而可依據處理器18的控制訊號，推動攝像頭12以改變其位置及/或角度。The steering gear 14 is, for example, a servo motor that can be disposed under or around the camera 12 and can push the camera 12 to change its position and/or angle in accordance with the control signal of the processor 18.

儲存裝置16可以是任何型態的固定式或可移動式隨機存取記憶體（random access memory，RAM）、唯讀記憶體（read-only memory，ROM）、快閃記憶體（flash memory）、硬碟（hard disk drive，HDD）、固態硬碟（solid state drive，SSD）或類似元件或上述元件的組合。在本實施例中，儲存裝置16用以儲存人臉檢測模組162、視訊框旋轉模組164、對視識別模組166及轉向模組168之軟體程式。The storage device 16 can be any type of fixed or removable random access memory (RAM), read-only memory (ROM), flash memory, A hard disk drive (HDD), a solid state drive (SSD) or the like or a combination of the above elements. In this embodiment, the storage device 16 is configured to store the software programs of the face detection module 162, the video frame rotation module 164, the view recognition module 166, and the steering module 168.

處理器18例如是中央處理單元（Central Processing Unit，CPU），或是其他可程式化之微處理器（Microprocessor）、數位訊號處理器（Digital Signal Processor，DSP）、可程式化控制器、特殊應用積體電路（Application Specific Integrated Circuit，ASIC）或其他類似元件或上述元件的組合。在本實施例中，處理器18用以存取並執行上述儲存裝置16中所儲存的模組，藉以實現本發明實施例的注視識別及互動方法。The processor 18 is, for example, a central processing unit (CPU), or other programmable microprocessor (Microprocessor), a digital signal processor (DSP), a programmable controller, and a special application. Application Specific Integrated Circuit (ASIC) or other similar components or a combination of the above. In this embodiment, the processor 18 is configured to access and execute the modules stored in the storage device 16 to implement the gaze recognition and interaction method in the embodiment of the present invention.

圖2是依照本發明一實施例所繪示的注視識別及互動方法流程圖。請同時參照圖1與圖2，本實施例的方法適用於上述的注視識別及互動裝置10，以下即搭配圖1中注視識別及互動裝置10的各項元件，說明本實施例方法的詳細流程。2 is a flow chart of a method for gaze recognition and interaction according to an embodiment of the invention. Referring to FIG. 1 and FIG. 2 simultaneously, the method of the present embodiment is applicable to the above-mentioned gaze recognition and interaction device 10, and the following is a detailed description of the detailed process of the method of the present embodiment in conjunction with the components of the gaze recognition and interaction device 10 of FIG. .

首先，由處理器18控制攝像頭12擷取多個視訊框（步驟S202）。接著，由處理器18執行視訊框旋轉模組164，以將當前視訊框對一方向軸旋轉為旋轉視訊框，並執行人臉檢測模組162，以檢測當前視訊框及旋轉視訊框中的至少一個人臉（步驟S204）。其中，人臉檢測模組162例如會執行維奧拉-瓊斯（Viola–Jones）檢測法或其他人臉檢測演算法，以即時處理攝像頭12所擷取的視訊框或旋轉後的視訊框，並檢測出現在這些視訊框中的人臉。First, the processor 12 controls the camera 12 to capture a plurality of video frames (step S202). Then, the video frame rotation module 164 is executed by the processor 18 to rotate the current video frame into a rotating video frame, and the face detection module 162 is executed to detect at least the current video frame and the rotating video frame. A face (step S204). The face detection module 162, for example, performs a Viola-Jones detection method or other face detection algorithms to instantly process the video frame or the rotated video frame captured by the camera 12, and Detect faces that appear in these video frames.

詳言之，在與人互動的初始場景中，人臉可能並未正對著注視識別及互動裝置10，這使得該人臉在攝像頭12所擷取的視訊框中有可能是側面對著或歪著對著注視識別及互動裝置10。對此，本實施例例如是藉由將當前視訊框對水平軸或垂直軸以順時針或逆時針的方向旋轉某個角度，以便於人臉檢測模組162進行人臉。而藉由重複上述旋轉視訊框及檢測人臉的步驟，有機會將視訊框中原本歪斜的人臉轉正，使得人臉檢測模組162能夠順利地檢測人臉。In particular, in an initial scene of interaction with a person, the face may not be facing the gaze recognition and interaction device 10, which may cause the face to be sideways or in the video frame captured by the camera 12. Looking at the recognition and interaction device 10. In this regard, the present embodiment facilitates the face detection module 162 to perform a human face by, for example, rotating the current video frame to a horizontal or vertical axis in a clockwise or counterclockwise direction. By repeating the steps of rotating the video frame and detecting the face, the face of the originally tilted video frame is turned positive, so that the face detection module 162 can smoothly detect the face.

舉例來說，圖3是依照本發明一實施例所繪示的旋轉視訊框的示意圖。請參照圖3，假設x軸、y軸、z軸為三維空間的3個方向軸，其中xz平面為水平面、xy平面為豎直面。圖3中所示的由z軸旋轉至x軸（對y軸順時針旋轉）代表水平方向上的旋轉，而圖3中由y軸旋轉至x軸（對z軸逆時針旋轉）代表垂直方向上的旋轉。而藉由將視訊框對不同方向軸進行順時針或逆時針旋轉，並在旋轉後執行人臉檢測，即可在各種人臉姿態下仍然能夠檢測到人臉。For example, FIG. 3 is a schematic diagram of a rotating video frame according to an embodiment of the invention. Referring to FIG. 3, it is assumed that the x-axis, the y-axis, and the z-axis are three directional axes of three-dimensional space, wherein the xz plane is a horizontal plane and the xy plane is a vertical plane. The rotation from the z-axis to the x-axis (clockwise rotation to the y-axis) shown in FIG. 3 represents the rotation in the horizontal direction, and the rotation from the y-axis to the x-axis (counterclockwise rotation to the z-axis) in FIG. 3 represents the vertical direction. Rotation on. By rotating the video frame clockwise or counterclockwise for different directions, and performing face detection after rotation, the face can still be detected under various face poses.

需說明的是，由於同一張人臉可能同時會在原始視訊框及旋轉後的視訊框中被檢測到，但實際代表同一人。對此，本發明實施例提供一種利用面積比排除相同人臉的方式，若在其他方向（即旋轉後方向）上檢測到的人臉的有效面積比大於一定的門檻值，則將該人臉視為相同人臉，放棄保存該人臉的資訊，且不會對該人臉進行對視識別。上述的有效面積比可以理解為重複面積比，而上述在旋轉視訊框與原始視訊框中分別檢測到的人臉如果有重複，且重複率超過一定的閾值，則只對原始視訊框中檢測到的那個人臉進行後續的對視識別，而不對旋轉視訊框中檢測到的人臉進行對視識別。藉此，可保證對一幅視訊框中的人臉只進行一次對視識別驗證，避免重複。需注意的是，上述人臉檢測的目標是原始視訊框與旋轉視訊框中的所有人臉，而對所有人臉分別進行且只進行一次識別。It should be noted that since the same face may be detected in both the original video frame and the rotated video frame, it actually represents the same person. In this regard, the embodiment of the present invention provides a method for eliminating the same face by using an area ratio. If the effective area ratio of the face detected in other directions (ie, the direction of rotation) is greater than a certain threshold, the face is It is regarded as the same face, and the information for saving the face is discarded, and the face is not recognized by the face. The effective area ratio can be understood as a repeating area ratio, and if the face detected in the rotating video frame and the original video frame is repeated, and the repetition rate exceeds a certain threshold, only the original video frame is detected. The face of the person performs subsequent visual recognition without visually recognizing the face detected in the rotating video frame. In this way, it is ensured that only one face recognition verification is performed on the face of a video frame to avoid duplication. It should be noted that the above-mentioned face detection target is all the faces of the original video frame and the rotating video frame, and all the faces are performed separately and only once.

詳言之，人臉檢測模組162在檢測旋轉視訊框中的人臉後，會進一步將旋轉視訊框中的人臉映射回當前視訊框，並與當前視訊框中位置相應的人臉的進行比對，判斷映射回當前視訊框的人臉與原本在當前視訊框中的人臉的重疊面積與原本在當前視訊框中的人臉的原始面積的比值是否大於一個門檻值。若此比值大於門檻值，則代表在旋轉視訊框中檢測到的人臉與在當前視訊框中檢測到的人臉是屬於同一人，此時人臉檢測模組162將會放棄保存在旋轉視訊框中檢測到的人臉的資訊，且不會對該人臉進行對視識別，避免重複識別。In detail, after detecting the face in the rotating video frame, the face detection module 162 further maps the face in the rotating video frame back to the current video frame and performs the face corresponding to the current video frame position. The comparison determines whether the ratio of the area of the face mapped back to the current video frame to the face of the face originally in the current video frame and the original area of the face originally in the current video frame is greater than a threshold. If the ratio is greater than the threshold, it means that the face detected in the rotating video frame belongs to the same person as the face detected in the current video frame, and the face detection module 162 will abandon the saved in the rotating video. The information of the face detected in the box, and the face will not be recognized by the face to avoid repeated recognition.

然後，由處理器18執行對視識別模組166，利用預先訓練的分類器識別由人臉檢測模組162所檢測到的各個人臉是否對視攝像頭12，以確認是否有人臉對視（步驟S206）。詳言之，對視識別模組166例如會預先採集大量的人臉影像，由使用者判斷各張人臉影像中的人臉是否對視著攝像頭，從而在各張人臉影像上標注對視標籤。藉此，對視識別模組166即可利用這些人臉影像及其對應的對視標籤訓練一個神經網路，以獲得用以識別人臉對視的分類器。上述的神經網路例如包括2層卷積層（convolutional layer）、2層全連接層（Fully connected layer）及1層使用softmax函式的輸出層，但不限於此。本領域技術人員可視實際需要，使用包括不同數目且不同組合的卷積層、池層（pooling）、全連接層、輸出層的卷積神經網路或其他種類的神經網路。Then, the processor 18 executes the visual recognition module 166, and uses the pre-trained classifier to identify whether each face detected by the face detection module 162 is facing the camera 12 to confirm whether a face is facing (steps). S206). In detail, the visual recognition module 166, for example, pre-captures a large number of facial images, and the user judges whether the faces in the respective facial images are facing the camera, thereby marking the faces on the respective facial images. label. Thereby, the visual recognition module 166 can train a neural network by using these facial images and their corresponding visual tags to obtain a classifier for recognizing the face. The above neural network includes, for example, two layers of a convolutional layer, two layers of a fully connected layer, and one layer of an output layer using a softmax function, but is not limited thereto. Those skilled in the art can use a convolutional layer, a pooling, a fully connected layer, an output layer convolutional neural network or other kinds of neural networks including different numbers and different combinations, depending on actual needs.

最後，在對視識別模組166的識別結果確認有人臉對視時，由處理器18執行轉向模組168，以依據此人臉於當前視訊框中的位置或是此人臉由旋轉視訊框映射回當前視訊框的位置，控制舵機14將攝像頭12轉向被識別為對視的人臉（步驟S208）。詳言之，針對在旋轉視訊框中的人臉被識別為對視的清況，對視識別模組166會先將該人臉的位置映射回當前視訊框，以作為控制攝像頭12轉向的依據。舉例來說，假設旋轉視訊框的旋轉角度為α，檢測到的人臉位置（x ₀ ,y ₀ ），原視訊框的寬度為w、高度為h，則映射回原視訊框的位置（x ,y ）為：Finally, when the face recognition is confirmed by the recognition result of the visual recognition module 166, the processor 18 executes the steering module 168 to rotate the video frame according to the position of the face in the current video frame or the face. Mapping back to the position of the current video frame, the control servo 14 turns the camera 12 to the face recognized as the view (step S208). In detail, for the face in the rotating video frame to be recognized as the viewing condition, the visual recognition module 166 first maps the position of the face back to the current video frame as a basis for controlling the steering of the camera 12. . For example, if the rotation angle of the rotating video frame is α, the detected face position ( x ₀ , y ₀ ), the width of the original video frame is w, and the height is h, then the position of the original video frame is mapped back ( x) , y ) is:

逆時針旋轉：；Anticlockwise rotation: ;

順時針旋轉：。clockwise rotation: .

需說明的是，在一實施例中，轉向模組168例如是將當前視訊框等分為多個區域，而依據人臉於當前視訊框中的位置或是由旋轉視訊框映射回當前視訊框的位置偏離這些區域的中心區域的距離及方向，控制舵機14將攝像頭12轉向人臉，使得人臉可位於轉向後攝像頭12所擷取的視訊框的中心區域。在另一實施例中，可藉由將攝像頭的轉向範圍與視訊框的寬度w對應，由人臉偏離中心區域的像素差，計算出攝像頭應該旋轉的方向及角度。在又一實施例中，可藉由平移攝像頭的位置，或是同時平移並旋轉攝像頭，以使人臉可位於平移及/或旋轉後攝像頭12所擷取的視訊框的中心區域，在此不設限。It should be noted that, in an embodiment, the steering module 168 divides the current video frame into multiple regions, for example, according to the position of the face in the current video frame or the current video frame according to the rotated video frame. The position deviates from the distance and direction of the central area of these areas, and the steering gear 14 is controlled to turn the camera 12 to the human face so that the face can be located in the central area of the video frame captured by the rear camera 12. In another embodiment, the direction and angle at which the camera should rotate can be calculated from the pixel difference of the face from the central region by matching the steering range of the camera with the width w of the video frame. In another embodiment, the camera can be panned and rotated at the same time, so that the face can be located in the center of the video frame captured by the camera 12 after translation and/or rotation. Set limits.

舉例來說，圖4是依照本發明一實施例所繪示的控制攝像頭轉向的示意圖。請參照圖4，假設視訊框40是由攝像頭擷取的視訊框，其中的人臉42是被識別出有對視的人臉。如圖4所示，視訊框40被區分為9個區域，其中被識別出有對視的人臉42位於右下區域40b。而根據此人臉42所在位置（例如人臉42的中心點位置）偏離中心區域40a（中心點位置）的距離及方向，可控制攝像頭朝相反方向轉向（以本實施例為例，朝右下方向轉向），使得該人臉42在轉向後攝像頭所擷取的視訊框中是位於中心區域40a。藉由將人臉42保持在視訊框40的中心區域40a，即可實現攝像頭轉向對視的位置。For example, FIG. 4 is a schematic diagram of controlling camera steering according to an embodiment of the invention. Referring to FIG. 4, it is assumed that the video frame 40 is a video frame captured by the camera, and the human face 42 is a face that is recognized as having an opposite view. As shown in FIG. 4, the video frame 40 is divided into nine areas in which the face 42 that is recognized to be in view is located in the lower right area 40b. According to the distance and direction of the position of the face 42 (for example, the center point position of the face 42) from the central area 40a (center point position), the camera can be controlled to turn in the opposite direction (in the embodiment, for example, facing to the lower right) The direction is turned so that the face 42 is located in the central area 40a in the video frame captured by the camera after steering. By holding the face 42 in the central area 40a of the video frame 40, the position at which the camera is turned to the view can be achieved.

藉由上述的注視識別及互動方法，可識別出周圍是否有人注視本實施例的注視識別及互動裝置10，並將注視識別及互動裝置10轉向朝向注視的人臉，藉此可模擬真實場景中與人對話時的對視交流效果。By means of the above-described gaze recognition and interaction method, it is possible to recognize whether or not someone is looking at the gaze recognition and interaction device 10 of the present embodiment, and turn the gaze recognition and interaction device 10 toward the gaze face, thereby simulating the real scene. The effect of the communication when talking to people.

需說明的是，在與人互動的初始場景中，人臉可能並未出現裝置攝像頭的視野內，即便人臉有出現在攝像頭的視野內並且對視著攝像頭，這也可能是剛好目光掃過，並非特意注視。對此，本發明提供另一實施例，可解決上述問題，從而獲得更好的識別效果。It should be noted that in the initial scene of interaction with the person, the face may not appear in the field of view of the device camera, even if the face appears in the field of view of the camera and looks at the camera, this may be just a gaze Not specifically watching. In this regard, the present invention provides another embodiment that can solve the above problems, thereby obtaining a better recognition effect.

詳言之，圖5是依照本發明一實施例所繪示的注視識別及互動方法流程圖。請同時參照圖1與圖5，本實施例的方法適用於上述的注視識別及互動裝置10，以下即搭配圖1中注視識別及互動裝置10的各項元件，說明本實施例方法的詳細流程。In detail, FIG. 5 is a flow chart of a gaze recognition and interaction method according to an embodiment of the invention. Referring to FIG. 1 and FIG. 5 simultaneously, the method of the present embodiment is applicable to the above-mentioned gaze recognition and interaction device 10, and the following is a detailed description of the detailed process of the method of the present embodiment with the components of the gaze recognition and interaction device 10 of FIG. .

首先，由處理器18利用收音裝置接收音訊，並判定音訊的來源方向，以控制舵機14將攝像頭12轉向此來源方向（步驟S502）。所述的收音裝置例如為麥克風、指向性麥克風、麥克風陣列等可以辨識聲音來源方向的裝置，在此不設限。而藉由將攝像頭12轉向音訊的來源方向，可確保攝像頭12能夠擷取到包含發出音訊的人臉的視訊框，從而進行後續的對視識別。First, the processor 18 receives the audio using the sound pickup device and determines the source direction of the audio to control the servo 14 to turn the camera 12 to the source direction (step S502). The sound receiving device is, for example, a microphone, a directional microphone, a microphone array, or the like, which can recognize the direction of the sound source, and is not limited herein. By turning the camera 12 to the source direction of the audio, it is ensured that the camera 12 can capture the video frame containing the face from which the audio is emitted, thereby performing subsequent view recognition.

接收，由處理器18控制攝像頭12擷取多個視訊框（步驟S504）。由處理器18執行視訊框旋轉模組164，以將當前視訊框對一方向軸旋轉為旋轉視訊框，並執行人臉檢測模組162，以檢測當前視訊框及旋轉視訊框中的至少一個人臉（步驟S506），以及執行對視識別模組166，利用預先訓練的分類器識別由人臉檢測模組162所檢測到的各個人臉是否對視攝像頭12，以確認是否有人臉對視（步驟S508）。上述步驟S504~S508與前述實施例的步驟S202~S206相同或相似，故其詳細內容在此不再贅述。Receiving, the processor 18 controls the camera 12 to capture a plurality of video frames (step S504). The video frame rotation module 164 is executed by the processor 18 to rotate the current video frame into a rotating video frame, and the face detection module 162 is configured to detect at least one face in the current video frame and the rotating video frame. (Step S506), and executing the visual recognition module 166, using the pre-trained classifier to identify whether each face detected by the face detection module 162 is facing the camera 12 to confirm whether a face is facing (steps) S508). The above steps S504 to S508 are the same as or similar to the steps S202 to S206 of the foregoing embodiment, and thus the details thereof are not described herein again.

相對於在前述實施例中只要當前視訊框有識別出人臉對視即確認有人臉對視，本實施例則需要有連續多個視訊框都有識別出人臉對視才確認有人臉對視。據此，當對視識別模組166於步驟S510中識別出當前視訊框有人臉對視之後，會判斷連續判定有人臉對視的視訊框的數目是否大於預設數目（步驟S510）。In the foregoing embodiment, as long as the current video frame recognizes the face view and confirms the face view, the embodiment needs to have a plurality of consecutive video frames to recognize the face view to confirm the face view. . According to this, when the visual recognition module 166 recognizes the current video frame face-to-face in step S510, it is determined whether the number of video frames of the face-to-face view is continuously determined to be greater than a preset number (step S510).

若判定有人臉對視的視訊框的數目未大於預設數目，則流程會回到步驟S504，由處理器18控制攝像頭12擷取下一視訊框，而由人臉檢測模組162繼續檢測下一視訊框及其旋轉後的旋轉視訊框中的人臉，並由對視識別模組166識別所檢測的各個人臉是否對視攝像頭12，以判定下一視訊框是否有人臉對視。若判定有人臉對視，則可累加連續判定有人臉對視的視訊框的數目，並進入步驟S510進行判斷。If it is determined that the number of the video frames of the face is not greater than the preset number, the flow returns to step S504, and the processor 18 controls the camera 12 to capture the next video frame, and the face detection module 162 continues to detect the next frame. A video frame and a rotated face of the rotated video frame, and the visual recognition module 166 identifies whether the detected individual faces are facing the camera 12 to determine whether the next video frame is face-to-face. If it is determined that the face is in the opposite direction, the number of video frames in which the faces of the faces are successively determined may be accumulated, and the process proceeds to step S510 to determine.

若判定有人臉對視的視訊框的數目大於預設數目，即確認有人臉對視，此時處理器18即會執行轉向模組168，以依據人臉於當前視訊框中的位置或是人臉由旋轉視訊框映射回當前視訊框的位置，控制舵機14將攝像頭12轉向被識別為對視的人臉（步驟S512）。上述轉向方法已揭露於前述實施例，故其詳細內容在此不再贅述。If it is determined that the number of the video frames of the face is greater than the preset number, that is, the face is confirmed, the processor 18 executes the steering module 168 to determine the position of the face in the current video frame or the person. The face is mapped back to the position of the current video frame by the rotating video frame, and the steering gear 14 is controlled to turn the camera 12 to the face recognized as the opposite view (step S512). The foregoing steering method has been disclosed in the foregoing embodiments, and thus the details thereof are not described herein again.

藉由將攝像頭12轉向音訊的來源方向，可確保攝像頭12能夠擷取到包含發出音訊的人臉的視訊框，而藉由連續檢測多個視訊框是否有人臉對視，則可確認使用者的意圖是否真為注視。藉此，可獲得更好的識別效果。By turning the camera 12 to the source direction of the audio, it is ensured that the camera 12 can capture the video frame containing the face that emits the audio, and by continuously detecting whether the plurality of video frames are face-to-face, the user can be confirmed. Whether the intention is really gazing. Thereby, a better recognition effect can be obtained.

綜上所述，本發明的注視識別及互動方法及裝置可在攝像頭拍攝視訊框時，由後台系統進行即時的人臉檢測及對視識別，並自動控制調節攝像頭轉向。藉此，每當檢測到有目光注視時，攝像頭（或包含攝像頭的機器人的頭部）會立刻轉向與之對視，從而達到近似真實場景中人與人之間交流時目光對視的效果。In summary, the gaze recognition and interaction method and device of the present invention can perform instant face detection and visual recognition by the background system when the camera captures the video frame, and automatically adjust and adjust the camera steering. Thereby, whenever a gaze is detected, the camera (or the head of the robot containing the camera) will immediately turn to look at it, thereby achieving the effect of gaze on the eyes when communicating with each other in a real scene.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention, and any one of ordinary skill in the art can make some changes and refinements without departing from the spirit and scope of the present invention. The scope of the invention is defined by the scope of the appended claims.

10‧‧‧注視識別及互動裝置10‧‧‧ Gaze recognition and interaction device

12‧‧‧攝像頭12‧‧‧ camera

14‧‧‧舵機14‧‧‧Steering gear

16‧‧‧儲存裝置16‧‧‧Storage device

18‧‧‧處理器18‧‧‧ processor

40‧‧‧視訊框40‧‧‧Video frame

40a‧‧‧中央區域40a‧‧‧Central area

40b‧‧‧右下區域40b‧‧‧Bottom right area

42‧‧‧人臉42‧‧‧ faces

S202~S208、S502~S512‧‧‧步驟S202~S208, S502~S512‧‧‧ steps

圖1是依據本發明一實施例所繪示的注視識別及互動裝置的方塊圖。圖2是依照本發明一實施例所繪示的注視識別及互動方法流程圖。圖3是依照本發明一實施例所繪示的旋轉視訊框的示意圖。圖4是依照本發明一實施例所繪示的控制攝像頭轉向的示意圖。圖5是依照本發明一實施例所繪示的注視識別及互動方法流程圖。FIG. 1 is a block diagram of a gaze recognition and interaction apparatus according to an embodiment of the invention. 2 is a flow chart of a method for gaze recognition and interaction according to an embodiment of the invention. FIG. 3 is a schematic diagram of a rotating video frame according to an embodiment of the invention. 4 is a schematic diagram of controlling camera steering according to an embodiment of the invention. FIG. 5 is a flow chart of a gaze recognition and interaction method according to an embodiment of the invention.

Claims

A gaze recognition and interaction method is applicable to an electronic device having a camera and a steering gear, wherein the steering gear is used for steering the camera, and the method comprises the following steps: capturing a plurality of video frames by using the camera (video frame Detecting at least one face of the rotating video frame generated by the current video frame of the video frame and the current video frame rotating in a direction axis; identifying each of the detected faces by using a pre-trained classifier Whether to view the camera; and if the recognition result confirms that the face is facing, according to the position of the face in the current video frame or the position of the current video frame by the rotating video frame, control The steering gear turns the camera to the face that is identified as being viewed.

The gaze recognition and interaction method of claim 1, wherein the electronic device further comprises a sound pickup device, and before the step of capturing the video frame by using the camera, the method further comprises: using the sound pickup device Receiving an audio and determining a source direction of the audio to control the steering gear to turn the camera to the source direction.

The gaze recognition and interaction method according to claim 1, wherein the current video frame of the video frame and the rotating video frame generated by the current video frame after rotating the direction axis are detected. After the step of describing the face, the method further includes: determining an overlapping area of the face corresponding to the current video frame position after the face in the rotating video frame is mapped back to the current video frame Whether the ratio of the face to the original area of the current video frame is greater than a threshold; and if the ratio is greater than the threshold, the information of the face in the rotated video frame is discarded.

The gaze recognition and interaction method of claim 1, wherein before the step of using the pre-trained classifier to identify whether each of the detected faces faces the camera, the method further comprises: collecting a large number of a face image, and according to whether the face in each of the face images is opposite to the view mark; and using the face image and the corresponding view tag to train the neural network to obtain Identify the classifier of the view.

The gaze recognition and interaction method of claim 1, wherein after the step of using the pre-trained classifier to identify whether each of the detected faces faces the camera, the method further includes: Describe the next video frame of the current video frame and the rotated face of the rotated video frame, and identify whether each of the detected faces faces the camera to determine whether the next video frame is The face is viewed; and the above steps are repeated, and when the number of the video frames of the face-to-face is continuously determined to be greater than the preset number, the face is confirmed.

The gaze recognition and interaction method according to claim 1, wherein the step of controlling the steering gear to turn the camera to the face recognized as an eye view comprises: dividing the current video frame into equal parts a plurality of areas, and depending on the position of the face in the current video frame or the position and direction of the current video frame mapped back to the central area of the area by the rotating video frame, the control center The steering gear turns the camera to the face such that the face is located in the central area of the video frame captured by the camera after steering.

A gaze recognition and interaction device, comprising: a camera for capturing a plurality of video frames; a steering gear for steering the camera; a storage device for storing a plurality of modules; and a processor for accessing and executing the module The module includes: a video frame rotation module, rotating a current video frame of the video frame to a rotating video frame; and a face detection module detecting the current video frame and the rotating video frame At least one face; a visual recognition module, using a pre-trained classifier to identify whether each of the detected faces is facing the camera; and a steering module, confirming the identification result in the visual recognition module When the face is in view, according to the position of the face in the current video frame or the position of the current video frame mapped by the rotating video frame, the steering gear is controlled to recognize the camera as being The face of the face.

The gaze recognition and interaction device of claim 7, further comprising: a sound receiving device that receives an audio, wherein the steering module further determines a source direction of the audio to control the servo to the camera Turn to the source direction.

The gaze recognition and interaction device of claim 7, wherein the face detection module further determines that the face in the rotated video frame is mapped back to the current video frame and the current video frame. Whether the ratio of the overlapping area of the face corresponding to the position of the frame to the original area of the face in the current video frame is greater than a threshold value, and if the ratio is greater than the threshold value, the saving is discarded. Rotate the information of the face in the video frame.

The gaze recognition and interaction device according to claim 7, wherein the visual recognition module further collects a large number of facial images, and according to whether the faces in the respective facial images are opposite to each other, And training the neural network with the face image and the corresponding view tag to obtain the classifier for identifying the view.

The gaze recognition and interaction device of claim 7, wherein: the face detection module further detects the next video frame of the current video frame and the rotated person in the rotated video frame The face recognition module further recognizes whether each of the detected faces faces the camera to determine whether the next frame is face-to-face, and continuously determines the face of the face. When the number of the video frames is greater than the preset number, it is confirmed that the face is facing.

The gaze recognition and interaction device of claim 7, wherein the steering module comprises dividing the current video frame into a plurality of regions, and according to the human face in the current video frame Positioning or shifting the position of the current video frame back from the rotating video frame to a distance and direction of a central area of the area, controlling the steering gear to turn the camera to the face, so that the face Located in the central area of the video frame captured by the camera after steering.