CN107622248B

CN107622248B - Gaze identification and interaction method and device

Info

Publication number: CN107622248B
Application number: CN201710887858.8A
Authority: CN
Inventors: 蒋静
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2020-11-10
Anticipated expiration: 2037-09-27
Also published as: TWI683575B; TW201916669A; CN107622248A

Abstract

A gaze recognition and interaction method and device are suitable for an electronic device having a camera and a steering engine for steering the camera. The method comprises the following steps: acquiring a plurality of video frames by using the camera; detecting at least one face in a current video frame of the video frames and a rotating video frame generated after the current video frame rotates relative to a direction axis; identifying whether each detected face looks at the camera by using a pre-trained classifier; and if the face is confirmed to be in sight according to the identification result, controlling the steering engine to steer the camera to be identified as the face in sight according to the position of the face in the current video frame or the position of the rotating video frame mapped back to the current video frame.

Description

Gaze identification and interaction method and device

Technical Field

The present disclosure relates to an interaction method and apparatus, and more particularly, to a gaze recognition and interaction method and apparatus.

Background

The existing interaction devices (such as electronic dolls, electronic pets or intelligent robots) can interact with users through limb movement or acousto-optic effects, so as to achieve entertainment effects. For example, the cyber pet may detect the voice of the user and accordingly change the expression or respond to the movement. The effect of interaction with the user can be achieved through the action of instant response.

However, the actions or responses of these interaction devices must be predefined, and during the interaction with the user, only simple response actions can be performed according to specific instructions (such as pressing a key or making a sound), and appropriate responses cannot be made according to the facial expressions or body language of the user, and the human-to-human interaction effect in the real scene cannot be realized.

Disclosure of Invention

In view of this, the present application provides a gaze recognition and interaction method and apparatus, which can simulate the effect of communication between eyes during interpersonal conversation in a real scene.

The gaze identification and interaction method is suitable for an electronic device with a camera and a steering engine, wherein the steering engine is used for steering the camera. The method comprises the following steps: acquiring a plurality of video frames (video frames) with a camera; detecting at least one face in a current video frame of the video frames and a rotating video frame generated after the current video frame rotates relative to a direction axis; identifying whether each detected face looks at the camera by using a pre-trained classifier; and if the face is confirmed to be in sight according to the identification result, controlling the steering engine to steer the camera to be identified as the face in sight according to the position of the face in the current video frame or the position of the rotating video frame mapped back to the current video frame.

The gaze recognition and interaction device comprises a camera, a steering engine, a storage device and a processor. The camera is used for acquiring a plurality of video frames. The steering wheel is used for turning the camera. The storage device is used for storing a plurality of modules. The processor is used for accessing and executing the module stored in the storage device. The modules comprise a video frame rotating module, a human face detection module, a pair-eye recognition module and a steering module. Wherein the video frame rotation module rotates a current video frame of the video frames relative to a directional axis into a rotated video frame. The face detection module detects at least one face in the current video frame and the rotated video frame. The eye-to-eye recognition module utilizes a pre-trained classifier to recognize whether each detected face is in eye-to-eye contact with the camera. When the face is confirmed to be looked at according to the recognition result of the face-to-eye recognition module, the steering module controls the steering engine to steer the camera to the face which is recognized as the face-to-eye according to the position of the face in the current video frame or the position of the rotating video frame which is mapped back to the current video frame.

Based on the above, the gaze recognition and interaction method and device provided by the application can detect faces in various postures by performing face detection on the video frame acquired by the camera, rotating the video frame in different axial directions and then performing face detection. And the pre-trained classifier is used for performing eye-to-eye recognition on the detected face, so that whether the detected face is looking at the camera or not can be confirmed, and the camera is further controlled to turn to the face. Therefore, the visual communication effect in the real scene during human-to-human conversation can be simulated.

In order to make the aforementioned and other features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a block diagram of a gaze identification and interaction device according to an embodiment of the present application.

Fig. 2 is a flow chart illustrating a gaze recognition and interaction method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a rotated video frame according to an embodiment of the present application.

Fig. 4 is a schematic diagram illustrating steering of a camera according to an embodiment of the present application.

Fig. 5 is a flow chart illustrating a gaze recognition and interaction method according to an embodiment of the present application.

Detailed Description

The application integrates the technologies of voice recognition, face detection, visual recognition and the like into an intelligent robot or other intelligent devices capable of interacting with people. When receiving the sound of the user, the robot turns to the sound-producing direction, so that the camera arranged on the robot can acquire the video frame (video frame) of the user. When a user watches the robot, the robot can detect the face from the video frame, and a pre-trained classifier is used for identifying whether the detected face is looking at the robot or not, so that the head of the robot is turned to the center of the face (representing the eyes of the user), and the effect of looking at and communicating between people in a real scene can be simulated.

Fig. 1 is a block diagram of a gaze identification and interaction device according to an embodiment of the present application. Referring to fig. 1, a gaze recognition and interaction device 10 of the present embodiment is, for example, an intelligent robot or other electronic device capable of interacting with a person, and includes a camera 12, a steering engine 14, a storage device 16, and a processor 18, and the functions thereof are as follows:

the camera 12 is composed of elements such as a lens, an aperture, a shutter, and an image sensor. Among them, the lens includes a plurality of optical lenses, which are driven by an actuator such as a stepping Motor or a Voice Coil Motor (VCM) to change a relative position between the lenses, thereby changing a focal length. The aperture is an annular opening formed by a plurality of metal blades, and the opening is enlarged or reduced according to the aperture value, so as to control the light inlet quantity of the lens. The shutter is used to control the time of light entering the lens, and the combination of the shutter and the aperture will affect the exposure of the image obtained by the image sensor. The image sensor is composed of, for example, a Charge Coupled Device (CCD), a Complementary Metal-Oxide Semiconductor (CMOS) Device or other types of photosensitive devices, which can sense the intensity of light entering the lens to generate a video frame of a subject.

The steering engine 14 is, for example, a servo motor, which can be disposed under or around the camera 12, and can push the camera 12 to change its position and/or angle according to a control signal of the processor 18.

The storage device 16 may be any type of fixed or removable Random Access Memory (RAM), read-only memory (ROM), flash memory (flash memory), Hard Disk Drive (HDD), Solid State Drive (SSD), or the like or any combination thereof. In the present embodiment, the storage device 16 is used for storing software programs of the face detection module 162, the video frame rotation module 164, the pair-of-sight recognition module 166 and the steering module 168.

The Processor 18 is, for example, a Central Processing Unit (CPU), or other programmable Microprocessor (Microprocessor), Digital Signal Processor (DSP), programmable controller, Application Specific Integrated Circuit (ASIC), or other similar components or combinations thereof. In the present embodiment, the processor 18 is configured to access and execute the modules stored in the storage device 16, so as to implement the gaze identification and interaction method of the embodiment of the present application.

Fig. 2 is a flow chart illustrating a gaze recognition and interaction method according to an embodiment of the present application. Referring to fig. 1 and fig. 2, the method of the present embodiment is applied to the above-mentioned gaze identification and interaction device 10, and the detailed flow of the method of the present embodiment will be described below by combining various elements of the gaze identification and interaction device 10 in fig. 1.

First, the camera 12 is controlled by the processor 18 to acquire a plurality of video frames (step S202). Next, the video frame rotation module 164 is executed by the processor 18 to rotate the current video frame with respect to the orientation axis into a rotated video frame, and the face detection module 162 is executed to detect at least one face in the current video frame and the rotated video frame (step S204). The face detection module 162 may, for example, execute a Viola-Jones (Viola-Jones) detection method or other face detection algorithm to immediately process the video frames or rotated video frames acquired by the camera 12 and detect faces appearing in the video frames.

In particular, in the initial scene of the interaction with the person, the face may not be facing the gaze recognition and interaction device 10, which makes it possible for the face to be facing sideways or askew to the gaze recognition and interaction device 10 in the video frames acquired by the camera 12. For this, in this embodiment, for example, the current video frame is rotated by a certain angle in a clockwise or counterclockwise direction with respect to the horizontal axis or the vertical axis, so that the face detection module 162 performs the face detection. By repeating the steps of rotating the video frame and detecting the face, the originally skewed face in the video frame is opportunistically corrected, so that the face detection module 162 can smoothly detect the face.

For example, fig. 3 is a schematic diagram of a rotating video frame according to an embodiment of the present application. Referring to fig. 3, it is assumed that an x axis, a y axis, and a z axis are 3 directional axes of a three-dimensional space, where an xz plane is a horizontal plane and an xy plane is a vertical plane. Rotation from the z-axis to the x-axis (clockwise rotation to the y-axis) shown in fig. 3 represents rotation in the horizontal direction, while rotation from the y-axis to the x-axis (counterclockwise rotation to the z-axis) in fig. 3 represents rotation in the vertical direction. And the video frame is rotated clockwise or anticlockwise on the axes in different directions, and the face detection is executed after the rotation, so that the face can still be detected under various face postures.

It should be noted that, the same face may be detected in the original video frame and the rotated video frame at the same time, but actually represents the same person. In contrast, the embodiment of the present application provides a method for excluding the same face by using the area ratio, and if the effective area ratio of the faces detected in other directions (i.e., the directions after rotation) is greater than a certain threshold, the faces are regarded as the same face, the information of the faces is abandoned, and the faces are not subjected to view recognition. The effective area ratio can be understood as a repetition area ratio, and if the faces detected in the rotated video frame and the original video frame respectively have repetition, and the repetition rate exceeds a certain threshold, only the face detected in the original video frame is subjected to subsequent eye-catching recognition, and the face detected in the rotated video frame is not subjected to eye-catching recognition. Therefore, the method can ensure that the human face in one video frame is subjected to visual recognition verification only once, and avoids repetition. It should be noted that the above-mentioned face detection targets all faces in the original video frame and the rotated video frame, and all faces are respectively and only once identified.

Specifically, after detecting the face in the rotating video frame, the face detection module 162 further maps the face in the rotating video frame back to the current video frame, compares the face with the face corresponding to the position in the current video frame, and determines whether a ratio of an overlapping area of the face mapped back to the current video frame and the face originally in the current video frame to an original area of the face originally in the current video frame is greater than a threshold. If the ratio is greater than the threshold, it represents that the face detected in the rotated video frame and the face detected in the current video frame belong to the same person, and at this time, the face detection module 162 will discard the information of the face detected in the rotated video frame, and will not perform the visual recognition on the face, so as to avoid the repeated recognition.

Then, the pair-of-sight recognition module 166 is executed by the processor 18 to recognize whether each face detected by the face detection module 162 is in pair with the camera 12 by using a pre-trained classifier to confirm whether there is a face pair (step S206). Specifically, the eye-to-eye recognition module 166 may, for example, acquire a large number of face images in advance, and the user determines whether the face in each face image is looking at the camera, so as to mark an eye-to-eye label on each face image. Therefore, the eye-to-eye recognition module 166 can train a neural network by using the face images and the corresponding eye-to-eye labels thereof to obtain a classifier for recognizing the face eye-to-eye. The neural network includes, for example, a 2-layer convolutional layer (convolutional layer), a 2-layer Fully connected layer (full connected layer), and a 1-layer output layer using a softmax function, but is not limited thereto. Those skilled in the art may use convolutional neural networks including different numbers and different combinations of convolutional layers, pooling layers (pooling), fully-connected layers, output layers, or other kinds of neural networks, as desired.

Finally, when the face is confirmed to be looked at by the recognition result of the pair-of-sight recognition module 166, the processor 18 executes the steering module 168 to control the steering engine 14 to steer the camera 12 to the face recognized as the pair-of-sight according to the position of the face in the current video frame or the position of the face mapped back to the current video frame by the rotating video frame (step S208). Specifically, for the case that the face in the rotating video frame is recognized as the eye, the eye-to-eye recognition module 166 will map the position of the face back to the current video frame as a basis for controlling the steering of the camera 12. For example, assuming that the rotation angle of the rotated video frame is α, the detected face position (x)₀,y₀) If the width of the original video frame is w and the height of the original video frame is h, the position (x, y) mapped back to the original video frame is:

and (3) rotating anticlockwise:

clockwise rotation:

it should be noted that, in the embodiment, the steering module 168, for example, equally divides the current video frame into a plurality of regions, and controls the steering engine 14 to steer the camera 12 to the face according to the distance and the direction of the position of the face in the current video frame or the position of the face mapped by the rotated video frame back to the current video frame deviating from the central region of the regions, so that the face can be located in the central region of the video frame acquired by the camera 12 after steering. In another embodiment, the direction and angle that the camera should rotate can be calculated from the pixel difference of the human face deviating from the central area by corresponding the steering range of the camera to the width w of the video frame. In another embodiment, the camera may be translated in position or both translated and rotated, so that the human face may be located in the center of the video frame acquired by the camera 12 after translation and/or rotation, which is not limited herein.

For example, fig. 4 is a schematic diagram illustrating steering of a camera according to an embodiment of the present application. Referring to fig. 4, it is assumed that the video frame 40 is a video frame obtained by a camera, and the face 42 is a face recognized as a looking-at face. As shown in fig. 4, the video frame 40 is divided into 9 regions, in which a face 42 recognized as a person having a sight is located in a lower right region 40 b. According to the distance and direction of the position of the face 42 (e.g., the center point position of the face 42) from the center area 40a (the center point position), the camera can be controlled to turn in the opposite direction (for example, turn in the right-down direction) so that the face 42 is located in the center area 40a in the video frame acquired by the camera after turning. By keeping the face 42 in the center area 40a of the video frame 40, the camera can be turned to the eye-catching position.

By the gaze recognition and interaction method, whether a person gazes at the gaze recognition and interaction device 10 of the present embodiment around can be recognized, and the gaze recognition and interaction device 10 is turned to the face gazing, so that the visual communication effect in a real scene when the person talks can be simulated.

It should be noted that in the initial scene of human interaction, the face may not appear in the field of view of the device camera, even if the face appears in the field of view of the camera and is looking at the camera, it may be just looking across, and not looking at intentionally. In view of the above, the present application provides another embodiment, which can solve the above problem, thereby obtaining a better recognition effect.

Specifically, fig. 5 is a flowchart illustrating a gaze recognition and interaction method according to an embodiment of the present application. Referring to fig. 1 and fig. 5, the method of the present embodiment is applied to the above-mentioned gaze identification and interaction device 10, and the detailed flow of the method of the present embodiment will be described below by combining various elements of the gaze identification and interaction device 10 in fig. 1.

First, the processor 18 receives the audio by using the sound receiving device, and determines the source direction of the audio to control the steering engine 14 to steer the camera 12 to the source direction (step S502). The sound receiving device is, for example, a microphone, a directional microphone, a microphone array, or the like, which can identify the direction of the sound source, and is not limited herein. By turning the camera 12 to the audio source direction, the camera 12 can be ensured to acquire a video frame containing a face emitting audio, so as to perform subsequent visual recognition.

Receiving, the camera 12 is controlled by the processor 18 to acquire a plurality of video frames (step S504). The video frame rotation module 164 is executed by the processor 18 to rotate the current video frame with respect to the orientation axis into a rotated video frame, the face detection module 162 is executed to detect at least one face in the current video frame and the rotated video frame (step S506), and the eye alignment recognition module 166 is executed to recognize whether each face detected by the face detection module 162 is aligned with the camera 12 by using a pre-trained classifier to determine whether there is a face alignment (step S508). The steps S504 to S508 are the same as or similar to the steps S202 to S206 of the previous embodiment, and therefore the details thereof are not repeated herein.

Compared with the embodiment in which the face pair is confirmed as long as the face pair is identified in the current video frame, the embodiment needs to confirm the face pair only when the face pair is identified in a plurality of consecutive video frames. Accordingly, after the pair-to-pair recognition module 166 recognizes that there is a face pair in the current video frame in step S510, it is determined whether the number of video frames having the face pair is continuously determined to be greater than the preset number (step S510).

If it is determined that the number of the video frames with the face-to-face relationship is not greater than the preset number, the process returns to step S504, the processor 18 controls the camera 12 to obtain the next video frame, the face detection module 162 continues to detect the next video frame and the faces in the rotated video frames after the rotation of the next video frame, and the pair-to-face recognition module 166 recognizes whether each detected face is in the pair with the camera 12, so as to determine whether the next video frame has the face-to-face relationship. If it is determined that there is a face-to-eye, the number of video frames for which there is a face-to-eye determination may be accumulated, and the process proceeds to step S510 for determination.

If it is determined that the number of video frames with a face pair is greater than the preset number, that is, it is determined that a face pair exists, the processor 18 executes the steering module 168 to control the steering engine 14 to steer the camera 12 to the face identified as a face pair according to the position of the face in the current video frame or the position of the face mapped back to the current video frame by the rotating video frame (step S512). The above-mentioned turning method is disclosed in the foregoing embodiments, and therefore the details thereof are not described herein.

By turning the camera 12 to the direction of the source of the audio, it is ensured that the camera 12 can acquire video frames containing faces from which the audio originates, and by continuously detecting whether a plurality of video frames are being viewed by the faces, it is possible to confirm whether the user's intention is actually gazing. Therefore, a better recognition effect can be obtained.

In summary, the gaze recognition and interaction method and device can be used for performing instant face detection and eye-to-eye recognition by the background system when the camera shoots a video frame, and automatically controlling and adjusting the steering of the camera. Therefore, when the fact that the eyes are gazed is detected, the camera (or the head of the robot comprising the camera) can be immediately turned to be in sight with the camera, and therefore the effect that the eyes are in sight when people exchange with each other in an approximate real scene is achieved.

Although the present application has been described with reference to the above embodiments, it should be understood that the embodiments are not limited thereto, and that various changes and modifications can be made by those skilled in the art without departing from the spirit and scope of the present application.

Description of the reference numerals

10: gaze recognition and interaction device

12: camera head

14: steering engine

16: storage device

18: processor with a memory having a plurality of memory cells

40: video frame

40 a: central region

40 b: lower right area

42: human face

S202 to S208, S502 to S512: and (5) carrying out the following steps.

Claims

1. A gaze identification and interaction method is applicable to an electronic device with a camera and a steering engine, wherein the steering engine is used for steering the camera, and the method comprises the following steps:

acquiring a plurality of video frames by using the camera;

detecting at least one face in a current video frame of the video frames and a rotating video frame generated after the current video frame rotates relative to a direction axis;

identifying whether each detected face looks at the camera or not by using a pre-trained classifier, wherein the classifier is trained by using a face image and a look-at label marked on the face image and indicating whether the face in the face image looks at the camera or not; and

if the face-to-eye is confirmed in the recognition result, controlling the steering engine to steer the camera to the face recognized as the face-to-eye according to the position of the face in the current video frame or the position mapped to the current video frame by the rotating video frame,

wherein, after the step of detecting the current video frame of the video frames and the face in the rotated video frame generated after the current video frame is rotated relative to the direction axis, the method further comprises:

judging whether the ratio of the overlapping area of the face corresponding to the position in the current video frame after the face in the rotating video frame is mapped back to the current video frame to the original area of the face in the current video frame is larger than a threshold value or not; and

if the ratio is larger than the threshold value, abandoning and saving the information of the face in the rotating video frame, and not identifying whether the face in the rotating video frame looks at the camera or not; after the step of identifying whether each detected face looks at the camera by using the pre-trained classifier, the method further comprises the following steps:

detecting the human faces in a next video frame of the current video frame and a rotated video frame after the current video frame is rotated, and identifying whether each detected human face is in line with the camera to judge whether the next video frame has a human face in line with the camera; and

and repeating the steps, and confirming that the face is in alignment when the number of the video frames with the face in alignment is continuously judged to be larger than the preset number.

2. The gaze recognition and interaction method of claim 1, wherein the electronic device further comprises a sound receiving device, and further comprising, prior to the step of acquiring the video frames with the camera:

and receiving audio by using the radio device, and judging the source direction of the audio so as to control the steering engine to steer the camera to the source direction.

3. The gaze recognition and interaction method of claim 1, further comprising, prior to the step of recognizing whether each of the detected faces is looking at the camera using the pre-trained classifier:

collecting a large number of face images, and labeling the eye-to-eye labels according to whether the faces in the face images look at the eyes or not; and

and training a neural network by using the face image and the corresponding eye-to-eye label to obtain the classifier for identifying the eye-to-eye.

4. The gaze recognition and interaction method of claim 1, wherein controlling the steering engine to steer the camera toward the face recognized as being in-view comprises:

equally dividing the current video frame into a plurality of regions, and controlling the steering engine to steer the camera to the face according to the distance and the direction of the position of the face in the current video frame or the position of the current video frame mapped by the rotating video frame to deviate from the central region of the region, so that the face is positioned in the central region of the video frame acquired by the camera after steering.

5. A gaze identification and interaction device, comprising:

a camera to obtain a plurality of video frames;

the steering engine is used for steering the camera;

a storage device storing a plurality of modules; and

a processor accessing and executing the modules, the modules comprising:

the video frame rotating module rotates the current video frame of the video frames relative to the direction axis into a rotating video frame;

the face detection module is used for detecting at least one face in the current video frame and the rotating video frame;

the eye-to-eye recognition module is used for recognizing whether each detected face looks at the camera or not by utilizing a pre-trained classifier, wherein the classifier is trained by utilizing a face image and an eye-to-eye label which is marked on the face image and indicates whether the face in the face image looks at the camera or not; and

a steering module, which controls the steering engine to steer the camera to the face identified as the face of the pair according to the position of the face in the current video frame or the position mapped back to the current video frame by the rotating video frame when the face of the pair is confirmed to be looked at according to the identification result of the pair recognition module,

the face detection module further judges whether a ratio of an overlapping area of the face corresponding to a position in the current video frame after the face in the rotating video frame is mapped back to the current video frame to an original area of the face in the current video frame is greater than a threshold, and if the ratio is greater than the threshold, the face detection module abandons to store the information of the face in the rotating video frame and does not utilize the eye-to-eye recognition module to recognize whether the face in the rotating video frame is in eye-to-eye with the camera;

the face detection module is also used for detecting the face in a next video frame of the current video frame and a rotated video frame after the current video frame is rotated; and

the eye-to-eye recognition module also recognizes whether each detected face is in eye-to-eye with the camera to judge whether the next video frame has a face in eye, and confirms that the face is in eye-to-eye when the number of the video frames continuously judged that the face is in eye-to-eye is larger than the preset number.

6. The gaze recognition and interaction device of claim 5, further comprising:

and the steering module is used for judging the source direction of the audio so as to control the steering engine to steer the camera to the source direction.

7. The gaze recognition and interaction device of claim 5, wherein the pair-of-sight recognition module further collects a plurality of face images, labels the pair-of-sight labels according to whether the faces in each of the face images are in sight, and trains a neural network by using the face images and the pair-of-sight labels corresponding thereto to obtain the classifier for recognizing the pair-of-sight.

8. The gaze recognition and interaction device of claim 5, wherein the steering module comprises a plurality of zones equally dividing the current video frame, and the steering engine is controlled to steer the camera to the face based on a distance and direction of a position of the face in the current video frame or a position mapped back to the current video frame by the rotated video frame from a center zone of the zones, such that the face is located in the center zone of the video frame acquired by the camera after steering.