CN110826396A

CN110826396A - Method and device for detecting eye state in video

Info

Publication number: CN110826396A
Application number: CN201910883511.5A
Authority: CN
Inventors: 张晋
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-02-21
Anticipated expiration: 2039-09-18
Also published as: CN110826396B

Abstract

The invention discloses a method for detecting eye states in a video, which comprises the following steps: performing face detection on a current video frame of a current user to obtain a first eye region characteristic of the current video frame; calling a pre-trained SVM classifier; and sending the first eye region feature into the SVM classifier for classification, and detecting the eye state of the current user according to the classification result. The eye region characteristics can be detected and acquired and trained only by needing less neural networks, the detection process is not complex, the eye region characteristic acquisition efficiency is greatly improved, the problem that the prior art utilizes two neural networks in complex operation is solved, and the eye state of the current user can be accurately judged.

Description

Method and device for detecting eye state in video

Technical Field

The present disclosure relates to the field of face recognition technologies, and in particular, to a method and an apparatus for detecting eye states in a video.

Background

Eyes are the most important parts of human faces, are the most direct sense organs contacting with foreign objects, and the application range of the detection method of the eye state is wider and wider. For example, a driver of a motor vehicle can feel fatigue when driving for a long time, and the fatigue of the eyes of the driver can be reflected by detecting the state of the eyes, so that traffic accidents are prevented. Parents can detect whether a child is talking and sleeping through eye state detection, etc., for example: and an online learning platform can be used for reminding the user of concentrating and carefully learning through state detection.

At present, most of eye state detection methods are feature analysis-based eye state detection methods in videos, and the steps are as follows: and intercepting the video face picture, detecting by using a face neural network, and then acquiring eye region features by using the face neural network and the eye neural network for training. The method needs to utilize a plurality of neural networks to obtain the eye region characteristics, is troublesome, consumes long time, and has low efficiency because the judgment on the eye state of the video person according to a single frame is not stable enough.

In view of the above problems, there is a need for a stable and efficient method for detecting eye state in video.

Disclosure of Invention

Aiming at the displayed problems, the method detects the eye state based on a mode classification detection method, obtains the eye region characteristics of the current video frame, and judges the eye state by means of an SVM (support vector machine), namely a generalized linear classifier which carries out binary classification on data according to a supervised learning mode.

The eye state includes an open eye state and a closed eye state.

A method for detecting eye states in videos comprises the following steps:

s101, carrying out face detection on a current video frame of a current user to obtain a first eye region characteristic of the current video frame;

s102, calling a pre-trained SVM classifier;

s103, sending the first eye region characteristics into the SVM classifier for classification, and detecting the eye state of the current user according to the classification result.

Preferably, the performing face detection on the current video frame to obtain the first eye region feature of the current video frame includes:

detecting key points of the human face by using a neural network;

acquiring human eye key points from human face key points;

and combining the human eye key points and the human face key points to map and output the first eye region characteristics.

Preferably, the performing face detection on the current video frame to obtain the first eye region feature of the current video frame further includes:

generating two minimum circumscribed rectangles of the left eye and the right eye according to the landworks coordinates of the eyes of the face region;

and mapping the minimum circumscribed rectangle to a representation layer of the human face features in the neural network according to a receptive field calculation formula to obtain the high-dimensional features of the eye region.

Preferably, before invoking the pre-trained SVM classifier, the method further comprises:

detecting a second eye region feature of a preset user in a preset video;

and carrying out interpolation processing on the second eye region characteristics to obtain characteristics with the same dimensionality for training to obtain the SVM classifier.

Preferably, the detecting a second eye region feature of the preset user in the preset video includes:

acquiring N continuous preset video frames of a preset video in each preset time period;

n is a positive integer greater than or equal to 2;

taking the eye region features of N continuous preset video frames in each preset time period as a second eye region feature to obtain M second eye region features;

the M second eye region features include eye region features of left and right eyes of a preset user in the N frames, where M is 2N in general;

carrying out interpolation processing on the second eye region characteristics to change the second eye region characteristics into characteristics with the same dimensionality, and training to obtain the SVM classifier, wherein the method comprises the following steps:

and sequentially carrying out interpolation processing on each second eye region feature in the M second eye region features to obtain features with the same dimensionality, and training to obtain the SVM classifier.

An eye state detection device in video, comprising:

the detection module is used for carrying out face detection on a current video frame of a current user to obtain a first eye region characteristic of the current video frame;

the calling module is used for calling a pre-trained SVM classifier;

and the classification module is used for sending the first eye region characteristics into the SVM classifier for classification, and detecting the eye state of the current user according to a classification result.

Preferably, the detection module includes:

the detection submodule is used for detecting key points of the human face by utilizing a neural network;

the first acquisition submodule is used for acquiring human eye key points from the human face key points;

and the output submodule is used for combining the human eye key points and the human face key points to map and output the first eye region characteristics.

Preferably, the detection module further includes:

the generation submodule is used for generating a minimum circumscribed rectangle of two left and right eyes according to the landworks coordinates of the eyes of the human face region;

and the second acquisition submodule is used for mapping the minimum circumscribed rectangle to a representation layer of the face features in the neural network according to a receptive field calculation formula to acquire the high-dimensional features of the eye region.

Preferably, the detection module is further configured to detect a second eye region feature of a preset user in a preset video before the pre-trained SVM classifier is called;

the eye state detection device further includes:

and the training submodule is used for carrying out interpolation processing on the second eye region characteristic to obtain the characteristic with the same dimension for training to obtain the SVM classifier.

Preferably, the detection module further includes:

the third acquisition submodule is used for acquiring N continuous preset video frames of the preset video in each preset time period;

taking eye region features of N continuous preset video frames in each preset time period as a second eye region feature, and obtaining M second eye region features in total;

the training submodule is used for:

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flowchart illustrating a method for detecting eye state in video according to the present invention;

FIG. 2 is a flow chart of a method for training an SVM classifier according to the present invention;

FIG. 3 is a block diagram of an apparatus for detecting eye state in video according to the present invention;

fig. 4 is another structural diagram of an eye state detection apparatus in a video according to the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Eyes are the most important parts of human faces, are the most direct sense organs contacting with foreign objects, and the application range of the detection method of the eye state is wider and wider. For example, a driver of a motor vehicle can feel fatigue when driving for a long time, and the fatigue of the eyes of the driver can be reflected by detecting the state of the eyes, so that traffic accidents are prevented. Parents can detect whether children are talking and sleeping through eye state detection, and the like. For example: and an online learning platform can be used for reminding the user of concentrating and carefully learning through state detection.

At present, most of eye state detection methods are feature analysis-based eye state detection methods in videos, and the steps are as follows: and detecting the intercepted video face picture by using a face neural network, and then acquiring eye region characteristics by using the face neural network and the eye neural network for training. The method needs to utilize a plurality of neural networks to obtain the eye region characteristics, is troublesome, consumes long time, and has low efficiency because the judgment of the eye state of the video person according to a single frame is not stable enough. In order to solve the above technical problem, an embodiment of the present disclosure provides a method for detecting an eye state in a video, as shown in fig. 1:

the invention detects the eye state based on a mode classification detection method, extracts the eye region characteristics of the current video frame, and judges the eye state by means of an SVM classifier, comprising the following steps:

s101, carrying out face detection on a current video frame of a current user in a current video, and simultaneously obtaining a first eye region characteristic of the current video frame;

the current video frame of the current user is usually a continuous R frame in a preset time in the current video, and the first eye feature is obtained by combining G eye features obtained according to the R frame (that is, the eye feature sent to the SVM classifier is an eye feature of a multi-frame combination), where R is a positive integer greater than or equal to 2, G is an eye region feature of left and right eyes in the R frame, and usually R is 2G. Compared with the scheme of detecting and determining the eye state by using a single frame in the prior art, the eye state detection effect is clearer and more accurate.

S102, calling a pre-trained SVM classifier;

s103, sending the first eye region characteristics into an SVM classifier for classification, and detecting the eye state of the current user according to the classification result.

The working principle of the method is as follows: the method comprises the steps of utilizing a face neural network to carry out face detection on a video frame, utilizing a feature extraction network to obtain first eye region features of a current user (in the obtaining step, key point detection is carried out on the detected face, eye features are obtained at the same time), processing and combining a plurality of second eye region features of a preset user in a preset video to carry out training, obtaining an SVM classifier, and comparing the first eye region features of the current user in the SVM classifier to determine the eye state of the current user.

The method has the beneficial effects that: the eye state of the current user can be detected by training an SVM classifier by using eye region characteristics acquired by a preset user, the eye state of the current user can be detected without adding a special eye neural network to acquire the eye region characteristics for recognition as in the related art, the eye region characteristics can be detected and acquired and trained by using fewer neural networks, the detection process is not complex, the eye region characteristic acquisition efficiency is greatly improved, the problem that the current technology utilizes three neural networks in complex operation is solved, and the eye state of the current user can be more accurately judged.

In one embodiment, performing face detection on a current video frame to obtain a first eye region feature of the current video frame includes:

detecting a face in a current video frame by using a face neural network;

acquiring key points of the face (by using a feature extraction network);

acquiring human eye key points from the human face key points (by using a feature extraction network);

and (utilizing a feature extraction network) combining the human face key points and the human eye key points to map and output first eye region features.

The specific process of the embodiment is as follows: firstly, detecting a human face, then acquiring human face key points according to a feature extraction network, acquiring human eye key points from the human face key points, and then mapping the human eye key points and the human face key points in the feature extraction network to obtain eye features. Compared with the prior art, the method has the advantages that the number of networks is reduced, the eye feature acquisition efficiency is improved, and the time consumed for acquiring the eye features is shortened.

Or the specific implementation procedure of this embodiment is: firstly, a neural network is utilized to detect a human face, then a key point detection network is utilized to extract human eye characteristics, and the human eye characteristics are obtained by mapping the positions of key points (namely the key points of the human eye) and the human face characteristics of a network middle layer (namely the characteristics of the key points of the human face, including contours, facial features and the like).

The method has the beneficial effects that: the required eye region features can be obtained by quickly utilizing the two neural networks for the video frame, detection by utilizing the neural networks for 3 times is avoided, time is shortened, efficiency is improved, and the eye region features are obtained without training a specific network.

In one embodiment, performing face detection on a current video frame to obtain a first eye region feature, which is obtained by combining and mapping the face key points and the eye key points in a first eye region feature of the current video frame, and further includes:

and mapping the minimum circumscribed rectangle to a representation layer (the content of representation layer representation is the feature of the key point of the human face) of the human face feature in a neural network (namely, a feature extraction network) according to a receptive field calculation formula, and acquiring the high-dimensional feature (namely, the feature of the first eye region) of the eye region. The high-dimensional features may be 64-dimensional features, etc.

The landworks coordinates are human eye key points in the current video frame.

The method has the beneficial effects that: the position of the eye in the video frame can be determined in a smaller range according to the lowest circumscribed rectangle generated by the landframes coordinates, and the high-dimensional features of the eye region obtained by the receptive field calculation formula are used as a standard with the same dimension for later training of the SVM.

In one embodiment, before invoking the pre-trained SVM classifier, the method further comprises:

detecting a second eye region feature of a preset user in a preset video;

and carrying out interpolation processing on the second eye region characteristics to obtain characteristics with the same dimensionality for training to obtain the SVM classifier of the characteristic extraction network.

The method has the beneficial effects that: and acquiring a second eye region feature of a preset user in a preset video, and changing the second eye region feature into a dimension of the high-dimensional feature, so that the second eye region feature can be combined and trained more precisely to obtain the SVM classifier.

In one embodiment, detecting a second eye region feature of the preset user in the preset video, as shown in fig. 2, includes:

taking the eye region features of the N continuous preset video frames in each preset time period as a second eye region feature to obtain M second eye region features; in general, M ═ 2N, i.e., the eye region features of the left and right eyes, is common.

Carrying out interpolation processing on the second eye region characteristics to obtain characteristics with the same dimensionality for training, and obtaining the SVM classifier, wherein the method comprises the following steps:

The method has the beneficial effects that: the continuous N video frames within the preset time can realize multi-frame processing, compared with a single-frame classification training network in the prior art, multi-frame training is put into the SVM classifier, a small amount of data is required for training, and the overall flow speed for judging the eye state is increased by using the SVM classifier.

In one embodiment, face detection is performed on a current video frame;

detecting a face key point by a face network, and simultaneously outputting the feature of a first eye region by combining the face key point and face feature mapping in the middle of the network;

the method comprises the steps that a human face key point network input is a detected human face, human face key point information including human eye key points is detected, and first eye region features are output by combining intermediate layer human face feature mapping of the network.

And generating two minimum circumscribed rectangles of the left eye and the right eye according to the landworks coordinates of the eyes of the human face region. Mapping the circumscribed rectangle to a face feature representation layer in a network by referring to a calculation formula of a receptive field, and acquiring high-dimensional features of a first eye region;

acquiring 10 frames of preset video frames in preset time in a preset video, acquiring 20 second eye region characteristics in the 10 frames, and performing interpolation processing on the second eye region characteristics to obtain a uniform dimension;

performing connection combination training on the second eye region characteristics of the processed continuous 10 frames to obtain an SVM classifier;

and (3) placing the first eye region features of 10 frames extracted by the current user into the SVM classifier for classification, and determining the eye state of the current user according to the classification result.

The method has the beneficial effects that: the SVM classifier for judging the eye opening and closing states does not need to train a classification model by a large amount of labeled data; the eye region features are obtained without training a specific network, and the region coordinates are mapped to a middle feature layer through a landworks network like the fast rcnn; and training by a small amount of data. Meanwhile, the SVM classifier is used to make the whole eye opening and closing judgment process fast. Under the video monitoring, compared with a single-frame classification training network in the prior art, the open and close eyes can be judged more stably and accurately by integrating multi-frame information.

In the above case, we can also separately process the eye region features of the left and right eyes, which includes the following steps:

acquiring 10 frames of preset video frames in preset time in a preset video, acquiring 20 second eye region characteristics in the 10 frames, separately placing the 10 second eye region characteristics of the left eye and the right eye respectively, and carrying out interpolation and other processing on the second eye region characteristics to form a unified dimension (for example, unified dimension is 18 and 64 characteristics);

respectively carrying out connection combination training on the second eye region characteristics of the left eye and the second eye region characteristics of the right eye of the processed continuous 10 frames to obtain an SVM classifier;

and placing the first eye region characteristic of the left eye and the first eye region characteristic of the right eye of the current user into the SVM classifier for classification, so that the respective states of the left eye and the right eye of the current user can be determined according to the classification result.

The method has the beneficial effects that: the second eye region features trained by the left eye and the right eye respectively can detect the eye states of the left eye and the right eye of the current user, and compared with the embodiment that all the second eye features are combined to train the left eye and the right eye to be not distinguished for detection, the detection result is more accurate.

In one embodiment, face detection is performed on a current video frame;

detecting a human face by using a human face neural network, detecting human face key points by using a feature extraction network, acquiring human eye key points from the human face key points, and simultaneously outputting the features of a first eye region by combining the human eye key points and the human face key point mapping;

acquiring 5 frames of preset video frames in preset time in a preset video, acquiring 10 second eye region characteristics in the 5 frames, and carrying out processing such as interpolation on the second eye region characteristics to form a unified dimension;

performing connection combination training on the second eye region characteristics of the processed continuous 5 frames to obtain an SVM classifier;

the method comprises the steps of sending a first eye feature of a current user into an SVM classifier for classification, copying 4 times of single frame input to combine the single frame into 5 frames to output a first eye region feature under an initial condition (namely under the condition of less than 5 frames), copying 3 times of second frame input to combine the second frame into 5 frames to output, copying 2 times of third frame input to combine the third frame into 5 frames to output and the like under a third frame input condition, circularly acquiring the eye region features of 5 continuous frames of the current user, putting the eye region features into the SVM classifier for classification after 5 frames are input, classifying the eye region features of 5 continuous frames of the current user as a whole every time, and further more accurately determining the eye state of the current user according to a classification result.

An apparatus for detecting eye state in video, as shown in fig. 3, comprises:

the calling module is used for calling a pre-trained SVM classifier;

and the classification module is used for sending the first eye region characteristics to an SVM classifier for classification and detecting the eye state of the current user according to the classification result.

In one embodiment, the detection module, as shown in fig. 4, includes:

In one embodiment, the detection module further includes:

and the second acquisition submodule is used for mapping the minimum circumscribed rectangle to a representation layer of the face features in the neural network according to a receptive field calculation formula and acquiring the high-dimensional features of the eye region.

In one embodiment, the detection module: the eye region detection method is also used for detecting a second eye region feature of a preset user in a preset video before calling a pre-trained SVM classifier;

the eye state detection device further includes:

and the training submodule is used for carrying out interpolation processing on the second eye region characteristic to change the second eye region characteristic into a characteristic with the same dimensionality for training to obtain the SVM classifier.

In one embodiment, the detection module further includes:

the training submodule is configured to:

It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application. For example: the first eye region feature is the eye feature in the detection stage, the second is the training stage, and the eye feature may be the size of the eye, the distance between the upper and lower eyelids, etc. The key points may be positions, etc., for example, the face key points may be the contours of the face, the positions of the five sense organs, etc., and the eye key points may be the positions of the eyes, etc.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for detecting eye state in video is characterized by comprising the following steps:

carrying out face detection on a current video frame of a current user to obtain a first eye region characteristic of the current video frame;

calling a pre-trained SVM classifier;

and sending the first eye region feature into the SVM classifier for classification, and detecting the eye state of the current user according to the classification result.

2. The method for detecting eye state in video according to claim 1, wherein the performing face detection on the current video frame to obtain the first eye region feature of the current video frame comprises:

detecting key points of the human face by using a neural network;

acquiring human eye key points from the human face key points;

3. The method for detecting eye state in video according to claim 2, wherein said performing face detection on the current video frame to obtain the first eye region feature of the current video frame further comprises:

4. The method of eye state detection in video of any one of claims 1 to 3, wherein prior to invoking the pre-trained SVM classifier, the method further comprises:

detecting a second eye region feature of a preset user in a preset video;

and carrying out interpolation processing on the second eye region characteristics to obtain characteristics with the same dimensionality for training, and obtaining the SVM classifier.

5. The method for detecting eye state in video according to claim 4, wherein the detecting the second eye region feature of the preset user in the preset video comprises:

acquiring N continuous preset video frames of the preset video in each preset time period;

taking the eye region features of the N continuous preset video frames in each preset time period as a second eye region feature, and obtaining M second eye region features in total;

the interpolating the second eye region feature to obtain a feature with the same dimension, and training to obtain the SVM classifier includes:

6. An apparatus for detecting an eye state in a video, comprising:

the calling module is used for calling a pre-trained SVM classifier;

7. The apparatus for detecting eye state in video according to claim 6, wherein the detecting module comprises:

and the output sub-module is used for combining the human eye key points and the human face key points, mapping and outputting the first eye region characteristics.

8. The apparatus for detecting eye state in video according to claim 7, wherein the detecting module further comprises:

9. The apparatus for detecting eye state in video according to any one of claims 6 to 8,

the detection module is further used for detecting a second eye region feature of a preset user in a preset video before the pre-trained SVM classifier is called;

the eye state detection device further includes:

10. An apparatus for detecting eye state in video according to claim 9,

the detection module further comprises:

the third obtaining submodule is used for obtaining N continuous preset video frames of the preset video in each preset time period;

the training submodule is configured to: