CN108960076B

CN108960076B - Ear recognition and tracking method based on convolutional neural network

Info

Publication number: CN108960076B
Application number: CN201810586771.1A
Authority: CN
Inventors: 林云智; 王雁刚
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2022-07-12
Anticipated expiration: 2038-06-08
Also published as: CN108960076A

Abstract

The invention discloses an ear recognition and tracking method based on a convolutional neural network, which comprises the following steps: constructing a first layer of convolutional neural network aiming at the existing face data set and face frame labels, and detecting the head of a person in an image to obtain a face image containing an ear region; building a second laminated neural network for the ear data set and the ear labeling frame label, and detecting the ear region in the output image in the step 1 through training; and (3) building a third layer neural network aiming at the ear data set and the ear characteristic point label, and automatically labeling the ear characteristic points in the output image in the step (2) through training. The invention adopts a three-layer cascade structure, and can effectively solve the problems of detection and characteristic point marking under the condition that the existing ear data set is relatively small. And the multi-layer network can obviously compress the size of the network, the parameter quantity of the network structure is relatively small, the requirement on the video memory in the training stage is not high, the training is easier to converge, and the performance is better under the complex condition.

Description

Ear recognition and tracking method based on convolutional neural network

Technical Field

The invention belongs to the technical field of computer vision and image processing, relates to an object detection and characteristic point positioning technology, and particularly relates to an ear identification and tracking method based on a convolutional neural network.

Background

Ear recognition and modeling have a very important impact on the realistic rendering of virtual objects. In the field of biometric identification, automatic identification starting from ear images represents an active research area. The ability to capture an ear image from a distance in a covert manner makes this technology an attractive option in surveillance and security applications, as well as other application areas. Compared with traditional biometric identification schemes, such as fingerprint, face and iris recognition techniques, the ear has its unique advantages: the ear has a stable and rich structure, does not change much with age, and is not affected by facial expressions. Recent studies have even empirically verified that certain characteristics of their ears are different even for homozygotic twins.

The ear image can therefore be used to supplement other biometric means in an automatic identification system and to provide an identity hint when other information is not reliable or even available. For example, in surveillance applications, the ear may be the source of the person's identity information in the surveillance lens in situations where face recognition techniques may be contradictory to the facial side.

The basis of ear recognition is ear detection and ear feature selection. In recent years, significant efforts have been made in this field to emerge a number of algorithms based on local coding features, and a number of data sets for training and testing this technology are publicly available, but there are still some unsolved research issues that hinder the wider commercial application of this technology.

First, a well robust ear detection is a cornerstone of the overall recognition system. There are many techniques for automatically extracting ears from 2D face images. Most of these techniques, however, do not perform well when the test images are taken under uncontrolled conditions. Moreover, blocking illumination variations is common in practical applications, which presents a challenging problem and needs to be solved urgently.

Meanwhile, the existing ear recognition technology mainly focuses on extracting and analyzing the geometric features of the ear image, and the limitation is the dependence of the ear image on edge detectors which are sensitive to illumination change and noise. The traditional ear feature point positioning related work positions a plurality of feature points as recognition assistance, and does not meet the requirement of multipoint positioning. Many techniques also require precise ear-specific point locations. Therefore, it becomes urgent to develop an automatic feature labeling method.

A Convolutional Neural Network (CNN) is a feed-forward Neural Network whose artificial neurons can respond to a portion of the coverage of surrounding cells, and performs well for large image processing. There are several key technical issues if CNN is incorporated into the ear recognition and tracking problem:

the first issue is the collection and sorting of data samples. In the last decade, deep learning algorithms have greatly improved the level of computer vision. The performance of visual tasks such as image classification, face recognition, and object detection is significantly improved. The system is driven by data input, has strong robustness, can well respond to various types of challenges, and does not depend on characteristics selected manually by human. Therefore, the invention adopts a deep learning method to identify and track ears. The traditional Adaboost or SVM method has low requirements on the capacity of data samples, the existing data samples can meet the requirements, but deep learning has high requirements on the quantity and accuracy of training samples. At the same time we need to ensure that the ear can still be detected and labeled under multi-angle conditions. This further increases the requirements for data acquisition.

The selection of the ear feature points also has a great influence on the final recognition result, the number of the selected ear feature points is usually too much to reach 48 or 55, and although the subsequent recognition rate is theoretically improved, too high requirements are provided for the workload of feature labeling and the real-time performance of data processing. How to select a proper marking quantity and a corresponding specific marking position under the condition of meeting the purpose of people, the workload of manual marking of the sample is reduced, and the requirement on the real-time property of data processing is weakened to be a great importance.

One of the difficulties of CNN-based ear recognition and tracking is how to extract a high-precision ear part by overcoming adverse effects such as illumination, occlusion, and noise. In the image with a complex background, the ear image is often very complex, the scale change is large, the angle change is large, and if the CNN method is simply applied to the extraction of the ear, the error identification is very easily caused under the conditions that the sample capacity is not sufficient and the sample selection is limited. It is the presence of various disturbances that make it difficult for the ear detection algorithm to reach a realistically usable reason. Secondly, how to design a reliable network structure and overcome the difficulty of ear feature point labeling under low resolution. The ear size is too small relative to the face, and features are densely concentrated in a specific area, so that labeling is extremely difficult.

Disclosure of Invention

In order to solve the problems, the invention discloses an ear identification and tracking method based on a Convolutional Neural Network (CNN), which realizes the tasks of accurate detection and feature point positioning of ears on a video by methods such as data expansion, three-layer deep convolutional network cascade structure construction, pyramid type sliding window network and the like, provides higher accuracy for detection and feature point marking of human ears and has stronger robustness in a complex environment.

Based on the technical key points, the invention provides the following technical scheme:

the ear recognition and tracking method based on the convolutional neural network comprises the following steps:

step 1, building a first layer of convolutional neural network aiming at an existing face data set and a face frame label, and detecting the head of a person in an image to obtain a face image containing an ear region;

step 2, building a second laminated neural network for the ear data set and the ear labeling frame label, and detecting the ear region in the output image in the step 1 through training;

and 3, building a third layer neural network aiming at the ear data set and the ear characteristic point label, and automatically labeling the ear characteristic points in the output image in the step 2 through training.

Further, the step 2 comprises a training part and a detection part,

the training part firstly acquires data and expands a data set, trains the network by utilizing the expanded data set, and obtains the weight of the network by combining a frame regression step;

and the detection part generates a deployment network, reads the network weight obtained by the training part and then detects the ear region in the image output in the step 1.

Further, the process of augmenting the data set includes: obtaining a plurality of positive samples from each original picture by adopting 4 methods of translation, rotation, cutting and scaling; randomly cutting pictures with different sizes in a certain area around the ear to obtain a plurality of negative samples; and the data is expanded by adopting a horizontal overturning method to obtain 2 times of samples.

A fixed length to width ratio is used when obtaining samples in the process of expanding the data set.

Further, the frame regression step specifically includes the following steps: and performing linear regression on the difference value between the predicted coordinate extracted by the computing network and the real mark to finely adjust the obtained frame coordinate, so that the frame coordinate is more accurate.

Further, the detection part specifically comprises the following sub-steps:

preprocessing a picture: zooming the original picture;

a pyramid model: the preprocessed pictures are down sampled through the pyramid model to obtain 9 groups of pictures with different sizes, so that the network can be suitable for detecting the pictures with different sizes;

generating a thermodynamic diagram: the method comprises the steps that 9 pictures generated by a pyramid model are sequentially subjected to sliding window network to obtain corresponding thermodynamic diagrams, and the thermodynamic diagrams are mapped to original images through coordinate proportion conversion;

non-maxima suppression algorithm: and (4) putting all the detection frames obtained by 9 pictures together, searching local maximum values of the detection frame scores in the local area by adopting a non-maximum value inhibition algorithm, and deleting the detection frames below the threshold score.

Further, the step 3 comprises the following sub-steps:

expanding the data set: acquiring a plurality of samples from each original picture by adopting 3 methods of horizontal turning, contrast modification and positive and negative rotation by certain angles, and manufacturing an hdf5 multi-label file after intercepting and zooming;

and detecting by using a third layer network architecture: and (4) adopting a network structure with a convolution layer and a pooling layer in an alternating mode, and finally outputting a result through a full-connection layer.

Further, a Relu activation function is adopted in the third layer network architecture, and the drop layer is utilized to randomly discard the weight with a certain probability.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention adopts a three-layer cascade structure, and can effectively solve the problems of detection and feature point marking under the condition that the existing ear data set is relatively small through a mature face detection network of a first layer, an ear detection network of a second layer and an ear feature point marking network of a third layer. And the multi-layer network can obviously compress the size of the network, the parameter quantity of the network structure is relatively small, the requirement on the video memory in the training stage is not high, and the training is easier to converge. Because a data-driven deep learning network is adopted without depending on the traditional contour detection or local feature coding technology, the method has better performance under the complex conditions of multiple angles, multiple scales, shielding and the like.

Drawings

Fig. 1 is a flowchart of an ear recognition and tracking method based on a convolutional neural network according to the present invention.

Fig. 2 is a schematic diagram of a calibration frame for face detection modification in the present invention.

Fig. 3 is a schematic diagram of a deep learning network structure for ear detection according to the present invention.

FIG. 4 is an expanded view of the sample in the ear test of the present invention.

Fig. 5 is a schematic diagram of a deep learning network structure for ear feature point labeling according to the present invention.

Fig. 6 is a diagram of ear feature point labeling effect.

Detailed Description

The technical solutions provided by the present invention will be described in detail with reference to specific examples, which should be understood that the following specific embodiments are only illustrative and not limiting the scope of the present invention.

The invention provides a three-layer network structure, and the complete processing process is shown in figure 1. The total is divided into three stages. The first stage performs face detection and performs block correction to make it include ear region. And in the second stage, ear detection is carried out on the head local area in the first stage, so that a more accurate ear position is obtained. And the third stage carries out ear characteristic point labeling on the ear region. The three phases comprise a plurality of supervised learning processes. Specifically, the ear recognition and tracking method based on the convolutional neural network provided by the invention comprises the following steps:

firstly, adopting a first-stage convolution neural network to detect the head of a person

We use a sophisticated face detection network to obtain the face and expand the final face calibration box coordinates by modification to include the ear region, as shown in fig. 2.

The invention firstly uses the improved human face detection deep learning network to obtain the head area containing the ear area. This significantly reduces the workload of the next local detection.

Second, a second convolution neural network is constructed according to the existing ear image data set and the frame information label of the ear to extract the ear region in the output image of the first step

The double-task architecture comprises the following steps: the network in the second stage adopts a multitask mode, and mainly comprises the following steps: ear classifier/ear candidate box coordinates regression (as shown in fig. 3).

The step includes a training part and a detection part.

The training part comprises data acquisition and expansion: firstly, because the data set of the existing ear is small, the data set sample needs to be increased by adopting a data expansion method. First, positive and negative samples for the ear classifier are obtained: we adopt 4 methods of translation, rotation, cropping and scaling to obtain 30 positive samples from each original picture (as shown in fig. 4); 60 negative samples were obtained by randomly cropping a certain area of the ear with different size pictures. Meanwhile, in order to enable the training sample to adapt to the sliding window network, a fixed length-width ratio is adopted when the sample is obtained (the length-width ratio of an ear labeling frame in a statistical data set is determined as a median 0.512). Based on the ear candidate box coordinate data set, we also adopt a horizontal flipping method to expand the data, resulting in 2 times of samples.

Frame regression: and performing linear regression on the difference value between the predicted coordinate extracted by the computing network and the real mark to finely adjust the obtained frame coordinate, so that the frame coordinate is more accurate.

We can get the weights of the network through the training part.

The detection part generates a deployment network and reads the network weight obtained by the training part. The method specifically comprises the following steps:

preprocessing the picture: the picture mean is subtracted from each pixel of the original picture and scaled to [0,1 ].

The pyramid model is as follows: the pyramid model is used for sampling the image to be detected (namely, the preprocessed image) downwards to obtain 9 groups of images with different sizes, so that the network can be suitable for detecting the images with different sizes.

Generating a thermodynamic diagram: and (3) sequentially passing 9 pictures generated by the pyramid model through a sliding window network to obtain a corresponding thermodynamic diagram, and mapping the thermodynamic diagram to an original image through coordinate proportion transformation.

Non-maxima suppression algorithm: all the test frames obtained from the 9 pictures (obtained in the step of generating the thermodynamic diagram) are put together. A non-maximum suppression algorithm (NMS) is adopted to search local maximum values of the scores of detection frames in a local area, a certain threshold value is set, and the detection frames below the threshold value score are deleted.

In the step, ear classification judgment information and the key point positions of an ear block diagram are integrated in a unified network architecture to obtain the ear region. By fusing the two pieces of information at the stage of the deep convolutional network, the supervised learning of more information of the ears is completed. We have found that by fusing coordinate information, the ear region detection accuracy can be improved. Compared with the traditional method depending on the ear contour structure, the deep convolutional network can adapt to different ear shapes, and stronger robustness is achieved.

And then, a third layer of deep convolutional network is built to finish feature point labeling, aiming at the problems that data samples are sparse and feature points are concentrated in a small area, data expansion is carried out by utilizing the traditional image processing method such as cutting, rotation and deformation, and then a shallower network structure is designed to prevent overfitting of small sample data while training is converged. The method specifically comprises the following steps:

and thirdly, building a third layer neural network aiming at the ear data set and the ear characteristic point label, and realizing automatic marking of the ear characteristic point in the output image in the step 2 through training.

Data acquisition and expansion: firstly, because the existing ear labeling data set is small, the data expansion method is adopted to increase the data set samples. We adopt 3 methods of horizontal turning, contrast modification and rotation of-5 ° - +5 ° to obtain 8 samples from each original picture, and for the convenience of network input, we adopt 1: 1 aspect ratio (as shown). Finally we scaled it to 96 × 96 size to make hdf5 multi-label file.

Network architecture: a network structure with a convolutional layer and a pooling layer alternated is adopted, and finally, a result is output by a full connection layer. To better converge the network, we use the Relu activation function in the middle of the network and use the dropout layer to randomly drop weights with 50% probability. The overall architecture is shown in fig. 5. The final labeling of the ear feature points through the above three steps is shown in fig. 6.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. The ear recognition and tracking method based on the convolutional neural network is characterized by comprising the following steps of:

step 2, building a second laminated neural network aiming at the ear data set and the ear labeling frame label, and detecting the ear region in the image output in the step 1 through training;

the second laminated neural network adopts a multitask mode and comprises: ear classifier/ear candidate box coordinate regression;

comprises a training part and a detection part;

the detection part generates a deployment network, and detects the ear region in the image output in the step 1 after the network weight obtained by the training part is read; the detection part specifically comprises the following sub-steps:

preprocessing a picture: zooming the original picture;

the pyramid model is as follows: the preprocessed pictures are down sampled through the pyramid model to obtain 9 groups of pictures with different sizes, so that the network can be suitable for detecting the pictures with different sizes;

generating a thermodynamic diagram: 9 pictures generated by the pyramid model are sequentially subjected to a sliding window network to obtain a corresponding thermodynamic diagram, and the thermodynamic diagram is mapped onto an original image through coordinate proportion conversion;

non-maxima suppression algorithm: putting all the detection frames obtained by 9 pictures together, searching a local maximum value of the detection frame scores in a local area by adopting a non-maximum value inhibition algorithm, and deleting the detection frames below a threshold score;

the frame regression step specifically comprises the following processes: the obtained frame coordinate is finely adjusted by performing linear regression on the difference value between the predicted coordinate extracted by the computing network and the real mark, so that the frame coordinate is more accurate;

the second laminated neural network integrates ear classification judgment information and the key point position of the ear block diagram to obtain an ear region;

step 3, building a third layer neural network for the ear data set and the ear characteristic point label, and automatically labeling the ear characteristic points in the output image in the step 2 through training;

the step 3 comprises the following sub-steps:

2. The convolutional neural network based ear recognition and tracking method of claim 1, wherein the process of expanding the data set comprises: obtaining a plurality of positive samples from each original picture by adopting 4 methods of translation, rotation, cutting and scaling; randomly cutting pictures with different sizes in a certain area around the ear to obtain a plurality of negative samples; and the data is expanded by adopting a horizontal overturning method to obtain 2 times of samples.

3. The convolutional neural network based ear recognition and tracking method of claim 2, wherein a fixed ratio of length to width is used in obtaining samples during the expansion of the data set.

4. The convolutional neural network-based ear recognition and tracking method as claimed in claim 1, wherein a Relu activation function is employed in a third layer network architecture, and a dropout layer is utilized to randomly drop weights with a certain probability.