CN110852214A

CN110852214A - Light-weight face recognition method facing edge calculation

Info

Publication number: CN110852214A
Application number: CN201911043719.2A
Authority: CN
Inventors: 龚征; 杨顺志; 叶开; 魏运根
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-28

Abstract

The invention discloses a lightweight face identification method facing edge calculation, which comprises the following steps: s1, constructing a lightweight face recognition network model AntCNN facing to the edge computing device, wherein the network structure of the AntCNN comprises: the first rolling layer, the first pooling layer, the first dense block, the second pooling layer, the second dense block, the third pooling layer, the third dense block and the third pooling layer; s2, capturing a face image, compressing the face image into small-size pixels as input of the AntCNN, extracting and classifying features by using the AntCNN, and S3, obtaining specific scores of all categories by passing the acquired multi-dimensional feature map through a full connection layer, wherein the maximum score represents the specific classification of the picture. The method uses the dlib library of the traditional machine learning to search the part of the face, successfully runs on the edge computing equipment of the raspberry group, and searches the video of the face very smoothly, thereby completely meeting the real-time requirement.

Description

Light-weight face recognition method facing edge calculation

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a lightweight face recognition method for edge calculation.

Background

The deep learning has better robustness to the diversity change of the target. Therefore, running deep-learning network models directly on edge computing devices is considered the most promising approach and has gained widespread research and application. However, deep learning is computationally intensive. While the computing power and storage space of the edge computing device are limited. This means that the design of the deep learning network model facing the edge calculation needs to take the accuracy into consideration and also needs to pay attention to the amount of calculation and parameters needed by the network. To be able to run deep learning network models on edge computing devices, lightweight network models such as MobileNet and ShuffleNet have been proposed. However, these networks are developed as a general network model, and are mainly applied to multi-target recognition, so the size of the input pictures of the network model is large. This results in the need for more computations (FLOPs) and more parameters under the same circumstances. The target identification needs to find and classify the specific position of the target, and the target classification only needs to classify the target with the known specific position. In some specific cases, only the objects need to be classified. Those common lightweight network models typically have an input size of 224 x 224 and contain multiple objects and associated backgrounds. This can result in wasted resources and poor performance if these networks are applied to target classification.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a lightweight face recognition method facing to edge calculation, so that the accuracy of face recognition is greatly improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a lightweight face recognition method facing to edge calculation, which comprises the following steps:

s1, constructing a lightweight face recognition network model AntCNN facing to the edge computing device, wherein the network structure of the AntCNN comprises: the first rolling layer, the first pooling layer, the first dense block, the second pooling layer, the second dense block, the third pooling layer, the third dense block and the third pooling layer;

s2, capturing a face image, compressing the face image into small-size pixels as input of the AntCNN, and performing feature extraction and classification by using the AntCNN, wherein the feature extraction and classification are as follows:

s21, extracting the bottom layer characteristics of the input image, and acquiring the bottom layer characteristics of the image through the first convolution layer;

s22, reducing the length and the width of the network model by half by using the first pooling layer;

s23, increasing the dimensional characteristics of the network model processed by the first pooling layer by 32 by using the first dense block;

s24, reducing the length and the width of the network model processed by the first dense block by half by using the second pooling layer;

s25, increasing the dimensional characteristics of the network model processed by the second pooling layer by 32 by using the second dense block;

s25, reducing the length and the width of the network model processed by the second dense block by half by using a third pooling layer;

s26, adding 56 the dimensional characteristics of the network model processed by the third pooling layer by using a third dense block;

s27, reducing the length and the width of the network model processed by the third dense block by one fifth by using a fourth pooling layer;

s28, obtaining a multidimensional characteristic diagram with the length and width of the network model being 1;

and S3, obtaining specific scores of all the categories of the acquired multi-dimensional feature map through a full connection layer, wherein the maximum score represents the specific classification of the picture.

As a preferred technical solution, in step S1, a dlib library is used to capture a face image, the face image captured by the dlib library is compressed into a uniform size of 44 × 44 pixels, the obtained network model input dimensions are (44,44,3), 44 represents the length and width of a picture, and 3 represents that the picture is in color.

3. The edge-computation-oriented lightweight face recognition method according to claim 1, wherein in step S1, the first convolution layer is a 3 × 3 convolution layer, pad 1, bias True;

the first pooling layer is a 3 × 3 largest pooling layer, stride 2;

the first dense block is

The second pooling layer is a 2 × 2 average pooling layer, stride 2;

the second dense block is

The third pooling layer is a 2 × 2 average pooling layer, stride 2;

the third dense block is

The fourth pooling layer is a 5 × 5 global average pooling layer.

As a preferable technical solution, the bottom layer features of the picture are acquired by adopting a 3 × 3 convolution layer once, the dimensions after the bottom layer features are extracted are (44,44,32),44 represents the length and width of the network, and 32 represents the features of 32 dimensions.

Preferably, the first dense block, the second dense block and the third dense block are each convolved with two consecutive 3 × 3 learnable groups.

Preferably, the dense block is a 3 × 3 learnable packet convolution layer with a step size of 1, and a feature map with a 4-fold growth rate is output; a batch normalization layer; an activation layer (Relu); 3 x 3 learnable packet convolution layer with step length of 1, and outputting a characteristic diagram with 1 time of growth rate; and a batch normalization layer.

As an optimal technical scheme, the AntCNN is successfully operated on the raspberry pi 3B +, the operation speed is 0.87FPS, the accuracy in the FER-2013 and RAF-DB data sets of emotion classification is higher than that of other popular lightweight feature extraction networks, the parameter number is 0.4MB, and the calculated amount is 2.7 MFLOPs.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention adopts the principle of ants moving food and divides the target identification into two independent parts of target positioning and target classification. The part of target positioning uses a traditional machine learning method, and the target classification uses a deep learning method. The traditional machine learning method is far less complicated in calculation than a deep learning method, and the traditional machine learning method cannot be used in the whole process of target identification because the traditional machine learning method does not have good identification robustness on the diversity of target changes. However, the conventional machine learning method is very superior in performance for finding the position of the target, such as a small amount of calculation and high accuracy. Therefore, the method uses the dlib library of the traditional machine learning to search the part of the face, successfully runs on the edge computing equipment of the raspberry group, and searches the video of the face very smoothly, thereby completely meeting the real-time requirement.

2. Part of the object classification can only use deep learning methods. Previous networks were large networks (224 x 224) that included much information about objects and context. Since the target location is already known using the dlib library, then the classification is left. Therefore, the network can be set to 44 × 44, and as long as the network is designed reasonably, the classification of AntCNN will not cause the accuracy to be reduced due to less input. In addition, since the learning features are relatively large, only a few dimensional features are required for a fully connected layer. This not only reduces complexity, but also allows the network to focus on learning relatively large features.

3. In the AntCNN network design, the principle of characteristic reuse is fully utilized to design the lightweight network model, and a new dense block is additionally designed. The invention fully shows the high efficiency of the method on the face recognition task.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a pictogram of a fully connected layer in the deep learning convolutional network of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

In the edge-computation-oriented lightweight face recognition method of the present embodiment, a small-size input edge-computation-oriented lightweight convolutional neural network classification model is designed, and is named as AntCNN, as shown in fig. 1, the method of the present invention includes the following steps:

s2, capturing a face image, and compressing the face image into small-size pixels as input of the AntCNN;

the present embodiment first captures a face image using the dlib library, which captures a face at a random time of 0.6 to 3 seconds. Capturing a face from a video is very fluid and does not jam. In the present embodiment, the face image captured by the dlib library is compressed to a uniform size of 44 × 44 pixels. This is because the dlib face detector can identify faces very accurately. Furthermore, the dlib library consumes less computing power than the deep convolutional network model. The face image of 44 × 44 pixels can be sufficiently classified. If the input size of the network is reduced, the learning characteristics will be coarse relative to networks with large inputs.

Further, the last layer of the current deep learning classification or identification network is a full connection layer. Now, taking the example of recognizing whether the target in the picture is "person", as shown in fig. 2, the "head", "body", etc. represent the feature map learned by the last layer. If the last layer network of AntCNN learns relatively large feature maps of head, body, etc., then the last layer network of 224 × 224 learns the features of ears, eyes, etc. in the head. It is clear that the large 224 x 224 networks, which are designed for large size input networks, require a lot of computation and thus find poor performance in small size input networks.

Although the AntCNN network with small-size input can have the learned features rough compared with the large-size network, the recognition target is that whether the human is comprehensively given by learning the similarity of each feature. This means that as long as the network is properly designed, the identification of AntCNN does not result in poor accuracy due to the network input size being too small.

After the small input face picture is obtained, a CNN model is used for feature extraction and classification. This is equivalent to converting a multi-target recognition problem into a single-target classification problem. Thus, it reduces the size of the network input, reduces the number of parameters and reduces the computing power. The network model input dimensions obtained are (44,44,3), 44 indicating the length and width of the picture, and 3 indicating that the picture is in color. The method specifically comprises the following steps:

where AntCNN starts, as shown in the convolution layer 1 in table 1, the underlying features of the input picture are extracted, and this part of the network contains much detail information, and it is necessary to extract as many features as possible. Because the former part has insufficient learned features, it is difficult to extract efficient and advanced features later, certainly not as big as possible, which wastes memory and time. The invention adopts the enlarged convolution kernel, the larger size of the convolution kernel means a larger receptive field, more sufficient information can be learned, and more parameters follow the information. In DenseNet, a convolution layer of 7 × 7 is used, and this calculation amount is very large. The present invention takes a 3 × 3 convolutional layer at a time to acquire the picture bottom layer features, and this is also sufficient for the 44 × 44 picture size input herein. The dimensions after extraction of the underlying features are (44,44,32),44 denoting the length and width of the network, 32 denoting the 32-dimensional features.

Table 1 network architecture of AntCNN

S3 pool layer

The role of the pooling layer is to reduce the amount of computation of the model and to eliminate noise in the features, and the present invention uses a total of four pooling layers. After the first time of bottom layer feature extraction, a maximum pooling layer of 3 × 3 with a step size of 2 is used, and the feature dimension after processing is (22,22,32), i.e., the length and width of the network are reduced by half, and 32 represents the dimension.

The second time after the first dense block, an average pooling layer of 2 x 2 with step size 2 is used, and the feature dimensions after processing are (11,11,64), i.e. the length and width of the network are reduced by half, 64 representing the dimension.

Third time after the second dense block, 2 × 2 average pooling layers with step size 2 are used, and the feature dimension after processing is (5,5,96), i.e. the length and width of the network are again reduced by half, 96 being the dimension.

Fourth after the third dense block, a 5 × 5 global average pooling layer is used, and the feature dimension after processing is (1, 152), i.e., the length and width of the network are reduced by one fifth, 152 being the dimension.

S4, dense blocks;

the dense block of this embodiment uses the conventional post-activation mode, i.e. the convolutional layer, then the normalization layer, and finally the activation layer, as shown in table 2. The activation layer is removed in the convolution block 2 of each dense block, mainly to prevent non-linearity from corrupting the final feature information.

The learnable group convolutions in tables 1 and 2 are all represented by L-conv, where the parameter groups is the number of packets and the confidence _ factor is the concentration factor. Condensed factors represent the only linkage of each group

The input feature channel of (1).

As shown in Table 2, the present invention employs two consecutive 3 × 3 learnable group convolutions in dense blocks because a larger receptive field can learn richer features. In addition, the 1 × 1 convolutional layer is eliminated, because the 1 × 1 convolutional layer would increase the extra memory consumption and would not perform well in the experiments herein. All convolutional layers used by the present invention, including learnable group convolutional layers, are set to bias True. This is to allow the network to add non-linear characteristics to better fit the data. We eliminate the active layer in the second volume block of the dense block. This prevents non-linearities from corrupting too much information. Furthermore, it reduces the operations at the element level.

The learnable group convolutional layer in the dense block convolutional block 1 outputs a feature map of 4-fold growth rate, as shown in table 2. Whereas the learnable group convolutional layer in convolutional block 2 only outputs a signature at a 1-fold growth rate. This means that the network can increase the feature map by a factor of 1 per pass through the dense block. The growth rate selected by the AntCNN of the invention is 8, and although only 8-dimensional feature maps are added after each dense block, the last layer of the AntCNN has 152-dimensional feature maps after a plurality of times.

In summary, as shown in table 2, the sequence of a dense block is: a 3 × 3 learnable packet convolutional layer with step size of 1 (feature map outputting 4 times growth rate), a batch normalization layer, an activation layer (Relu), a 3 × 3 learnable packet convolutional layer with step size of 1 (feature map outputting 1 times growth rate), and a batch normalization layer. The growth rate was 8.

In the invention, a total of three dense blocks are used, and after the first pooling layer, the original input is (22,22,32), 32 represents 32-dimensional features, and 22 represents the length and width of the network. After 4 times dense blocks, the output of the network is (22,22,64),64 representing a 32+4 × 8-64 dimensional feature.

After the second pooling level, the original inputs are (11,11,64), 64 representing a 64-dimensional feature, and 11 representing the length and width of the network. After 4 times dense blocks, the output of the network is (11,11,96),96 representing a 64+4 × 8-96 dimensional feature.

Third time after the third pooling layer, the original inputs are (5,5,96), 96 representing 96 dimensional features, and 5 representing the length and width of the network. After 7 times dense blocks, the output of the network is (5, 152),152 representing a 96+7 × 8-152 dimensional feature.

Table 2: detail view of dense blocks (L-Conv denotes learnable group convolution, groups is the number of groups, dense _ factor is the concentration factor)

S5, a classification layer;

the complete network structure of AntCNN is shown in Table 1, and the obtained network size (1, 152) is 152-dimensional feature map 152. The 152-dimensional feature map gets the specific scores of each category through the full connection layer, and the maximum score represents the specific classification of the picture.

The invention successfully operates AntCNN on raspberry pi 3B +, and the operation speed is 0.87FPS (frames persecond). The accuracy in the FER-2013[1] and RAF-DB [2] data sets of emotion classification is higher than that of other popular lightweight feature extraction networks, the parameter number is 0.4MB, and the calculated amount is 2.7 MFLOPs.

The performance pair of AntCNN with other lightweight network models is shown in Table 3, where the speed and percentage of memory consumed by IGCV1 and Pelee running on Raspy 3B + is Null indicating that the models of IGCV1 and Pelee are too large to run on Raspy 3B +. Therefore, the AntCNN model of the invention is advantageous in terms of accuracy, parameter quantity and calculation quantity. Although the amount of computation is not equal in sign to the velocity, the amount of computation required by the model is particularly important if the edge computing device needs to perform multiple tasks.

TABLE 3 comparison of Raspberry pie 3B + with other lightweight networks

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The light-weight face recognition method facing the edge calculation is characterized by comprising the following steps:

2. An edge-computation-oriented lightweight face recognition method as claimed in claim 1, wherein in step S1, a dlib library is used to capture the face image, the face image captured by the dlib library is compressed to a uniform size of 44 × 44 pixels, the obtained network model input dimensions are (44,44,3), 44 represents the length and width of the picture, and 3 represents that the picture is in color.

the first pooling layer is a 3 × 3 largest pooling layer, stride 2;

the first dense block is

The second pooling layer is a 2 × 2 average pooling layer, stride 2;

the second dense block is

The third pooling layer is a 2 × 2 average pooling layer, stride 2;

the third dense block is

The fourth pooling layer is a 5 × 5 global average pooling layer.

4. The lightweight face recognition method facing edge calculation as claimed in claim 3, wherein 3 x 3 convolutional layers are adopted to obtain the bottom layer features of the picture, the dimensions after bottom layer feature extraction are (44,44,32),44 represents the length and width of the network, and 32 represents the 32-dimensional features.

5. An edge computation oriented lightweight face recognition method as claimed in claim 3 wherein the first, second and third dense blocks each employ two consecutive 3 x 3 learnable group convolutions.

6. The edge-computation-oriented lightweight face recognition method according to claim 1, wherein dense blocks are formed in a 3 × 3 learnable packet convolutional layer with a step size of 1 in order, and a feature map with a 4-fold growth rate is output; a batch normalization layer; an activation layer (Relu); 3 x 3 learnable packet convolution layer with step length of 1, and outputting a characteristic diagram with 1 time of growth rate; and a batch normalization layer.

7. The edge-computation-oriented lightweight face recognition method of claim 1, wherein AntCNN is successfully run on Raspy 3B +, the running speed is 0.87FPS, the accuracy in the FER-2013 and RAF-DB data sets of emotion classification is higher than other popular lightweight feature extraction networks, and the parameters are 0.4MB and the computation amount is 2.7 MFLOPs.