CN116704577A

CN116704577A - Face recognition and clustering method and system

Info

Publication number: CN116704577A
Application number: CN202310676420.0A
Authority: CN
Inventors: 吴叔義; 朱超; 许哲浩; 刘孟寅; 郭秀峰; 罗威; 侯丽; 罗准辰; 鲁珂琦; 马雨琪; 韩梓航; 毛宇成
Original assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla; University of Science and Technology Beijing USTB
Current assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla; University of Science and Technology Beijing USTB
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-09-05

Abstract

The application provides a face recognition and clustering method and a system, wherein the method comprises the following steps: using a face detection model, inputting a picture to be detected, and outputting a rectangular detection frame of the face position in the picture to be detected; after cutting a rectangular detection frame, amplifying and correcting, inputting a face recognition model, outputting a feature vector corresponding to the face image, performing similarity calculation on the feature vector corresponding to all the faces with known identities, and if the similarity is smaller than a set threshold value, the faces belong to unknown faces; inputting an unknown face by using a face clustering model, and aggregating faces with potential identical identities in the unknown face; the face clustering model is an incremental clustering model based on graph connection. The application has the advantages that: the incremental clustering method based on graph connection is used, so that the calculation cost of clustering result updating is reduced; the method adopts a multithreading acceleration and batch clustering mode to achieve the balance of precision and speed.

Description

Face recognition and clustering method and system

Technical Field

The application belongs to the technical fields of artificial intelligence, computer vision, face detection, face recognition, face clustering and incremental clustering, and particularly relates to a face recognition and clustering method and system.

Background

The closest prior art to the present application includes face detection, face recognition, face clustering, and the like. Face detection and recognition are mainly based on deep learning neural networks. The human face detection model is similar to the general target detection model, and is trained by a large-scale human face detection data set, receives picture input and outputs a rectangular detection frame of the human face position in the picture. The specific method is mainly represented by SRCFD and other models, and training methods such as model architecture search and small-scale face data enhancement are mainly adopted, so that model efficiency and robustness are enhanced.

And taking the cut, enlarged and corrected face image as input, outputting a feature vector corresponding to the face image by the face recognition model, performing similarity calculation on the feature vector corresponding to all the faces with known identities, and if the similarity is greater than a certain threshold, considering that the face belongs to the identity with the maximum similarity, otherwise, obtaining the unknown face. The CNN models such as ResNet-100 and the like are generally adopted to be matched with the training methods such as Partial-FC and the like, and the parallel training and Partial category approximation methods of the models are mainly adopted to accelerate the training speed on the large-scale face recognition training data.

In an open source face analysis code library represented by Insight, face detection and recognition can be constructed in two stages in cascade: in the first stage, the face detection result is used for cutting, amplifying and correcting the face image; in the second stage, the face image is input into a face detection model, and the corresponding characteristics of the face image are obtained. However, such open source libraries do not further integrate other modules or methods such as face clustering.

Similar to face recognition, given feature vectors that have been determined to be unknown faces, face clustering methods cluster vectors that may belong to the same person into the same cluster, typically using rule-based clustering methods. Generally, the known face recognition calculates the similarity between the face to be recognized and all the faces with known identities to determine the identities, so that the optimization space is small. However, the unknown face clustering only needs to determine the attribution of the faces to be clustered in each cluster, so that the similarity is not required to be calculated with all the faces, and a certain optimization space is provided.

For the offline clustering method, all unknown face feature vectors are subjected to similarity calculation in pairs, and if a new face appears, the original clustering result is discarded and clustering is performed again. The method is mainly represented by K-Means, and when unknown faces are accumulated to a certain scale, the calculation cost is overlarge. Further, for the online clustering method, the previous clustering result can be saved, the similarity between the new face and the members of each cluster is calculated to determine the home cluster, and the original cluster members do not need to be recalculated.

However, this method is not applicable to such practical cases: the face of an unknown person in the early stage is less and is attributed to other clusters, and the face data volume of the unknown person in the later stage is increased, but the clusters cannot be independent from the original clusters. At the same time, comparisons with all members in the cluster (or randomly selecting K comparisons) are still needed, and if the cluster size is too large, the computational overhead is still large.

The on-line clustering method based on graph connection introduces the concepts of large clusters and small clusters, updates the attributed large clusters based on the connection between the small clusters as graph nodes, and only updates the feature vector of the center of the small clusters so as to reduce the comparison with cluster members. However, the clustering method is not directly used for a face clustering task, is not integrated with a face detection and recognition model, and is a unified face recognition clustering system which can be used for processing video data.

In open source network videos of different scenes, a large number of human faces are included, including faces with known identities and faces with unknown identities, so that the following problems exist:

(1) The angle and the size of the face are various, and the face recognition by the same thunder is difficult;

(2) Along with the accumulation of data, the scale of the unknown face is continuously enlarged, the existing open source library is not combined with a clustering method, unknown face feature vectors possibly belonging to the same identity are clustered to confirm the identity of the cluster, and the known face database is expanded, so that the calculation cost for updating the clustering result each time is very large due to the continuous increase of new data;

(3) Lack of efficient optimization for video processing: because of the limitation of data compression formats such as MP4, each frame is gradually accumulated by a key frame and a differential frame, and in order to detect all faces as far as possible, the frames need to be extracted at fixed time intervals; in the prior art, the hardware performance is not optimized in the hardware level, so that the calculation efficiency is low.

Disclosure of Invention

The application aims to overcome the defects that the computing cost for face clustering is very high and the optimization aiming at video processing is not carried out during computing in the prior art.

In order to achieve the above object, the present application provides a face recognition and clustering method, which includes:

using a face detection model, inputting a picture to be detected, and outputting a rectangular detection frame of the face position in the picture to be detected;

after cutting a rectangular detection frame, amplifying and correcting, inputting a face recognition model, outputting a feature vector corresponding to the face image, performing similarity calculation on the feature vector corresponding to all the faces with known identities, and if the similarity is smaller than a set threshold value, the faces belong to unknown faces;

inputting an unknown face by using a face clustering model, and aggregating faces with potential identical identities in the unknown face;

the face detection model and the face recognition model are trained neural network models;

the face clustering model is an incremental clustering model based on graph connection.

As an improvement of the method, the incremental clustering model based on graph connection is realized by the following steps:

step 1: inputting an unknown face;

step 2: if the number of the current small clusters is less than 1, creating a new small cluster, and taking the feature vector of the unknown face as the center of the small cluster;

if the number of the current small clusters is greater than or equal to 1, calculating the similarity between the unknown face and the centers of all the small clusters, if the similarity is lower than a set threshold value, creating a new small cluster, and taking the feature vector of the unknown face as the center of the small cluster;

if the similarity between the unknown face and the centers of 1 small cluster is higher than a set threshold value and the similarity is greater than the similarity between the unknown face and the centers of other clusters, adding the unknown face into the small cluster;

step 3: the average value of the feature vectors of all faces of the updated small cluster is recalculated and used as the center of the updated cluster;

step 4: calculating the similarity between every two small clusters, increasing the connection between the small clusters when the similarity is larger than a set threshold value, and disconnecting the connection between the small clusters when the similarity is smaller than the set threshold value;

step 5: constructing a graph by taking small clusters as nodes and the connection among the small clusters as edges; calculating the connectivity between nodes, if any two nodes of the current graph can be reached through the connection between the nodes in the current graph, only one connected graph of the current graph is the current graph, otherwise, the graph is divided into a plurality of connected graphs, so that any two nodes between any two connected graphs can not be reached through the connection between any nodes; the small clusters in each connected graph form a large cluster, and belong to the same face identity.

As an improvement of the above method, the calculation method of the connected graph is a depth-first search or a breadth-first search.

As an improvement of the method, the picture to be detected is a picture extracted from video.

As an improvement of the above method, the specific implementation process of the face recognition and clustering method is as follows:

establishing a frame extraction thread, a target detection thread and a data queue;

extracting pictures from the video by using a CPU in the frame extraction thread and putting the pictures into the data queue;

and the face detection model, the face recognition model and the face clustering model acquire pictures from the data queue in the target detection thread by using a picture processing chip, and perform detection, recognition and clustering.

As an improvement of the above method, the frame extracting thread and the object detecting thread are 1 or more threads, respectively.

As an improvement of the method, the picture processing chip is a GPU, NPU or TPU chip.

As an improvement of the method, a plurality of unknown faces are formed into a batch, feature vectors of the unknown face images are calculated in batches, the feature vectors are cached, and when the cache area is full or reaches the set time, face clustering is carried out on all the cached feature vectors.

The application also provides a face recognition and clustering system, which is realized based on the method, and comprises the following steps:

the human face detection module is used for inputting a picture to be detected by using a human face detection model and outputting a rectangular detection frame of the human face position in the picture to be detected; the face detection model is a trained neural network model;

the face recognition module is used for amplifying and correcting the rectangular detection frame after cutting, inputting a face recognition model, outputting a feature vector corresponding to the face image, performing similarity calculation on the feature vector corresponding to all the faces with known identities, and if the similarity is smaller than a set threshold value, the face belongs to an unknown face; the face recognition model is a trained neural network model; and

the face clustering module is used for inputting an unknown face by using a face clustering model and aggregating faces with potential identical identities in the unknown face; the face clustering model is an incremental clustering model based on graph connection.

As an improvement of the above system, the system further comprises:

and the picture acquisition module is used for extracting the picture to be detected from the video.

Compared with the prior art, the application has the advantages that:

1. the incremental clustering method based on graph connection is provided, a small cluster center updating mechanism is used, the calculation cost of clustering result updating is reduced, graph connection between small clusters is updated and increased or decreased, so that large clusters to which the small clusters belong are dynamically updated, and the clustering calculation amount is reduced;

2. based on the video data compression format characteristics and the hardware characteristics of the CPU and the GPU, multi-thread and batch optimization aiming at the video are provided, so that the prediction speed is increased;

3. integrating a face detection model, a face recognition model and a face clustering method into a unified multitask model, and reducing the calculation cost of clustering result updating by using an incremental clustering method based on graph connection; aiming at the characteristics of video data, a multithreading acceleration and batch clustering mode is used to achieve the balance of precision and speed.

Drawings

FIG. 1 is a diagram of an overall system architecture of a face recognition and clustering method;

FIG. 2 is a schematic diagram of a multitasking model of a face recognition and clustering method;

FIG. 3 is a flow chart of incremental clustering based on graph connections;

FIG. 4 is a schematic diagram of multi-threaded optimization for video;

FIG. 5 is a schematic diagram of batch optimization for video.

Detailed Description

The technical scheme of the application is described in detail below with reference to the accompanying drawings.

The face recognition and clustering method and system provided by the application integrate a face detection model, a face recognition model and a face clustering method into a unified multitask model, and reduce the calculation cost of clustering result updating by using an incremental clustering method based on graph connection; aiming at the characteristics of video data, a multithreading acceleration and batch clustering mode is used to achieve the balance of precision and speed.

1. The general system architecture of the technical proposal of the application

As shown in fig. 1, the overall system architecture of the present application includes three modules: (1) the multi-task model integrates a face detection model, a face recognition model and a face clustering method and is used for two tasks of known face recognition and unknown face clustering; (2) the incremental clustering method based on graph connection is used for reducing the cost of updating the clustering result and ensuring that the clustering clusters are continuously updated; (3) optimization for video, including multithreading and batch optimization.

2. Multitasking model

The multi-task model integrates a face clustering method besides two main modules of a face detection model and a face recognition model, and is used for aggregating faces with potential identical identity in unknown faces determined by the face recognition model.

The human face detection model is similar to the general target detection model, and is trained by a large-scale human face detection data set, receives picture input and outputs a rectangular detection frame of the human face position in the picture.

And taking the cut, enlarged and corrected face image as input, outputting a feature vector corresponding to the face image by the face recognition model, performing similarity calculation on the feature vector corresponding to all the faces with known identities, and if the similarity is greater than a certain threshold, considering that the face belongs to the identity with the maximum similarity, otherwise, obtaining the unknown face.

As shown in FIG. 2, the relationship of the three tasks of detection, identification and clustering is connected in front of and behind, and the relationship is provided with the division of the known face and the unknown face. The unknown face refers to: and calculating the similarity between the feature vector and the face feature vectors in all the known face databases, wherein all the similarity is lower than the face with the threshold value. Meanwhile, the three tasks can independently output intermediate results for users to use.

The face detection model and the face recognition model may be neural network models such as ViT or CNN.

3. Incremental clustering based on graph connection

In the practical application scenario, as new video data is continuously increased, the number of unknown faces obtained after face detection and recognition is also continuously increased. With the expansion of the size of the unknown face, the calculation cost of updating the clustering result each time needs to be weighed.

The offline clustering method is directly used, all clustering results need to be recalculated, and the similarity is calculated with all cluster members when the single cluster size becomes large, and the cost is huge. And once the attribution of the cluster is determined, the affiliated cluster cannot be changed, if the faces of some unknown person at the early stage are fewer and are attributed to other clusters, the face data volume of the unknown person at the later stage is increased, and the cluster cannot be independent from the original cluster.

Therefore, the application provides an incremental clustering method based on graph connection, which expands the concept of clusters into small clusters and large clusters. The clusters formed by all faces belonging to the same identity are marked as large clusters, and the clusters formed by part of faces belonging to the same identity are marked as small clusters due to different postures, appearances and the like, wherein one large cluster can comprise one or more small clusters.

As shown in fig. 3, the clustering method comprises the following steps:

step 1: inputting a new unknown face;

step 2: if the number of the current small clusters is less than 1, directly creating a new small cluster, and taking the feature vector of the human face as the center of the small cluster;

if the number of the current small clusters is greater than or equal to 1, judging whether a new small cluster needs to be created or not by calculating the similarity between the feature vector of the face at the current position and the center of each small cluster, so that the calculation times only need the number of the clusters and are not similar to all members of all clusters;

if the similarity between the unknown face and the centers of all the small clusters is lower than a threshold value, directly creating a new small cluster, and taking the feature vector of the face as the center of the small cluster;

if the similarity between the unknown face and the center of a small cluster is higher than a threshold value and the similarity is greater than the similarity between the unknown face and the centers of other clusters, adding the face into the small cluster;

step 5: and constructing a graph by taking small clusters as nodes and connecting the small clusters as edges, and calculating the connectivity (whether the nodes are connected) between the nodes, wherein if any two nodes of the current graph can be reached through the connection between the nodes in the current graph, only one connected graph is the current graph, otherwise, the graph is divided into a plurality of connected graphs, so that any two nodes between any two connected graphs can not be reached through the connection between any nodes. The small clusters in each connected graph form a large cluster, and belong to the same face identity.

There are various classical algorithms for obtaining several connected graphs in the current graph, including depth-first, breadth-first searches, etc.

And for similarity calculation between the feature vector and the center of each small cluster and similarity calculation between every two small cluster centers, adopting cosine similarity, euclidean distance or Min Shi distance.

It can be seen that after the concept of big clusters and small clusters is introduced, incremental clustering based on graph connection allows updating of big clusters by taking small clusters as units, and the calculated amount is obviously less than that of the previous few common offline or online clustering methods, so that the high-efficiency balance of precision and speed is realized.

4. Optimization for video

Unlike a common single frame image, video has a special data compression storage format, such as a common MP4 format, and the like. This format does not keep the pictures corresponding to all frames intact, but only stores a few frames of key frames, the rest of the frames are stored in the form of differential frames of the most recent forward key frame, and if a complete picture of this frame is to be obtained, it is necessary to repeatedly superimpose differential frames starting from this key frame until this frame.

Meanwhile, for face detection and recognition in video, all faces should be detected as much as possible, so that only key frames cannot be extracted, and frames must be densely extracted as input images at certain time intervals. Thus, the above-described serial frame extraction is a necessary requirement, and thus takes a certain time. At the hardware level, CPU hardware is mainly responsible for video frame extraction, GPU hardware is responsible for face detection and recognition, and the two types of hardware generally do not interfere with each other when processing data.

Accordingly, the present application proposes a multi-threaded optimization method for video data characteristics, as well as CPU and GPU hardware characteristics. As shown in fig. 4, a frame extraction thread and a detection identification thread (i.e., a main thread) are set up, and a data queue between threads is established as a buffer of input and output data due to the asynchronous production and consumption data speeds of the two threads.

As can be seen from the figure, the object detection on the GPU is performed in parallel during the time interval in which frames are pumped on the CPU each time. In addition, if the frame extraction thread is finished and pictures still exist in the queue, the main process continues target detection until the queue is empty. Compared with the method that target detection is started after all frames are waited to be extracted, the total time required for outputting the result is greatly shortened.

The GPU hardware may also be other non-CPU hardware that can accelerate the neural network model operation, such as a special chip for NPU at the mobile end, TPU at the server end, and the like.

For simplicity of illustration, there are 1 main thread and frame extraction threads, and in practical engineering practice, the number of both threads may be greater than 1 (only one queue and shared with each thread), so as to further increase the overall prediction speed on video data.

In the face detection and recognition process of each video, repeated initiation of clustering can lead to frequent switching of thread tasks, and the input and output communication efficiency among all hardware is affected. As shown in fig. 5, the application proposes batch optimization for video, and can collect all unknown faces, and then perform batch parallelization calculation, for example, batch calculation on the similarity between all faces to be clustered and all small cluster centers. And caching the unknown face feature vectors of each batch to a hard disk, and when the cache library is full or a timer is activated, initiating a clustering task by using the batch of data, and emptying the cache after clustering is finished.

The application aims to reduce the calculation cost of clustering result updating by using an incremental clustering method based on graph connection; aiming at the characteristics of video data, a multithreading acceleration and batch clustering mode is used to achieve the balance of precision and speed.

The face recognition and clustering method provided by the application is used for comparing the accuracy and the speed of a common online clustering method (5 members are randomly selected for calculating the similarity of each cluster, and the whole cluster participates in calculation if the number is insufficient). The test data is 4635 face pictures, belongs to 932-class identities, and the comparison result is shown in the following table:

Method	clustering accuracy 1	Clustering accuracy 2	Number of clusters	Clustering is time-consuming
					General	98.79(4579/4635)	97.67(4527/4635)	967	3 minutes 27 seconds
The application is that	99.78(4625/4635)	98.40(4561/4635)	988	2 minutes 45 seconds

Two different evaluation criteria were used: (1) clustering accuracy 1: possibly, a plurality of clusters belong to the same identity, and the result is high; (2) clustering accuracy 2: one cluster belongs to one identity at most, so that the clustering result is prevented from being fragmented and the result is lower. Brackets on the right side of the accuracy are accuracy calculation formulas (face pictures/total number of face pictures with correct clustering results). It can be seen that the face recognition and clustering method of the present application has significant advantages in both accuracy and speed.

The method has the advantages that before and after batch optimization for the video is used, on 720P resolution MP4 format video for 4 minutes (including frame extraction, detection and identification, if batch optimization is not used, clustering is directly initiated again, frame extraction intervals are 3 seconds), 5 times of testing are repeated, the time consumption is reduced from 40+/-1 seconds to 35+/-1 seconds, and the total time consumption of a prediction stage is remarkably reduced.

The test is repeated 5 times on the video in the format (including frame extraction, detection and identification, 3 seconds of frame extraction interval) before and after the multithreading optimization for the video, the time consumption is reduced from 35+/-1 seconds to 30+/-2 seconds, and the total time consumption of the prediction stage is obviously reduced.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present application and are not limiting. Although the present application has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present application, which is intended to be covered by the appended claims.

Claims

1. A face recognition and clustering method, the method comprising:

2. The face recognition and clustering method according to claim 1, wherein the incremental clustering model based on graph connection comprises the following specific steps:

step 1: inputting an unknown face;

3. The face recognition and clustering method of claim 2, wherein the connectivity map calculation method is depth-first search or breadth-first search.

4. The face recognition and clustering method according to claim 1, wherein the picture to be detected is a picture extracted from a video.

5. The face recognition and clustering method according to claim 4, wherein the face recognition and clustering method is specifically performed as follows:

6. The face recognition and clustering method of claim 5, wherein the frame extraction thread and the object detection thread are 1 or more threads, respectively.

7. The face recognition and clustering method of claim 5, wherein the picture processing chip is a GPU, NPU or TPU chip.

8. The face recognition and clustering method according to claim 1, wherein a plurality of unknown faces are combined into a batch, feature vectors of the unknown face images are calculated in batches and cached, and when the cache area is full or reaches a set time, face clustering is performed on all the cached feature vectors.

9. A face recognition and clustering system implemented based on any one of the methods of claims 1-8, the system comprising:

10. The face recognition and clustering system of claim 9, wherein the system further comprises: