CN118097755A

CN118097755A - Intelligent face identity recognition method based on YOLO network

Info

Publication number: CN118097755A
Application number: CN202410289557.5A
Authority: CN
Inventors: 万敏; 邹赛; 邹源
Original assignee: Guizhou University; Yibin University
Current assignee: Guizhou University; Yibin University
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-05-28

Abstract

The invention discloses a method for identifying intelligent face identity based on a YOLO network, which comprises the following steps: acquiring video data, and acquiring a labeled face recognition image dataset based on the video data; building a Yolo intelligent face identification model based on the Yolo v5 model, and training the Yolo intelligent face identification model based on the labeled face identification image data set to obtain a trained Yolo intelligent face identification model; and inputting the video data to be detected into a trained YOLO intelligent face identification model to detect so as to obtain a target detection result. The invention not only improves the accuracy of evaluation, but also reduces the dependence on a large amount of training data, so that the system is easier to deploy and apply in educational environments with limited resources.

Description

Intelligent face identity recognition method based on YOLO network

Technical Field

The invention relates to the technical field of education, in particular to a method for identifying intelligent face identity based on a YOLO network.

Background

Labor education has a unique and critical position in a primary and secondary education system, and aims to cultivate practical skills, team cooperation spirit and social responsibility feeling of students. However, in the traditional education system, the evaluation of the labor education mainly depends on direct observation of teachers and self-reporting of students, and the method often causes inconsistent and unreliable evaluation results and actual situations due to inherent subjective judgment and non-quantitative characteristics, and often cannot accurately evaluate the performances and progress of the students in the labor education. Such inaccuracy of assessment not only affects improvement of educational quality, but may also lead to imbalance in development of student labor skills and value perspective. Therefore, making an objective, accurate and efficient labor education detection method is important to improving the achievement and overall quality of labor education.

In recent years, with the development of computer vision and deep learning technology, intelligent data acquisition and analysis technology is increasingly applied in the education field. In particular to a face detection identity recognition technology based on a YOLO network, which provides a new solution for the evaluation of labor education. Compared with the traditional method, the technology automatically identifies and records the participation of students in labor education activities by analyzing video data, thereby improving the objectivity and efficiency of evaluation and reducing the consumption of human resources.

Although the YOLO network-based face detection technology is excellent in terms of still image recognition, its accuracy and stability are still to be improved in terms of real-time video processing and dynamic scene recognition, especially in a complex labor education environment. Furthermore, efficient operation of these systems typically requires training of large amounts of data, with high demands on the quality and diversity of the data sets. Therefore, in order to solve the technical problems, the invention provides a method for identifying the intelligent face based on the YOLO network.

Disclosure of Invention

The invention aims to provide a method for identifying intelligent face identity based on a YOLO network, which aims to solve the problems in the prior art.

The invention provides a method for identifying intelligent face identity based on a YOLO network, which comprises the following steps:

acquiring video data, and acquiring a labeled face recognition image data set based on the video data;

Building a Yolo intelligent face identity recognition model based on a Yolo v5 model, and training the Yolo intelligent face identity recognition model based on the labeled face recognition image dataset to obtain a trained Yolo intelligent face identity recognition model;

and inputting the video data to be detected into the trained YOLO intelligent face identification model to detect so as to obtain a target detection result.

Optionally, the process of obtaining the annotated face recognition image dataset based on the video data includes:

Carrying out image interception on the video data by taking a frame as a unit to obtain a plurality of face images;

Recognizing a plurality of face images based on a YOLO v5 model to obtain a face boundary box corresponding to each image;

Obtaining corresponding initial boundary frame coordinates and face confidence scores based on the face boundary frames;

Transforming the initial boundary frame coordinates corresponding to the face boundary frame to obtain transformed boundary frame coordinates;

And carrying out category identification labeling on the face image based on Labelimg software to obtain a labeled face image, wherein the labeled face image corresponds to a labeling file, and the labeling file comprises: file name, face confidence score, class name and transformed bounding box coordinates for each labeling object.

Optionally, the process of transforming the initial bounding box coordinates corresponding to the face bounding box to obtain transformed bounding box coordinates includes:

carrying out standardization processing on the initial boundary frame coordinates based on the original image size and different image sizes after the model is input to obtain standardized coordinates;

Performing inverse standardization on the standardized coordinates to obtain transformed boundary frame coordinates;

The calculation formula for carrying out standardization processing on the initial boundary frame coordinates is as follows:

Where W _orig denotes the width of the original image, H _orig denotes the height of the original image, x _min denotes the x-coordinate of the upper left corner of the bounding box, y _min denotes the y-coordinate of the upper left corner of the bounding box, x _max denotes the x-coordinate of the lower right corner of the bounding box, y _max denotes the y-coordinate of the lower right corner of the bounding box, x _norm denotes the normalized value of the x-coordinate of the center point of the bounding box with respect to the width of the image, y _norm denotes the normalized value of the y-coordinate of the center point of the bounding box with respect to the height of the image, W _norm denotes the normalized value of the width of the bounding box with respect to the width of the image, and H _norm denotes the normalized value of the height of the bounding box with respect to the height of the image.

Optionally, the calculation formula of the transformed bounding box coordinate obtained by performing inverse normalization on the normalized coordinate is:

optionally, the YOLO intelligent face identity recognition model includes a backhaul module and a Head module, and the process of obtaining the prediction result based on the backhaul module and the Head module includes:

performing feature extraction on the labeled face recognition image based on the Backbone module to obtain a multi-layer feature map;

the Head module fuses the multi-layer feature images based on a C3 layer to obtain a final output feature image;

Predicting the final output feature map to obtain a prediction result, wherein the prediction result comprises: the feature map predicts bounding boxes, categories, and confidence of the object.

Optionally, constructing a total loss function based on the prediction result and the true value of the annotation file, and performing model optimization based on the total loss function, wherein a mathematical model of the total loss function is as follows:

L_total＝λ_CIoUL_CIoU+λ_FocalL_Focal+λ_{cross-entropy}L_{cross-entropy}

Where L _CIoU denotes a bounding box loss function, L _Focal denotes a background noise loss function, and L _{cross-entropy} denotes a multi-class loss function.

Optionally, the YOLO smart face identification model further includes a process of filtering the bounding box based on non-maximum suppression.

Optionally, inputting the video data to be detected into the trained YOLO intelligent face identity recognition model to detect to obtain a class label and a confidence coefficient of the target person, and obtaining a target detection result based on the class label and the confidence coefficient of the target person;

The calculation formulas of the category labels and the confidence coefficient are as follows:

category = arg max _c P_class (c)

P＝P_obj×P_class(c)

Where arg max _c represents the value of c for finding the maximization of P _class (c), P _class (c) represents the probability that the prediction box belongs to class c, P _obj represents the confidence that the object exists, and P represents the integrated confidence.

The invention has the following technical effects:

the model of the invention shows higher accuracy and robustness when processing dynamic scenes and complex backgrounds. In addition, the model is particularly optimized for the labor education environment, so that different illumination conditions and diversified student behaviors can be processed more effectively. The model not only improves the accuracy of evaluation, but also reduces the dependence on a large amount of training data, so that the system is easier to deploy and apply in an educational environment with limited resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a schematic diagram of a network structure of a YOLO v5 model in an embodiment of the present invention;

FIG. 2 is a schematic diagram of the structure of each network module of YOLO v5 in an embodiment of the present invention;

FIG. 3 is a graph of a YOLO v5 model training and validation loss function and evaluation index in an embodiment of the invention;

Fig. 4 is a flowchart of a YOLO v5 intelligent face identification method in an embodiment of the invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

Example 1

A face detection and identity recognition method based on an improved YOLO network model is specially designed for a middle and primary school labor education scene. The main purpose of the method is to improve the objectivity and efficiency of labor education assessment through intelligent data acquisition and analysis technology. The method includes collecting and processing video data, automatically labeling with YOLO v5, generating an image dataset, and improving accuracy and stability of face detection by training and optimizing a YOLO model. At the same time, the model is particularly optimized for the labor education environment to deal with challenges of dynamic scenes and complex backgrounds. The invention provides a more accurate and efficient assessment method for middle and primary school labor education through an advanced computer vision technology.

As shown in fig. 4, the present embodiment optimizes the labor education scenario of primary and secondary schools to improve the accuracy of face detection and identity recognition, and optimizes the data training and processing flow to better adapt to the complex education environment, and the present embodiment discloses a method for intelligent face identity recognition based on a YOLO network, which comprises the following steps:

Acquiring video data, and acquiring a labeled face recognition image dataset based on the video data; the specific implementation process comprises the following steps:

s1, collecting video data related to face recognition, and intercepting images in each video according to the number of frames to form a picture sequence so as to construct a face recognition image data set; and (3) automatically marking by using YOLO, generating a file of the name of the png/. Jpg and the name of the txt, and checking and identifying missing by using Labelimg software on the face image dataset and improving the precision.

Based on the YOLO v5 model, a YOLO intelligent face identity recognition model is built, and based on the labeled face recognition image data set, the YOLO intelligent face identity recognition model is trained to obtain a trained YOLO intelligent face identity recognition model, and the specific implementation process comprises the following steps:

and S2, training and optimizing the YOLO v5 model by using the image data set in the S1, and storing the trained optimal weight file to obtain the YOLO intelligent face identification model.

Inputting video data to be detected into a trained YOLO intelligent face identity recognition model to detect to obtain a target detection result, wherein the specific implementation process comprises the following steps:

S3, inputting video data to be detected or pictures to be detected in the Yolo intelligent face identification model, detecting the video data and outputting a corresponding target detection result.

Further, step S1 includes:

S101: collecting video image data with character images in different multiple forms; and intercepting images in each video according to the frame number to form a picture sequence, and constructing a face image data set after eliminating pictures without face graphics in the images.

S102: automatic labeling with YOLO, YOLO v5 runs on the input image and outputs a series of detected face bounding boxes, each bounding box containing bounding box coordinates x _min,y_min,x_max,y_max along with a confidence score of whether a face is contained within a bounding box, assuming the original image size is W _orig,H_orig and the image size input to the model is W _input,H_input, to ensure that the detection result fits to the original image size, the detected bounding box coordinates need to be adjusted according to the original image size, and the scaled center point coordinates can be expressed as:

s1021: the scaling of width and height can be expressed as:

s103: in order to enable the object detection model to process images with multiple resolutions independent of image sizes, the applicability and detection accuracy of the model on different image sizes are improved, and the coordinates of the object detection model are subjected to standardized processing, which can be expressed as follows:

S104: to ensure that the normalized output of the model maps precisely back to the original image size, so that the detection result has direct usability and accuracy, the coordinate anti-normalization process is performed on the detection result, which can be expressed as:

s105: converting each bounding box information calculated into a standardized data format and generating a file of png/. Jpg and txt, assuming that I is a annotated image, the save process can be expressed as:

SaveAsJPG(I,path)

S1051: assuming that there are N Detection results, each of which Detection _i contains bounding box coordinates and confidence, the process of outputting the txt file may be expressed as:

OutputTXT({Detection₁,Detection₂,...,Detection_N}，path)

Detection_i＝{B_i,C_i}

B_i＝(x_min,i,y_min,i,x_max,i,y_max,i)

Wherein B _i is the boundary frame coordinates of the ith detection result; c _i is the confidence of the ith detection result;

s106: carrying out detection and identification omission on the face image dataset by Labelimg software and improving the precision; each image (. Png/. Jpg) corresponds to a. Txt file; the file includes the following information: file name, confidence score, category name, and coordinate information of each labeling target.

Further, step S2 includes: as shown in figures 1-2 of the drawings,

S201: the Backbone network of the YOLO v5 model is a CSPDARKNET structure, and the CSPDARKNET structure comprises a back bone module and a Head module which are connected with each other.

S2011: the back box module is used for extracting features of an input image and converting the features into a multi-layer feature map, the Head module is used for fusing the feature maps of different layers output by the back box module, the detection capability of the model on objects of different scales is enhanced, and the feature maps predict the bounding boxes, the categories and the confidence of the objects and output.

S2012: the backup module and the Head module both comprise a C3 layer (CSP Bottleneck with, 3, convolutions), and the C3 layer is used for dividing, convolving and combining the feature images to output new two-dimensional feature images; assume that the input feature map is:

Wherein H, W, and D are height, width, and depth, respectively; the C3 layer first partitions the input feature map into two parts F ₁ and F ₂ in depth, such that:

S2013: selecting F ₁ for convolution, setting the operation as a function C (-), and taking the processed characteristic diagram as C (F ₁); the convolved C (F ₁) is recombined with the unprocessed F ₂, and the output signature F _out can be expressed as:

F_out＝Concat[C(F₁),F₂]

Wherein, To better accommodate the subsequent identity detection task, an additional convolutional layer C' (. Cndot.) is added for F _out for feature extraction enhancement, expressed as:

F′_out＝C′(F_out)

where F' _out is the final output profile.

S202: predicting the coupled feature map of the Head module; the predictive calculation formula is:

b_x＝2σ(t_x)-0.5+c_x

b_y＝2σ(t_y)-0.5+c_y

wherein b _x,b_y represents the x and y coordinates of the center coordinates of the prediction bounding box by a convolution operation; t _x,t_y represents the original output of the model for the center position offset of each grid cell; c _x,c_y denotes the center coordinates of the grid cell; b _w denotes the width of the prediction bounding box, and b _h denotes the height of the prediction bounding box; p _w,p_h denotes the width and height of the a priori frame; t _w,t_h represents the original output of the model for resizing; λ represents a scale factor that is adjustable according to the face data characteristics;

S203: for detecting whether a face exists in a specific boundary box, setting object confidence, and calculating the probability of detecting an object in a specific boundary box, wherein the calculation formula is as follows:

Object confidence＝σ(t_o)

Where t _o is the original output of the model.

S204: the Head module is used for predicting the category to which the object may belong in each bounding box. Specifically, for multi-category detection tasks, the module can output a corresponding probability score for each category, so that accurate identification and classification of each category of objects are effectively realized; wherein the class score vector is a set of raw scores that the model outputs for each detected object, each score representing the relative confidence that the object belongs to a particular class, assuming that the model detects K classes, for each detected object the model outputs a class score vector of length K, expressed as:

t＝[t₁,t₂,t₃,...,t_K]

where t _i represents the original score of the model predictive object belonging to the ith class.

S205: in order to determine the probability that each detected object belongs to each predefined category, in face detection, each object only belongs to one category, and for each bounding box, according to the category score vector output by the model, a multi-category probability distribution is obtained, and the function is expressed as:

where P _i represents the predicted probability that the object belongs to the ith class.

S206: setting model loss function

S2061: the difference between the bounding box predicted by the input loss function quantization model and the corresponding real bounding box, the bounding box loss function can be expressed as:

Wherein IoU denotes the cross-over ratio between the predicted box b _predict and the real box b _gt; ρ (b _predict,b_gt) represents the euclidean distance between the predicted box and the true box center point; c represents the diagonal length of the smallest closed frame containing the predicted frame and the real frame; alpha represents a trade-off parameter; v quantifies the aspect ratio consistency of the prediction and real frames.

S2062: adding an object confidence loss function, improving the overall accuracy when detecting an object and reducing false positive prediction, effectively distinguishing an object from a non-object region, wherein the object confidence loss function can be expressed as:

L_BCE(O,y)＝-[y·log(O)+(1-y)·log(1-O)]

wherein O represents the object confidence of model prediction; y is the true label, 1 if there is an object, and 0 if there is no object.

S2063: in the task of face detection for identification, the background area is usually far more than the area containing the object, which results in unbalanced category, and the characteristic of adding Focal Loss can be expressed as:

L_Focal(O,y)＝-α·[y·(1-O)^γ·log(O)+(1-y)·O^γ·log(1-O)]

wherein γ represents an index that adjusts the contribution of the easily classified sample to loss; alpha represents the weight that balances the positive and negative samples.

S2064: in the task of face recognition identity detection, the difference between different individuals may be very subtle, in order to ensure that the face is matched with the identity, and ensure the consistency between the probability distribution predicted by the model and the actual label distribution, a multi-class loss function is added, which can be expressed as:

Wherein N represents the number of predicted bounding boxes; c represents the number of possible identity categories; y _i,c is an indicator, which indicates whether the true class of the ith bounding box is class c, if so, it is 1, otherwise it is 0; p _i,c represents the probability that the model predicts that the ith bounding box belongs to category c.

S2065: in a model for face detection identity recognition, a total loss function forms a key evaluation index for model training, ensures that the performance of a plurality of detection dimensions is comprehensively considered in the training process of the model, promotes the model to achieve double targets of high efficiency and accuracy when carrying out accurate object detection and classification, and is expressed as:

L_total＝λ_CIoUL_CIoU+λ_FocalL_Focal+λ_{cross-entropy}L_{cross-entropy}

where λ represents a hyper-parameter for balancing the weights of the various parts of the loss function.

S207: in a YOLO v5 architecture of face detection identity recognition, an optimization method of a back propagation algorithm is adopted to refine parameter configuration of a neural network; firstly, calculating the value of a loss function through a forward propagation process, and then reversely transmitting layer by utilizing a chain rule to accurately calculate the gradient value of the loss function on each network parameter. The network parameters are adjusted and updated through a gradient descent rule so as to achieve the purpose of reducing the total loss function value; further, through continuous iterative optimization process, the network model is ensured to have higher recognition rate and positioning accuracy for the object in the detection task after multiple times of learning, so that the generalization capability of the model to new data is remarkably improved; the algorithm can be expressed as:

Wherein θ _old represents the old value of the parameter before the current iteration; η represents the learning rate, which is a super-parameter used to control the scale of the gradient applied in each parameter update; Representing the gradient of the total loss function L _total with respect to the parameter θ; the rate of change of the loss function with respect to the parameter θ, i.e., the direction and extent to which the parameter θ needs to be adjusted, is indicated such that the loss function value decreases.

S208: after the training phase of the YOLO v5 intelligent face identification model is completed, non-maximum suppression (NMS) is adopted as a key post-processing technology. The method is specially used for analyzing a plurality of face boundary frames identified by the model, removing redundant frames with high overlapping degree, only preserving the frame with highest confidence, optimizing the output quality of the model, and ensuring the high efficiency and reliability of the model in complex image processing tasks; for each of the other bounding boxes B _i, its intersection ratio with the bounding box B _max with the highest confidence is calculated as:

Wherein, if IoU of B _i and B _max exceeds a preset threshold, ioU (B _max,B_i) > IoU threshold, inhibiting B _i; the box with the highest confidence is selected again from the remaining bounding boxes and the above procedure is repeated until all bounding boxes have been considered.

S209: after finishing the non-maximum suppression (NMS) step, implementing a threshold filtering technique, aiming at further refining the detection result after NMS, and ensuring that the output bounding box has high confidence and accuracy; for each NMS-back bounding box B _i, its object confidence O _i and class probability P _i are calculated, and the threshold filtering performance can be expressed as:

Reserved bounding box = { B _i|O_i>θ_confidence,P_i>θ_probability }

Where θ _confidence and θ _probability represent thresholds of explicitly set confidence and class probability, respectively.

S210: in the case that YOLO v5 is used for the intelligent face identification model, the boundary box which is finally reserved after threshold filtering is regarded as a positive sample block of the model; and training and optimizing the YOLO v5 model by using the image data set in the S1 in a mode of obtaining a positive sample block, and storing an optimal weight file best. Pt after training to obtain a YOLO intelligent face identity recognition model and a training result.

S211: dividing a face image data set into a training set, a verification set and a test set; the training set is used for training the YOLO v5 model, and the verification set is used for verifying the YOLO v5 model after training is completed so as to evaluate the training result of the YOLO v5 model; the test set is used for testing the YOLO v5 model to judge the recognition accuracy of the YOLO v5 model.

S212: as shown in fig. 3, in step S2, evaluation indexes of the YOLO face detection identity recognition intelligent recognition model include an accuracy rate (P), a recall rate (R), a mean average accuracy (mAP) and a F1 Score (F1 Score), and the calculation formula is as follows:

where TP represents the number of correctly detected samples, FP represents the number of erroneously detected samples, FN represents the number of undetected samples, and N represents the number of categories.

Further, step S3 includes:

S301: inputting video data to be detected into a YOLO intelligent face identity recognition model;

the method comprises the steps that a YOLO intelligent face identity recognition model processes video data to be detected, a processed video is output, if a target detection person appears in a picture of the processed video, a prediction frame is used for identifying the target person to determine a category label, confidence coefficients are displayed corresponding to each prediction frame, and a calculation formula of the category label determination and the confidence coefficients (P) is as follows:

category = arg max _cP_class (c)

P＝P_obj×P_class(c)

Combining the category label with the corresponding comprehensive confidence coefficient to obtain a final output:

Final output= (category, P)

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method for identifying intelligent face identity based on a YOLO network is characterized by comprising the following steps:

2. The YOLO network-based intelligent face identification method of claim 1, wherein the process of obtaining the annotated face recognition image dataset based on the video data comprises:

3. The intelligent face identification method based on the YOLO network according to claim 2, wherein the process of transforming the initial bounding box coordinates corresponding to the face bounding box to obtain transformed bounding box coordinates comprises the following steps:

4. The YOLO network-based intelligent face identification method according to claim 3, wherein the calculation formula for obtaining the transformed bounding box coordinates by performing inverse normalization on the normalized coordinates is:

5. The YOLO network-based intelligent face identification method of claim 4, wherein the YOLO intelligent face identification model comprises a backbox module and a Head module, and the face identification is performed based on the backbox module and the Head module, so that the process of obtaining the prediction result comprises the following steps:

6. The YOLO network-based intelligent face identification method according to claim 5, wherein a total loss function is constructed based on the prediction result and the actual value of the annotation file, and model optimization is performed based on the total loss function, wherein the mathematical model of the total loss function is as follows:

L_total＝λ_CIoUL_CIoU+λ_FocalL_Focal+λ_{cross-entropy}L_{cross-entropy}

7. The YOLO network-based intelligent face identification method of claim 6, wherein the YOLO intelligent face identification model further comprises a process of filtering the bounding box based on non-maximum suppression.

8. The intelligent face identification method based on the YOLO network according to claim 1, wherein video data to be detected is input into the trained YOLO intelligent face identification model to be detected to obtain a class label and a confidence coefficient of a target person, and a target detection result is obtained based on the class label and the confidence coefficient of the target person;

category = arg max _c P_class (c)

P＝P_obj×P_class(c)