CN111401418A

CN111401418A - Employee dressing specification detection method based on improved Faster r-cnn

Info

Publication number: CN111401418A
Application number: CN202010147949.XA
Authority: CN
Inventors: 包晓安; 黄友; 张娜
Original assignee: Zhejiang University Of Science And Technology Tongxiang Research Institute Co ltd
Current assignee: Zhejiang University Of Science And Technology Tongxiang Research Institute Co ltd
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-07-10

Abstract

The invention discloses a method for detecting employee dressing specification based on improved Faster r-cnn, which is used for collecting, labeling and enhancing data of a sample data set aiming at different application scenes; establishing an improved Faster r-cnn network model; training the improved network by using the enhanced training sample set; detecting the test sample set by using the trained network model; and analyzing whether the detection result meets the predefined dressing standard or not, and if not, feeding back the detected dressing non-standard result. Compared with a one-stage target detection algorithm, the improved algorithm can achieve higher accuracy. The improved network can achieve a better detection effect on multi-scale targets on one hand, and can also improve the detection rate on the other hand. The method can be directly applied to real-time detection of the monitoring video, and can play a role in monitoring and reminding the dressing standard of the staff, thereby effectively improving the efficiency of staff management.

Description

Employee dressing specification detection method based on improved Faster r-cnn

Technical Field

The invention relates to the field of computer vision, target detection and deep learning, detects target behaviors, and particularly relates to a method for detecting employee dressing specifications based on improved Faster r-cnn.

Background

In recent years, with the rapid development of computers, networks, and image processing and transmission technologies, companies tend to have intelligent and information management for employees. In order to facilitate supervision and management of the behaviors of the staff, monitoring equipment is installed in many companies, the behaviors of the staff are shot through a camera, and then special staff are arranged to observe whether abnormal behaviors exist in pictures shot on a screen. However, as the size of the company is increased, the number of employees is increased, the number of cameras is also increased, and the behavior of the employees is supervised and managed by manpower, which is time-consuming and labor-consuming, and is easy to miss due to visual fatigue or temporary neglect. Aiming at the current situations of large amount of monitoring videos, low use efficiency, complex management and need of examination by a large number of personnel at present, according to the requirements and specifications of an enterprise management layer aiming at different scenes, the invention aims to adopt a Fatser r-cnn network model based on deep learning to carry out real-time detection on the pictures acquired by a camera. The main task is to apply a target detection technology based on deep learning in a monitoring system to detect an interested moving target in a video in real time, so that the safety monitoring efficiency can be effectively improved, and a large amount of financial and material resources can be saved.

The existing target detection algorithms mainly have two types: an image recognition method based on a traditional characteristic operator and a target detection method based on deep learning. The image identification method based on the traditional feature operator has poor robustness on scenes, and has poor detection effect after the scenes are changed. The target detection algorithm based on deep learning is mainly divided into two types: a single-stage based target detection and a two-stage based target detection algorithm. The two methods are difficult to balance in the aspects of detection precision and detection speed, and the improved Fatser r-cnn network model provided by the invention can achieve higher detection speed while ensuring higher detection accuracy.

Disclosure of Invention

In order to solve the problem of detection of the dressing standard of the employee in the prior art, the invention provides a method for detecting the dressing standard of the employee based on improved Faster r-cnn, which extracts multi-layer and more robust image characteristics, so that a model can achieve better detection speed while ensuring better detection precision. The specific technical scheme is as follows:

A. collecting and labeling sample data sets aiming at different application scenes;

B. performing data enhancement on the sample data set;

C. establishing an improved Faster r-cnn network model, which comprises a characteristic pyramid network, a guide anchor point frame generation network, region-of-interest mapping and characteristic graph pooling, a classification sub-network and a regression frame sub-network; sending an image under an application scene into a feature pyramid network to extract a multi-scale semantic feature map, sending the multi-scale semantic feature map into a guide anchor point frame generation network to generate an anchor point frame, pooling the multi-scale semantic feature map with the anchor point frame to obtain feature maps with consistent scales, sending the feature maps into a classification sub-network to predict the category of the anchor point frame, and sending the feature maps into a regression frame sub-network to predict the position of the anchor point frame;

D. training the improved Faster r-cnn network model by adopting the sample data set after data enhancement to generate a training model;

E. and D, acquiring the dressing image of the employee and inputting the dressing image into the training model generated in the step D to obtain the type and the position of the dressing to be detected, and sending a reminding signal if the dressing type or the dressing position of the employee is detected to be not in accordance with the preset standard according to the dressing specification of the employee.

Further, the step a specifically comprises:

a1, collecting a sample data set, including manual collection of images in actual application scenes and downloading of network images;

and A2, labeling the sample data set, namely labeling the target to be detected in the sample image by a L-enable Img tool according to the dressing standard of the employee to be detected, wherein the label of the labeled rectangular box comprises employee clothing-staff, non-employee clothing-notstat, apron-pinafere, hat-hat and mask-mask, and the labeled rectangular box is automatically stored as a labeled file corresponding to the sample image after labeling is completed.

Further, the step B specifically includes:

b1, data expansion including turning, scaling and brightness change;

turning, namely turning the sample data set image up and down and turning the sample data set image left and right, wherein the turned image is used as a new sample image;

zooming, wherein zooming operations are carried out on the sample data set image, and the zooming ratios are 0.5, 0.8, 1.2 and 1.5 respectively;

the brightness change is carried out on the sample image, the change of illumination intensity under the real condition is simulated, and the brightness change proportion is respectively 0.5, 0.75, 1.25 and 1.50;

b2, fusing the sample data set image containing the target to be detected with the randomly selected normal image not containing the target to be detected, wherein the fusion coefficients are 0.3, 0.5 and 0.7 respectively, and updating the label file of the sample image;

b3, cutting, namely randomly cutting the sample data set image, wherein the cutting is divided into length random cutting, width random cutting and overall random cutting, and the random intervals are respectively 10% of the length and the width;

after the above operations are completed, the annotation file of the sample image is updated at the same time.

Further, the feature pyramid network in the step C is composed of a bottom-up feedforward calculation network and a top-down lateral connection network;

the feedforward calculation network consists of an initialization convolutional layer, an initialization pooling layer, a first block layer, a second block layer, a third block layer and a fourth block layer which are sequentially stacked; the initialization convolutional layer consists of a convolutional layer, a batch normalization layer and a nonlinear activation layer, the size of a convolutional layer convolutional kernel is 7 x 7, the step length of the convolutional kernel is 2, and the number of generated characteristic channels is 64; initializing the step length of the pooling layer to be 2; the output of the initialization pooling layer is connected to four block layers: the first block layer comprises 3 residual modules, the second block layer comprises 4 residual modules, the third block layer comprises 6 residual modules, the fourth block layer comprises 3 residual modules, each residual module comprises three convolution layers with 3 convolution kernels respectively being 1 x 1, 3 x 3 and 1 x 1, a batch normalization layer and an activation function layer, and the convolution kernels of the convolution layers in the four block layers are 64, 128, 256 and 512 in sequence; each block layer also comprises a branch consisting of a batch normalization layer, a nonlinear activation layer and a convolution layer, the convolution size of the convolution layer is 1 x 1, the number of generated characteristic diagram channels is 256, 512, 1024 and 512 respectively, the input of the branch is the same as the input of the first residual module in each block layer, and the output of the branch and the output of the first residual module are added to be used as the input of the next residual module; the characteristics of each block layer output fig. 1 serves on the one hand as input for the next block layer and on the other hand as input for the lateral connection network;

the lateral connection network takes the feature maps generated by the first layer block, the second layer block, the third layer block and the fourth layer block in the feedforward calculation network as the input of lateral connection, and uses convolution layers with convolution kernel size of 1 x 1 and step length of 1 to respectively operate, and then adds the result with the result of top-down up sampling to output four semantic feature maps with different levels; and operating the feature map generated by the fourth layer block in the feedforward calculation network through a convolution layer with the convolution kernel size of 1 x 1 and the step length of 2 to obtain a fifth semantic feature map.

Furthermore, the guided anchor frame generation network in the step C guides the generation of the anchor frame by using a semantic feature map, and comprises a position prediction branch, a shape prediction branch and a feature adaptive branch;

the position prediction branch is used for predicting which areas should be used as central points to generate anchor points, dividing the area of the whole feature map into a target central area, a peripheral area and an neglected area, marking an area of a small block at the center of a real target frame, which corresponds to the feature map, as the target central area, as a positive sample during training, and marking the rest areas as the neglected or negative samples according to the distance from the center; the position prediction branch consists of a convolution layer, a nonlinear activation layer and a loss layer, the convolution kernel size of the convolution layer is 1 x 1, the number of generated characteristic channels is 1, the output of the position prediction branch is the probability that each position of a characteristic diagram is a target center, and the position predicted as the target center area is used as a candidate center area of an anchor point;

the shape prediction branch is used for predicting the optimal length and width by giving an anchor point central point, belongs to a regression problem, firstly, 9 groups of w and h are sampled in a target central area by adopting an approximate method, the overlapping degree of the 9 groups and a target real frame is calculated, and the w and h with the maximum overlapping degree are the w and h of the current anchor point position; the shape prediction network is a convolution layer with convolution kernel size of 1 x 1, the number of generated characteristic channels is 2, and the output of the network is the predicted value of the length and width of the anchor point frame at each position of the characteristic diagram;

the feature self-adaptive branch directly blends the shape information of the anchor frame into the feature map by using deformable convolution operation, so that the newly obtained feature map can adapt to the shape of the anchor frame at each position, the predicted value of the length and the width of the anchor frame at each position of the feature map is used, the position offset of the next layer of convolution kernel is obtained by 1-1 convolution, and then the original feature map is corrected by 3-3 deformable convolution operation, so that the feature map adapting to the shape of the anchor frame is obtained.

And further, the region-of-interest mapping and feature map pooling in the step C is performed according to coordinates and position information of the anchor point frame obtained by the network generated by the guide anchor point frame, the anchor point frame is mapped into the feature map, and then pooling is performed to obtain the feature map with the same scale.

Further, in the classifying sub-network and the regression frame sub-network described in step C, both calculate the corresponding class and position coordinate through the full-connection layer network, the output size of the classifying sub-network is 2 × k × a, and the output size of the regression frame sub-network is 4 × a, where k represents the number of classes and a represents the number of anchor points.

Further, the classification sub-network calculates a classification loss using a focus loss function.

Further, the step D specifically includes:

d1, pre-training of the network: pre-training the improved Faster r-cnn network model by adopting a VOC data set, and storing the trained network model parameters as pth files;

d2, secondary training of the network: loading a pre-training model, carrying out secondary training on the pre-training model by using a sample data set after data enhancement, setting the initial learning rate to be 0.01, reducing the value of the learning rate in a step-type manner according to the increase of the training times, setting the batch training size to be 16, finally obtaining the trained model, and saving the trained model parameter file as a pth file.

Further, the step E specifically includes:

e1, detecting the test set sample by using the trained model, wherein the detection results are respectively a prediction type, a confidence coefficient and a corresponding position coordinate, and are stored as a pth file;

e2, calculating the precision of the to-be-detected category and the average precision of all categories according to the detection result of E1.

Compared with the prior art, the invention has the beneficial effects that:

(1) the data enhancement operation is carried out on the collected sample set, so that the problems of insufficient network learning and poor detection effect caused by too small actual sample amount can be solved;

(2) by adopting a residual error network of a pre-activated residual error unit and using a PRelu function to replace a Relu function as a nonlinear activation function, under the condition of increasing a very small amount of network parameters, image features with better robustness can be extracted, so that the accuracy of network target detection is improved;

(3) the image characteristic pyramid network is adopted to fuse the characteristics, so that multi-scale characteristics can be obtained, and good detection effects are achieved for targets with different scales;

(4) the network is generated by adopting the guide anchor point frame, so that the detection time can be greatly shortened while higher detection accuracy is ensured; the classification sub-network calculates the classification loss by adopting a focus loss function, so that the problems of unbalance of positive and negative samples of a candidate frame and unbalance of difficult and easy samples in target detection can be effectively solved;

(5) along with the increase of the training times, the learning rate is gradually reduced in a step-type mode, and the training speed can be effectively accelerated. Therefore, the invention is a technical breakthrough for the detection method of the dressing specification of the staff and solves the problems in the existing detection methods.

Drawings

FIG. 1 is a diagram of the steps of the method of the present invention;

FIG. 2 is a schematic diagram of an improved Faster r-cnn network;

FIG. 3 is a diagram of a pre-activation residual block architecture;

FIG. 4 is a schematic diagram of the PRelu function;

FIG. 5 is a diagram of an image pyramid network structure;

FIG. 6 is a diagram of a network structure generated by a guided anchor box;

FIG. 7 is a flow chart of employee dressing specification detection.

Detailed Description

The invention is further described by the following detailed description in conjunction with the accompanying drawings.

Referring to fig. 1, the implementation steps of the present invention are as follows:

A. collecting and labeling sample data set aiming at different application scenes

The method comprises the steps of collecting a sample data set, wherein the source of the data set comprises two parts, namely images of various targets to be detected downloaded on the network, and the other part is manually collected under the actual application scene, marking the sample data set, marking the collected data set by using L-enable Img software, marking all the targets to be detected appearing in each image, avoiding missing, and automatically storing the marked targets as marked files with the same names as the images after marking.

B. Data enhancement for training sample set

And B, performing data enhancement on the sample set obtained in the step A mainly through three modes: data augmentation, fusion, and clipping. The data expansion method mainly comprises the following steps: flipping, zooming, brightness variation, etc. Turning, namely turning the sample data set image up and down and turning the sample data set image left and right, wherein the turned image is used as a new sample image; zooming, wherein zooming operations are carried out on the sample data set image, and the zooming ratios are 0.5, 0.8, 1.2 and 1.5 respectively; and contrast enhancement, namely performing contrast enhancement operation on the sample data set image to simulate the change of illumination intensity under the real condition, wherein enhancement coefficients are 0.5, 0.75, 1.25 and 1.50 respectively. After the above operations are completed, the label file fusion of the operation sample is updated at the same time, the sample data set image containing the target to be detected is fused with the randomly selected normal image without the target to be detected, the fusion coefficients are 0.3, 0.5 and 0.7 respectively, and the label file of the operation sample is updated at the same time; the cutting is to cut the sample data set image randomly, the cutting is divided into length random cutting, width random cutting and overall random cutting, the random interval is 10% of the length and the width respectively, and meanwhile, the label file of the cut image is updated. And combining the new data set obtained by the three modes with the original data set into a sample data set.

C. Improvement based on improved Faster r-cnn network model

As shown in fig. 2, the improved Faster r-cnn network model consists of a feature pyramid network, a guided anchor box generation network, region of interest mapping and feature map pooling, a classification sub-network, and a regression box sub-network. The image is sent into a feature pyramid network to extract multi-scale features, the extracted feature graph is sent into a guide anchor point frame generation network to generate an anchor point frame, then the multi-scale feature graph is pooled to obtain feature graphs with consistent scales, finally the feature graphs are sent into a classification sub-network to predict the category of the anchor point frame, and the category is sent into a regression sub-network to predict the position of the anchor point frame.

The characteristic pyramid network is composed of a bottom-up feedforward calculation network and a top-down lateral connection network, wherein the bottom-up feedforward calculation network is composed of an initialization convolution layer, an initialization pooling layer, a first block layer, a second block layer, a third block layer and a fourth block layer which are sequentially stacked.

The initialization convolutional layer consists of a convolutional layer, a batch normalization layer and a nonlinear activation layer. The convolution kernel size is 7 × 7, the convolution kernel step size is 2, and the number of generated feature channels is 64. The step size for initializing the pooling layer is 2.

As shown in fig. 3, the block layer is composed of a plurality of pre-activation residual modules, and the block includes three sets of normalization layer, nonlinear activation layer and convolution layer. The first block layer has 3 blocks, where the first block consists of two branches in parallel. The first branch consists of a batch standardization layer, a nonlinear activation layer, a convolution layer, a batch standardization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers were 1 × 1, 3 × 3, and 1 × 1, respectively, and the number of generated signature channels was 64, and 256, respectively. The second branch circuit consists of a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution size of the convolutional layer is 1 × 1, and the number of generated feature map channels is 256. And adding the characteristic graphs of the first branch and the second branch to be used as the input of the next block. The second block consists of a batch normalization layer, a nonlinear activation layer, a convolution layer, a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers were 1 × 1, 3 × 3, and 1 × 1, respectively, and the number of generated signature channels was 64, and 256, respectively. And then added to the input of the current block as the input of the next block. The third block is structurally identical to the second block.

The second block layer has 4 blocks, where the first block consists of two branches in parallel. The first branch consists of a batch standardization layer, a nonlinear activation layer, a convolution layer, a batch standardization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers are 1 × 1, 3 × 3 and 1 × 1 in sequence, the number of generated feature map channels is 128, 128 and 512, respectively, and the step size of the first convolutional layer is 2. The second branch circuit consists of a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution size of the convolutional layer is 1 × 1, the number of generated feature map channels is 512, and the step size of the convolutional layer is 2. And adding the characteristic graphs of the first branch and the second branch to be used as the input of the next block. The second block consists of a batch normalization layer, a nonlinear activation layer, a convolution layer, a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers were 1 × 1, 3 × 3, and 1 × 1, respectively, and the number of generated signature channels was 128, and 512, respectively. And then added to the input of the current block as the input of the next block. The two latter blocks are all structured as the second block.

The third block level has 6 blocks, where the first block consists of two branches in parallel. The first branch consists of a batch standardization layer, a nonlinear activation layer, a convolution layer, a batch standardization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers are 1 × 1, 3 × 3 and 1 × 1 in sequence, the number of generated feature map channels is 256, 256 and 1024, respectively, and the step size of the first convolutional layer is 2. The second branch circuit consists of a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution size of the convolutional layer is 1 × 1, the number of generated feature map channels is 1024, and the step size of the convolutional layer is 2. And adding the characteristic graphs of the first branch and the second branch to be used as the input of the next block. The second block consists of a batch normalization layer, a nonlinear activation layer, a convolution layer, a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers are 1 × 1, 3 × 3 and 1 × 1, respectively, and the number of generated feature map channels is 256, 256 and 1024, respectively. And then added to the input of the current block as the input of the next block. The structure of the rear four blocks is the same as that of the second block.

The fourth block layer has 3 blocks, where the first block consists of two branches in parallel. The first branch consists of a batch standardization layer, a nonlinear activation layer, a convolution layer, a batch standardization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers are 1 × 1, 3 × 3 and 1 × 1 in sequence, the number of generated feature map channels is 128, 128 and 512, respectively, and the step size of the first convolutional layer is 2. The second branch circuit consists of a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution size of the convolutional layer is 1 × 1, the number of generated feature map channels is 512, and the step size of the convolutional layer is 2. And adding the characteristic graphs of the first branch and the second branch to be used as the input of the next block. The second block consists of a batch normalization layer, a nonlinear activation layer, a convolution layer, a batch normalization layer, a nonlinear activation layer and a convolution layer; the convolution kernel sizes of the convolutional layers were 1 × 1, 3 × 3, and 1 × 1, respectively, and the number of generated signature channels was 128, and 512, respectively. And then added to the input of the current block as the input of the next block. The third block is structurally identical to the second block.

The image is input into a feedforward computing network to carry out a series of operations as shown in figure 3, and a multi-scale feature map is extracted.

FIG. 4, in which the nonlinear activation function employs the Pre L U function shown in FIG. 4:

wherein

As a parameter, the setting of 0.8 is proved by experiments to have better effect.

The top-down lateral connection is a multi-scale feature map obtained by calculation of a feedforward calculation network as an input, and the output is a feature map after high-low-dimensional features are fused. As shown in fig. 5, the signature graphs generated by the first, second, third and fourth layer blocks in the feed forward computing network are used as inputs to the lateral connections, denoted as m1, m2, m3 and m 4. Operating m4 by a convolution layer with a convolution kernel size of 1 x 1 to obtain a feature map p4 with the channel number of 256; operating m3 by using a convolution layer with the convolution kernel size of 1 x 1 to obtain a feature map p31 with the channel number of 256, and then performing up-sampling on p4 and adding the up-sampling and p31 to obtain p 3; operating m2 by using a convolution layer with the convolution kernel size of 1 x 1 to obtain a feature map p21 with the channel number of 256, and then performing up-sampling on p3 and adding the up-sampling and p21 to obtain p 2; operating m1 by using a convolution layer with the convolution kernel size of 1 x 1 to obtain a feature map p11 with the channel number of 256, and then performing up-sampling on p2 and adding the up-sampling and p11 to obtain p 1; and (3) operating m4 by using the convolution layer with the convolution kernel size of 1 × 1 and the step size of 2 to obtain a feature map p5 with the channel number of 256.

The guided anchor frame generation network utilizes semantic features to guide the generation of anchor frames. As shown in fig. 6, the guided anchor block generation network is composed of three parts, namely a position prediction branch, a shape prediction branch and a feature adaptive branch:

the position prediction branch is used for predicting which areas should be used as central points to generate anchor points. The area of the whole feature map is divided into a target central area, a peripheral area and an neglected area, a small area of the center of a real target frame corresponding to the area on the feature map is marked as the target central area, the target central area is used as a positive sample during training, and the rest areas are marked as neglected or negative samples according to the distance from the center. The position prediction network consists of a convolution layer, a nonlinear active layer and a loss layer, the convolution kernel size of the convolution layer is 1 x 1, and the number of generated characteristic channels is 1. The output of the network is the probability that each location of the feature map is the center of the object. The position predicted as the target central area is used as a candidate central area of the anchor point;

shape prediction branches, which predict the optimal length and width by giving the anchor point center point, are a regression problem. First, in the target center region, the common 9 groups w and h are sampled by approximation. And then calculating the overlapping degree of the 9 groups with the target real frame, wherein the w and h with the maximum overlapping degree are the w and h of the current anchor point position. The shape prediction network generates 2 feature channels by a convolution layer with convolution kernel size of 1 x 1, and the output of the network is the predicted value of the length and width of the anchor point frame at each position of the feature map;

and the feature self-adaptive branch directly blends the shape information of the anchor point frame into the feature map by utilizing deformable convolution operation, so that the newly obtained feature map can adapt to the shape of each position anchor point frame, namely the receptive field corresponding to the anchor point frame with a larger size is larger than the receptive field corresponding to the anchor point frame with a smaller size is smaller than the receptive field corresponding to the anchor point frame with a smaller size. And obtaining the position offset of the next convolution kernel by convolution of 1 x 1 at anchor points w and h predicted by C32. Correcting the original characteristic diagram by using a 3-by-3 deformable convolution operation to obtain a characteristic diagram adapting to the shape of the anchor point frame;

on the one hand, the feature map obtained above is input into the position prediction branch of the guidance anchor frame generation network to predict the target central region, and on the other hand, the feature map is input into the shape prediction branch of the guidance anchor frame generation network to predict the optimal length and width at the anchor. Meanwhile, according to the relevant information of the image scale change, the anchor point candidate frame can be obtained. And performing convolution of the shape prediction branch result by 1 x 1 to obtain the position offset of the next layer of convolution kernel. And correcting the original characteristic diagram by using a 3-by-3 deformable convolution operation to obtain the characteristic diagram adapting to the shape of the anchor point frame.

And mapping the region of interest and pooling the feature map, mapping the anchor point frame into the feature map according to the coordinates and position information of the anchor point frame obtained by the network generated by guiding the anchor point frame, and then performing pooling operation to obtain the feature map with fixed size. Sending the feature map with the fixed size into a classification sub-network to obtain a category predicted value of the anchor point frame; and simultaneously sending the anchor point frame into a regression subnetwork to obtain a coordinate prediction value of the anchor point frame. The classification sub-network and the regression frame sub-network are used for calculating corresponding categories and position coordinates through a full-connection layer network, the output size of the classification sub-network is 2 k A, the output size of the regression frame sub-network is 4A, wherein k represents the number of the categories, and A represents the number of anchor points. The classification sub-network calculates the classification loss by adopting a focus loss function, and the problem of unbalanced samples can be effectively solved.

D. Pre-training, training and selection of networks

And C, pre-training the improved Faster r-cnn network in the step C by adopting a VOC data set, and storing the trained network model parameters as a pth file. And B, dividing the sample set after the data enhancement through the step B into a training set and a testing set according to the ratio of 8: 2. And loading the pre-trained model and training by using the training sample set. Setting the initial learning rate to be 0.01, wherein the value of the learning rate is reduced in a step mode according to the increase of the training times; setting the batch size to 16; and saving the trained network parameter file as a pth file.

E. Detecting a test sample set by using a trained network model

And D, detecting the test set sample by using the model trained in the step D, wherein detection results are respectively a prediction type, a confidence coefficient and a corresponding position coordinate, and are stored as a pth file. And respectively calculating the precision of the to-be-detected category and the average precision of all the categories according to the detection result.

The flow of performing employee dressing specification detection is shown in fig. 7, and firstly, images to be detected are obtained from a video monitoring terminal every 15 s; secondly, the image to be detected is sent to a neural network loaded with trained model parameters, and a detection result is calculated. The result comprises the detected object type, confidence and corresponding position coordinates; then, according to the dressing standard defined in advance, whether the dressing of the staff in the image to be detected meets the standard or not is analyzed by combining the detection result; and finally, outputting the detection result, if the detection result is not standard, starting a reminding service, storing the detection result and sending the detection result to a relevant responsible person. Tests prove that the detection speed of the invention can reach 4 pictures in 1 minute, and the detection precision reaches 92%.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. An employee dressing specification detection method based on improved Faster r-cnn is characterized by comprising the following steps:

B. performing data enhancement on the sample data set;

2. The method for detecting employee dressing rules based on improved Faster r-cnn according to claim 1, wherein the step a specifically comprises:

3. The method for detecting employee dressing rules based on improved Faster r-cnn according to claim 1, wherein the step B specifically comprises:

b1, data expansion including turning, scaling and brightness change;

turning, namely turning the sample data set image up and down and turning the sample data set image left and right, taking the turned image as a new sample image, and updating an annotation file of the sample image;

zooming, namely zooming the sample data set image, wherein the zooming ratios are 0.5, 0.8, 1.2 and 1.5 respectively, and updating the label file of the sample image;

the brightness changes, the sample image is subjected to random brightness changes, the change of illumination intensity under the real condition is simulated, the brightness change proportion is respectively 0.5, 0.75, 1.25 and 1.50, and meanwhile, the annotation file of the sample image is updated;

and B3, cutting, namely randomly cutting the sample data set image, wherein the cutting is divided into length random cutting, width random cutting and overall random cutting, the random intervals are respectively 10% of the length and the width, and meanwhile, the label file of the sample image is updated.

4. The improved Faster r-cnn based employee rigging specification detection method according to claim 1, wherein the feature pyramid network of step C is comprised of a bottom-up feed forward computing network and a top-down lateral connection network;

the feedforward calculation network consists of an initialization convolutional layer, an initialization pooling layer, a first block layer, a second block layer, a third block layer and a fourth block layer which are sequentially stacked; the initialization convolutional layer consists of a convolutional layer, a batch normalization layer and a nonlinear activation layer, the size of a convolutional layer convolutional kernel is 7 x 7, the step length of the convolutional kernel is 2, and the number of generated characteristic channels is 64; initializing the step length of the pooling layer to be 2; the output of the initialization pooling layer is connected to four block layers: the first block layer comprises 3 residual modules, the second block layer comprises 4 residual modules, the third block layer comprises 6 residual modules, the fourth block layer comprises 3 residual modules, each residual module comprises three convolution layers with 3 convolution kernels respectively being 1 x 1, 3 x 3 and 1 x 1, a batch normalization layer and an activation function layer, and the convolution kernels of the convolution layers in the four block layers are 64, 128, 256 and 512 in sequence; each block layer also comprises a branch consisting of a batch normalization layer, a nonlinear activation layer and a convolution layer, the convolution size of the convolution layer is 1 x 1, the number of generated characteristic diagram channels is 256, 512, 1024 and 512 respectively, the input of the branch is the same as the input of the first residual module in each block layer, and the output of the branch and the output of the first residual module are added to be used as the input of the next residual module; the feature diagram output by each block layer is used as the input of the next block layer on one hand and used as the input of a lateral connection network on the other hand;

5. The method for detecting employee's dressing rules based on improved Faster r-cnn according to claim 1, wherein the guided anchor block generation network of step C utilizes a semantic feature map to guide the generation of anchor blocks, and comprises a position prediction branch, a shape prediction branch and a feature adaptive branch;

6. The method for detecting employee dressing rules based on improved Faster r-cnn according to claim 1, wherein the region-of-interest mapping and feature map pooling of step C maps the anchor point frame to the feature map according to the coordinates and position information of the anchor point frame obtained by the guiding anchor point frame generation network, and then pooling is performed to obtain the feature map with a consistent scale.

7. The improved Faster r-cnn based employee rigging specification detection method according to claim 1, wherein the classification sub-network and the regression frame sub-network of step C, both of which calculate the corresponding category and position coordinates through a full link network, have an output size of 2 x k a for the classification sub-network and 4 x a for the regression frame sub-network, where k represents the number of categories and a represents the number of anchor points.

8. The improved Faster r-cnn based employee rigging specification detection method according to claim 7, wherein said classification sub-network calculates classification loss using a focus loss function.

9. The method for detecting employee dressing rules based on improved Faster r-cnn according to claim 1, wherein the step D comprises: