CN112800977A - Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning - Google Patents

Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning Download PDF

Info

Publication number
CN112800977A
CN112800977A CN202110130937.0A CN202110130937A CN112800977A CN 112800977 A CN112800977 A CN 112800977A CN 202110130937 A CN202110130937 A CN 202110130937A CN 112800977 A CN112800977 A CN 112800977A
Authority
CN
China
Prior art keywords
network
regularization
pruning
value
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110130937.0A
Other languages
Chinese (zh)
Inventor
张文博
包振山
周晚晴
杜嘉磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110130937.0A priority Critical patent/CN112800977A/en
Publication of CN112800977A publication Critical patent/CN112800977A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Abstract

A teacher blackboard writing action recognition method based on multi-granularity convolutional neural network pruning belongs to the field of deep learning. The invention combines the practical application of intelligent recording of classroom top-quality courses in intelligent education, and applies the multi-granularity convolutional neural network pruning to the human body action recognition algorithm, thereby improving the processing speed of the human body action recognition algorithm. The textboard writing action recognition algorithm is divided into three steps: and OpenPose performs feature extraction, coordinate normalization and BP neural network classification. In addition, in the openpos algorithm, a multi-granularity convolutional neural network pruning framework based on a filter level and a connection level is used for compressing the openpos backbone network, a corresponding training strategy is designed and realized, and the combination of two types of pruning methods is realized. The final experiment result shows that the accuracy and the speed of the network pruning completely meet the requirements of practical application.

Description

Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning
Technical Field
The invention belongs to the field of deep learning, relates to a teacher blackboard-writing action recognition method based on multi-granularity convolutional neural network pruning, and belongs to the technical field of deep neural network model compression.
Background
In recent years, the intelligent recording system for the fine class courses in the classroom can collect pictures of teachers and students through a high-definition shooting camera, and can realize switching of various teaching scenes by analyzing video images, automatically shoot teachers and student subjects, and identify the actions of the subjects.
In the intelligent recording system for the high-quality classroom courses, a teacher in an input video needs to be detected and tracked, and blackboard writing actions of the teacher are identified. Generally speaking, an openpos algorithm is selected to detect a teacher in an image and recognize blackboard writing actions, the openpos algorithm is used for extracting key points of a human body in an input image, the algorithm adopts a bottom-up strategy, the positions of the key points in the image are extracted first, then human body skeleton information is calculated through the learned key point relations, and the position of the teacher in the image can be calculated according to the key point information, so that the detection problem of the teacher is solved.
Disclosure of Invention
The invention provides a teacher blackboard writing action recognition method based on multi-granularity convolutional neural network pruning in combination with the practical application of intelligent recording of classroom top-quality courses in intelligent education, wherein a multi-granularity pruning compression algorithm achieves the aim of network compression by effectively training strategies and combining filter-level pruning and connection-level pruning.
The flow of the teacher blackboard-writing action recognition method is shown in fig. 1: firstly, a standard definition camera is used for collecting a high-definition video image, and an original image is input into a teacher blackboard writing action identification method after an interesting region is extracted.
In the teacher board action book identification method, an OpenPose key point extraction algorithm is adopted to extract key points of a human body from an image, then position normalization processing is carried out on the coordinates of the key points, and finally normalized coordinate information is input into a trained BP neural network for classification to obtain an output result.
The following is the detailed explanation of each stage of the textbook writing action identification method:
(1) OpenPose performs feature extraction
Firstly, an RGB image with the size of w × h is used as input, then a backbone network of OpenPose performs feedforward calculation on basic features, and meanwhile, a group of two-dimensional confidence maps S for predicting human key points and a group of two-dimensional vector fields V for representing the association degree between the human key points are extracted. Set S ═ S1,S2,S3,…,Sj,…,SJ),Sj∈Rw*h,Rw*hRefers to an input all RGB image with size w × h, comprising J confidence maps, each representing a type of key point of a human joint, where each response peak indicates the presence of one key point. Set V ═ V1,V2,V3,…,Vc,…,Vc),VC∈Rw*h*2There are C two-dimensional vector fields, one for each limb, encoding the direction in which each part of the limb points towards the other. And finally, analyzing the confidence coefficient graph and the affinity field through the Hungarian algorithm, and outputting the key point information of all human bodies in the image.
Fig. 2 is a network framework of openpos, which consists of a basic VGG19 network and two branches of a loop. Branch one predicts the location of the keypoints and branch two predicts the affinity domain between limbs, commonly known as PAFs. The two branches of the first stage take the characteristic diagram F of VGG output as input to obtain a group of output S1=ρ1(F),
Figure BDA0002925268750000024
Where ρ () and () represent regression functions, in particular
Figure BDA0002925268750000021
Where D is the convolution kernel, convolution is performed using 3X3 and F is the input feature map. The following branches respectively have outputs S of more than one brancht-1And Vt-1And the characteristic diagram F is used as input to obtain the output S of the new branchtAnd VtFinally, outputting a human body key point confidence map by repeating the process t timesS and an affinity field matrix V, t representing the relationship of the key points are iteration times, the value is more than or equal to 2, and the iteration is generally carried out until S is outputtUntil convergence, convergence means StNo longer changes in value. The calculation process is shown in the formulas (1) and (2).
Figure BDA0002925268750000022
Figure BDA0002925268750000023
In addition, the openpos algorithm provides a plurality of output forms of human BODY key points, including BODY _25, COCO, Face, Hand and other models. For the teacher to write on the blackboard, when the teacher lifts up one hand and faces the blackboard, the teacher can be considered to write on the blackboard. Through the requirement analysis of the teacher blackboard writing action recognition method, people need to pay key attention to key points of upper limbs and the head of people, and do not need to pay attention to the lower limbs of the people. Thus, the present invention employs an output model of class COCO. In the process of identifying the teacher writing movement, due to the problems of desk occlusion, camera angle and the like, the text only concerns the upper limb part of a person, so 12 key points numbered as 0,1,2,3,4,5,6,7,14,15,16 and 17 in the COCO model are activated, and the coordinates of the activated key points are used as the original input data of the next stage. The COCO model is shown in fig. 3.
(2) Coordinate normalization
Because the position of the teacher in the image is not fixed, and the blackboard writing action of the teacher identifies the relative position of the key point of the person, but does not concern the position of the person in the image, the position normalization processing is carried out on the obtained key point coordinate by adopting the method shown in formula (3). Origin of coordinates in formula (A)0,B0) Is the key point of the neck of the human body (A)max,Bmax) And (A)min,Bmin) Respectively, the maximum and minimum values in the sample data, (A)b,Bb) And (A)b,Bb) Respectively, the coordinates of the key points before and after the normalization process.
Figure BDA0002925268750000031
(3) BP neural network classification
The BP neural network is an effective multilayer feedforward neural network, has high nonlinear classification capability and strong robustness, and is widely used for pattern recognition and classification tasks. Generally, the normalized key point coordinates obtained in step (2) are usually linear indivisible data, but the full-connection layer in the BP neural network can map low-dimensional linear indivisible data to a high-dimensional data first, and at this time, the data becomes linearly separable, and the high-dimensional data can be classified through a hyperplane. Taking linear indivisible two-dimensional data as an example, fig. 4 shows a process of classifying two-dimensional linear indivisible data by a full connection layer. FIG. 4(a) is linear indifferent two-dimensional data [ x ]1,x2]Mapped into three-dimensional data [ y ] by formula (4)1,y2,y3]At this time, the three-dimensional data may be classified into two categories by one hyperplane, as shown in fig. 4 (b). Classified three-dimensional data [ y1,y2,y3]Mapped into two-dimensional data [ z ] by formula (5)1,z2]As shown in fig. 4(c) and 4 (d).
Figure BDA0002925268750000032
Figure BDA0002925268750000033
Therefore, the normalized key point coordinates are input into the BP neural network, and the BP neural network is trained. The BP neural network consists of: the input layer neuron number is 1 × 24, 24 is obtained by changing 12 2-dimensional coordinate point data into one-dimensional data, the key point hidden layer neuron number is 32, and the output layer contains 2 nervesThe elements respectively represent the blackboard writing state and the non-blackboard writing state and are distinguished through a Softmax classifier. The output of the Softmax function and the loss function L () are shown in equation (6) and equation (7), respectively. Y in formula (6)qAnd (3) obtaining a vector with q normalized key points obtained in the step (2), wherein each q corresponds to one Softmax, n is the number of output categories, and the teacher action is classified into two categories, namely n is 2. y isq' is the output value of the Softmax function, and for convenience of writing, is hereinafter denoted by yq' instead, the output value of the Softmax function is represented. In addition, the significance of the loss function L () is to solve and evaluate the difference between the model and the actual result, and the value of the loss function is made smaller and smaller by continuously iterating the neural network, so that the result of the model is more accurate. Hereinafter, unless otherwise specified, the loss function L () is referred to in the same sense and is calculated by the calculation method of formula (7).
Figure BDA0002925268750000041
Figure BDA0002925268750000042
Further, the teacher blackboard writing action recognition method based on multi-granularity convolutional neural network pruning is characterized in that the OpenPose algorithm in the step (1) generates most of calculated amount, so that the OpenPose algorithm needs to be optimized, and the optimization process is as follows:
(1.1) intercept redefinement stage
By analyzing openpos, we find that in the openpos algorithm, the image feature extraction uses the first 4 convolution modules of the VGG19 network, and two convolution layers (Conv4-3, Conv4-4) are used to perform dimension reduction on the feature map after the feature extraction is completed. And respectively inputting the feature maps subjected to dimension reduction into the two branches to perform regression of the key points of the human body and prediction of a part affinity vector field representing the association degree between the two key points. Two branches have the same cascaded network structure, consisting of one initial stage and one refinement stage that cycles five times. Fig. 5 lists the calculated amount distribution at each stage of the openpos algorithm and the Average Precision (mAP) of different numbers of stages on the COCO data set, where the image size input to the openpos algorithm is uniformly adjusted to 368 × 368. As can be seen from fig. 3, after 2 refinishment stages are used, the accuracy of OpenPose has reached 46.2%, and 3 stages do not significantly improve the accuracy, but increase the calculation amount by more than 40%, so we only keep 2 refinishment stages, and the calculation amount 136.6GFLOPs of the algorithm is reduced to 80.8GFLOPs, so the calculation efficiency of the algorithm is significantly improved.
(1.2) model compression of VGG19 backbone network
The VGG19 network comprises 16 convolutional layers (with core sizes of 3 × 3) and 3 fully-connected layers, and the parameter quantity of the network model is 1.44 hundred million, about 574 MB. Fig. 6 shows the steps of optimizing the network, and compressing the backbone network of the multi-granularity convolutional neural network pruning frame OpenPose proposed by the present invention. The method comprises the specific steps of firstly fixing parameters of two cyclic Initial stages, a redefinition stage and a BP neural network part in OpenPose, carrying out multi-granularity convolutional neural network pruning on a VGG19 network, and retraining the pruned network by adopting a certain data set so as to recover the performance of the network. In the pruning process, parameters of other layers in the fixed network are unchanged, and only the first 10 layers (conv4-2) of the network are pruned; in the retraining process, parameters of the rest layers in the fixed network are unchanged, and only the parameters of the front 10 layer (conv4-2) of the network are updated, so that the purpose of keeping a single variable is to ensure that the capability of the front 10 layer of the VGG19 network model for extracting image features after the network pruning is completed is not reduced. Finally, the entire openpos algorithm is retrained with the COCO dataset to recover the loss of accuracy caused by network changes of VGG 19. And replacing the original network model with the pruned network model, thereby completing the optimization of the OpenPose algorithm.
Further, the multi-granularity convolutional neural network pruning algorithm in the step (1.2) specifically comprises the following steps:
(1.2.1) Filter level pruning
Firstly, randomly selecting a plurality of images as an evaluation set, calculating the mean value of the output feature mapping of the filter as input of each image, and using the mean value as the response value of the filter to the input image, thereby obtaining the response tensor of the batch of image sets. Then, measuring the variation degree of the tensor by using the information entropy, equally dividing the value range of tensor elements into m blocks, counting the number of the elements contained in each block, calculating the occurrence probability pj, and calculating the information entropy according to a formula (8).
Figure BDA0002925268750000051
Wherein Hj,kEntropy of information representing the tensor generated by the ith filter, j representing the current block, N and CiThe number of network layers and the number of channels included in the i-th convolutional layer are respectively shown. And after the information entropy calculation is finished, sorting the filters in the ith convolutional layer according to the information entropy in an ascending order. The user can set the expected compression ratio C according to the evaluation of the convolutional neural network to be compressedr(0≤Cr1) can be intuitively understood as the proportion of the remaining filters in the network that the user expects after compression. The number of filters to be cut out for the corresponding layer is calculated using equation (9).
ni=Ci(1-Cr) (9)
Correspondingly sorting the ith layer to the top niAnd deleting each filter, and removing the corresponding two-dimensional convolution kernel in the i +1 layer to finish pruning.
In the implementation, the process needs to set a binary mask matrix T which is completely consistent with the scale of the convolutional neural network model, wherein T is a 0-1 matrix and is used for representing the pruning state, each element in the T matrix corresponds to one parameter in the network model, the initial value of each element is set to 1, and when one filter is pruned, the matrix element values corresponding to the filter are all set to 0. Thus, for the filter bank Wi,kWhen the input characteristic diagram is FiThe convolution operation varies as shown in equation (10).
Figure BDA0002925268750000052
Wherein f () represents an activation function, Ti,kIs and Wi,kThe corresponding matrix of the mask is then used,
Figure BDA0002925268750000061
representing a convolution operation, an example is a hadamard product.
The convolutional neural network is a feedforward calculation neural network, the basic composition unit is a neuron, a plurality of neurons form a two-dimensional vector for extracting the basic features of an image, the two-dimensional vector is called a feature matrix for short hereinafter, a plurality of two-dimensional vectors form convolutional layers, two adjacent convolutional layers are connected through the neurons to transmit information, and the neurons in the same convolutional layer are independent. Convolutional layers are used to extract features from the input vector, each convolutional layer consisting of a number of filter banks trained by a back-propagation algorithm. Let wiAnd hiRespectively representing input three-dimensional feature vectors
Figure BDA0002925268750000062
Width, height, XiAfter convolution calculation, the output characteristic vector is changed into an output characteristic vector
Figure BDA0002925268750000063
The vector will continue as input for the next convolution layer. The convolutional layer operation is at CiApplication C on one input channeli+1Implemented by filters, one filter generating a feature vector, wherein each filter is represented by CiA convolution kernel
Figure BDA0002925268750000064
And (4) forming. Therefore, the number of operations of the i +1 th convolutional layer is Ci+1Cik2hi+1wi+1. One filter to prune the ith layer reduces Cik2hi+1wi+1The sub-operation, while the corresponding input eigenvector of the (i + 1) th layer is also removed, can reduce Ci+2k2hi+2wi+2The next operation isPruning m filters in the ith layer to reduce m/C of the ith and (i + 1) th layersi+1The amount of calculation of (a).
(1.2.2) connection-level pruning
By using dynamic pruning method, threshold TH is obtained according to formula (11)A、THB(THB≥THANot less than 0). Will be lower than threshold ThAWill be cut off and will be above threshold THBThe connection recovery is realized, and the problem that the network cannot be recovered due to mistaken deletion of important connections in the pruning process is solved by the recoverable mechanism.
Figure BDA0002925268750000065
Formula (11) Wi,kRepresents a set of parameters in the ith filter, mean () in the formula represents averaging the set of parameters, and std () represents a standard deviation function of the set of parameters. s.s takes the value-1. The value of Δ t is-2 × s × std (W)i,k). The pruning and restoration of the connection are realized by setting the setting and the clearing of the corresponding elements of the mask matrix. Let Wi,k(p) is the ith filter Wi,kP parameter of (2), Ti,kAnd (p) is the corresponding element in the mask matrix, the update strategy of each element in the mask matrix is shown in formula (12).
Figure BDA0002925268750000066
When the network parameters are updated again, an updating strategy of a random gradient descent method is adopted, as shown in formula (13).
Figure BDA0002925268750000067
Wherein the character I represents the set of all filters in the deep network, L () represents the loss function of the network during the pruning process, and the partial derivative is first taken for the loss function in equation (11)
Figure BDA0002925268750000071
Beta is the learning rate of parameter updating (beta is more than 0 and less than or equal to 1), and in order to avoid the problem that the parameters are not updated any more due to the undersize of the beta, the minimum value of the beta is taken to be 10-4That is, β is not less than 0.0001 and not more than 1, and each convolutional layer needs more than 10000 times of iterative training.
(1.2.3) precision recovery training strategy-use of L1 and L2 regularization
The objective function is minimized, see equation (14).
Figure BDA0002925268750000072
In the formula (14), ω represents a parameter to be processed in the network model, and ω represents a parameter obtained after the regularization processing. λ is a regularization term parameter, and the definition of the λ value will be given in the introduction to L1 and L2 regularization, respectively, below. First term in formula (14)
Figure BDA0002925268750000073
Representing the predicted value f (x) of the network model to the e-th samplee(ii) a ω) and training label yiThe error between. The second term Ω (ω) in equation (9) is a regularization function for the parameter ω, and the regularization function Ω (ω) has many choices, which are mainly introduced for the regularization of L1 and the regularization of L2. The method comprises the following steps of adopting the L1 regularization and L2 regularization method, specifically introducing L1 regularization in a recovery training process after filter-level pruning is completed, and introducing L2 regularization in a recovery training process of a connection-level pruning method.
L1 regularization
After the filter-level pruning is completed, L1 regularization is introduced in the recovery training process. The L1 regular term of the parameter ω to be processed of the known network model is shown in equation (15).
Figure BDA0002925268750000074
Where ω includes a batch parameter size of n. The calculation of the L1 regularization term is to sum the absolute values of these parameters.
From the definition of regularization, the loss function with the regularization term of L1 is shown in equation (16).
Figure BDA0002925268750000075
The derivation is performed on the objective function with the L1 regularization term, and the result is shown in equation (17).
Figure BDA0002925268750000076
When L1 is normalized, when ω is updated by gradient descent method, the update process is shown in formula (18), where β is the learning rate of parameter update (0.0001. ltoreq. beta.ltoreq.1) as above.
Figure BDA0002925268750000077
In the gradient descent algorithm process of the formula (18), the parameter lambda of the regularization term is more than or equal to 0, and the value of lambda is set
Figure BDA0002925268750000081
The part omega can be changed into 0, so that a sparse model can be obtained, and the problem of parameter overfitting is solved.
L2 regularization
L2 regularization is introduced during the restoration training process for connection-level pruning, and the L2 regularization process will be described below. The process of regularization L2 for the parameter ω to be processed of the network model is shown in equation (19), where n represents the number of batch parameters contained in ω.
Figure BDA0002925268750000082
The calculation of the L2 regularization term is to take the sum of the squares of these parameters.
From the definition of regularization, a loss function with a regularization term of L2 is known (see equation (20)).
Figure BDA0002925268750000083
Wherein the content of the first and second substances,
Figure BDA0002925268750000084
the initial function before regularization is characterized,
Figure BDA0002925268750000085
the function obtained after regularization by L2 is characterized.
The derivation is performed on the objective function with L2 regularization, and the result is shown in equation (21).
Figure BDA0002925268750000086
Figure BDA0002925268750000087
When regularized by L2, ω is updated by gradient descent, which is updated as shown in equation (22), where n and β are both defined as above and λ is the regularization term parameter. Firstly, according to the expectation of the times to be trained, determining the learning rate beta, wherein 0.0001 is recommended; then, regarding the value of λ, a method of "coarse to fine" adjustment is adopted, the initial value of which is gradually increased/decreased from 1, the parameters are learned on the training set, and then the errors are verified on the test set, so as to seek the parameters which can make the verification errors of the test set smaller. The above process is repeated until the error on the test set is minimized. Firstly, setting the parameter of the regular item as 1, and then gradually increasing by 10 times according to the range of the verification set; if the error on the test set is unchanged or increased after 2-3 times of exploration, the error is adjusted to be gradually reduced, and the error is reduced by 10 times each time; and so on until finding the order of magnitude that minimizes the test set error; then, at this level of magnitude, a further "fine" adjustment is made by starting with 0 at the lowest bit, incrementing the value at the lowest bit by 1 each time until a value is found that minimizes the test set error.
Drawings
FIG. 1 is a flow chart of a teacher blackboard-writing action recognition method;
FIG. 2 is a network framework for OpenPose;
FIG. 3 is a COCO model;
FIG. 4 is a process for classifying two-dimensional linearly indivisible data by a fully connected layer;
FIG. 5 is a table comparing the computation and precision at different stages of OpenPose;
FIG. 6 is a backbone network optimization process for OpenPose;
FIG. 7 is a server-side configuration;
FIG. 8 is a comparison of initial and optimized OpenPose algorithm performance;
FIG. 9 is the calculated amount change of each network layer of VGG19 in the network pruning stage;
Detailed Description
In order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
In the present subject, training of a BP neural network and retraining of backbone network pruning by OpenPose are performed at a server, and fig. 7 shows a configuration of the server. The trained model is then transplanted to an embedded GPU platform specially applied to deep learning, NVIDIA Jetson TX2 for testing.
Fig. 8 shows the maps of openpos on the MS COCO dataset, the calculated amount of the teacher identification method, the single frame calculation time of the teacher identification method, and the accuracy change of the teacher identification method in the teacher posture verification set in the two stages of the initial algorithm and the algorithm optimization. As can be seen from fig. 8, after the last three redefinition stages of the OpenPose model are intercepted, the accuracy of the OpenPose model on the COCO key point verification set drops by 2.4%, but the algorithm speed is significantly increased, the single-frame processing time on TX2 drops from 251.7ms to 150.1ms, at this time, the accuracy of the teacher blackboard-writing recognition method is 98.1%, and the accuracy drops by only 0.5%, which is because the teacher is always in a standing posture, which is an ideal input for OpenPose to extract the key point, and the drop in the accuracy of the OpenPose model has a lower influence on the teacher blackboard-writing recognition method.
Details and results of network pruning of VGG19 are shown in fig. 9, the pruning in two stages is respectively reduced by 18.3GFLOPs and 6.0GFLOPs, after the pruning is completed, because the feature extracted by each convolution kernel of the backbone network is slightly changed, the adaptability of the parameter of the original OpenPose algorithm to the backhaul is reduced, and the accuracy of the OpenPose algorithm on the COCO verification set is reduced by 43.7%, therefore, the network needs to be retrained to recover the network accuracy, the super-parameter base _ size is 10, the base _ lr is 10-5, the network is subjected to 60K iterative training, and finally the network accuracy is recovered to 46.0%, and the network pruning only causes an accuracy loss of 0.1%, but reduces a calculation amount of 30%, thereby completely meeting the requirements of practical applications.

Claims (2)

1. A teacher blackboard-writing action recognition method based on multi-granularity convolutional neural network pruning is characterized by comprising the following steps:
(1) OpenPose performs feature extraction
Firstly, an RGB image with the size of w multiplied by h is used as input, then a backbone network of OpenPose carries out feedforward calculation on basic features, and meanwhile, a group of two-dimensional confidence maps S for predicting human key points and a group of two-dimensional vector fields V for representing the association degree between the human key points are extracted; set S ═ S1,S2,S3,…,Sj,…,SJ),Sj∈Rw*h,Rw*hThe method comprises the steps that all RGB images with the size of w x h are input, J confidence maps are included, each confidence map represents key points of a type of human body joints, and each response peak value in the map indicates that one key point exists; set V ═ V1,V2,V3,…,Vc,…,VC),VC∈Rw*h*2Having C two-dimensional vector fields, one for each limb, encoding each part of the limb pointing to anotherThe orientation of one portion; finally, analyzing the confidence coefficient graph and the affinity field through a Hungarian algorithm, and outputting key point information of all human bodies in the image;
openpos consists of a basic VGG19 network and two branches of loops; predicting the position of a key point by a first branch and predicting an affinity domain between limbs by a second branch, which are commonly called PAFs; the two branches of the first stage take the characteristic diagram F of VGG output as input to obtain a group of output S1=ρ1(F),
Figure FDA0002925268740000011
Where ρ () and φ () represent regression functions, specifically
Figure FDA0002925268740000012
Where D is the convolution kernel, convolved with 3X3, and F is the input feature map; the following branches respectively have outputs S of more than one brancht-1And Vt-1And the characteristic diagram F is used as input to obtain the output S of the new branchtAnd VtFinally, outputting a human body key point confidence map S and an affinity field matrix V representing the key point relation by repeating the process t times, wherein t is iteration times, the value is more than or equal to 2, and the iteration is carried out until S is outputtUntil convergence, convergence means StNo longer changes in value; the calculation process is shown in formulas (1) and (2);
Figure FDA0002925268740000013
Figure FDA0002925268740000014
adopting an output model with the category of COCO in the output form of the OpenPose algorithm; activating 12 key points numbered 0,1,2,3,4,5,6,7,14,15,16 and 17 in the COCO model, and taking the coordinates of the activated key points as the original input data of the next stage;
(2) coordinate normalization
Performing position normalization processing on the obtained key point coordinates by adopting a method shown in a formula (3); origin of coordinates in formula (A)0,B0) Is the key point of the neck of the human body (A)max,Bmax) And (A)min,Bmin) Respectively, the maximum and minimum values in the sample data, (A)b,Bb) And (A)b,Bb) Respectively carrying out normalization processing on the coordinates of the key points before and after the normalization processing;
Figure FDA0002925268740000021
(3) BP neural network classification
Inputting the normalized key points obtained in the step (2) into a BP neural network, and training the BP neural network; the BP neural network consists of: the number of neurons in an input layer is 1 multiplied by 24, 24 are obtained by changing 12 2-dimensional coordinate point data into one-dimensional data, the number of neurons in a key point hiding layer is 32, an output layer comprises 2 neurons and respectively represents a blackboard-writing state and a non-blackboard-writing state, and the neurons are distinguished through a Softmax classifier; the output of the Softmax function and the loss function L () are respectively shown as formula (4) and formula (5); y in formula (4)qThe vector with q key points after normalization processing obtained in the step (2) corresponds to one Softmax for each q, n is the number of output categories, and the teacher action is classified into two categories, namely n is 2; y isq' is the output value of the Softmax function, and for convenience of writing, is hereinafter denoted by yq' instead of representing the output value of the Softmax function; hereinafter, unless otherwise specified, the loss function L () is referred to in the same sense and is calculated by the calculation method of formula (5);
Figure FDA0002925268740000022
Figure FDA0002925268740000023
2. the method for teacher blackboard writing action recognition based on multi-granularity convolutional neural network pruning as claimed in claim 1, wherein the openpos algorithm in the step (1) is specifically as follows:
(2.1) intercept redefinement stage
By analyzing OpenPose, it is found that in the OpenPose algorithm, the first 4 convolution modules of a VGG19 network are used for image feature extraction, and after the feature extraction is completed, two convolution layers, namely Conv4-3 and Conv4-4, are used for reducing the dimension of a feature map; inputting the characteristic graphs subjected to dimensionality reduction into two branches respectively to perform regression of key points of a human body and prediction of a part affinity vector field representing the association degree between the two key points; the two branches have the same cascade network structure and consist of an initial stage and a redefinition stage circulating for 2 times;
(2.2) model compression of VGG19 backbone networks
The VGG19 network comprises 16 convolutional layers and 3 full-connection layers, the sizes of the convolutional layers are all 3X3, and the provided multi-granularity convolutional neural network pruning framework is selected to compress the backbone network of OpenPose; firstly, parameters of two cyclic branches Initial stage and refinement stage in OpenPose and a BP neural network part are fixed, and multi-granularity convolutional neural network pruning is carried out on a VGG19 network; in the pruning process, parameters of other layers in the fixed network are unchanged, and only the first 10 layers of the network are pruned; in the retraining process, parameters of other layers in the fixed network are unchanged, only parameters of the first 10 layers of the network are updated, and finally, a COCO data set is adopted to retrain the whole OpenPose algorithm so as to recover precision loss caused by network change of VGG 19; replacing the original network model with the pruned network model, namely completing the optimization of the OpenPose algorithm; for the VGG19 network, except for the pruning method, other parts are not changed;
the pruning process is as follows:
(3.1) performing filter level pruning on the convolutional layers of the input network model layer by layer, specifically:
first of all, the first step is to,randomly selecting a plurality of images as an evaluation set, calculating the mean value of the output characteristic mapping of the filter when each image is used as input, and using the mean value as the response value of the filter to the input image, thereby obtaining the response tensor of the batch of image sets; then, measuring the variation degree of the tensor by using the information entropy, equally dividing the value range of tensor elements into m blocks, recommending the value of m to be 10, then counting the number of elements contained in each block and calculating the occurrence probability pjCalculating the information entropy according to the formula (6);
Figure FDA0002925268740000031
wherein Hi,kEntropy of information representing the tensor generated by the ith filter, j representing the current block, N and CiRespectively representing the number of network layers and the number of channels contained in the ith convolutional layer; after the information entropy calculation is finished, sorting the filters in the ith convolution layer according to the information entropy in an ascending order; the user can set the expected compression ratio C according to the evaluation of the convolutional neural network to be compressedrThe residual filter ratio in the network after the compression is expected by a user can be intuitively understood; crThe value of (a) is between 0 and 1, wherein 0.5 is recommended; calculating the number of filters needing to be cut out of the corresponding layer by using a formula (7);
Ri=Ci(1-Cr) (7)
correspondingly sorting the ith layer to obtain the front RiDeleting each filter, and removing the corresponding two-dimensional convolution kernel in the i +1 layer to finish pruning;
in the implementation, a binary mask matrix T which is completely consistent with the scale of the convolutional neural network model is required to be set in the process, wherein the T is a 0-1 matrix and is used for representing the pruning state, each element in the T matrix corresponds to one parameter in the network model, the initial value of each element is set to be 1, and when one filter is pruned, the matrix element values corresponding to the filter are all set to be 0; thus, for the filter bank Wi,kWhen the input characteristic diagram is FiVariation of time, convolution operationAs shown in equation (8);
Figure FDA0002925268740000041
wherein f () represents an activation function, Ti,kIs and Wi,kThe corresponding matrix of the mask is then used,
Figure FDA0002925268740000042
representing a convolution operation, an h-hadamard product;
the convolutional neural network is a feedforward calculation neural network, a basic composition unit is a neuron, a plurality of neurons form a two-dimensional vector for extracting basic features of an image, the two-dimensional vector is called a feature matrix for short in the following text, a plurality of two-dimensional vectors form convolutional layers, two adjacent convolutional layers are connected through the neurons to transmit information, and the neurons in the same convolutional layer are independent; the convolutional layers are used for extracting characteristics of input vectors, and each convolutional layer is composed of a plurality of filter banks trained through a back propagation algorithm; let wiAnd hiRespectively representing input three-dimensional feature vectors
Figure FDA0002925268740000043
Width, height, XiAfter convolution calculation, the output characteristic vector is changed into an output characteristic vector
Figure FDA0002925268740000044
The vector will continue to be the input for the next convolution layer; the convolutional layer operation is at CiApplication C on one input channeli+1Implemented by filters, one filter generating a feature vector, wherein each filter is represented by CiA convolution kernel
Figure FDA0002925268740000045
Composition is carried out; therefore, the number of operations of the i +1 th convolutional layer is Ci+1Cik2hi+1wi+1(ii) a One filter to prune the ith layer reduces Cik2hi+1wi+1The sub-operation, while the corresponding input eigenvector of the (i + 1) th layer is also removed, can reduce Ci+2k2hi+2wi+2And in the secondary operation, M filters in the ith layer are pruned, the value of M is the same as the value of M in the previous layer, and the M/C of the ith layer and the (i + 1) th layer are respectively reducedi+1The calculated amount of (2);
(3.2) performing connection level pruning on the network model after the filter level network pruning, specifically:
setting a threshold TH by a formula (9) by adopting a dynamic pruning methodA、THBWherein TH isB≥THANot less than 0; will be lower than threshold ThAWill be cut off and will be above threshold THBThe connection is recovered, and the problem that the network cannot be recovered due to mistaken deletion of important connections in the pruning process is solved by the recoverable mechanism;
Figure FDA0002925268740000051
w in formula (9)i,kRepresenting a group of parameters in the ith filter of the ith layer, wherein mean () in the formula represents the average value of the group of parameters, and std () represents the standard deviation function of the group of parameters; s takes the value of-1; the value of Δ t is-2 × s × std (W)i,k) (ii) a The pruning and the recovery of the connection are realized by setting and clearing corresponding elements of a mask matrix; let Wi,k(p) is the ith filter Wi,kP parameter of (2), Ti,k(p) is the corresponding element in the mask matrix, and the update strategy of each element in the mask matrix is shown in formula (10);
Figure FDA0002925268740000052
when the network parameters are updated, an updating strategy of a random gradient descent method is adopted, as shown in a formula (11);
Figure FDA0002925268740000053
wherein the character I represents the set of all filters in the deep network, L () represents the loss function of the network during the pruning process, and the partial derivative is first taken for the loss function in equation (11)
Figure FDA0002925268740000054
Beta is the learning rate of parameter update, in order to avoid the problem that the parameter is not updated any more due to the over-small value of beta, the value of beta in the middle is: beta is more than or equal to 0.0001 and less than or equal to 1;
(3.3) precision recovery training strategy-use of L1 and L2 regularization
Minimizing the objective function, see equation (12);
Figure FDA0002925268740000061
in the formula (12), ω represents a parameter to be processed in the network model, and ω represents a parameter obtained after the regularization processing; λ is a regularization term parameter, the definition of the λ value being given in the following introduction to L1 and L2 regularization, respectively; first term in equation (12)
Figure FDA0002925268740000062
Representing the predicted value f (x) of the network model to the e-th samplee(ii) a ω) and training label yeThe error between; the second term Ω (ω) in formula (7) is a regularization function for the parameter ω, and the regularization function Ω (ω) has many choices, which are mainly introduced for the regularization of L1 and the regularization of L2; the method comprises the following steps that an L1 regularization and L2 regularization method is adopted, and specifically, the L1 regularization is introduced in a recovery training process after filter-level pruning is completed, and the L2 regularization is introduced in a recovery training process of a connection-level pruning method;
l1 regularization
After filter-level pruning is completed, introducing L1 regularization in the training recovery process; the L1 regular term of the parameter ω to be processed of the known network model is shown in formula (13);
Figure FDA0002925268740000063
wherein the size of the batch parameter dimension contained in ω is nL1(ii) a The calculation process of the L1 regular term is to take the absolute values of the parameters to sum;
according to the definition of regularization, a loss function with an L1 regularization term is shown in an equation (14);
Figure FDA0002925268740000064
derivation is carried out on the target function with the L1 regularization term, and the result is shown in an equation (15);
Figure FDA0002925268740000065
when L1 is normalized, when ω is updated by gradient descent method, the updating process is shown in formula (16), wherein β is the learning rate of parameter updating and β is 0.0001 ≦ 1;
Figure FDA0002925268740000071
in the gradient descent algorithm process of the formula (16), the parameter lambda of the regularization term is more than or equal to 0;
l2 regularization
The L2 regularization is introduced in the training process of the recovery of the connection-level pruning, and the L2 regularization process is introduced below; the L2 regularization process for the network model to-be-processed parameter ω is shown in formula (17), where nL2Represents the number of batch processing parameters contained in ω;
Figure FDA0002925268740000072
the calculation process of the L2 regular term is to take the square sum of the parameters;
from the definition of regularization, a loss function with a regularization term of L2 (see equation (18));
Figure FDA0002925268740000073
wherein the content of the first and second substances,
Figure FDA0002925268740000074
the initial function before regularization is characterized,
Figure FDA0002925268740000075
characterizing a function obtained after regularization by L2;
derivation is performed on the objective function with L2 regularization, and the result is shown in equation (19);
Figure FDA0002925268740000076
Figure FDA0002925268740000077
when L2 is normalized, when ω is updated by gradient descent method, the update is shown in formula (20), wherein n and β are both defined as above, and λ is the parameter of the regularization term; firstly, according to the expectation of the times to be trained, determining the learning rate beta, wherein 0.0001 is recommended; then, regarding the value of lambda, adopting a method of adjusting from coarse to fine, gradually increasing/reducing the initial value from 1, learning the parameters on the training set, and then verifying the error on the testing set to seek the parameters which can make the verification error of the testing set smaller; repeating the above processes until the error on the test set is minimum; firstly, setting the parameter of the regular item as 1, and then gradually increasing by 10 times according to the range of the verification set; if the error on the test set is unchanged or increased after 2-3 times of exploration, the error is adjusted to be gradually reduced, and the error is reduced by 10 times each time; and so on until finding the order of magnitude that minimizes the test set error; then, at this level of magnitude, a further "fine" adjustment is made by starting with 0 at the lowest bit, incrementing the value at the lowest bit by 1 each time until a value is found that minimizes the test set error.
CN202110130937.0A 2021-01-30 2021-01-30 Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning Pending CN112800977A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110130937.0A CN112800977A (en) 2021-01-30 2021-01-30 Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110130937.0A CN112800977A (en) 2021-01-30 2021-01-30 Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning

Publications (1)

Publication Number Publication Date
CN112800977A true CN112800977A (en) 2021-05-14

Family

ID=75813109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110130937.0A Pending CN112800977A (en) 2021-01-30 2021-01-30 Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning

Country Status (1)

Country Link
CN (1) CN112800977A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116451771A (en) * 2023-06-14 2023-07-18 中诚华隆计算机技术有限公司 Image classification convolutional neural network compression method and core particle device data distribution method
CN117497194B (en) * 2023-12-28 2024-03-01 苏州元脑智能科技有限公司 Biological information processing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175628A (en) * 2019-04-25 2019-08-27 北京大学 A kind of compression algorithm based on automatic search with the neural networks pruning of knowledge distillation
CN110874631A (en) * 2020-01-20 2020-03-10 浙江大学 Convolutional neural network pruning method based on feature map sparsification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175628A (en) * 2019-04-25 2019-08-27 北京大学 A kind of compression algorithm based on automatic search with the neural networks pruning of knowledge distillation
CN110874631A (en) * 2020-01-20 2020-03-10 浙江大学 Convolutional neural network pruning method based on feature map sparsification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周晚晴: "多粒度卷积神经网络剪枝算法的研究", 中国优秀硕士学位论文全文数据库, 15 March 2020 (2020-03-15), pages 3 - 4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116451771A (en) * 2023-06-14 2023-07-18 中诚华隆计算机技术有限公司 Image classification convolutional neural network compression method and core particle device data distribution method
CN116451771B (en) * 2023-06-14 2023-09-15 中诚华隆计算机技术有限公司 Image classification convolutional neural network compression method and core particle device data distribution method
CN117497194B (en) * 2023-12-28 2024-03-01 苏州元脑智能科技有限公司 Biological information processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107122809B (en) Neural network feature learning method based on image self-coding
CN108647583B (en) Face recognition algorithm training method based on multi-target learning
CN108898620B (en) Target tracking method based on multiple twin neural networks and regional neural network
CN109063666A (en) The lightweight face identification method and system of convolution are separated based on depth
CN110348399B (en) Hyperspectral intelligent classification method based on prototype learning mechanism and multidimensional residual error network
CN108596138A (en) A kind of face identification method based on migration hierarchical network
Li et al. Leaf vein extraction using independent component analysis
CN108509843B (en) Face recognition method based on weighted Huber constraint sparse coding
CN108446589B (en) Face recognition method based on low-rank decomposition and auxiliary dictionary in complex environment
CN112464865A (en) Facial expression recognition method based on pixel and geometric mixed features
CN115035418A (en) Remote sensing image semantic segmentation method and system based on improved deep LabV3+ network
CN112800977A (en) Teacher blackboard writing action identification method based on multi-granularity convolutional neural network pruning
Oliva et al. Multilevel thresholding by fuzzy type II sets using evolutionary algorithms
CN112418261B (en) Human body image multi-attribute classification method based on prior prototype attention mechanism
Dhanaseely et al. Performance comparison of cascade and feed forward neural network for face recognition system
CN111694977A (en) Vehicle image retrieval method based on data enhancement
CN109948662B (en) Face image depth clustering method based on K-means and MMD
CN114155572A (en) Facial expression recognition method and system
CN113361589A (en) Rare or endangered plant leaf identification method based on transfer learning and knowledge distillation
CN109934281B (en) Unsupervised training method of two-class network
CN115280329A (en) Method and system for query training
CN115100694A (en) Fingerprint quick retrieval method based on self-supervision neural network
Xie et al. Bi-weighted robust matrix regression for face recognition
CN113011506B (en) Texture image classification method based on deep fractal spectrum network
CN108460426A (en) A kind of image classification method based on histograms of oriented gradients combination pseudoinverse learning training storehouse self-encoding encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination